The advent of internet has made the volume of data going high everyday in all computer-based applications. This has entrusted researchers to design more powerful techniques to generate and manipulate large amounts of data to derive useful information. Indexing plays a vital role in the fast recovery of required data from large databases.
Among the many indexing techniques reported the signature file approach is preferred for its efficient evaluation of set-oriented queries and easy handling of insert and update operations. Initially applied on text data [2, 3, 12, 17, 21] it has now been used in other applications like office filing [6], relational and Object-Oriented Databases [14,16,26] and hypertext [9].
Signatures are hash coded abstractions of the original data. It is a binary pattern of predefined length with fixed number of ls. The attributes' signatures are superimposed to form object's signature. To resolve a query, the query signature say Sq is generated using the same hash function and compared with signatures in the signature file for l s sequentially and many non-qualifying objects are immediately rejected.
If all the Is of Sq matches with that of the signature in the file it is called a drop. The signature file method guarantees that all qualifying objects will pass through the filtering mechanism; however some non-qualifying objects may also pass the signature test. The drop that actually matches the Sq is called an actual drop and drop that fails the test is called false drop. The next step in the query processing is the false drop resolution. To remove false drops each drop is accessed and checked individually. The number of false drops can be statistically controlled by careful design of the signature extraction method [7] and by using long signatures [3,6].
1.1 Related work
Different approaches have been discussed by researchers to represent Signature file in a way conducive for evaluating queries, such as Sequential Signature File [31], Bit-Slice Signature file [31], Multilevel Signature file [25], Compressed Multi Framed Signature file [23], Parallel Signature file [20], S-Tree and its variants [13,24], Signature Graph [28] and Signature tree [27,29,30].
1.2 Motivation for the current work
The signature tree developed by Chen [30] is having the following drawbacks:
* Signatures are inserted considering both 0s and 1s whereas actual weight age is for set bits only.
* Insertion path is dictated by the existing tree structure.
* To process a query, bits appearing in the tree from root node are compared with query signature pattern for 0s and I s and not by its set bits.
* For a 0-bit in the query both left and right sub tree is followed leading to multiple traversals.
These observations laid the foundation for the current work. We study a new indexing technique for OODBSs using the dynamic balancing of B+ tree called Signature Declustering (SD)-tree in which the positions of 1s in the signatures are distributed over a set of leaf nodes. Using this for a given query signature all the matching signatures can be retrieved cumulatively in a single node.
The rest of the paper is organized as follows. In section 2 we discuss briefly the different approaches used to represent the signature file. In section 3 the structure of the proposed SD-tree is shown. Section 4 is devoted to the algorithms for insert, search and delete operations. Section 5 proposes a sample data set and queries to validate the new structure. Section 6 reports the results of the experiments conducted with the analytical comparison of SD-tree with that of Signature tree [30]. Finally section 7 concludes the work with further outlook on the work.
2 A summary of signature file techniques
2.1 Signature files
A Signature is a bit string formed from a given value. Compared to other index structures, signature file is more efficient in handling new insertions and queries on parts of words. Other advantages include its simple implementation and the ability to support a growing file. But it introduces information loss which can be minimized by carefully selecting the signature extraction method.
Techniques for signature extraction such as Word Signature (WS) [2,3,4,6], Superimposed Coding (SC) [1,2,3,4,5,6,10], Multilevel Superimposed Coding [12], Run Length Encoding (RL) [3,4,6], Bit-block Compression (BC) [3,4], Variable Bit-block Compression (VBC) [4,6] have been reported. The encoding scheme sets a constant number say m, of 1s in the range [1..F], where F is the length of the signature.
The resulting binary pattern with m number of 1s and (Fm) number of 0s is called a word signature. The signature of a text block or object can be obtained by superimposing (logical OR operation) all its constituent signatures (i.e) word signatures for text block and attributes' signatures for object. The set of all signatures form a signature file. An example of Superimposed Coding and a sample query evaluation is given below.
Information 0010 0100
Retrieval 0100 0001
Block Signature 0110 0101
Sample queries
Matching query
Keyword = Information 0010 0100
Query descriptor0010 0100
Block signature matches (Actual Drop)
False Match query
Keyword = Coding0010 0001
Query descriptor0010 0001
Block signature matches (False Drop)
but keyword does not
Non-matching query
Keyword = Information 0010 0100
Keyword = Science 0000 0110
Query descriptor0010 0110
Block signature does not match
2.2 Applications of signatures
Tuesday, January 18, 2011
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment