Tuesday, January 18, 2011

Similarity measures for relational databases

The relational algebra (4; 15), a relational data model with five basic operations on relations, i.e., Cartesian product x, projection [pi], selection [sigma], union [union], and set difference --, and several additional operations such as [theta]-join or intersection, has three main advantages over non-relational data models (13):

--From the point of view of usability, the model has a simple interpretation in terms of real-world concepts, i.e., the essential data structure of the model is a relation, which can be visualized in a tabular format.

--From the point of view of applicability, the model is flexible and general, and can be easily adapted to many applications.

--From the point of view of formalism, the model is elegant enough to support extensive research and analysis.

Hence, the relational data models have gained acceptance from a broad range of users, they have gained popularity and credibility in a variety of application areas, and they facilitate better theoretical research in many fundamental issues arising from database query languages and dependency theory.

However, there are several applications that have evolved beyond the capabilities of traditional relational data models, such as applications that require databases to cooperate with the user by suggesting answers which may be helpful but were not explicitly asked for. The cooperative-behaviour or cooperative-answering techniques (5) may be differentiated into the following categories:

i.) consideration of specific information about a user's state of mind,

ii.) evaluation of presuppositions in a query,

iii.) detection and correction of misconceptions in a query,

iv.) formulation of intensional answers,

v.) generalization of queries and of responses.

The cooperative behaviour plays an important part, for instance, in information-providing dialogue systems (7), where the most vital cooperative-answering technique leading to user satisfaction is generalization of queries and of responses as shown by Hajdinjak and Mihelic (8). Generalization of queries and of responses, the aim of which is to capture possibly relevant information, is often achieved by query relaxation (6).

Another kind of applications not suitable for the traditional relational data models are applications which require the database to be enhanced with a notion of similarity that allows one to perform approximate searches (9). The goal in these applications is often one of the following:

i.) Find objects whose feature values fall within a given range or where the distance from some query object falls into a certain range (range queries).

ii.) Find objects whose features have values similar to those of a given query object or set of query objects (nearest neighbour queries and approximate nearest neighbour queries).

iii.) Find pairs of objects from the same set or different sets which are sufficiently similar to each other (closest pairs queries).

Examples of such approximate-matching or similarity-search applications are databases storing images, fingerprints, audio clips or time sequences, text databases with typographical or spelling errors, text databases where we look for documents that are similar to a given document, and computational-biology applications where we want to find a DNA or a protein sequence in a database allowing some errors due to typical variations.

Persuaded that many applications will never reach the limitations of the widespread relational data model this article focuses on traditional relational algebra equipped with extra features that allow query relaxation and similarity searches. Although a large body of work has addressed how to extend the relational data model to incorporate cooperativity, neighbouring information, and/or orderings (2; 3; 10; 11; 13), neither of them have succeeded to fit into the representational and operational uniformity of traditional relational algebra or even to reach a certain degree of generality.

Therefore, we are going to talk about domains, similarity, approximate answers, and nearness of data in a highly systematic and comprehensive way, which will lead us towards an usable, applicable, and a formaly strong generalization of the relational data model.

2 Sets with similarity

Most applications and proposed solutions of non-exact matches and similarity search, which are not covered by traditional relational algebra, have some common characteristics--there is a universe of objects and a non-negative distance or distance-like function defined among them. The distance function measures how close are the non-exact matches to the exact specifications that were given by the user willing to accept approximate answers.

Instead of restricting only to distance metrics, we consider more general similarity measures that satisfy the only condition of being reflexive, i.e., every object is most similar to itself. Hence, rather than focusing on (ordinary sets) or metric spaces, we will consider more general sets with similarity, where a measure of similarity assigns to a pair of objects a similarity value, which tells us how similar they are. Note, we speak of similarity instead of distance--if a point x moves toward a point y, the distance between x and y gets smaller, but their similarity gets larger.

0 comments:

Post a Comment

newer post older post Home