SRT-Rank: Ranking Keyword Query Results in Relational Databases Using the Strongly Related Tree

In-Joong KIM  Kyu-Young WHANG  Hyuk-Yoon KWON  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E97-D   No.9   pp.2398-2414
Publication Date: 2014/09/01
Online ISSN: 1745-1361
Type of Manuscript: PAPER
Category: Data Engineering, Web Information Systems
Keyword: 
keyword query,  relational database,  semantic relevancy,  lossless join,  strongly related tree,  

Full Text: PDF(3.3MB)
>>Buy this Article


Summary: 
A top-k keyword query in relational databases returns k trees of tuples — where the tuples containing the query keywords are connected via primary key-foreign key relationships — in the order of relevance to the query. Existing works are classified into two categories: 1) the schema-based approach and 2) the schema-free approach. We focus on the former utilizing database schema information for more effective ranking of the query results. Ranking measures used in existing works can be classified into two categories: 1) the size of the tree (i.e., the syntactic score) and 2) ranking measures, such as TF-IDF, borrowed from the information retrieval field. However, these measures do not take into account semantic relevancy among relations containing the tuples in the query results. In this paper, we propose a new ranking method that ranks the query results by utilizing semantic relevancy among relations containing the tuples at the schema level. First, we propose a structure of semantically strongly related relations, which we call the strongly related tree (SRT). An SRT is a tree that maximally connects relations based on the lossless join property. Next, we propose a new ranking method, SRT-Rank, that ranks the query results by a new scoring function augmenting existing ones with the concept of the SRT. SRT-Rank is the first research effort that applies semantic relevancy among relations to ranking the results of keyword queries. To show the effectiveness of SRT-Rank, we perform experiments on synthetic and real datasets by augmenting the representative existing methods with SRT-Rank. Experimental results show that, compared with existing methods, SRT-Rank improves performance in terms of four quality measures — the mean normalized discounted cumulative gain (nDCG), the number of queries whose top-1 result is relevant to the query, the mean reciprocal rank, and the mean average precision — by up to 46.9%, 160.0%, 61.7%, and 63.8%, respectively. In addition, we show that the query performance of SRT-Rank is comparable to or better than those of existing methods.