This paper presents an effective keyword search method for data-centric extensive markup language (XML) documents. The method divides an XML document into compact connected integral subtrees, called self-integral tr...This paper presents an effective keyword search method for data-centric extensive markup language (XML) documents. The method divides an XML document into compact connected integral subtrees, called self-integral trees (SI-Trees), to capture the structural information in the XML document. The SI-Trees are generated based on a schema guide. Meaningful self-integral trees (MSI-Trees) are identified, which contain all or some of the input keywords for the keyword search in the XML documents. Indexing is used to accelerate the retrieval of MSI-Trees related to the input keywords. The MSI-Trees are ranked to identify the top-k results with the highest ranks. Extensive tests demonstrate that this method costs 10-100 ms to answer a keyword query, and outperforms existing approaches by 1-2 orders of magnitude.展开更多
基金Partly Supported by the National High-Tech Research and Development (863) Program of China (No. 2007AA01Z152)the Basic Research Foundation of Tsinghua National Laboratory for Information Science and Technology (TNList)2008 HP Labs Innovation Research Program
文摘This paper presents an effective keyword search method for data-centric extensive markup language (XML) documents. The method divides an XML document into compact connected integral subtrees, called self-integral trees (SI-Trees), to capture the structural information in the XML document. The SI-Trees are generated based on a schema guide. Meaningful self-integral trees (MSI-Trees) are identified, which contain all or some of the input keywords for the keyword search in the XML documents. Indexing is used to accelerate the retrieval of MSI-Trees related to the input keywords. The MSI-Trees are ranked to identify the top-k results with the highest ranks. Extensive tests demonstrate that this method costs 10-100 ms to answer a keyword query, and outperforms existing approaches by 1-2 orders of magnitude.