Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, and are often stored as set records in hidden-Web databases. Many in...Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, and are often stored as set records in hidden-Web databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and the efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To the best of our knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and the efficiency of proposed correlation query and optimization strategies.展开更多
电子病历EMR(Electronic Medical Records)检索是信息检索研究中的一个新领域。医学术语在电子病历检索中占有重要地位,通常用来限定检索条件、表达用户的检索意图。针对这种情况,提出一种基于医学术语权重调整的查询重构方法,以提高电...电子病历EMR(Electronic Medical Records)检索是信息检索研究中的一个新领域。医学术语在电子病历检索中占有重要地位,通常用来限定检索条件、表达用户的检索意图。针对这种情况,提出一种基于医学术语权重调整的查询重构方法,以提高电子病历检索的性能。该方法首先从原始查询语句中筛选出医学术语,然后使用自信息来度量每个医学术语的权重,最后将加权的医学术语与原始查询语句按照一定的权重比例结合,构造出新的查询语句。将该方法在TREC数据集上进行实验,结果表明与原始查询结果相比,重构后的查询结果在MAP、bpref和P10这三项指标上,分别提高了14.2%、10.1%和9.6%,验证了该方法的有效性。展开更多
基金The work was supported by the National Key Technology R&D Program of China under Grant No. 2015BAH14F02, the National Natural Science Foundation of China under Grant Nos. 61572272, 61202008, 61325008, and 61370055, and the Tsinghua University Initiative Scientific Research Program.
文摘Large scale of short text records are now prevalent, such as news highlights, scientific paper citations, and posted messages in a discussion forum, and are often stored as set records in hidden-Web databases. Many interesting information retrieval tasks are correspondingly raised on the correlation query over these short text records, such as finding hot topics over news highlights and searching related scientific papers on a certain topic. However, current relational database management systems (RDBMS) do not directly provide support on set correlation query. Thus, in this paper, we address both the effectiveness and the efficiency issues of set correlation query over set records in databases. First, we present a framework of set correlation query inside databases. To the best of our knowledge, only the Pearson's correlation can be implemented to construct token correlations by using RDBMS facilities. Thereby, we propose a novel correlation coefficient to extend Pearson's correlation, and provide a pure-SQL implementation inside databases. We further propose optimal strategies to set up correlation filtering threshold, which can greatly reduce the query time. Our theoretical analysis proves that with a proper setting of filtering threshold, we can improve the query efficiency with a little effectiveness loss. Finally, we conduct extensive experiments to show the effectiveness and the efficiency of proposed correlation query and optimization strategies.
文摘电子病历EMR(Electronic Medical Records)检索是信息检索研究中的一个新领域。医学术语在电子病历检索中占有重要地位,通常用来限定检索条件、表达用户的检索意图。针对这种情况,提出一种基于医学术语权重调整的查询重构方法,以提高电子病历检索的性能。该方法首先从原始查询语句中筛选出医学术语,然后使用自信息来度量每个医学术语的权重,最后将加权的医学术语与原始查询语句按照一定的权重比例结合,构造出新的查询语句。将该方法在TREC数据集上进行实验,结果表明与原始查询结果相比,重构后的查询结果在MAP、bpref和P10这三项指标上,分别提高了14.2%、10.1%和9.6%,验证了该方法的有效性。