A passage retrieval strategy for web-based question answering (QA) systems is proposed in our QA system. It firstly analyzes the question based on semantic patterns to obtain its syntactic and semantic information a...A passage retrieval strategy for web-based question answering (QA) systems is proposed in our QA system. It firstly analyzes the question based on semantic patterns to obtain its syntactic and semantic information and then form initial queries. The queries are used to retrieve documents from the World Wide Web (WWW) using the Google search engine. The queries are then rewritten to form queries for passage retrieval in order to improve the precision. The relations between keywords in the question are employed in our query rewrite method. The experimental result on the question set of the TREC-2003 passage task shows that our system performs well for factoid questions.展开更多
The paper proposes a novel method for subtopics segmentation of Web document. An effective retrieval results may be obtained by using subtopics segmentation. The proposed method can segment hierarchically subtopics an...The paper proposes a novel method for subtopics segmentation of Web document. An effective retrieval results may be obtained by using subtopics segmentation. The proposed method can segment hierarchically subtopics and identify the boundary of each subtopic. Based on the term frequency matrix, the method measures the similarity between adjacent blocks, such as paragraphs, passages. In the real-world sample experiment, the macro-averaged precision and recall reach 73.4 % and 82.5 %, and the micro-averaged precision and recall reach 72.9% and 83. 1%. Moreover, this method is equally efficient to other Asian languages such as Japanese and Korean, as well as other western languages.展开更多
1 Introduction.Inspired by the impressive success of BERT[1]in various NLP applications,researchers have attempted to apply pretrained language models to information retrieval,and existing BERT-based retrieval models ...1 Introduction.Inspired by the impressive success of BERT[1]in various NLP applications,researchers have attempted to apply pretrained language models to information retrieval,and existing BERT-based retrieval models obtain improved performance on passage retrieval[2-4].Since BERT has the limitation that the maximum length of tokens is only 512,however,simply applying those models to the task of long document retrieval derives suboptimal results.展开更多
In this paper,we study the problem of extracting variable-depth"logical document hierarchy"from long documents,namely organizing the recognized"physical document objects"into hierarchical structure...In this paper,we study the problem of extracting variable-depth"logical document hierarchy"from long documents,namely organizing the recognized"physical document objects"into hierarchical structures.The discovery of logical document hierarchy is the vital step to support many downstream applications(e.g.,passage-based retrieval and high-quality information extraction).However,long documents,containing hundreds or even thousands of pages and a variable-depth hierarchy,challenge the existing methods.To address these challenges,we develop a framework,namely Hierarchy Extraction from Long Document(HELD),where we"sequentially"insert each physical object at the proper position on the current tree.Determining whether each possible position is proper or not can be formulated as a binary classification problem.To further improve its effectiveness and efficiency,we study the design variants in HELD,including traversal orders of the insertion positions,heading extraction explicitly or implicitly,tolerance to insertion errors in predecessor steps,and so on.As for evaluations,we find that previous studies ignore the error that the depth of a node is correct while its path to the root is wrong.Since such mistakes may worsen the downstream applications seriously,a new measure is developed for a more careful evaluation.The empirical experiments based on thousands of long documents from Chinese financial market,English financial market and English scientific publication show that the HELD model with the"root-to-leaf"traversal order and explicit heading extraction is the best choice to achieve the tradeoff between effectiveness and efficiency with the accuracy of 0.972,6,0.729,1 and 0.957,8 in the Chinese financial,English financial and arXiv datasets,respectively.Finally,we show that the logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task.In summary,we conduct a systematic study on this task in terms of methods,evaluations,and applications.展开更多
基金Supported by the National Basic Research Program of China (2003CB317002)the Grant from City University of Hong Kong (7002137)
文摘A passage retrieval strategy for web-based question answering (QA) systems is proposed in our QA system. It firstly analyzes the question based on semantic patterns to obtain its syntactic and semantic information and then form initial queries. The queries are used to retrieve documents from the World Wide Web (WWW) using the Google search engine. The queries are then rewritten to form queries for passage retrieval in order to improve the precision. The relations between keywords in the question are employed in our query rewrite method. The experimental result on the question set of the TREC-2003 passage task shows that our system performs well for factoid questions.
基金Supported by the National High Tech-nology Research and Development Program of China(2002AA119050)
文摘The paper proposes a novel method for subtopics segmentation of Web document. An effective retrieval results may be obtained by using subtopics segmentation. The proposed method can segment hierarchically subtopics and identify the boundary of each subtopic. Based on the term frequency matrix, the method measures the similarity between adjacent blocks, such as paragraphs, passages. In the real-world sample experiment, the macro-averaged precision and recall reach 73.4 % and 82.5 %, and the micro-averaged precision and recall reach 72.9% and 83. 1%. Moreover, this method is equally efficient to other Asian languages such as Japanese and Korean, as well as other western languages.
基金supported by the Key Research and Development Program of Hubei Province(2020BAB017)Scientific Research Center Program of National Language Commission(ZDI135-135)the Fundamental Research Funds for the Central Universities(CCNU22QN015).
文摘1 Introduction.Inspired by the impressive success of BERT[1]in various NLP applications,researchers have attempted to apply pretrained language models to information retrieval,and existing BERT-based retrieval models obtain improved performance on passage retrieval[2-4].Since BERT has the limitation that the maximum length of tokens is only 512,however,simply applying those models to the task of long document retrieval derives suboptimal results.
基金the National Key Research and Development Program of China under Grant No.2017YFB1002104the National Natural Science Foundation of China under Grant Nos.62076231 and U1811461.
文摘In this paper,we study the problem of extracting variable-depth"logical document hierarchy"from long documents,namely organizing the recognized"physical document objects"into hierarchical structures.The discovery of logical document hierarchy is the vital step to support many downstream applications(e.g.,passage-based retrieval and high-quality information extraction).However,long documents,containing hundreds or even thousands of pages and a variable-depth hierarchy,challenge the existing methods.To address these challenges,we develop a framework,namely Hierarchy Extraction from Long Document(HELD),where we"sequentially"insert each physical object at the proper position on the current tree.Determining whether each possible position is proper or not can be formulated as a binary classification problem.To further improve its effectiveness and efficiency,we study the design variants in HELD,including traversal orders of the insertion positions,heading extraction explicitly or implicitly,tolerance to insertion errors in predecessor steps,and so on.As for evaluations,we find that previous studies ignore the error that the depth of a node is correct while its path to the root is wrong.Since such mistakes may worsen the downstream applications seriously,a new measure is developed for a more careful evaluation.The empirical experiments based on thousands of long documents from Chinese financial market,English financial market and English scientific publication show that the HELD model with the"root-to-leaf"traversal order and explicit heading extraction is the best choice to achieve the tradeoff between effectiveness and efficiency with the accuracy of 0.972,6,0.729,1 and 0.957,8 in the Chinese financial,English financial and arXiv datasets,respectively.Finally,we show that the logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task.In summary,we conduct a systematic study on this task in terms of methods,evaluations,and applications.