摘要
语料库短语序列提取一直是短语学研究的关键技术环节。囿于计算和操作的复杂性,前人研究多使用相对单一的统计方法测量和提取短语序列,导致提取的数据包含大量噪音。文章使用前沿的大数据处理手段和计算技术,实现了基于频数、互信息、边界熵等多种统计手段的短语序列提取方法,并研制开发了相应的系统。实验结果表明,该系统能够在普通计算机上支持千万词级规模的大型语料库运算,并能显著提高短语序列的提取质量。
The extraction of phraseological sequences from corpora has become one of the research hotspots in recent years, but due to the computational complexity, previous studies often used a single measurement method to extract the phraseological sequence, and their experimental results also constantly contained a lot of disturbing sequences. In this paper, by using the state-of-the-art big data processing method, we design a new extraction method based on frequency, mutual information and maximum boundary entropy, and we also develop a phraseological sequence extraction software The experimental results show that the software can support computing on a large-scale corpus of tens of millions of word tokens in the ordinary computer, and meanwhile it can reach a higher extraction precision of phraseological sequences in terms of both quantity and quality.
出处
《外语电化教学》
CSSCI
北大核心
2017年第4期9-16,共8页
Technology Enhanced Foreign Language Education
基金
国家社会科学基金项目(项目编号:13BYY074
14CYY049)
北京市社会科学基金项目(项目编号:16JDYYA001)的部分研究成果
关键词
语料库驱动
短语序列
自动提取
设计与开发
Corpus-Driven Approach
Phraseological Sequence
Automatic Extraction
Design and hnplementation