期刊文献+

基于散列技术的快速子串归并算法 被引量:4

Fast Hash Algorithms on Statistical Substring Reduction
原文传递
导出
摘要 用统计方法研究东西方语言的多词单元问题和东方语言的未登录词问题时需要删除同频子串(子串归并).传统的子串归并算法时间复杂度为O(n2),在大规模语料库的处理中效率低下.提出一种基于散列技术的时间复杂度为O(n)的子串归并算法,并用数学方法证明其与O(n2)复杂度的算法等价,即输入相同时输出也相同.不同规模语料上的实验结果表明新算法能够大大缩短子串归并所需时间,适用于大规模语料库的处理. Statistical processing of multi-word units in occidental or oriental languages and unknown words in oriental languages requires substring reduction. The time complexity of traditional substring reduction algorithms is O(n^2), which is ineffective for large-scale corpora. It proposes a hash algorithm with time complexity O(n), and mathematically proves the equivalence to the O(n^2) one. That is, with the same inputs, the outputs are the same. The experiments on different scale corpora show that the new algorithm can dramatically shorten the processing time than the traditional one. So it is therefore an appropriate choice for large scale corpus processing.
出处 《复旦学报(自然科学版)》 CAS CSCD 北大核心 2004年第5期948-951,955,共5页 Journal of Fudan University:Natural Science
基金 国家"八六三"高技术研究发展计划项目资助(2001AA114019 2001AA114210 2002AA117010-08) 国家自然科学基金资助项目(60083006) 国家"九七三"重点基础研究发展规划项目(G19980305011)
关键词 归并 散列 算法 时间复杂度 大规模 删除 语料库 法能 处理 东西方 large scale corpus text mining multi-word unit unknown word statistical string frequency
  • 相关文献

参考文献7

二级参考文献16

  • 1孙茂松,黄昌宁,高海燕,方捷.中文姓名的自动辨识[J].中文信息学报,1995,9(2):16-27. 被引量:87
  • 2黄萱菁,吴立德,王文欣,叶丹瑾.基于机器学习的无需人工编制词典的切词系统[J].模式识别与人工智能,1996,9(4):297-303. 被引量:24
  • 3张民,李生,王海峰,赵铁军,王铁志.基于知识评价的快速汉语自动分词系统[J].情报学报,1996,15(2):95-105. 被引量:4
  • 4[1]Melamed I.D. , Automatic Construction of Clean Broad-Coverage Translation Lexicons. In: Conference of the Association for Machine Translation in Americas, Montreal, Canada, 1996. 被引量:1
  • 5[2]Church K. W. and Hanks, Word association norms, mutual information and lexicography. In: Computational Linguistics 16(1): 22 - 29,1990. 被引量:1
  • 6[3]Smadja F. , McKeown K. R. and Hatzivassiloglou V. , Translation collocations for bilingual lexicons:a statistical approach. In: Computational Linguistics 22(1): 1 - 38,1996. 被引量:1
  • 7[4]Haruno M., Ikehara S. and Yamazaki T., Learning bilingual collocations by word-level sorting. In:COL-INC96(525 - 530)1996. 被引量:1
  • 8[5]Melamed I.D. ,Automatic Discovery of Non-Compositional Compounds. In: Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Providence, RI 1997. 被引量:1
  • 9[6]Takaaki Tanaka and Yoshihiro Matsuo,Extraction of compound noun translation from non-parallel corpora. In:Proc. of the 5th Annual Meeting of the ANLP,Japanese, 1999. 被引量:1
  • 10[7]Vintar, Spela, Using Parallel Corpora for Translation-Oriented Term Extraction. In:Babel Joumal, John Benjamins Publishing, 2001. 被引量:1

共引文献41

同被引文献42

  • 1张锋,樊孝忠,许云.Chinese Term Extraction Based on PAT Tree[J].Journal of Beijing Institute of Technology,2006,15(2):162-166. 被引量:2
  • 2Oakes M P,Paice C D.Term extraction for automatic abstracting[M] //Bourigault D,Jacquemin C,L'Homme M-C.Recent Advances in Computational Terminology.John Benjamins Publishing Company,2001:353-370. 被引量:1
  • 3Fortuna B,Lavrac N,Velardi P.Advancing Topic Ontology Learning through Term Extraction[C].PRICAI 2008,LNAI 5351,2008:626-635. 被引量:1
  • 4Cerbah F,Euzenat J.Using Terminology Extraction to Improve Traceability from Formal Models to Textual Requirements[C].NLDB 2000,LNCS 1959,2001:115-126. 被引量:1
  • 5Bourigault D.Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases[C] //Proceedings of COLING'92,1992:977-981. 被引量:1
  • 6Frantzi K T,Ananiadou S,Mima H.Automatic Recognition of Multi-word terms:the C-value/NC-value Method[J].International Journal on Digital Libraries,2000,3(2):115-130. 被引量:1
  • 7Yoshida M,Nakagawa H.Automatic Term Extraction Based on Perplexity of Compound Words[C] //IJCNLP 2005:269-279. 被引量:1
  • 8Zhang Huaping,Yu Hongkui,Xiong Deyi,et al.HHMM-based Chinese Lexical Analyzer ICTCLAS[C] //Preceedings of the 2nd SigHan Workshop,July 2003:184-187. 被引量:1
  • 9Merkel M,Andersson M.Knowledge-lite extraction of multi-word units language filters and entropy thresholds[C] //Proceedings of 2000 Conference on User-Oriented Content-Based Text and Image Handling.Pairs,France:ACM Press,2000:737-746. 被引量:1
  • 10Patry A,Langlais P.Corpus-Based Terminology Extraction[C] //Proceedings of the 7th International Conference on Terminology and Knowledge Engineering,2005:313-321. 被引量:1

引证文献4

二级引证文献35

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部