期刊文献+

计算Web页面信息熵的方法 被引量:1

Method for calculating entropy of web information block
下载PDF
导出
摘要 为有效解决Web信息抽取中的主题漂移问题,提出了一种能更准确地反映Web页面信息熵的计算方法——混合熵。该方法把需要计算信息熵的信息块放在多页面网站环境中进行讨论,通过考虑页面内信息对信息熵计算的影响,并同时考虑由模版生成的页面间相同的信息分布的影响,从而保证了信息熵的计算的准确度。用该方法解决信息抽取中信息块的信息熵计算问题,并将仿真结果与其它算法进行比较,结果表明了该方法计算的信息熵的准确度及主题相关信息块与主题无关信息块之间的区分度优于其它方法。 To solve the topic drift problem in web information extraction effectively, a new computation of entropy based on web page is proposed. The information within local page and same information distribution between websites are considered completely to greatly guarantee improving the precision. Calculating entropy of web information block in web information extraction is used to verify this algorithm and the result of simulation, which is compared to other well-known algorithms, indicated that this algorithm is better than several other algorithms in both precision of calculating entropy and distinguishing between related information blocks and unrelated information blocks.
出处 《计算机工程与设计》 CSCD 北大核心 2010年第1期114-117,共4页 Computer Engineering and Design
基金 国家社科基金项目(08CTQ007)
关键词 信息熵 信息抽取 信息块 模版 特征词 information entropy information extraction information block template term
  • 相关文献

参考文献7

二级参考文献42

  • 1王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 2贺智平,徐学洲,李爱玲.一种基于信息熵的Web页面主题信息抽取方法[J].计算机工程与应用,2007,43(4):164-166. 被引量:6
  • 3朱红灿,邹凯.基于机器学习的Web链接的抽取[J].情报理论与实践,2007,30(2):252-255. 被引量:2
  • 4Berger A L,S A Della Pietra,V J Della Pietra. A Maximum Entropy Approach to Natural Language Processing[J].Computational Linguistics,1996;22(1):39~71 被引量:1
  • 5Darroch J N,Ratcliff D.Generlized iterative scaling for log-linear models[C].In: The Annals of Mathematical Statistics, 1972 ; 43 (5):1470~1480 被引量:1
  • 6McCallum A,D Freitag,F Pereira. Maximum Entropy Markov Models for Information Extraction and Segmentation[C].In:Machine Learning:Proceedings of the Seventeenth International Conference(ICML 2000),Stanford, California, 2000: 591 ~598 被引量:1
  • 7Leek T R.Information extraction using hidden Markov models[D].Master′s thesis.UC San Diego,1997 被引量:1
  • 8Yamron J,Carp I,Gillick L et al.A hidden Markov model approach to text segmentation and event tracking[C].In:Proceedings of ICASSP′98,IEEE, Volume: 1,1998: 333~336 被引量:1
  • 9Makoto Nagao,Shinsuke Mori.A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese[A].In:Proceedings of ACL-1994[C],1994. 被引量:1
  • 10Xueqiang Lv,Le Zhang and Junfeng Hu.Statistical Substring Reduction in Linear Time[A].In:Proceedings of IJCNLP-2004[C],2004. 被引量:1

共引文献25

同被引文献6

引证文献1

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部