摘要
为有效解决Web信息抽取中的主题漂移问题,提出了一种能更准确地反映Web页面信息熵的计算方法——混合熵。该方法把需要计算信息熵的信息块放在多页面网站环境中进行讨论,通过考虑页面内信息对信息熵计算的影响,并同时考虑由模版生成的页面间相同的信息分布的影响,从而保证了信息熵的计算的准确度。用该方法解决信息抽取中信息块的信息熵计算问题,并将仿真结果与其它算法进行比较,结果表明了该方法计算的信息熵的准确度及主题相关信息块与主题无关信息块之间的区分度优于其它方法。
To solve the topic drift problem in web information extraction effectively, a new computation of entropy based on web page is proposed. The information within local page and same information distribution between websites are considered completely to greatly guarantee improving the precision. Calculating entropy of web information block in web information extraction is used to verify this algorithm and the result of simulation, which is compared to other well-known algorithms, indicated that this algorithm is better than several other algorithms in both precision of calculating entropy and distinguishing between related information blocks and unrelated information blocks.
出处
《计算机工程与设计》
CSCD
北大核心
2010年第1期114-117,共4页
Computer Engineering and Design
基金
国家社科基金项目(08CTQ007)
关键词
信息熵
信息抽取
信息块
模版
特征词
information entropy
information extraction
information block
template
term