摘要
Web新闻内容抽取是众多"大数据"和"大知识"应用的基础,也是一个开放性问题.标签路径特征和文本块密度特征是目前解决该问题的两类优良特征.标签路径特征能较好地区分全网页的内容与噪声,但难以识别内容块中的噪声和噪声块中的内容;文本块密度特征能较好地识别高密度的内容块,但鲁棒性不足.因此,本文提出了一种可有效结合标签路径特征和文本块密度特征的Web信息抽取模型CEDP,结合两种特征的优点,设计了一种基于文本块密度加权的标签路径特征,并设计了基于该特征的Web新闻抽取算法CEDP-NLTD.CEDP-NLTD是一种快速的、通用的、无需训练的在线Web新闻内容抽取算法,适用于Web大数据环境下的多种来源、多种风格、多种语言的异构Web新闻网页抽取任务.在Clean Eval等测试数据集上的实验结果表明,CEDP-NLTD方法优于CETR,CETD,CEPR,CEPF等在线抽取方法,且优于基于CEDP模型直接使用CETD方法设计的3种块密度特征所形成的算法CEDP-TD,CEDP-CTD,CEDP-DSum.
Web news extraction is the basis and an open research problem of many "big data" and "big knowledge" applications. Presently, tag paths and text block density are two excellent features that can help to solve this problem. The tag path feature can distinguish well the content from the noise for the whole webpage, but it has difficulty in recognizing noise in the content block or the content in the noise block. The text block density feature can recognize well the high-density content block, but it is not robust enough. Aiming at the abovementioned problems, we propose a Web information extraction model, referred to as CEDP, which can effectively combine the tag path feature and the text block density feature. We design a tag path feature weighted by the text block density in order to utilize the merits of the two features above. In addition, we design a Web news extraction method via the weighted tag path feature, CEDP-NLTD. CEDP-NLTD is a fast, universal, nontraining, online Web news extraction algorithm that is suitable for extracting heterogeneous Web news from the big data environment of the Web across various resources, styles, and languages. Experiments on public datasets such as Clean Eval show that the CEDP-NLTD method achieves better performance than the state-of-the-art CETR, CETD, CEPR, and CEPF methods, and it achieves better performance than CEDP-TD, CEDP-CTD,and CEDP-DSum, which are respectively generated from CEDP by using one of the three block density features of CETD.
出处
《中国科学:信息科学》
CSCD
北大核心
2017年第8期1078-1094,共17页
Scientia Sinica(Informationis)
基金
国家重点研发计划(批准号:2016YFB1000901)
教育部创新团队发展计划(批准号:IRT13059)
国家自然科学基金(批准号:612-73297
61673152)
国家留学基金(批准号:201506695019)资助项目