摘要
针对Web中数据密集型的动态页面,文本数据少,网页结构化程度高的特点,介绍了一种基于HTML结构的web信息提取方法。该方法先将去噪处理后的Web页面进行解析,然后根据树编辑距离计算页面之间的相似度,对页面进行聚类,再对每一类簇生成相应的提取规则,对Web页面进行数据提取。
According to the characteristic of data-intensive dynamic web pages, insufficient text data and page structure with a high degree in web, this paper outlines a web information extraction method based on HTML structure.This method first parses the de-noised web pages to form DOM trees, then with tree edit distance calculates the similarity between pages, clusters pages, generates the corresponding extraction rules for each category and implements web information extraction.
作者
崔慧超
刘莉
CUI Hui-chao, LIU Li(Southwest Jiaotong University, College of Information Science and Technology, Chengdu 610031, China)
出处
《电脑知识与技术》
2010年第1期212-213,共2页
Computer Knowledge and Technology
关键词
WEB信息提取
树编辑距离
聚类
提取规则
Web information extraction
edited tree distance
clustering
extraction rules