摘要
在XML基础上,利用HTMLTidy可实现轻量级的Web数据挖掘和转换。转换过程主要解决的是HTML文档及其集合要表达的模式信息的分离。转换步骤是利用HTMLTidy提供的标准类库,净化HTML文档,借助DOM生成树对HTML元素结构做进一步分析,最后通过XSL、XPATH等自动提取转换。
Using XML and HTML Tidy tools set, we can get a lightweight method of Web data mining and transformation. The purpose of transformation is to separate HTML document content from its schema. The processes included purifying HTML documents by HTML Tidy Standard class library, analyzing HTML element's structure through DOM, and extracting data with XSL and XPATH.
出处
《华东理工大学学报(自然科学版)》
CAS
CSCD
北大核心
2003年第6期613-616,共4页
Journal of East China University of Science and Technology