摘要
随着Web信息抽取的研究和发展,抽取技术已经逐渐成熟,通过软件来实现从Web页中抽取所需要的信息已成为可能。对基于.NET技术实现的Web信息抽取系统进行了研究,分析并提出了HTML文档下载和清理、HTML到XML格式转换、数据定位及抽取、抽取数据的保存等需要研究解决的关键技术问题,并探讨了相应的解决方案。
With the Web information extraction researchment and development,and in the extraction technology has gradually matured through the software from a Web page to extract the required information is possible.Based on.NET technology for Web information extraction system for research,analysis and put forward the document to download and clean up HTML,HTML to XML format,data location and extraction,extraction of data preservation needs to study and solve key technical problems and to explore the corresponding solutions.
出处
《软件导刊》
2010年第12期120-122,共3页
Software Guide
基金
浙江省教育厅科研项目(Y200803750)