摘要
抽取网页中的行情数据进行预测和分析具有重要意义。提出了Web中的行情数据抽取算法,该算法主要基于"行情数据通常在网页中表现为区域最大的数据表格"等实践规律,首先自动识别出最大的数据表格,然后转换为DOM树结构,最后抽取DOM树的结点值。与传统算法不同,算法自动抽取行情区域而无需用户定义抽取数据区域。设计了一个农产品价格预测原型系统,该系统针对某个农产品,自动从特定网站获取价格数据,对月度价格进行预测,实验表明预测性能较好。
It is significant to extract market data in Web pages for prediction and analysis.An extraction algorithm for Web pages is proposed.Taking into account the common practice that “market data are usually displayed in the largest table on a Web page”,the market data extraction algorithm first detects the largest table on a Web page and then transfers it into a DOM tree,and in the end gets the node values of the tree.This algorithm is different from traditional ones in that it can automatically detect market data and does not need a data extraction region to be specified by the users.A prototype system for agriculture product price prediction is designed and developed.The system extracts market price data from a given website automatically and predicts the price in the future months.Experimental results show the prediction results are satisfying.
出处
《计算机工程与应用》
CSCD
北大核心
2009年第20期202-204,248,共4页
Computer Engineering and Applications
基金
安徽省科研项目No.KJ2008B033~~
关键词
WEB内容挖掘
行情数据抽取
行情预测
Web content mining
market data extraction
market data prediction