面向政府采购数据的工程化采集方案设计被引量：2

Design of engineering collection scheme for government procurement data

下载PDF

导出

摘要政府采购过程中产生的大量招投标数据,基本都以Web文本的形式向公众呈现,难以获取结构化数据,严重制约着公众对政府采购过程的知情、分析和监督。本文提出一种基于Web挖掘的政府采购数据的工程化采集方案,构建了一套面向政府采购公开数据的结构化数据形成体系。首先,通过对招投标信息来源和结构的分析,设计基于Scrapy爬虫框架的工程化数据抓取平台;其次,结合基于规则和基于统计两种抽取方式,设计专用信息抽取器;最后,根据领域特点建立阶段性数据清洗中心,分层过滤数据,最终输出可用于分析和挖掘的结构化数据。系统实验结果证明了该方案的可行性和优越性,为政府采购信息公开发挥监督和引导职能提供了有力的技术支撑。 A large amount of bidding information is generated in the process of government procurement,which is presented to the public in the form of Web text.It is difficult for people to obtain structured data behind it,which seriously restricts the ability of realization,analysis and supervision of the public for the process of government procurement.This paper presents an engineering data collection scheme based on Web mining for government procurement data,and constructs a system for the structured data in public government procurement field.At first,an engineering data crawling platform based on Scrapy crawler framework is designed by analyzing the source and structure of bidding information.Secondly,a special information extractor is designed by combining rulebased and statistics-based information extraction methods.Finally,a stage data cleaning center is established according to the characteristics of the field where the data is filtered hierarchically,and the final output can be used for analysis and mining.The system experimental results prove the feasibility and superiority of the scheme,and provide strong technical support for the supervision and guidance function through the public information of government procurement.

作者王宏夏禹常静静 WANG Hong;XIA Yu;CHANG Jingjing(Collage of Computer Science,Xi'Anshiyou University,Xi'an 710065,China)

机构地区西安石油大学计算机学院

出处《智能计算机与应用》 2020年第7期170-175,共6页 Intelligent Computer and Applications

基金教育部产学合作协同育人项目(201802224022)

关键词政府采购 WEB挖掘 Scrapy爬虫信息抽取数据清洗 Government procurement Web mining Scrapy crawler Information extractor Data cleaning

分类号 D630 [政治法律—政治学] TP311.13 [政治法律—中外政治制度]

引文网络
相关文献

参考文献5

1丁伟,边漫远,陈超.浅议Web数据挖掘技术在政府采购中的应用[J].中国政府采购,2015(4):70-71. 被引量：2
2万如意.大数据分析在政府采购领域中的应用:数据、技术与案例[J].中国政府采购,2015,0(12):52-56. 被引量：7
3刘彦军.招标投标市场现状及发展[J].中国招标,2015,0(9):12-14. 被引量：1
4王宏,门博,雷娜.K近邻算法在政府采购数据挖掘中的研究与应用[J].智能计算机与应用,2019,9(3):269-272. 被引量：1
5盛怡瑾,黄政,张学福.面向领域分析的文献数据清洗策略研究[J].数字图书馆论坛,2015(12):2-8. 被引量：3

二级参考文献21

1ERahm, H HDo. Data cleaning: Problems and current approaches[J]. IEEE DATA ENGINEERING BULLETIN, 2000, 23(4): 3-13. 被引量：1
2RBaxter, PChristen, TChurches. A comparison of fast blocking methods for record linkage[J].KDD WORKSHOPS, 2003: 25-27. 被引量：1
3W EWinkler. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage[J].PROCEEDINGS OF THE, 1990:8. 被引量：1
4V ILevenshtein. Binary codes capable of correcting deletions, insertions and reversals[J].SOVIET PHYSICS DOKLADY, 1966, 10(10): 707-710. 被引量：1
5T FSmith, Waterman M S. Identification of common molecular subsequences[J]. JOURNAL OF MOLECULAR BIOLOGY, 1981 (1): 195-197. 被引量：1
6WWCohen. Integration of heterogeneous databases without common domains using queries based on textual similarity[J].ACM SIGMOD RECORD,1998, 27(2): 201-212. 被引量：1
7LGravano, P Glpeirotis, NKoudas, et al. Text joins in an RDBMS for web data integration[C]. Proceedings of the 12th international conference on World Wide Web. New York:ACM, 2003: 90-101. 被引量：1
8D RWilson. Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage[C].Neural Networks (IJCNN),USA:IEEE, 2011: 9-14. 被引量：1
9G.P.Hettiarachchi, N.N.Hettiarachchi,D.S. Hettiarachchi,et al. Next generation data classification and linkage: Role of probabilistic models and artificial intelligence[C]. 2014 Global Humanitarian Technology Conference (GHTC), USA:IEEE,2014: 569-576. 被引量：1
10PChristen. Automatic training example selection for scalable unsupervised record linkage[M]. Berlin :Springer, 2008: 511-518. 被引量：1