摘要
针对地平线扫描数据的多源异构特点,为了解决所出现的数据重复和数据噪声问题,选择基于可变长度数据分块的重复数据检测方法和基于TF-IDF的噪声数据检测方法,检测和删除重复数据和噪声数据。采用SOA架构的设计思想,使用Java编程语言设计地平线扫描数据消重去噪系统的开发。使用该系统进行数据预处理,能够有效提升高质量数据比例,为后续产业分析、技术识别做好数据层面的支撑。
Combined with the multi-source heterogeneous characteristics of horizon scanning data,aimed at the problems of data duplication and data noise,the duplicate data detection method is selected to detect and delete duplicate data and noise data,based on variable length data block and the noise data detection method and TF-IDF.Using the idea of SOA archi tecture,the horizon scan data deduplication denoising system is designed using Java.By the system to preprocess the horizon scan data,it can effectively improve the proportion of high-quality data and provide data support for subsequent industrial analysis and technical identification.
作者
鄢天安
张文强
吴思
张英杰
YAN Tian’an;ZHANG Wenqiang;WU Si;ZHANG Yingjie(Engineering Center,Institute of Scientific and Technical Information of China,Beijing 100038,China)
出处
《微型电脑应用》
2023年第11期1-4,共4页
Microcomputer Applications
基金
国家重点研发计划项目课题(2019YFA0707202)。