摘要
为了解决多源挥发性有机物(Volatile Organic Compounds,VOCs)数据存在数据维度高、数据关系复杂、数据存在异常的问题,建立了基于核主成分分析(Kernel Principal Component Analysis,KPCA)、孤立森林(Isolated Forest,IF)、加权随机森林(Weighted Random Forest,WRF)混合方法的VOCs数据清洗模型。首先对研究区域进行网格划分,建立了基于KPCA-IF的VOCs降维异常数据识别模型,通过KPCA方法对多源混合VOCs数据降维,使用IF算法识别异常数据并进行剔除。然后设计了基于WRF的VOCs数据补偿算法,对降维与异常识别后的数据集进行缺失值回归填补。最后,以西安市为例,选取空气质量数据、气象数据等多源VOCs数据进行数据清洗。结果表明,该混合模型可有效对多源VOCs数据降维,进行数据清洗的平均绝对误差为5.08、均方根误差为10.24、中值绝对误差为3.54,均优于对比模型,证明了KPCA-IF-WRF混合模型的鲁棒性更强、精确度更高,具有科学性和可行性。
To solve the problems of multi-source Volatile Organic Compounds(VOCs)data with high dimension,complex data relationship and abnormal data,a data cleaning model for multi-source VOCs data based on Kernel Principal Component Analysis(KPCA),Isolated Forest(IF),Weighted Random Forest(WRF)hybrid model was proposed.Firstly,a target area was selected and divided into grids with each 5 km×5 km to achieve the VOCs data refinement and visualization of grid management.Then a VOCs dimensionality reduction and abnormal data identification model based on KPCA-IF model was established.KPCA method was used to reduce the dimension of the initial multi-source VOCs data based on the cumulative contribution rate,10 main components were screened out,and abnormal data was identified and eliminated by IF algorithm.Finally,a VOCs data compensation algorithm based on WRF was designed to fill the missing value regression of the data set after dimensionality reduction and anomaly recognition.Taking Xi’an city as an example,multi-source VOCs data such as air quality data and meteorological data were selected for data cleaning.The results show that the accuracy of KPCA-IF model for VOCs abnormal data points is 96.16%,which is 2.97%,1.15%and 0.81%higher than OCSVM,K-means and EllipticEnvelope methods respectively.The Mean Absolute Error of KPCA-IF-WRF model is 5.08,which is 0.80,2.03 and 3.02 lower than KPCA-OCSVM,RF and Mean interpolation methods.The Root Mean Square Error is 10.24,which is 3.57,2.39 and 5.41 lower than those of the three methods.The Median Absolute Error is 3.54,which is 0.16,2.01 and 2.45 lower than the above three methods.It is proved that KPCA-IF-WRF hybrid model is more robust,accurate,scientific and feasible,which can provide a reference for subsequent VOCs source analysis,hazard assessment,concentration prediction,treatment scheme formulation,and environmental protection.
作者
黄光球
赵羲轩
陆秋琴
HUANG Guang-qiu;ZHAO Xi-xuan;LU Qiu-qin(School of Management,Xi'an University of Architecture and Technology,Xi'an 710055,China)
出处
《安全与环境学报》
CAS
CSCD
北大核心
2022年第6期3412-3423,共12页
Journal of Safety and Environment
基金
国家自然科学基金项目(71874134)
陕西省自然科学基础研究计划项目(2019JZ-30)。
关键词
环境工程学
挥发性有机物
数据清洗
核主成分分析
孤立森林
加权随机森林
environmental engineering
volatile organic compounds
data cleaning
Kernel Principal Component Analysis(KPCA)
Isolated Forest(IF)
Weighted Random Forest(WRF)