摘要
大数据时代的到来催生了一门新的学科——数据科学。首先,探讨了数据科学的内涵、发展简史、学科地位及知识体系等基本问题,并提出了专业数据科学与专业中的数据科学之间的区别与联系。其次,分析现阶段数据科学的研究特点,并分别提出了专业数据科学、专业中的数据科学及大数据生态系统中的相对热门话题。接着,探讨了数据科学研究中的10个争议及挑战:思维模式的转变(知识范式还是数据范式)、对数据的认识(主动属性还是被动属性)、对智能的认识(更好的算法还是更多的数据)、主要瓶颈(数据密集型还是计算密集型)、数据准备(数据预处理还是数据加工)、服务质量(精准度还是用户体验)、数据分析(解释性分析还是预测性分析)、算法评价(复杂度还是扩展性)、研究范式(第三范式还是第四范式)、人才培养(数据工程师还是数据科学家)。然后,提出了数据科学研究的10个发展趋势:预测模型及相关分析的重视,模型集成及元分析的兴起,数据在先、模式在后或无模式的出现,数据一致性及现实主义的回归,多副本技术及靠近数据原则的广泛应用,多样化技术及一体化应用并存,简单计算及实用主义占据主导地位,数据产品开发及数据科学的嵌入式应用,专家余及公众数据科学的兴起,数据科学家与人才培养的探讨。最后,结合文中工作,对数据科学研究者给出了几点建议和注意事项。
The entering big data era gives rise to a novel discipline called data science.First,the differences between domain-general data science and domain-specific data science were proposed based upon conducting an in-depth discussion on its basic concept,brief history,scientific roles and the body of knowledge.Secondly,top ten challenges faced by data science were identified via describing the debates on paradoxical topics including the shifts of thinking pattern(knowledge pattern or data pattern),perspectives on data(active or negative),implementation of intelligence(via AI or via big data),bottlenecks of data products development(computing intensive or data intensive),data preparation(data preprocessing or data wrangling),quality of services(performance of services or user experiences),data analysis(explanatory or predictive),evaluation of algorithm(by complexity or by scalability),research paradigm(third paradigm or fourth paradigm)as well as main motivations of the education(in order to cultivate data engineer or data scientist).And then,the top ten trends in data science studies were proposed:to vale predictive models and correlation analysis,to give more attention on model integration and meta-analysis,to embrace data first,model later or never paradigm,to be led by realism and ensure data consistence,to support multi-copies and data locality,the coexistence of varieties in implementation techno logies and integrated applications,to be dominated by simple computing and pragmatism,to develop dataproducts and the embedded applications of data science,to embrace the Pro-Am and metadata,and cultivate data scientist and curriculums or majors.Finally,some suggestions on how do further studies were also proposed.
出处
《计算机科学》
CSCD
北大核心
2018年第1期1-13,共13页
Computer Science
基金
国家自然科学基金项目(91646202
71103020)
国家社会科学基金(15BTQ054
12&ZD220)资助
关键词
数据科学
大数据
数据产品开发
数据加工
数据驱动
Data science
Big data
Data products developement
Data wrangling
Data-driven