期刊文献+

一种基于词法特征和数据挖掘的无意义变量名检测方法

Nonsense Variable Names Detection Method Based on Lexical Features and Data Mining
下载PDF
导出
摘要 标识符是代码的重要组成部分,也是人们理解代码语义的关键元素之一。变量名是最常见的标识符之一,其质量对于代码的可读性和可理解性有着重要的意义。然而,因为各种原因程序员经常使用一些毫无意义的变量名,如“a”和“var”等。这些无意义的变量名严重降低了代码的可理解性,需要进行检测并重构(重命名)。为此,提出了一种基于词法特征和数据挖掘的自动化方法,以检测代码中无意义的变量名。首先,对开源代码中的无意义变量名进行了实证分析,发现无意义变量名通常比较短且不包含任何有意义的单词,因此可以利用词法特征筛选出名称较短且不包含有意义单词的可疑变量名。如果可疑变量名包含缩写词,则使用缩写词扩展算法进行扩展,以获得完整的变量名。然后,基于数据挖掘算法判断可疑变量名是否为约定俗成的常用变量名。有些常用的变量名,如“i”和“e”,虽然字面上没有明确的语义,但是通过约定俗成的表示规范,程序员可以理解该变量的语义,因此不算是无意义的变量名,也不需要进行重构。如果可疑变量名称不是约定俗成的常用变量名,则断定该变量名为无意义的变量名,并提醒程序员进行重命名。在开源数据集上进行实验,结果表明,该方法具有较高的准确率,其平均查准率为85%,平均查全率为91.5%。 Identifiers is an important part of code,and it is also one of the key elements for people to understand the semantics of code.Variables are widely used to represent objects in programs.Names of such variables could serve as a major clue to the responsibility of the variables if they are serious and properly named.However,unqualified variable names(e.g.,“a”,“var”)are constructed frequently by developers.Such nonsense variable names have a severe negative impact on the readability and maintai-nability of software applications.So,automated identification of bad smells is one of the hot topics in the field of software refacto-ring.To identify such nonsense names automatically,we conduct an empirical study to figure out the key features that could be exploited to distinguishing nonsense names from well-constructed meaningful ones.Results of the study suggest that nonsense variable names are often short and rarely contain meaningful words.To this end,in this paper,we propose a heuristics and data mining-based approach to identifying nonsense variable names.It first retrieves suspicious variable names based on lexical analysis.On the resulting suspicious names,it conducts an abbreviation expansion-based filtering to exclude such variable names that are carefully constructed to represent the abbreviations of meaningful words.Finally,it conducts data mining-based filtering to further exclude well-known symbols(e.g.“i”,“e”).Experimental results on open source datasets show that the proposed method has high accuracy.Its average precision and recall is 85%and 91.5%,respectively.
作者 姜艳杰 东春浩 刘辉 JIANG Yanjie;DONG Chunhao;LIU Hui(School of Computer Science and Technology,Peking University,Beijing 100087,China;School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China)
出处 《计算机科学》 CSCD 北大核心 2024年第6期23-33,共11页 Computer Science
基金 国家自然科学基金重点项目(62232003)。
关键词 软件重构 代码质量 数据挖掘 无意义变量名 词法特征 Software refactoring Code quality Data mining Nonsense variable names Lexical features
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部