一种基于词法特征和数据挖掘的无意义变量名检测方法

Nonsense Variable Names Detection Method Based on Lexical Features and Data Mining

下载PDF

导出

摘要标识符是代码的重要组成部分,也是人们理解代码语义的关键元素之一。变量名是最常见的标识符之一,其质量对于代码的可读性和可理解性有着重要的意义。然而,因为各种原因程序员经常使用一些毫无意义的变量名,如“a”和“var”等。这些无意义的变量名严重降低了代码的可理解性,需要进行检测并重构(重命名)。为此,提出了一种基于词法特征和数据挖掘的自动化方法,以检测代码中无意义的变量名。首先,对开源代码中的无意义变量名进行了实证分析,发现无意义变量名通常比较短且不包含任何有意义的单词,因此可以利用词法特征筛选出名称较短且不包含有意义单词的可疑变量名。如果可疑变量名包含缩写词,则使用缩写词扩展算法进行扩展,以获得完整的变量名。然后,基于数据挖掘算法判断可疑变量名是否为约定俗成的常用变量名。有些常用的变量名,如“i”和“e”,虽然字面上没有明确的语义,但是通过约定俗成的表示规范,程序员可以理解该变量的语义,因此不算是无意义的变量名,也不需要进行重构。如果可疑变量名称不是约定俗成的常用变量名,则断定该变量名为无意义的变量名,并提醒程序员进行重命名。在开源数据集上进行实验,结果表明,该方法具有较高的准确率,其平均查准率为85%,平均查全率为91.5%。 Identifiers is an important part of code,and it is also one of the key elements for people to understand the semantics of code.Variables are widely used to represent objects in programs.Names of such variables could serve as a major clue to the responsibility of the variables if they are serious and properly named.However,unqualified variable names(e.g.,“a”,“var”)are constructed frequently by developers.Such nonsense variable names have a severe negative impact on the readability and maintai-nability of software applications.So,automated identification of bad smells is one of the hot topics in the field of software refacto-ring.To identify such nonsense names automatically,we conduct an empirical study to figure out the key features that could be exploited to distinguishing nonsense names from well-constructed meaningful ones.Results of the study suggest that nonsense variable names are often short and rarely contain meaningful words.To this end,in this paper,we propose a heuristics and data mining-based approach to identifying nonsense variable names.It first retrieves suspicious variable names based on lexical analysis.On the resulting suspicious names,it conducts an abbreviation expansion-based filtering to exclude such variable names that are carefully constructed to represent the abbreviations of meaningful words.Finally,it conducts data mining-based filtering to further exclude well-known symbols(e.g.“i”,“e”).Experimental results on open source datasets show that the proposed method has high accuracy.Its average precision and recall is 85%and 91.5%,respectively.

作者姜艳杰东春浩刘辉 JIANG Yanjie;DONG Chunhao;LIU Hui(School of Computer Science and Technology,Peking University,Beijing 100087,China;School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China)

机构地区北京大学计算机学院北京理工大学计算机学院

出处《计算机科学》 CSCD 北大核心 2024年第6期23-33,共11页 Computer Science

基金国家自然科学基金重点项目(62232003)。

关键词软件重构代码质量数据挖掘无意义变量名词法特征 Software refactoring Code quality Data mining Nonsense variable names Lexical features

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

1邢代鑫,边奕心.基于机器学习的无低内存解析器异味检测方法[J].长江信息通信,2024,37(1):139-142.
2徐珺.大数据挖掘与分析在图像处理中的运用探讨[J].信息与电脑,2024,36(4):150-152.
3王栋欢,肖洪,吴丁毅.基于深度学习的航空发动机涡轮叶片自动射线检测技术研究[J].推进技术,2024,45(5):217-225.
4《中华创伤杂志》编辑委员会.《中华创伤杂志》对一些常用英文词汇缩写的要求[J].中华创伤杂志,2024,40(5):458-458.
5本刊编辑部.量和单位[J].山东大学学报（医学版）,2024,62(3):120-120.
6《中国临床研究》编辑部.对关键词的要求[J].中国临床研究,2024,37(5):792-792.
7雷文丽,杨玉翰.基于CT扫描的含方解石脉填充页岩损伤破裂数值仿真试验[J].土工基础,2024,38(2):356-360.
8王宇鹏.开源软件供应链基础设施平台的演进及影响研究[J].电脑与电信,2024(3):76-80.
9《中国医药指南》关键词书写要求[J].中国医药指南,2024,22(16):180-180.
10关于中英文摘要书写格式的要求[J].中国美容医学,2024,33(6):105-105.

计算机科学

2024年第6期

浏览历史

内容加载中请稍等...

一种基于词法特征和数据挖掘的无意义变量名检测方法

相关作者

相关机构

相关主题

浏览历史