Following the expanding of VSM and LSI, a text classification based on Concept Space is proposed in thispaper. Information gaining is applied to acquire concepts based on large training set. Concept Space is built by ...Following the expanding of VSM and LSI, a text classification based on Concept Space is proposed in thispaper. Information gaining is applied to acquire concepts based on large training set. Concept Space is built by acquir-ing latent semantic indexing data, building a latent semantic space by LSI, and then adding the class-basis vector. Thecalculating method of the word-similarity, the text-similarity, the similarity of the text vector and the class-basis vec-tor in Concept Space are presented. Experiment results show the Concept Space method is superior to Vector SpaceModel. This paper also discusses the future work the problem of concept space learning.展开更多
With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(C...With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(CCD)method and latent semantic indexing(LSI).In the first stage,a novel CCD method is proposed to select the most effective features for text classification,which is more effective than the traditional feature selection method.In the second stage,document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features,which leads to a poor categorization accuracy.So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension.Firstly,each feature in our algorithm is ranked depending on their importance of classification using CCD method.Secondly,we construct a new semantic space based on LSI method among features.The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization.展开更多
软件系统的实体演化耦合分析有助于共同变更预测、软件供应链风险识别、代码漏洞溯源、缺陷预测、架构问题定位等分析活动.两个代码实体之间存在演化耦合(evolutionary coupling)是指在软件修订历史中,这对实体倾向于共同变更(共变).已...软件系统的实体演化耦合分析有助于共同变更预测、软件供应链风险识别、代码漏洞溯源、缺陷预测、架构问题定位等分析活动.两个代码实体之间存在演化耦合(evolutionary coupling)是指在软件修订历史中,这对实体倾向于共同变更(共变).已有的演化耦合分析方法难以准确检测软件维护历史中频繁发生的、有“距离”的共变.为了解决这一问题,提出了基于关联规则挖掘、情节挖掘、潜在语义索引模型相结合的演化耦合分析方法(association rule,MINEPI and LSI based method,AR-MIM),以挖掘有“距离”的共同变更关系.实验收集了58个Python项目、242074条训练数据、330660条ground truth的数据集,与已有的4种baseline方法进行了比较,验证了AR-MIM的效果.结果表明:在预测共同变更候选项场景上,AR-MIM的准确性、召回率、F1分数均优于已有方法.展开更多
文摘Following the expanding of VSM and LSI, a text classification based on Concept Space is proposed in thispaper. Information gaining is applied to acquire concepts based on large training set. Concept Space is built by acquir-ing latent semantic indexing data, building a latent semantic space by LSI, and then adding the class-basis vector. Thecalculating method of the word-similarity, the text-similarity, the similarity of the text vector and the class-basis vec-tor in Concept Space are presented. Experiment results show the Concept Space method is superior to Vector SpaceModel. This paper also discusses the future work the problem of concept space learning.
基金the National Natural Science Foundation of China(Nos.61073193 and 61300230)the Key Science and Technology Foundation of Gansu Province(No.1102FKDA010)+1 种基金the Natural Science Foundation of Gansu Province(No.1107RJZA188)the Science and Technology Support Program of Gansu Province(No.1104GKCA037)
文摘With the purpose of improving the accuracy of text categorization and reducing the dimension of the feature space,this paper proposes a two-stage feature selection method based on a novel category correlation degree(CCD)method and latent semantic indexing(LSI).In the first stage,a novel CCD method is proposed to select the most effective features for text classification,which is more effective than the traditional feature selection method.In the second stage,document representation requires a high dimensionality of the feature space and does not take into account the semantic relation between features,which leads to a poor categorization accuracy.So LSI method is proposed to solve these problems by using statistically derived conceptual indices to replace the individual terms which can discover the important correlative relationship between features and reduce the feature space dimension.Firstly,each feature in our algorithm is ranked depending on their importance of classification using CCD method.Secondly,we construct a new semantic space based on LSI method among features.The experimental results have proved that our method can reduce effectively the dimension of text vector and improve the performance of text categorization.
文摘软件系统的实体演化耦合分析有助于共同变更预测、软件供应链风险识别、代码漏洞溯源、缺陷预测、架构问题定位等分析活动.两个代码实体之间存在演化耦合(evolutionary coupling)是指在软件修订历史中,这对实体倾向于共同变更(共变).已有的演化耦合分析方法难以准确检测软件维护历史中频繁发生的、有“距离”的共变.为了解决这一问题,提出了基于关联规则挖掘、情节挖掘、潜在语义索引模型相结合的演化耦合分析方法(association rule,MINEPI and LSI based method,AR-MIM),以挖掘有“距离”的共同变更关系.实验收集了58个Python项目、242074条训练数据、330660条ground truth的数据集,与已有的4种baseline方法进行了比较,验证了AR-MIM的效果.结果表明:在预测共同变更候选项场景上,AR-MIM的准确性、召回率、F1分数均优于已有方法.