微博情感分析是社会媒体挖掘中的重要任务之一,在恐怖组织识别、个性化推荐、舆情分析等方面具有重要的理论和应用价值.但与传统文本数据不同,微博消息短小而凌乱,包含着大量诸如微博表情符号之类的特有信息,同时微博情感是与其讨论主...微博情感分析是社会媒体挖掘中的重要任务之一,在恐怖组织识别、个性化推荐、舆情分析等方面具有重要的理论和应用价值.但与传统文本数据不同,微博消息短小而凌乱,包含着大量诸如微博表情符号之类的特有信息,同时微博情感是与其讨论主题是密切相关的.多数现有的微博情感分析方法都没有将微博主题与微博情感进行协同分析,或者在微博主题情感分析过程中没有考虑将用户关系、用户性格情绪等特征数据,从而导致微博情感分析与主题检测的效果难尽人意.为此,提出了一个基于多特征融合的微博主题情感挖掘模型TSMMF(Topic Sentiment Model based on Multi-feature Fusion),该模型将情感表情符号与微博用户性格情绪特征纳入到图模型LDA中实现微博主题与情感的同步推导.实验结果表明,与当前用于短文本情感主题挖掘的最优模型(JST,SLDA与DPLDA)相比较,TSMMF具有更优的微博主题情感检测性能.展开更多
微博实体链接是把微博中给定的指称链接到知识库的过程,广泛应用于信息抽取、自动问答等自然语言处理任务(Natural language processing,NLP).由于微博内容简短,传统长文本实体链接的算法并不能很好地用于微博实体链接任务.以往研究大...微博实体链接是把微博中给定的指称链接到知识库的过程,广泛应用于信息抽取、自动问答等自然语言处理任务(Natural language processing,NLP).由于微博内容简短,传统长文本实体链接的算法并不能很好地用于微博实体链接任务.以往研究大都基于实体指称及其上下文构建模型进行消歧,难以识别具有相似词汇和句法特征的候选实体.本文充分利用指称和候选实体本身所含有的语义信息,提出在词向量层面对任务进行抽象建模,并设计一种基于词向量语义分类的微博实体链接方法.首先通过神经网络训练词向量模板,然后通过实体聚类获得类别标签作为特征,再通过多分类模型预测目标实体的主题类别来完成实体消歧.在NLPCC2014公开评测数据集上的实验结果表明,本文方法的准确率和召回率均高于此前已报道的最佳结果,特别是实体链接准确率有显著提升.展开更多
In this research paper,we propose a corpus for the task of detecting religious extremism in social networks and open sources and compare various machine learning algorithms for the binary classification problem using ...In this research paper,we propose a corpus for the task of detecting religious extremism in social networks and open sources and compare various machine learning algorithms for the binary classification problem using a previously created corpus,thereby checking whether it is possible to detect extremist messages in the Kazakh language.To do this,the authors trained models using six classic machine-learning algorithms such as Support Vector Machine,Decision Tree,Random Forest,K Nearest Neighbors,Naive Bayes,and Logistic Regression.To increase the accuracy of detecting extremist texts,we used various characteristics such as Statistical Features,TF-IDF,POS,LIWC,and applied oversampling and undersampling techniques to handle imbalanced data.As a result,we achieved 98%accuracy in detecting religious extremism in Kazakh texts for the collected dataset.Testing the developed machine learningmodels in various databases that are often found in everyday life“Jokes”,“News”,“Toxic content”,“Spam”,“Advertising”has also shown high rates of extremism detection.展开更多
文摘微博情感分析是社会媒体挖掘中的重要任务之一,在恐怖组织识别、个性化推荐、舆情分析等方面具有重要的理论和应用价值.但与传统文本数据不同,微博消息短小而凌乱,包含着大量诸如微博表情符号之类的特有信息,同时微博情感是与其讨论主题是密切相关的.多数现有的微博情感分析方法都没有将微博主题与微博情感进行协同分析,或者在微博主题情感分析过程中没有考虑将用户关系、用户性格情绪等特征数据,从而导致微博情感分析与主题检测的效果难尽人意.为此,提出了一个基于多特征融合的微博主题情感挖掘模型TSMMF(Topic Sentiment Model based on Multi-feature Fusion),该模型将情感表情符号与微博用户性格情绪特征纳入到图模型LDA中实现微博主题与情感的同步推导.实验结果表明,与当前用于短文本情感主题挖掘的最优模型(JST,SLDA与DPLDA)相比较,TSMMF具有更优的微博主题情感检测性能.
文摘微博实体链接是把微博中给定的指称链接到知识库的过程,广泛应用于信息抽取、自动问答等自然语言处理任务(Natural language processing,NLP).由于微博内容简短,传统长文本实体链接的算法并不能很好地用于微博实体链接任务.以往研究大都基于实体指称及其上下文构建模型进行消歧,难以识别具有相似词汇和句法特征的候选实体.本文充分利用指称和候选实体本身所含有的语义信息,提出在词向量层面对任务进行抽象建模,并设计一种基于词向量语义分类的微博实体链接方法.首先通过神经网络训练词向量模板,然后通过实体聚类获得类别标签作为特征,再通过多分类模型预测目标实体的主题类别来完成实体消歧.在NLPCC2014公开评测数据集上的实验结果表明,本文方法的准确率和召回率均高于此前已报道的最佳结果,特别是实体链接准确率有显著提升.
基金This work was supported by the grant“Development of models,algorithms for semantic analysis to identify extremist content in web resources and creation the tool for cyber forensics”funded by the Ministry of Digital Development,Innovations and Aerospace industry of the Republic of Kazakhstan.Grant No.IRN AP06851248.Supervisor of the project is Shynar Mussiraliyeva,email:mussiraliyevash@gmail.com.
文摘In this research paper,we propose a corpus for the task of detecting religious extremism in social networks and open sources and compare various machine learning algorithms for the binary classification problem using a previously created corpus,thereby checking whether it is possible to detect extremist messages in the Kazakh language.To do this,the authors trained models using six classic machine-learning algorithms such as Support Vector Machine,Decision Tree,Random Forest,K Nearest Neighbors,Naive Bayes,and Logistic Regression.To increase the accuracy of detecting extremist texts,we used various characteristics such as Statistical Features,TF-IDF,POS,LIWC,and applied oversampling and undersampling techniques to handle imbalanced data.As a result,we achieved 98%accuracy in detecting religious extremism in Kazakh texts for the collected dataset.Testing the developed machine learningmodels in various databases that are often found in everyday life“Jokes”,“News”,“Toxic content”,“Spam”,“Advertising”has also shown high rates of extremism detection.