In generative dialog systems, learning representations for the dialog context is a crucial step in generating high quality responses. The dialog systems are required to capture useful and compact information from mutu...In generative dialog systems, learning representations for the dialog context is a crucial step in generating high quality responses. The dialog systems are required to capture useful and compact information from mutually dependent sentences such that the generation process can effectively attend to the central semantics. Unfortunately, existing methods may not effectively identify importance distributions for each lower position when computing an upper level feature, which may lead to the loss of information critical to the constitution of the final context representations. To address this issue, we propose a transfer learning based method named transfer hierarchical attention network(THAN). The THAN model can leverage useful prior knowledge from two related auxiliary tasks, i.e.,keyword extraction and sentence entailment, to facilitate the dialog representation learning for the main dialog generation task. During the transfer process, the syntactic structure and semantic relationship from the auxiliary tasks are distilled to enhance both the wordlevel and sentence-level attention mechanisms for the dialog system. Empirically, extensive experiments on the Twitter Dialog Corpus and the PERSONA-CHAT dataset demonstrate the effectiveness of the proposed THAN model compared with the state-of-the-art methods.展开更多
The cybersecurity report provides unstructured actionable cyber threat intelligence(CTI)with detailed threat attack procedures and indicators of compromise(IOCs),e.g.,malware hash or URL(uniform resource locator)of co...The cybersecurity report provides unstructured actionable cyber threat intelligence(CTI)with detailed threat attack procedures and indicators of compromise(IOCs),e.g.,malware hash or URL(uniform resource locator)of command and control server.The actionable CTI,integrated into intrusion detection systems,can not only prioritize the most urgent threats based on the campaign stages of attack vectors(i.e.,IOCs)but also take appropriate mitigation measures based on contextual information of the alerts.However,the dramatic growth in the number of cybersecurity reports makes it nearly impossible for security professionals to find an efficient way to use these massive amounts of threat intelligence.In this paper,we propose a trigger-enhanced actionable CTI discovery system(TriCTI)to portray a relationship between IOCs and campaign stages and generate actionable CTI from cybersecurity reports through natural language processing(NLP)technology.Specifically,we introduce the“campaign trigger”for an effective explanation of the campaign stages to improve the performance of the classification model.The campaign trigger phrases are the keywords in the sentence that imply the campaign stage.The trained final trigger vectors have similar space representations with the keywords in the unseen sentence and will help correct classification by increasing the weight of the keywords.We also meticulously devise a data augmentation specifically for cybersecurity training sets to cope with the challenge of the scarcity of annotation data sets.Compared with state-of-the-art text classification models,such as BERT,the trigger-enhanced classification model has better performance with accuracy(86.99%)and F1 score(87.02%).We run TriCTI on more than 29k cybersecurity reports,from which we automatically and efficiently collect 113,543 actionable CTI.In particular,we verify the actionability of discovered CTI by using large-scale field data from VirusTotal(VT).The results demonstrate that the threat intelligence provided by VT lacks a part of 展开更多
TTPs (Tactics, Techniques, and Procedures), which represent an attacker’s goals and methods, are the long period and essential feature of the attacker. Defenders can use TTP intelligence to perform the penetration te...TTPs (Tactics, Techniques, and Procedures), which represent an attacker’s goals and methods, are the long period and essential feature of the attacker. Defenders can use TTP intelligence to perform the penetration test and compensate for defense deficiency. However, most TTP intelligence is described in unstructured threat data, such as APT analysis reports. Manually converting natural language TTPs descriptions to standard TTP names, such as ATT&CK TTP names and IDs, is time-consuming and requires deep expertise. In this paper, we define the TTP classification task as a sentence classification task. We annotate a new sentence-level TTP dataset with 6 categories and 6061 TTP descriptions from 10761 security analysis reports. We construct a threat context-enhanced TTP intelligence mining (TIM) framework to mine TTP intelligence from unstructured threat data. The TIM framework uses TCENet (Threat Context Enhanced Network) to find and classify TTP descriptions, which we define as three continuous sentences, from textual data. Meanwhile, we use the element features of TTP in the descriptions to enhance the TTPs classification accuracy of TCENet. The evaluation result shows that the average classification accuracy of our proposed method on the 6 TTP categories reaches 0.941. The evaluation results also show that adding TTP element features can improve our classification accuracy compared to using only text features. TCENet also achieved the best results compared to the previous document-level TTP classification works and other popular text classification methods, even in the case of few-shot training samples. Finally, the TIM framework organizes TTP descriptions and TTP elements into STIX 2.1 format as final TTP intelligence for sharing the long-period and essential attack behavior characteristics of attackers. In addition, we transform TTP intelligence into sigma detection rules for attack behavior detection. Such TTP intelligence and rules can help defenders deploy long-term effective threat detection and perform 展开更多
Offensive messages on social media,have recently been frequently used to harass and criticize people.In recent studies,many promising algorithms have been developed to identify offensive texts.Most algorithms analyze ...Offensive messages on social media,have recently been frequently used to harass and criticize people.In recent studies,many promising algorithms have been developed to identify offensive texts.Most algorithms analyze text in a unidirectional manner,where a bidirectional method can maximize performance results and capture semantic and contextual information in sentences.In addition,there are many separate models for identifying offensive texts based on monolin-gual and multilingual,but there are a few models that can detect both monolingual and multilingual-based offensive texts.In this study,a detection system has been developed for both monolingual and multilingual offensive texts by combining deep convolutional neural network and bidirectional encoder representations from transformers(Deep-BERT)to identify offensive posts on social media that are used to harass others.This paper explores a variety of ways to deal with multilin-gualism,including collaborative multilingual and translation-based approaches.Then,the Deep-BERT is tested on the Bengali and English datasets,including the different bidirectional encoder representations from transformers(BERT)pre-trained word-embedding techniques,and found that the proposed Deep-BERT’s efficacy outperformed all existing offensive text classification algorithms reaching an accuracy of 91.83%.The proposed model is a state-of-the-art model that can classify both monolingual-based and multilingual-based offensive texts.展开更多
The conversation machine comprehension(MC)task aims to answer questions in the multi-turn conversation for a single passage.However,recent approaches don’t exploit information from historical conversations effectivel...The conversation machine comprehension(MC)task aims to answer questions in the multi-turn conversation for a single passage.However,recent approaches don’t exploit information from historical conversations effectively,which results in some references and ellipsis in the current question cannot be recognized.In addition,these methods do not consider the rich semantic relationships between words when reasoning about the passage text.In this paper,we propose a novel model GraphFlow+,which constructs a context graph for each conversation turn and uses a unique recurrent graph neural network(GNN)to model the temporal dependencies between the context graphs of each turn.Specifically,we exploit three different ways to construct text graphs,including the dynamic graph,static graph,and hybrid graph that combines the two.Our experiments on CoQA,QuAC and DoQA show that the GraphFlow+model can outperform the state-of-the-art approaches.展开更多
As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects in...As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects increasing interest in the field and induces critical inquiries into ChatGPT’s applicability in the NLP domain.This review paper systematically investigates the role of ChatGPT in diverse NLP tasks,including information extraction,Name Entity Recognition(NER),event extraction,relation extraction,Part of Speech(PoS)tagging,text classification,sentiment analysis,emotion recognition and text annotation.The novelty of this work lies in its comprehensive analysis of the existing literature,addressing a critical gap in understanding ChatGPT’s adaptability,limitations,and optimal application.In this paper,we employed a systematic stepwise approach following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses(PRISMA)framework to direct our search process and seek relevant studies.Our review reveals ChatGPT’s significant potential in enhancing various NLP tasks.Its adaptability in information extraction tasks,sentiment analysis,and text classification showcases its ability to comprehend diverse contexts and extract meaningful details.Additionally,ChatGPT’s flexibility in annotation tasks reducesmanual efforts and accelerates the annotation process,making it a valuable asset in NLP development and research.Furthermore,GPT-4 and prompt engineering emerge as a complementary mechanism,empowering users to guide the model and enhance overall accuracy.Despite its promising potential,challenges persist.The performance of ChatGP Tneeds tobe testedusingmore extensivedatasets anddiversedata structures.Subsequently,its limitations in handling domain-specific language and the need for fine-tuning in specific applications highlight the importance of further investigations to address these issues.展开更多
Cybercriminals often use fraudulent emails and fictitious email accounts to deceive individuals into disclosing confidential information,a practice known as phishing.This study utilizes three distinct methodologies,Te...Cybercriminals often use fraudulent emails and fictitious email accounts to deceive individuals into disclosing confidential information,a practice known as phishing.This study utilizes three distinct methodologies,Term Frequency-Inverse Document Frequency,Word2Vec,and Bidirectional Encoder Representations from Transform-ers,to evaluate the effectiveness of various machine learning algorithms in detecting phishing attacks.The study uses feature extraction methods to assess the performance of Logistic Regression,Decision Tree,Random Forest,and Multilayer Perceptron algorithms.The best results for each classifier using Term Frequency-Inverse Document Frequency were Multilayer Perceptron(Precision:0.98,Recall:0.98,F1-score:0.98,Accuracy:0.98).Word2Vec’s best results were Multilayer Perceptron(Precision:0.98,Recall:0.98,F1-score:0.98,Accuracy:0.98).The highest performance was achieved using the Bidirectional Encoder Representations from the Transformers model,with Precision,Recall,F1-score,and Accuracy all reaching 0.99.This study highlights how advanced pre-trained models,such as Bidirectional Encoder Representations from Transformers,can significantly enhance the accuracy and reliability of fraud detection systems.展开更多
Entity linking refers to linking a string in a text to corresponding entities in a knowledge base through candidate entity generation and candidate entity ranking.It is of great significance to some NLP(natural langua...Entity linking refers to linking a string in a text to corresponding entities in a knowledge base through candidate entity generation and candidate entity ranking.It is of great significance to some NLP(natural language processing)tasks,such as question answering.Unlike English entity linking,Chinese entity linking requires more consideration due to the lack of spacing and capitalization in text sequences and the ambiguity of characters and words,which is more evident in certain scenarios.In Chinese domains,such as industry,the generated candidate entities are usually composed of long strings and are heavily nested.In addition,the meanings of the words that make up industrial entities are sometimes ambiguous.Their semantic space is a subspace of the general word embedding space,and thus each entity word needs to get its exact meanings.Therefore,we propose two schemes to achieve better Chinese entity linking.First,we implement an ngram based candidate entity generation method to increase the recall rate and reduce the nesting noise.Then,we enhance the corresponding candidate entity ranking mechanism by introducing sense embedding.Considering the contradiction between the ambiguity of word vectors and the single sense of the industrial domain,we design a sense embedding model based on graph clustering,which adopts an unsupervised approach for word sense induction and learns sense representation in conjunction with context.We test the embedding quality of our approach on classical datasets and demonstrate its disambiguation ability in general scenarios.We confirm that our method can better learn candidate entities’fundamental laws in the industrial domain and achieve better performance on entity linking through experiments.展开更多
Objective This study aimed to examine and propagate the medication experience and group formula of traditional Chinese medicine(TCM)Master XIONG Jibo in diagnosing and treat-ing arthralgia syndrome(AS)through data min...Objective This study aimed to examine and propagate the medication experience and group formula of traditional Chinese medicine(TCM)Master XIONG Jibo in diagnosing and treat-ing arthralgia syndrome(AS)through data mining.Methods Data of outpatient cases of Professor XIONG Jibo were collected from January 1,2014 to December 31,2018,along with cases recorded in A Real Famous Traditional Chinese Medicine Doctor:XIONG Jibo's Clinical Medical Record 1,which was published in December 2019.The five variables collected from the patients’data were TCM diagnostic information,TCM and western medicine diagnoses,syndrome,treatment,and prescription.A database was established for the collected data with Excel.Using the Python environment,a custom-ized modified natural language processing(NLP)model for the diagnosis and treatment of AS by Professor XIONG Jibo was established to preprocess the data and to analyze the word cloud.Frequency analysis,association rule analysis,cluster analysis,and visual analysis of AS cases were performed based on the Traditional Chinese Medicine Inheritance Computing Platform(V3.0)and RStudio(V4.0.3).Results A total of 610 medical records of Professor XIONG Jibo were collected from the case database.A total of 103 medical records were included after data screening criteria,which comprised 187 times(45 kinds)of prescriptions and 1506 times(125 kinds)of Chinese herbs.The main related meridians were the liver,spleen,and kidney meridians.The properties of Chinese herbs used most were mainly warm,flat,and cold,while the flavors of herbs were mainly bitter,pungent,and sweet.The main patterns of AS included the damp heat,phlegm stasis,and neck arthralgia.The most commonly used herbs for AS were Chuanniuxi(Cyathu-lae Radix),Huangbo(Phellodendri Chinensis Cortex),Cangzhu(Atractylodis Rhizoma),Qinjiao(Gentianae Macrophyllae Radix),Gancao(Glycyrrhizae Radix et Rhizoma),Huangqi(Astragali Radix),and Chuanxiong(Chuanxiong Rhizoma).The most common effect of the herbs was“promoting blood circulation and removin展开更多
The pre-training-then-fine-tuning paradigm has been widely used in deep learning.Due to the huge computation cost for pre-training,practitioners usually download pre-trained models from the Internet and fine-tune them...The pre-training-then-fine-tuning paradigm has been widely used in deep learning.Due to the huge computation cost for pre-training,practitioners usually download pre-trained models from the Internet and fine-tune them on downstream datasets,while the downloaded models may suffer backdoor attacks.Different from previous attacks aiming at a target task,we show that a backdoored pre-trained model can behave maliciously in various downstream tasks without foreknowing task information.Attackers can restrict the output representations(the values of output neurons)of trigger-embedded samples to arbitrary predefined values through additional training,namely neuron-level backdoor attack(NeuBA).Since fine-tuning has little effect on model parameters,the fine-tuned model will retain the backdoor functionality and predict a specific label for the samples embedded with the same trigger.To provoke multiple labels in a specific task,attackers can introduce several triggers with predefined contrastive values.In the experiments of both natural language processing(NLP)and computer vision(CV),we show that NeuBA can well control the predictions for trigger-embedded instances with different trigger designs.Our findings sound a red alarm for the wide use of pre-trained models.Finally,we apply several defense methods to NeuBA and find that model pruning is a promising technique to resist NeuBA by omitting backdoored neurons.展开更多
Fine-tuning pre-trained language models like BERT have become an effective way in natural language processing(NLP)and yield state-of-the-art results on many downstream tasks.Recent studies on adapting BERT to new task...Fine-tuning pre-trained language models like BERT have become an effective way in natural language processing(NLP)and yield state-of-the-art results on many downstream tasks.Recent studies on adapting BERT to new tasks mainly focus on modifying the model structure,re-designing the pre-training tasks,and leveraging external data and knowledge.The fine-tuning strategy itself has yet to be fully explored.In this paper,we improve the fine-tuning of BERT with two effective mechanisms:self-ensemble and self-distillation.The self-ensemble mechanism utilizes the checkpoints from an experience pool to integrate the teacher model.In order to transfer knowledge from the teacher model to the student model efficiently,we further use knowledge distillation,which is called self-distillation because the distillation comes from the model itself through the time dimension.Experiments on the GLUE benchmark and the Text Classification benchmark show that our proposed approach can significantly improve the adaption of BERT without any external data or knowledge.We conduct exhaustive experiments to investigate the efficiency of the self-ensemble and self-distillation mechanisms,and our proposed approach achieves a new state-of-the-art result on the SNLI dataset.展开更多
Background:In this investigation,we explore the literature regarding neuroregeneration from the 1700s to the present.The regeneration of central nervous system neurons or the regeneration of axons from cell bodies and...Background:In this investigation,we explore the literature regarding neuroregeneration from the 1700s to the present.The regeneration of central nervous system neurons or the regeneration of axons from cell bodies and their reconnection with other neurons remains a major hurdle.Injuries relating to war and accidents attracted medical professionals throughout early history to regenerate and reconnect nerves.Early literature till 1990 lacked specific molecular details and is likely provide some clues to conditions that promoted neuron and/or axon regeneration.This is an avenue for the application of natural language processing(NLP)to gain actionable intelligence.Post 1990 period saw an explosion of all molecular details.With the advent of genomic,transcriptomics,proteomics,and other omics-there is an emergence of big data sets and is another rich area for application of NLP.How the neuron and/or axon regeneration related keywords have changed over the years is a first step towards this endeavor.Methods:Specifically,this article curates over 600 published works in the field of neuroregeneration.We then apply a dynamic topic modeling algorithm based on the Latent Dirichlet allocation(LDA)algorithm to assess how topics cluster based on topics.Results:Based on how documents are assigned to topics,we then build a recommendation engine to assist researchers to access domain-specific literature based on how their search text matches to recommended document topics.The interface further includes interactive topic visualizations for researchers to understand how topics grow closer and further apart,and how intra-topic composition changes over time.Conclusions:We present a recommendation engine and interactive interface that enables dynamic topic modeling for neuronal regeneration.展开更多
Current research on metaphor analysis is generally knowledge-based and corpus-based,which calls for methods of automatic feature extraction and weight calculation.Combining natural language processing(NLP),latent sema...Current research on metaphor analysis is generally knowledge-based and corpus-based,which calls for methods of automatic feature extraction and weight calculation.Combining natural language processing(NLP),latent semantic analysis(LSA),and Pearson correlation coefficient,this paper proposes a metaphor analysis method for extracting the content words from both literal and metaphorical corpus,calculating correlation degree,and analyzing their relationships.The value of the proposed method was demonstrated through a case study by using a corpus with keyword“飞翔(fly)”.When compared with the method of Pearson correlation coefficient,the experiment shows that the LSA can produce better results with greater significance in correlation degree.It is also found that the number of common words that appeared in both literal and metaphorical word bags decreased with the correlation degree.The case study also revealed that there are more nouns appear in literal corpus,and more adjectives and adverbs appear in metaphorical corpus.The method proposed will benefit NLP researchers to develop the required step-by-step calculation tools for accurate quantitative analysis.展开更多
The Memorable Tourist Experience(MTE)is a scientific concept within the studies on tourism that is developed based on several related constructions:Perceived Confidence,Sincerity,Authenticity,and Satisfaction.This wor...The Memorable Tourist Experience(MTE)is a scientific concept within the studies on tourism that is developed based on several related constructions:Perceived Confidence,Sincerity,Authenticity,and Satisfaction.This work takes this model established by the work of Dr.Babak Taheri in 2018 on Monuments World Heritage of UNESCO,adopting an alternative data collection method to the face-to-face survey.Therefore,this work takes as a source of data the reviews collected in the recommendation platform TripAdvisor,working the same constructions of the MTE,with the collection of similar terms and the relationships between them.In order to highlight the terms,a first step is established with the use of Natural Language Processing(NLP),followed by the use of Machine Learning(ML)techniques to generate the relationships between the constructors defined in the models.The study makes a comparison using the method,in immaterial nature such as a flamenco show in the city of Seville;Flamenco has been declared by UNESCO an intangible World Heritage Site since 2010.The results of the study go in two directions:on the one hand to find similarities in the study of the specific MTE of both monuments with the hypotheses worked in the original model of Taheri.In addition to highlighting possible distinctive elements of each case and,and furthermore within the value contribution of the visit when it is led by an official tour guide,on the other hand,give presence to the model of obtaining data by reviews as a complementary data source of any tourist study.The data collection and analysis from both NLP and ML techniques permit the scientific study and the tourist operators to develop better value propositions to users and understanding of heterogeneous behaviors in the tourism industry.The study of reviews within the MTE allows identifying the stimulus that leads the user to choose an activity and hire it.These studies are extendable to other industries and business models,given the importance that references acquire within the consumer will展开更多
Purpose:Patent classification is one of the areas in Intellectual Property Analytics(IPA),and a growing use case since the number of patent applications has been increasing worldwide.We propose using machine learning ...Purpose:Patent classification is one of the areas in Intellectual Property Analytics(IPA),and a growing use case since the number of patent applications has been increasing worldwide.We propose using machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task.Design/methodology/approach:We applied three different approaches in this paper.First,we used a dataset available by INPI to explore traditional machine learning algorithms and ensemble methods.After preprocessing data by applying TF-IDF,FastText and Doc2Vec,the models were evaluated by cross-validation in 5 folds.In a second approach,we used two different Neural Networks architectures,a Convolutional Neural Network(CNN)and a bi-directional Long Short-Term Memory(BiLSTM).Finally,we used pre-trained BERT,DistilBERT,and ULMFiT models in the third approach.Findings:BERTTimbau,a BERT architecture model pre-trained on a large Portuguese corpus,presented the best results for the task,even though with a performance of only 4%superior to a LinearSVC model using TF-IDF feature engineering.Research limitations:The dataset was highly imbalanced,as usual in patent applications,so the classes with the lowest samples were expected to present the worst performance.That result happened in some cases,especially in classes with less than 60 training samples.Practical implications:Patent classification is challenging because of the hierarchical classification system,the context overlap,and the underrepresentation of the classes.However,the final model presented an acceptable performance given the size of the dataset and the task complexity.This model can support the decision and improve the time by proposing a category in the second level of ICP,which is one of the critical phases of the grant patent process.Originality/value:To our knowledge,the proposed models were never implemented for Portuguese patent classification.展开更多
For projects with thousands of files, finding the locations of bugs is time-consuming and labor-intensive. Bug reports as a potential resource to help locate bugs in source codes have been used to design automatic too...For projects with thousands of files, finding the locations of bugs is time-consuming and labor-intensive. Bug reports as a potential resource to help locate bugs in source codes have been used to design automatic tools to solve this problem. Existing information retrieval(IR)-based bug localization methods rely heavily on the similarity score between bug report and historical reports. As deep learning methods show great advantages in calculating text semantic similarity, we adapt the transformer network with IR-based bug localization methods to design a novel approach, TSLocator, to bug localization. In TSLocator, we propose five new features between bug reports and source codes. We use SVMRank to model the relation between all the six features and the actual buggy file. Given a new bug report, TSLocator automatically calculates the features and linearly weights the features to produce a suspicious score for all candidate files. TSLocator recommends a list of suspicious buggy files ranked by the score. The experimental results show that TSLocator outperforms existing methods in accuracy and performance of bug localization.展开更多
Language disorder,a common manifestation of Alzheimer’s disease(AD),has attracted widespread attention in recent years.This paper uses a novel natural language processing(NLP)method,compared with latest deep learning...Language disorder,a common manifestation of Alzheimer’s disease(AD),has attracted widespread attention in recent years.This paper uses a novel natural language processing(NLP)method,compared with latest deep learning technology,to detect AD and explore the lexical performance.Our proposed approach is based on two stages.First,the dialogue contents are summarized into two categories with the same category.Second,term frequency—inverse document frequency(TF-IDF)algorithm is used to extract the keywords of transcripts,and the similarity of keywords between the groups was calculated separately by cosine distance.Several deep learning methods are used to compare the performance.In the meanwhile,keywords with the best performance are used to analyze AD patients’lexical performance.In the Predictive Challenge of Alzheimer’s Disease held by iFlytek in 2019,the proposed AD diagnosis model achieves a better performance in binary classification by adjusting the number of keywords.The F1 score of the model has a considerable improvement over the baseline of 75.4%,and the training process of which is simple and efficient.We analyze the keywords of the model and find that AD patients use less noun and verb than normal controls.A computer-assisted AD diagnosis model on small Chinese dataset is proposed in this paper,which provides a potential way for assisting diagnosis of AD and analyzing lexical performance in clinical setting.展开更多
Machine Learning is revolutionizing the era day by day and the scope is no more limited to computer science as the advancements are evident in the field of healthcare.Disease diagnosis,personalized medicine,and Recomm...Machine Learning is revolutionizing the era day by day and the scope is no more limited to computer science as the advancements are evident in the field of healthcare.Disease diagnosis,personalized medicine,and Recommendation system(RS)are among the promising applications that are using Machine Learning(ML)at a higher level.A recommendation system helps inefficient decision-making and suggests personalized recommendations accordingly.Today people share their experiences through reviews and hence designing of recommendation system based on users’sentiments is a challenge.The recommendation system has gained significant attention in different fields but considering healthcare,little is being done from the perspective of drugs,disease,and medical recommendations.This study is engrossed in designing a recommendation system that is based on the fusion of sentiment analysis and radiant boosting.The polarity of the sentiments is analyzed through user reviews and the processed data is fed into the Extreme Gradient Boosting(XGBOOST)framework to generate the drug recommendation.To establish the applicability of the concept a comparative study is performed between the proposed approach and the existing approaches.展开更多
This research examines industry-based dissertation research in a doctoralcomputing program through the lens of machine learning algorithms todetermine if natural language processing-based categorization on abstractsal...This research examines industry-based dissertation research in a doctoralcomputing program through the lens of machine learning algorithms todetermine if natural language processing-based categorization on abstractsalone is adequate for classification. This research categorizes dissertationby both their abstracts and by their full-text using the GraphLabCreate library from Apple’s Turi to identify if abstract analysis is anadequate measure of content categorization, which we found was not. Wealso compare the dissertation categorizations using IBM’s Watson Discoverydeep machine learning tool. Our research provides perspectiveson the practicality of the manual classification of technical documents;and, it provides insights into the: (1) categories of academic work createdby experienced fulltime working professionals in a Computing doctoralprogram, (2) viability and performance of automated categorization of theabstract analysis against the fulltext dissertation analysis, and (3) natuallanguage processing versus human manual text classification abstraction.展开更多
UML Class diagram generation from textual requirements is an important task in object-oriented design and programing course.This study proposes a method for automatically generating class diagrams from Chinese textual...UML Class diagram generation from textual requirements is an important task in object-oriented design and programing course.This study proposes a method for automatically generating class diagrams from Chinese textual requirements on the basis of Natural Language Processing(NLP)and mapping rules for sentence pattern matching.First,classes are identified through entity recognition rules and candidate class pruning rules using NLP from requirements.Second,class attributes and relationships between classes are extracted using mapping rules for sentence pattern matching on the basis of NLP.Third,we developed an assistant tool integrated into a precision micro classroom system for automatic generation of class diagram,to effectively assist the teaching of object-oriented design and programing course.Results are evaluated with precision,accuracy and recall from eight requirements of object-oriented design and programing course using truth values created by teachers.Our research should benefit beginners of object-oriented design and programing course,who may be students or software developers.It helps them to create correct domain models represented in the UML class diagram.展开更多
文摘In generative dialog systems, learning representations for the dialog context is a crucial step in generating high quality responses. The dialog systems are required to capture useful and compact information from mutually dependent sentences such that the generation process can effectively attend to the central semantics. Unfortunately, existing methods may not effectively identify importance distributions for each lower position when computing an upper level feature, which may lead to the loss of information critical to the constitution of the final context representations. To address this issue, we propose a transfer learning based method named transfer hierarchical attention network(THAN). The THAN model can leverage useful prior knowledge from two related auxiliary tasks, i.e.,keyword extraction and sentence entailment, to facilitate the dialog representation learning for the main dialog generation task. During the transfer process, the syntactic structure and semantic relationship from the auxiliary tasks are distilled to enhance both the wordlevel and sentence-level attention mechanisms for the dialog system. Empirically, extensive experiments on the Twitter Dialog Corpus and the PERSONA-CHAT dataset demonstrate the effectiveness of the proposed THAN model compared with the state-of-the-art methods.
基金Our research was supported by the National Key Research and Development Program of China(Nos.2019QY1301,2018YFB0805005,2018YFC0824801).
文摘The cybersecurity report provides unstructured actionable cyber threat intelligence(CTI)with detailed threat attack procedures and indicators of compromise(IOCs),e.g.,malware hash or URL(uniform resource locator)of command and control server.The actionable CTI,integrated into intrusion detection systems,can not only prioritize the most urgent threats based on the campaign stages of attack vectors(i.e.,IOCs)but also take appropriate mitigation measures based on contextual information of the alerts.However,the dramatic growth in the number of cybersecurity reports makes it nearly impossible for security professionals to find an efficient way to use these massive amounts of threat intelligence.In this paper,we propose a trigger-enhanced actionable CTI discovery system(TriCTI)to portray a relationship between IOCs and campaign stages and generate actionable CTI from cybersecurity reports through natural language processing(NLP)technology.Specifically,we introduce the“campaign trigger”for an effective explanation of the campaign stages to improve the performance of the classification model.The campaign trigger phrases are the keywords in the sentence that imply the campaign stage.The trained final trigger vectors have similar space representations with the keywords in the unseen sentence and will help correct classification by increasing the weight of the keywords.We also meticulously devise a data augmentation specifically for cybersecurity training sets to cope with the challenge of the scarcity of annotation data sets.Compared with state-of-the-art text classification models,such as BERT,the trigger-enhanced classification model has better performance with accuracy(86.99%)and F1 score(87.02%).We run TriCTI on more than 29k cybersecurity reports,from which we automatically and efficiently collect 113,543 actionable CTI.In particular,we verify the actionability of discovered CTI by using large-scale field data from VirusTotal(VT).The results demonstrate that the threat intelligence provided by VT lacks a part of
基金Our research was supported by the National Key Research and Development Program of China(Grant No.2018YFC0824801,No.2019QY1302)the National Natural Science Foundation of China(No.61802404).
文摘TTPs (Tactics, Techniques, and Procedures), which represent an attacker’s goals and methods, are the long period and essential feature of the attacker. Defenders can use TTP intelligence to perform the penetration test and compensate for defense deficiency. However, most TTP intelligence is described in unstructured threat data, such as APT analysis reports. Manually converting natural language TTPs descriptions to standard TTP names, such as ATT&CK TTP names and IDs, is time-consuming and requires deep expertise. In this paper, we define the TTP classification task as a sentence classification task. We annotate a new sentence-level TTP dataset with 6 categories and 6061 TTP descriptions from 10761 security analysis reports. We construct a threat context-enhanced TTP intelligence mining (TIM) framework to mine TTP intelligence from unstructured threat data. The TIM framework uses TCENet (Threat Context Enhanced Network) to find and classify TTP descriptions, which we define as three continuous sentences, from textual data. Meanwhile, we use the element features of TTP in the descriptions to enhance the TTPs classification accuracy of TCENet. The evaluation result shows that the average classification accuracy of our proposed method on the 6 TTP categories reaches 0.941. The evaluation results also show that adding TTP element features can improve our classification accuracy compared to using only text features. TCENet also achieved the best results compared to the previous document-level TTP classification works and other popular text classification methods, even in the case of few-shot training samples. Finally, the TIM framework organizes TTP descriptions and TTP elements into STIX 2.1 format as final TTP intelligence for sharing the long-period and essential attack behavior characteristics of attackers. In addition, we transform TTP intelligence into sigma detection rules for attack behavior detection. Such TTP intelligence and rules can help defenders deploy long-term effective threat detection and perform
文摘Offensive messages on social media,have recently been frequently used to harass and criticize people.In recent studies,many promising algorithms have been developed to identify offensive texts.Most algorithms analyze text in a unidirectional manner,where a bidirectional method can maximize performance results and capture semantic and contextual information in sentences.In addition,there are many separate models for identifying offensive texts based on monolin-gual and multilingual,but there are a few models that can detect both monolingual and multilingual-based offensive texts.In this study,a detection system has been developed for both monolingual and multilingual offensive texts by combining deep convolutional neural network and bidirectional encoder representations from transformers(Deep-BERT)to identify offensive posts on social media that are used to harass others.This paper explores a variety of ways to deal with multilin-gualism,including collaborative multilingual and translation-based approaches.Then,the Deep-BERT is tested on the Bengali and English datasets,including the different bidirectional encoder representations from transformers(BERT)pre-trained word-embedding techniques,and found that the proposed Deep-BERT’s efficacy outperformed all existing offensive text classification algorithms reaching an accuracy of 91.83%.The proposed model is a state-of-the-art model that can classify both monolingual-based and multilingual-based offensive texts.
文摘The conversation machine comprehension(MC)task aims to answer questions in the multi-turn conversation for a single passage.However,recent approaches don’t exploit information from historical conversations effectively,which results in some references and ellipsis in the current question cannot be recognized.In addition,these methods do not consider the rich semantic relationships between words when reasoning about the passage text.In this paper,we propose a novel model GraphFlow+,which constructs a context graph for each conversation turn and uses a unique recurrent graph neural network(GNN)to model the temporal dependencies between the context graphs of each turn.Specifically,we exploit three different ways to construct text graphs,including the dynamic graph,static graph,and hybrid graph that combines the two.Our experiments on CoQA,QuAC and DoQA show that the GraphFlow+model can outperform the state-of-the-art approaches.
文摘As Natural Language Processing(NLP)continues to advance,driven by the emergence of sophisticated large language models such as ChatGPT,there has been a notable growth in research activity.This rapid uptake reflects increasing interest in the field and induces critical inquiries into ChatGPT’s applicability in the NLP domain.This review paper systematically investigates the role of ChatGPT in diverse NLP tasks,including information extraction,Name Entity Recognition(NER),event extraction,relation extraction,Part of Speech(PoS)tagging,text classification,sentiment analysis,emotion recognition and text annotation.The novelty of this work lies in its comprehensive analysis of the existing literature,addressing a critical gap in understanding ChatGPT’s adaptability,limitations,and optimal application.In this paper,we employed a systematic stepwise approach following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses(PRISMA)framework to direct our search process and seek relevant studies.Our review reveals ChatGPT’s significant potential in enhancing various NLP tasks.Its adaptability in information extraction tasks,sentiment analysis,and text classification showcases its ability to comprehend diverse contexts and extract meaningful details.Additionally,ChatGPT’s flexibility in annotation tasks reducesmanual efforts and accelerates the annotation process,making it a valuable asset in NLP development and research.Furthermore,GPT-4 and prompt engineering emerge as a complementary mechanism,empowering users to guide the model and enhance overall accuracy.Despite its promising potential,challenges persist.The performance of ChatGP Tneeds tobe testedusingmore extensivedatasets anddiversedata structures.Subsequently,its limitations in handling domain-specific language and the need for fine-tuning in specific applications highlight the importance of further investigations to address these issues.
文摘Cybercriminals often use fraudulent emails and fictitious email accounts to deceive individuals into disclosing confidential information,a practice known as phishing.This study utilizes three distinct methodologies,Term Frequency-Inverse Document Frequency,Word2Vec,and Bidirectional Encoder Representations from Transform-ers,to evaluate the effectiveness of various machine learning algorithms in detecting phishing attacks.The study uses feature extraction methods to assess the performance of Logistic Regression,Decision Tree,Random Forest,and Multilayer Perceptron algorithms.The best results for each classifier using Term Frequency-Inverse Document Frequency were Multilayer Perceptron(Precision:0.98,Recall:0.98,F1-score:0.98,Accuracy:0.98).Word2Vec’s best results were Multilayer Perceptron(Precision:0.98,Recall:0.98,F1-score:0.98,Accuracy:0.98).The highest performance was achieved using the Bidirectional Encoder Representations from the Transformers model,with Precision,Recall,F1-score,and Accuracy all reaching 0.99.This study highlights how advanced pre-trained models,such as Bidirectional Encoder Representations from Transformers,can significantly enhance the accuracy and reliability of fraud detection systems.
基金supported by the National Natural Science Foundation of China under Grant Nos.61932004 and 62072205.
文摘Entity linking refers to linking a string in a text to corresponding entities in a knowledge base through candidate entity generation and candidate entity ranking.It is of great significance to some NLP(natural language processing)tasks,such as question answering.Unlike English entity linking,Chinese entity linking requires more consideration due to the lack of spacing and capitalization in text sequences and the ambiguity of characters and words,which is more evident in certain scenarios.In Chinese domains,such as industry,the generated candidate entities are usually composed of long strings and are heavily nested.In addition,the meanings of the words that make up industrial entities are sometimes ambiguous.Their semantic space is a subspace of the general word embedding space,and thus each entity word needs to get its exact meanings.Therefore,we propose two schemes to achieve better Chinese entity linking.First,we implement an ngram based candidate entity generation method to increase the recall rate and reduce the nesting noise.Then,we enhance the corresponding candidate entity ranking mechanism by introducing sense embedding.Considering the contradiction between the ambiguity of word vectors and the single sense of the industrial domain,we design a sense embedding model based on graph clustering,which adopts an unsupervised approach for word sense induction and learns sense representation in conjunction with context.We test the embedding quality of our approach on classical datasets and demonstrate its disambiguation ability in general scenarios.We confirm that our method can better learn candidate entities’fundamental laws in the industrial domain and achieve better performance on entity linking through experiments.
基金Project of State Administration of Traditional Chinese Medicine(GZY-YZS-2019-45)The Horizontal Project of Hunan Medical College(HYH-2021Y-KJ-6-33)+1 种基金Scientific Research Project of Hunan Provincial Department of Education in 2021(21C0223)Natural Science Foundation of Hunan Province in 2022(1524)。
文摘Objective This study aimed to examine and propagate the medication experience and group formula of traditional Chinese medicine(TCM)Master XIONG Jibo in diagnosing and treat-ing arthralgia syndrome(AS)through data mining.Methods Data of outpatient cases of Professor XIONG Jibo were collected from January 1,2014 to December 31,2018,along with cases recorded in A Real Famous Traditional Chinese Medicine Doctor:XIONG Jibo's Clinical Medical Record 1,which was published in December 2019.The five variables collected from the patients’data were TCM diagnostic information,TCM and western medicine diagnoses,syndrome,treatment,and prescription.A database was established for the collected data with Excel.Using the Python environment,a custom-ized modified natural language processing(NLP)model for the diagnosis and treatment of AS by Professor XIONG Jibo was established to preprocess the data and to analyze the word cloud.Frequency analysis,association rule analysis,cluster analysis,and visual analysis of AS cases were performed based on the Traditional Chinese Medicine Inheritance Computing Platform(V3.0)and RStudio(V4.0.3).Results A total of 610 medical records of Professor XIONG Jibo were collected from the case database.A total of 103 medical records were included after data screening criteria,which comprised 187 times(45 kinds)of prescriptions and 1506 times(125 kinds)of Chinese herbs.The main related meridians were the liver,spleen,and kidney meridians.The properties of Chinese herbs used most were mainly warm,flat,and cold,while the flavors of herbs were mainly bitter,pungent,and sweet.The main patterns of AS included the damp heat,phlegm stasis,and neck arthralgia.The most commonly used herbs for AS were Chuanniuxi(Cyathu-lae Radix),Huangbo(Phellodendri Chinensis Cortex),Cangzhu(Atractylodis Rhizoma),Qinjiao(Gentianae Macrophyllae Radix),Gancao(Glycyrrhizae Radix et Rhizoma),Huangqi(Astragali Radix),and Chuanxiong(Chuanxiong Rhizoma).The most common effect of the herbs was“promoting blood circulation and removin
基金supported by the National Key Research and Development Program of China(No.2020AAA0106500)the National Natural Science Foundation of China(NSFC No.62236004).
文摘The pre-training-then-fine-tuning paradigm has been widely used in deep learning.Due to the huge computation cost for pre-training,practitioners usually download pre-trained models from the Internet and fine-tune them on downstream datasets,while the downloaded models may suffer backdoor attacks.Different from previous attacks aiming at a target task,we show that a backdoored pre-trained model can behave maliciously in various downstream tasks without foreknowing task information.Attackers can restrict the output representations(the values of output neurons)of trigger-embedded samples to arbitrary predefined values through additional training,namely neuron-level backdoor attack(NeuBA).Since fine-tuning has little effect on model parameters,the fine-tuned model will retain the backdoor functionality and predict a specific label for the samples embedded with the same trigger.To provoke multiple labels in a specific task,attackers can introduce several triggers with predefined contrastive values.In the experiments of both natural language processing(NLP)and computer vision(CV),we show that NeuBA can well control the predictions for trigger-embedded instances with different trigger designs.Our findings sound a red alarm for the wide use of pre-trained models.Finally,we apply several defense methods to NeuBA and find that model pruning is a promising technique to resist NeuBA by omitting backdoored neurons.
基金supported by the National Key Research and Development Program of China under Grant No.2020AAA0106700the National Natural Science Foundation of China under Grant No.62022027.
文摘Fine-tuning pre-trained language models like BERT have become an effective way in natural language processing(NLP)and yield state-of-the-art results on many downstream tasks.Recent studies on adapting BERT to new tasks mainly focus on modifying the model structure,re-designing the pre-training tasks,and leveraging external data and knowledge.The fine-tuning strategy itself has yet to be fully explored.In this paper,we improve the fine-tuning of BERT with two effective mechanisms:self-ensemble and self-distillation.The self-ensemble mechanism utilizes the checkpoints from an experience pool to integrate the teacher model.In order to transfer knowledge from the teacher model to the student model efficiently,we further use knowledge distillation,which is called self-distillation because the distillation comes from the model itself through the time dimension.Experiments on the GLUE benchmark and the Text Classification benchmark show that our proposed approach can significantly improve the adaption of BERT without any external data or knowledge.We conduct exhaustive experiments to investigate the efficiency of the self-ensemble and self-distillation mechanisms,and our proposed approach achieves a new state-of-the-art result on the SNLI dataset.
文摘Background:In this investigation,we explore the literature regarding neuroregeneration from the 1700s to the present.The regeneration of central nervous system neurons or the regeneration of axons from cell bodies and their reconnection with other neurons remains a major hurdle.Injuries relating to war and accidents attracted medical professionals throughout early history to regenerate and reconnect nerves.Early literature till 1990 lacked specific molecular details and is likely provide some clues to conditions that promoted neuron and/or axon regeneration.This is an avenue for the application of natural language processing(NLP)to gain actionable intelligence.Post 1990 period saw an explosion of all molecular details.With the advent of genomic,transcriptomics,proteomics,and other omics-there is an emergence of big data sets and is another rich area for application of NLP.How the neuron and/or axon regeneration related keywords have changed over the years is a first step towards this endeavor.Methods:Specifically,this article curates over 600 published works in the field of neuroregeneration.We then apply a dynamic topic modeling algorithm based on the Latent Dirichlet allocation(LDA)algorithm to assess how topics cluster based on topics.Results:Based on how documents are assigned to topics,we then build a recommendation engine to assist researchers to access domain-specific literature based on how their search text matches to recommended document topics.The interface further includes interactive topic visualizations for researchers to understand how topics grow closer and further apart,and how intra-topic composition changes over time.Conclusions:We present a recommendation engine and interactive interface that enables dynamic topic modeling for neuronal regeneration.
基金Fundamental Research Funds for the Central Universities of Ministry of Education of China(No.19D111201)。
文摘Current research on metaphor analysis is generally knowledge-based and corpus-based,which calls for methods of automatic feature extraction and weight calculation.Combining natural language processing(NLP),latent semantic analysis(LSA),and Pearson correlation coefficient,this paper proposes a metaphor analysis method for extracting the content words from both literal and metaphorical corpus,calculating correlation degree,and analyzing their relationships.The value of the proposed method was demonstrated through a case study by using a corpus with keyword“飞翔(fly)”.When compared with the method of Pearson correlation coefficient,the experiment shows that the LSA can produce better results with greater significance in correlation degree.It is also found that the number of common words that appeared in both literal and metaphorical word bags decreased with the correlation degree.The case study also revealed that there are more nouns appear in literal corpus,and more adjectives and adverbs appear in metaphorical corpus.The method proposed will benefit NLP researchers to develop the required step-by-step calculation tools for accurate quantitative analysis.
文摘The Memorable Tourist Experience(MTE)is a scientific concept within the studies on tourism that is developed based on several related constructions:Perceived Confidence,Sincerity,Authenticity,and Satisfaction.This work takes this model established by the work of Dr.Babak Taheri in 2018 on Monuments World Heritage of UNESCO,adopting an alternative data collection method to the face-to-face survey.Therefore,this work takes as a source of data the reviews collected in the recommendation platform TripAdvisor,working the same constructions of the MTE,with the collection of similar terms and the relationships between them.In order to highlight the terms,a first step is established with the use of Natural Language Processing(NLP),followed by the use of Machine Learning(ML)techniques to generate the relationships between the constructors defined in the models.The study makes a comparison using the method,in immaterial nature such as a flamenco show in the city of Seville;Flamenco has been declared by UNESCO an intangible World Heritage Site since 2010.The results of the study go in two directions:on the one hand to find similarities in the study of the specific MTE of both monuments with the hypotheses worked in the original model of Taheri.In addition to highlighting possible distinctive elements of each case and,and furthermore within the value contribution of the visit when it is led by an official tour guide,on the other hand,give presence to the model of obtaining data by reviews as a complementary data source of any tourist study.The data collection and analysis from both NLP and ML techniques permit the scientific study and the tourist operators to develop better value propositions to users and understanding of heterogeneous behaviors in the tourism industry.The study of reviews within the MTE allows identifying the stimulus that leads the user to choose an activity and hire it.These studies are extendable to other industries and business models,given the importance that references acquire within the consumer will
基金This work was supported by national funds through FCT(Fundação para a Ciência e a Tecnologia),under the project-UIDB/04152/2020-Centro de Investigação em Gestão de Informação(MagIC)/NOVA IMS.
文摘Purpose:Patent classification is one of the areas in Intellectual Property Analytics(IPA),and a growing use case since the number of patent applications has been increasing worldwide.We propose using machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task.Design/methodology/approach:We applied three different approaches in this paper.First,we used a dataset available by INPI to explore traditional machine learning algorithms and ensemble methods.After preprocessing data by applying TF-IDF,FastText and Doc2Vec,the models were evaluated by cross-validation in 5 folds.In a second approach,we used two different Neural Networks architectures,a Convolutional Neural Network(CNN)and a bi-directional Long Short-Term Memory(BiLSTM).Finally,we used pre-trained BERT,DistilBERT,and ULMFiT models in the third approach.Findings:BERTTimbau,a BERT architecture model pre-trained on a large Portuguese corpus,presented the best results for the task,even though with a performance of only 4%superior to a LinearSVC model using TF-IDF feature engineering.Research limitations:The dataset was highly imbalanced,as usual in patent applications,so the classes with the lowest samples were expected to present the worst performance.That result happened in some cases,especially in classes with less than 60 training samples.Practical implications:Patent classification is challenging because of the hierarchical classification system,the context overlap,and the underrepresentation of the classes.However,the final model presented an acceptable performance given the size of the dataset and the task complexity.This model can support the decision and improve the time by proposing a category in the second level of ICP,which is one of the critical phases of the grant patent process.Originality/value:To our knowledge,the proposed models were never implemented for Portuguese patent classification.
文摘For projects with thousands of files, finding the locations of bugs is time-consuming and labor-intensive. Bug reports as a potential resource to help locate bugs in source codes have been used to design automatic tools to solve this problem. Existing information retrieval(IR)-based bug localization methods rely heavily on the similarity score between bug report and historical reports. As deep learning methods show great advantages in calculating text semantic similarity, we adapt the transformer network with IR-based bug localization methods to design a novel approach, TSLocator, to bug localization. In TSLocator, we propose five new features between bug reports and source codes. We use SVMRank to model the relation between all the six features and the actual buggy file. Given a new bug report, TSLocator automatically calculates the features and linearly weights the features to produce a suspicious score for all candidate files. TSLocator recommends a list of suspicious buggy files ranked by the score. The experimental results show that TSLocator outperforms existing methods in accuracy and performance of bug localization.
基金the Natural Science Foundation of Zhejiang Province(No.GF20F020063)the Fujian Province Young and Middle-Aged Teacher Education Research Project(No.JAT170480)。
文摘Language disorder,a common manifestation of Alzheimer’s disease(AD),has attracted widespread attention in recent years.This paper uses a novel natural language processing(NLP)method,compared with latest deep learning technology,to detect AD and explore the lexical performance.Our proposed approach is based on two stages.First,the dialogue contents are summarized into two categories with the same category.Second,term frequency—inverse document frequency(TF-IDF)algorithm is used to extract the keywords of transcripts,and the similarity of keywords between the groups was calculated separately by cosine distance.Several deep learning methods are used to compare the performance.In the meanwhile,keywords with the best performance are used to analyze AD patients’lexical performance.In the Predictive Challenge of Alzheimer’s Disease held by iFlytek in 2019,the proposed AD diagnosis model achieves a better performance in binary classification by adjusting the number of keywords.The F1 score of the model has a considerable improvement over the baseline of 75.4%,and the training process of which is simple and efficient.We analyze the keywords of the model and find that AD patients use less noun and verb than normal controls.A computer-assisted AD diagnosis model on small Chinese dataset is proposed in this paper,which provides a potential way for assisting diagnosis of AD and analyzing lexical performance in clinical setting.
文摘Machine Learning is revolutionizing the era day by day and the scope is no more limited to computer science as the advancements are evident in the field of healthcare.Disease diagnosis,personalized medicine,and Recommendation system(RS)are among the promising applications that are using Machine Learning(ML)at a higher level.A recommendation system helps inefficient decision-making and suggests personalized recommendations accordingly.Today people share their experiences through reviews and hence designing of recommendation system based on users’sentiments is a challenge.The recommendation system has gained significant attention in different fields but considering healthcare,little is being done from the perspective of drugs,disease,and medical recommendations.This study is engrossed in designing a recommendation system that is based on the fusion of sentiment analysis and radiant boosting.The polarity of the sentiments is analyzed through user reviews and the processed data is fed into the Extreme Gradient Boosting(XGBOOST)framework to generate the drug recommendation.To establish the applicability of the concept a comparative study is performed between the proposed approach and the existing approaches.
文摘This research examines industry-based dissertation research in a doctoralcomputing program through the lens of machine learning algorithms todetermine if natural language processing-based categorization on abstractsalone is adequate for classification. This research categorizes dissertationby both their abstracts and by their full-text using the GraphLabCreate library from Apple’s Turi to identify if abstract analysis is anadequate measure of content categorization, which we found was not. Wealso compare the dissertation categorizations using IBM’s Watson Discoverydeep machine learning tool. Our research provides perspectiveson the practicality of the manual classification of technical documents;and, it provides insights into the: (1) categories of academic work createdby experienced fulltime working professionals in a Computing doctoralprogram, (2) viability and performance of automated categorization of theabstract analysis against the fulltext dissertation analysis, and (3) natuallanguage processing versus human manual text classification abstraction.
基金This work is supported by the Collaborative education project of QST Innovation Technology Group Co.,Ltd and the Ministry of Education of PRC(NO.201801243022).
文摘UML Class diagram generation from textual requirements is an important task in object-oriented design and programing course.This study proposes a method for automatically generating class diagrams from Chinese textual requirements on the basis of Natural Language Processing(NLP)and mapping rules for sentence pattern matching.First,classes are identified through entity recognition rules and candidate class pruning rules using NLP from requirements.Second,class attributes and relationships between classes are extracted using mapping rules for sentence pattern matching on the basis of NLP.Third,we developed an assistant tool integrated into a precision micro classroom system for automatic generation of class diagram,to effectively assist the teaching of object-oriented design and programing course.Results are evaluated with precision,accuracy and recall from eight requirements of object-oriented design and programing course using truth values created by teachers.Our research should benefit beginners of object-oriented design and programing course,who may be students or software developers.It helps them to create correct domain models represented in the UML class diagram.