With the escalating complexity in production scenarios, vast amounts of production information are retained within enterprises in the industrial domain. Probing questions of how to meticulously excavate value from com...With the escalating complexity in production scenarios, vast amounts of production information are retained within enterprises in the industrial domain. Probing questions of how to meticulously excavate value from complex document information and establish coherent information links arise. In this work, we present a framework for knowledge graph construction in the industrial domain, predicated on knowledge-enhanced document-level entity and relation extraction. This approach alleviates the shortage of annotated data in the industrial domain and models the interplay of industrial documents. To augment the accuracy of named entity recognition, domain-specific knowledge is incorporated into the initialization of the word embedding matrix within the bidirectional long short-term memory conditional random field (BiLSTM-CRF) framework. For relation extraction, this paper introduces the knowledge-enhanced graph inference (KEGI) network, a pioneering method designed for long paragraphs in the industrial domain. This method discerns intricate interactions among entities by constructing a document graph and innovatively integrates knowledge representation into both node construction and path inference through TransR. On the application stratum, BiLSTM-CRF and KEGI are utilized to craft a knowledge graph from a knowledge representation model and Chinese fault reports for a steel production line, specifically SPOnto and SPFRDoc. The F1 value for entity and relation extraction has been enhanced by 2% to 6%. The quality of the extracted knowledge graph complies with the requirements of real-world production environment applications. The results demonstrate that KEGI can profoundly delve into production reports, extracting a wealth of knowledge and patterns, thereby providing a comprehensive solution for production management.展开更多
Relation Extraction(RE)is to obtain a predefined relation type of two entities mentioned in a piece of text,e.g.,a sentence-level or a document-level text.Most existing studies suffer from the noise in the text,and ne...Relation Extraction(RE)is to obtain a predefined relation type of two entities mentioned in a piece of text,e.g.,a sentence-level or a document-level text.Most existing studies suffer from the noise in the text,and necessary pruning is of great importance.The conventional sentence-level RE task addresses this issue by a denoising method using the shortest dependency path to build a long-range semantic dependency between entity pairs.However,this kind of denoising method is scarce in document-level RE.In this work,we explicitly model a denoised document-level graph based on linguistic knowledge to capture various long-range semantic dependencies among entities.We first formalize a Syntactic Dependency Tree forest(SDT-forest)by introducing the syntax and discourse dependency relation.Then,the Steiner tree algorithm extracts a mention-level denoised graph,Steiner Graph(SG),removing linguistically irrelevant words from the SDT-forest.We then devise a slide residual attention to highlight word-level evidence on text and SG.Finally,the classification is established on the SG to infer the relations of entity pairs.We conduct extensive experiments on three public datasets.The results evidence that our method is beneficial to establish long-range semantic dependency and can improve the classification performance with longer texts.展开更多
This paper explores the potential of applying online collaborative documents to foster critical thinking skills in EFL college-level classrooms.Considering the limitations of traditional teacher-centered approaches an...This paper explores the potential of applying online collaborative documents to foster critical thinking skills in EFL college-level classrooms.Considering the limitations of traditional teacher-centered approaches and the need for innovative methods,the study examines the integration of online collaborative tools,using Tencent Docs as an example.The discussion highlights the importance of critical thinking in the academic and professional spheres and introduces the concept of online collaborative documents for enhancing this cognitive skill.Through a detailed exploration,the paper presents a model of employing collaborative documents within a college English class,demonstrating how students collaboratively learning an article.Then,the paper discusses the pros and cons of employing this technology in classroom.The conclusion emphasizes the transformative potential of integrating technology into pedagogy and its role in creating a dynamic learning environment.The paper underscores the importance of striking a balance between technology and traditional methods,foreseeing avenues for further research and development.展开更多
Sentiment classification is a useful tool to classify reviews about sentiments and attitudes towards a product or service.Existing studies heavily rely on sentiment classification methods that require fully annotated ...Sentiment classification is a useful tool to classify reviews about sentiments and attitudes towards a product or service.Existing studies heavily rely on sentiment classification methods that require fully annotated inputs.However,there is limited labelled text available,making the acquirement process of the fully annotated input costly and labour-intensive.Lately,semi-supervised methods emerge as they require only partially labelled input but perform comparably to supervised methods.Nevertheless,some works reported that the performance of the semi-supervised model degraded after adding unlabelled instances into training.Literature also shows that not all unlabelled instances are equally useful;thus identifying the informative unlabelled instances is beneficial in training a semi-supervised model.To achieve this,an informative score is proposed and incorporated into semisupervised sentiment classification.The evaluation is performed on a semisupervised method without an informative score and with an informative score.By using the informative score in the instance selection strategy to identify informative unlabelled instances,semi-supervised models perform better compared to models that do not incorporate informative scores into their training.Although the performance of semi-supervised models incorporated with an informative score is not able to surpass the supervised models,the results are still found promising as the differences in performance are subtle with a small difference of 2%to 5%,but the number of labelled instances used is greatly reduced from100%to 40%.The best finding of the proposed instance selection strategy is achieved when incorporating an informative score with a baseline confidence score at a 0.5:0.5 ratio using only 40%labelled data.展开更多
Sentiment analysis is the process of determining the intention or emotion behind an article.The subjective information from the context is analyzed by the sentimental analysis of the people’s opinion.The data that is...Sentiment analysis is the process of determining the intention or emotion behind an article.The subjective information from the context is analyzed by the sentimental analysis of the people’s opinion.The data that is analyzed quantifies the reactions or sentiments and reveals the information’s contextual polarity.In social behavior,sentiment can be thought of as a latent variable.Measuring and comprehending this behavior could help us to better understand the social issues.Because sentiments are domain specific,sentimental analysis in a specific context is critical in any real-world scenario.Textual sentiment analysis is done in sentence,document level and feature levels.This work introduces a new Information Gain based Feature Selection(IGbFS)algorithm for selecting highly correlated features eliminating irrelevant and redundant ones.Extensive textual sentiment analysis on sentence,document and feature levels are performed by exploiting the proposed Information Gain based Feature Selection algorithm.The analysis is done based on the datasets from Cornell and Kaggle repositories.When compared to existing baseline classifiers,the suggested Information Gain based classifier resulted in an increased accuracy of 96%for document,97.4%for sentence and 98.5%for feature levels respectively.Also,the proposed method is tested with IMDB,Yelp 2013 and Yelp 2014 datasets.Experimental results for these high dimensional datasets give increased accuracy of 95%,96%and 98%for the proposed Information Gain based classifier for document,sentence and feature levels respectively compared to existing baseline classifiers.展开更多
Document-level machine translation(MT)remains challenging due to its difficulty in efficiently using documentlevel global context for translation.In this paper,we propose a hierarchical model to learn the global conte...Document-level machine translation(MT)remains challenging due to its difficulty in efficiently using documentlevel global context for translation.In this paper,we propose a hierarchical model to learn the global context for documentlevel neural machine translation(NMT).This is done through a sentence encoder to capture intra-sentence dependencies and a document encoder to model document-level inter-sentence consistency and coherence.With this hierarchical architecture,we feedback the extracted document-level global context to each word in a top-down fashion to distinguish different translations of a word according to its specific surrounding context.Notably,we explore the effect of three popular attention functions during the information backward-distribution phase to take a deep look into the global context information distribution of our model.In addition,since large-scale in-domain document-level parallel corpora are usually unavailable,we use a two-step training strategy to take advantage of a large-scale corpus with out-of-domain parallel sentence pairs and a small-scale corpus with in-domain parallel document pairs to achieve the domain adaptability.Experimental results of our model on Chinese-English and English-German corpora significantly improve the Transformer baseline by 4.5 BLEU points on average which demonstrates the effectiveness of our proposed hierarchical model in document-level NMT.展开更多
Social media like Twitter who serves as a novel news medium and has become increasingly popular since its establishment. Large scale first-hand user-generated tweets motivate automatic event detection on Twitter. Prev...Social media like Twitter who serves as a novel news medium and has become increasingly popular since its establishment. Large scale first-hand user-generated tweets motivate automatic event detection on Twitter. Previous unsupervised approaches detected events by clustering words. These methods detect events using burstiness,which measures surging frequencies of words at certain time windows. However,event clusters represented by a set of individual words are difficult to understand. This issue is addressed by building a document-level event detection model that directly calculates the burstiness of tweets,leveraging distributed word representations for modeling semantic information,thereby avoiding sparsity. Results show that the document-level model not only offers event summaries that are directly human-readable,but also gives significantly improved accuracies compared to previous methods on unsupervised tweet event detection,which are based on words/segments.展开更多
基金supported by the National Science and Technology Innovation 2030 New Generation Artificial Intelligence Major Project(Grant No.2018AAA0101800)the National Natural Science Foundation of China(Grant No.72271188).
文摘With the escalating complexity in production scenarios, vast amounts of production information are retained within enterprises in the industrial domain. Probing questions of how to meticulously excavate value from complex document information and establish coherent information links arise. In this work, we present a framework for knowledge graph construction in the industrial domain, predicated on knowledge-enhanced document-level entity and relation extraction. This approach alleviates the shortage of annotated data in the industrial domain and models the interplay of industrial documents. To augment the accuracy of named entity recognition, domain-specific knowledge is incorporated into the initialization of the word embedding matrix within the bidirectional long short-term memory conditional random field (BiLSTM-CRF) framework. For relation extraction, this paper introduces the knowledge-enhanced graph inference (KEGI) network, a pioneering method designed for long paragraphs in the industrial domain. This method discerns intricate interactions among entities by constructing a document graph and innovatively integrates knowledge representation into both node construction and path inference through TransR. On the application stratum, BiLSTM-CRF and KEGI are utilized to craft a knowledge graph from a knowledge representation model and Chinese fault reports for a steel production line, specifically SPOnto and SPFRDoc. The F1 value for entity and relation extraction has been enhanced by 2% to 6%. The quality of the extracted knowledge graph complies with the requirements of real-world production environment applications. The results demonstrate that KEGI can profoundly delve into production reports, extracting a wealth of knowledge and patterns, thereby providing a comprehensive solution for production management.
基金supported by the National Natural Science Foundation of China(Nos.U19A2059&62176046).
文摘Relation Extraction(RE)is to obtain a predefined relation type of two entities mentioned in a piece of text,e.g.,a sentence-level or a document-level text.Most existing studies suffer from the noise in the text,and necessary pruning is of great importance.The conventional sentence-level RE task addresses this issue by a denoising method using the shortest dependency path to build a long-range semantic dependency between entity pairs.However,this kind of denoising method is scarce in document-level RE.In this work,we explicitly model a denoised document-level graph based on linguistic knowledge to capture various long-range semantic dependencies among entities.We first formalize a Syntactic Dependency Tree forest(SDT-forest)by introducing the syntax and discourse dependency relation.Then,the Steiner tree algorithm extracts a mention-level denoised graph,Steiner Graph(SG),removing linguistically irrelevant words from the SDT-forest.We then devise a slide residual attention to highlight word-level evidence on text and SG.Finally,the classification is established on the SG to infer the relations of entity pairs.We conduct extensive experiments on three public datasets.The results evidence that our method is beneficial to establish long-range semantic dependency and can improve the classification performance with longer texts.
文摘This paper explores the potential of applying online collaborative documents to foster critical thinking skills in EFL college-level classrooms.Considering the limitations of traditional teacher-centered approaches and the need for innovative methods,the study examines the integration of online collaborative tools,using Tencent Docs as an example.The discussion highlights the importance of critical thinking in the academic and professional spheres and introduces the concept of online collaborative documents for enhancing this cognitive skill.Through a detailed exploration,the paper presents a model of employing collaborative documents within a college English class,demonstrating how students collaboratively learning an article.Then,the paper discusses the pros and cons of employing this technology in classroom.The conclusion emphasizes the transformative potential of integrating technology into pedagogy and its role in creating a dynamic learning environment.The paper underscores the importance of striking a balance between technology and traditional methods,foreseeing avenues for further research and development.
基金This research is supported by Fundamental Research Grant Scheme(FRGS),Ministry of Education Malaysia(MOE)under the project code,FRGS/1/2018/ICT02/USM/02/9 titled,Automated Big Data Annotation for Training Semi-Supervised Deep Learning Model in Sentiment Classification.
文摘Sentiment classification is a useful tool to classify reviews about sentiments and attitudes towards a product or service.Existing studies heavily rely on sentiment classification methods that require fully annotated inputs.However,there is limited labelled text available,making the acquirement process of the fully annotated input costly and labour-intensive.Lately,semi-supervised methods emerge as they require only partially labelled input but perform comparably to supervised methods.Nevertheless,some works reported that the performance of the semi-supervised model degraded after adding unlabelled instances into training.Literature also shows that not all unlabelled instances are equally useful;thus identifying the informative unlabelled instances is beneficial in training a semi-supervised model.To achieve this,an informative score is proposed and incorporated into semisupervised sentiment classification.The evaluation is performed on a semisupervised method without an informative score and with an informative score.By using the informative score in the instance selection strategy to identify informative unlabelled instances,semi-supervised models perform better compared to models that do not incorporate informative scores into their training.Although the performance of semi-supervised models incorporated with an informative score is not able to surpass the supervised models,the results are still found promising as the differences in performance are subtle with a small difference of 2%to 5%,but the number of labelled instances used is greatly reduced from100%to 40%.The best finding of the proposed instance selection strategy is achieved when incorporating an informative score with a baseline confidence score at a 0.5:0.5 ratio using only 40%labelled data.
文摘Sentiment analysis is the process of determining the intention or emotion behind an article.The subjective information from the context is analyzed by the sentimental analysis of the people’s opinion.The data that is analyzed quantifies the reactions or sentiments and reveals the information’s contextual polarity.In social behavior,sentiment can be thought of as a latent variable.Measuring and comprehending this behavior could help us to better understand the social issues.Because sentiments are domain specific,sentimental analysis in a specific context is critical in any real-world scenario.Textual sentiment analysis is done in sentence,document level and feature levels.This work introduces a new Information Gain based Feature Selection(IGbFS)algorithm for selecting highly correlated features eliminating irrelevant and redundant ones.Extensive textual sentiment analysis on sentence,document and feature levels are performed by exploiting the proposed Information Gain based Feature Selection algorithm.The analysis is done based on the datasets from Cornell and Kaggle repositories.When compared to existing baseline classifiers,the suggested Information Gain based classifier resulted in an increased accuracy of 96%for document,97.4%for sentence and 98.5%for feature levels respectively.Also,the proposed method is tested with IMDB,Yelp 2013 and Yelp 2014 datasets.Experimental results for these high dimensional datasets give increased accuracy of 95%,96%and 98%for the proposed Information Gain based classifier for document,sentence and feature levels respectively compared to existing baseline classifiers.
基金supported by the National Natural Science Foundation of China under Grant Nos.61751206,61673290 and 61876118the Postgraduate Research&Practice Innovation Program of Jiangsu Province of China under Grant No.KYCX20_2669a project funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD).
文摘Document-level machine translation(MT)remains challenging due to its difficulty in efficiently using documentlevel global context for translation.In this paper,we propose a hierarchical model to learn the global context for documentlevel neural machine translation(NMT).This is done through a sentence encoder to capture intra-sentence dependencies and a document encoder to model document-level inter-sentence consistency and coherence.With this hierarchical architecture,we feedback the extracted document-level global context to each word in a top-down fashion to distinguish different translations of a word according to its specific surrounding context.Notably,we explore the effect of three popular attention functions during the information backward-distribution phase to take a deep look into the global context information distribution of our model.In addition,since large-scale in-domain document-level parallel corpora are usually unavailable,we use a two-step training strategy to take advantage of a large-scale corpus with out-of-domain parallel sentence pairs and a small-scale corpus with in-domain parallel document pairs to achieve the domain adaptability.Experimental results of our model on Chinese-English and English-German corpora significantly improve the Transformer baseline by 4.5 BLEU points on average which demonstrates the effectiveness of our proposed hierarchical model in document-level NMT.
基金Supported by the National High Technology Research and Development Programme of China(No.2015AA015405)
文摘Social media like Twitter who serves as a novel news medium and has become increasingly popular since its establishment. Large scale first-hand user-generated tweets motivate automatic event detection on Twitter. Previous unsupervised approaches detected events by clustering words. These methods detect events using burstiness,which measures surging frequencies of words at certain time windows. However,event clusters represented by a set of individual words are difficult to understand. This issue is addressed by building a document-level event detection model that directly calculates the burstiness of tweets,leveraging distributed word representations for modeling semantic information,thereby avoiding sparsity. Results show that the document-level model not only offers event summaries that are directly human-readable,but also gives significantly improved accuracies compared to previous methods on unsupervised tweet event detection,which are based on words/segments.