Recently, the emergence of pre-trained models(PTMs) has brought natural language processing(NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language rep...Recently, the emergence of pre-trained models(PTMs) has brought natural language processing(NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next,we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.展开更多
The leakage of medical audio data in telemedicine seriously violates the privacy of patients.In order to avoid the leakage of patient information in telemedicine,a two-stage reversible robust audio watermarking algori...The leakage of medical audio data in telemedicine seriously violates the privacy of patients.In order to avoid the leakage of patient information in telemedicine,a two-stage reversible robust audio watermarking algorithm is proposed to protect medical audio data.The scheme decomposes the medical audio into two independent embedding domains,embeds the robust watermark and the reversible watermark into the two domains respectively.In order to ensure the audio quality,the Hurst exponent is used to find a suitable position for watermark embedding.Due to the independence of the two embedding domains,the embedding of the second-stage reversible watermark will not affect the first-stage watermark,so the robustness of the first-stage watermark can be well maintained.In the second stage,the correlation between the sampling points in the medical audio is used to modify the hidden bits of the histogram to reduce the modification of the medical audio and reduce the distortion caused by reversible embedding.Simulation experiments show that this scheme has strong robustness against signal processing operations such as MP3 compression of 48 db,additive white Gaussian noise(AWGN)of 20 db,low-pass filtering,resampling,re-quantization and other attacks,and has good imperceptibility.展开更多
针对受字数限定影响的文本特征表达能力弱成为短文本分类中制约效果的主要问题,提出基于word2vec维基百科词模型的中文短文本分类方法(chinese short text classification method based on embedding trained by word2vec from wikipedi...针对受字数限定影响的文本特征表达能力弱成为短文本分类中制约效果的主要问题,提出基于word2vec维基百科词模型的中文短文本分类方法(chinese short text classification method based on embedding trained by word2vec from wikipedia, CSTC-EWW),并针对新浪爱问4个主题的短文本集进行相关试验。首先训练维基百科语料库并获取word2vec词模型,然后建立基于此模型的短文本特征,通过SVM、贝叶斯等经典分类器对短文本进行分类。试验结果表明:本研究提出的方法可以有效进行短文本分类,最好情况下的F-度量值可达到81.8%;和词袋(bag-of-words, BOW)模型结合词频-逆文件频率(term frequency-inverse document frequency, TF-IDF)加权表达特征的短文本分类方法以及同样引入外来维基百科语料扩充特征的短文本分类方法相比,本研究分类效果更好,最好情况下的F-度量提高45.2%。展开更多
基金the National Natural Science Foundation of China(Grant Nos.61751201 and 61672162)the Shanghai Municipal Science and Technology Major Project(Grant No.2018SHZDZX01)and ZJLab。
文摘Recently, the emergence of pre-trained models(PTMs) has brought natural language processing(NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next,we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.
基金This work was supported,in part,by the Natural Science Foundation of Jiangsu Province under Grant Numbers BK20201136,BK20191401in part,by the National Nature Science Foundation of China under Grant Numbers 61502240,61502096,61304205,61773219in part,by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fund.Conflicts of Interest:The aut。
文摘The leakage of medical audio data in telemedicine seriously violates the privacy of patients.In order to avoid the leakage of patient information in telemedicine,a two-stage reversible robust audio watermarking algorithm is proposed to protect medical audio data.The scheme decomposes the medical audio into two independent embedding domains,embeds the robust watermark and the reversible watermark into the two domains respectively.In order to ensure the audio quality,the Hurst exponent is used to find a suitable position for watermark embedding.Due to the independence of the two embedding domains,the embedding of the second-stage reversible watermark will not affect the first-stage watermark,so the robustness of the first-stage watermark can be well maintained.In the second stage,the correlation between the sampling points in the medical audio is used to modify the hidden bits of the histogram to reduce the modification of the medical audio and reduce the distortion caused by reversible embedding.Simulation experiments show that this scheme has strong robustness against signal processing operations such as MP3 compression of 48 db,additive white Gaussian noise(AWGN)of 20 db,low-pass filtering,resampling,re-quantization and other attacks,and has good imperceptibility.
文摘针对受字数限定影响的文本特征表达能力弱成为短文本分类中制约效果的主要问题,提出基于word2vec维基百科词模型的中文短文本分类方法(chinese short text classification method based on embedding trained by word2vec from wikipedia, CSTC-EWW),并针对新浪爱问4个主题的短文本集进行相关试验。首先训练维基百科语料库并获取word2vec词模型,然后建立基于此模型的短文本特征,通过SVM、贝叶斯等经典分类器对短文本进行分类。试验结果表明:本研究提出的方法可以有效进行短文本分类,最好情况下的F-度量值可达到81.8%;和词袋(bag-of-words, BOW)模型结合词频-逆文件频率(term frequency-inverse document frequency, TF-IDF)加权表达特征的短文本分类方法以及同样引入外来维基百科语料扩充特征的短文本分类方法相比,本研究分类效果更好,最好情况下的F-度量提高45.2%。