Since OpenAI opened access to ChatGPT,large language models(LLMs)become an increasingly popular topic attracting researchers’attention from abundant domains.However,public researchers meet some problems when developi...Since OpenAI opened access to ChatGPT,large language models(LLMs)become an increasingly popular topic attracting researchers’attention from abundant domains.However,public researchers meet some problems when developing LLMs given that most of the LLMs are produced by industries and the training details are typically unrevealed.Since datasets are an important setup of LLMs,this paper does a holistic survey on the training datasets used in both the pre-train and fine-tune processes.The paper first summarizes 16 pre-train datasets and 16 fine-tune datasets used in the state-of-the-art LLMs.Secondly,based on the properties of the pre-train and fine-tune processes,it comments on pre-train datasets from quality,quantity,and relation with models,and comments on fine-tune datasets from quality,quantity,and concerns.This study then critically figures out the problems and research trends that exist in current LLM datasets.The study helps public researchers train and investigate LLMs by visual cases and provides useful comments to the research community regarding data development.To the best of our knowledge,this paper is the first to summarize and discuss datasets used in both autoregressive and chat LLMs.The survey offers insights and suggestions to researchers and LLM developers as they build their models,and contributes to the LLM study by pointing out the existing problems of LLM studies from the perspective of data.展开更多
According to the reports of"Top Ten Emerging Technologies in Chemistry 2022"released by the International Union of Pure and Applied Chemistry,sodium-ion battery(SIB)technology is identified as a crucial emer...According to the reports of"Top Ten Emerging Technologies in Chemistry 2022"released by the International Union of Pure and Applied Chemistry,sodium-ion battery(SIB)technology is identified as a crucial emerging technology,indicating its promising development for future energy-storage applications[1].In practical applications,commercialized lithium-ion batteries(LIBs)with lithium cobalt oxide and ternary oxide as cathode materials have assumed a dominant position[2].However,these cathode materials of LIBs are highly dependent on expensive cobalt and nickel,rendering them less sustainable for grid-scale energy storage.Conversely,cathode materials in SIBs appear more sustainable due to their lower dependence on cobalt.Furthermore,the strategic importance of reducing over-dependence on lithium resources cannot be overstated.Hence,SIB technology can serve as one of the potential solutions to mitigate this issue[3].展开更多
二氧化碳+水溶液体系界面张力(IFT)是影响地层中气水两相运移特性的重要参数之一,对二氧化碳捕集、埋存至关重要.为了快速准确地确定二氧化碳+水溶液体系IFT,对已有IFT实验结果进行了统计整理,得到了1 677组样本数据,考虑了压力,温度,...二氧化碳+水溶液体系界面张力(IFT)是影响地层中气水两相运移特性的重要参数之一,对二氧化碳捕集、埋存至关重要.为了快速准确地确定二氧化碳+水溶液体系IFT,对已有IFT实验结果进行了统计整理,得到了1 677组样本数据,考虑了压力,温度,气体中甲烷、氮气含量,水溶液中一价阳离子(Na+,K+)浓度、二价阳离子(Ca^(2+),Mg^(2+))浓度6个因素对IFT的影响,建立了小波神经网络(WNN)预测模型对二氧化碳+水溶液体系IFT进行预测.模拟结果表明,随机选取839组数据作为训练集样本,得到的小波神经网络结构为6-16-1,该模型预测IFT的平均绝对误差(MMAE)、平均相对误差(MMARE)、方差(MMSE)和相关度(R2)分别为1.23 m N/m,3.30%,2.30 m N2/m2,0.988.与最新提出的多元拟合模型和BP神经网络模型对比结果表明,小波神经网络模型预测精度最高.展开更多
文摘Since OpenAI opened access to ChatGPT,large language models(LLMs)become an increasingly popular topic attracting researchers’attention from abundant domains.However,public researchers meet some problems when developing LLMs given that most of the LLMs are produced by industries and the training details are typically unrevealed.Since datasets are an important setup of LLMs,this paper does a holistic survey on the training datasets used in both the pre-train and fine-tune processes.The paper first summarizes 16 pre-train datasets and 16 fine-tune datasets used in the state-of-the-art LLMs.Secondly,based on the properties of the pre-train and fine-tune processes,it comments on pre-train datasets from quality,quantity,and relation with models,and comments on fine-tune datasets from quality,quantity,and concerns.This study then critically figures out the problems and research trends that exist in current LLM datasets.The study helps public researchers train and investigate LLMs by visual cases and provides useful comments to the research community regarding data development.To the best of our knowledge,this paper is the first to summarize and discuss datasets used in both autoregressive and chat LLMs.The survey offers insights and suggestions to researchers and LLM developers as they build their models,and contributes to the LLM study by pointing out the existing problems of LLM studies from the perspective of data.
基金supported by the National Key R&D Program of China(2023YFE0202000)the National Natural Science Foundation of China(52173246)Double-Thousand Talents Plan of Jiangxi Province(jxsq2023102005)。
文摘According to the reports of"Top Ten Emerging Technologies in Chemistry 2022"released by the International Union of Pure and Applied Chemistry,sodium-ion battery(SIB)technology is identified as a crucial emerging technology,indicating its promising development for future energy-storage applications[1].In practical applications,commercialized lithium-ion batteries(LIBs)with lithium cobalt oxide and ternary oxide as cathode materials have assumed a dominant position[2].However,these cathode materials of LIBs are highly dependent on expensive cobalt and nickel,rendering them less sustainable for grid-scale energy storage.Conversely,cathode materials in SIBs appear more sustainable due to their lower dependence on cobalt.Furthermore,the strategic importance of reducing over-dependence on lithium resources cannot be overstated.Hence,SIB technology can serve as one of the potential solutions to mitigate this issue[3].
文摘二氧化碳+水溶液体系界面张力(IFT)是影响地层中气水两相运移特性的重要参数之一,对二氧化碳捕集、埋存至关重要.为了快速准确地确定二氧化碳+水溶液体系IFT,对已有IFT实验结果进行了统计整理,得到了1 677组样本数据,考虑了压力,温度,气体中甲烷、氮气含量,水溶液中一价阳离子(Na+,K+)浓度、二价阳离子(Ca^(2+),Mg^(2+))浓度6个因素对IFT的影响,建立了小波神经网络(WNN)预测模型对二氧化碳+水溶液体系IFT进行预测.模拟结果表明,随机选取839组数据作为训练集样本,得到的小波神经网络结构为6-16-1,该模型预测IFT的平均绝对误差(MMAE)、平均相对误差(MMARE)、方差(MMSE)和相关度(R2)分别为1.23 m N/m,3.30%,2.30 m N2/m2,0.988.与最新提出的多元拟合模型和BP神经网络模型对比结果表明,小波神经网络模型预测精度最高.