Gene duplications provide evolutionary potentials for generating novel functions, while polyploidization or whole genome duplication (WGD) doubles the chromosomes initially and results in hundreds to thousands of re...Gene duplications provide evolutionary potentials for generating novel functions, while polyploidization or whole genome duplication (WGD) doubles the chromosomes initially and results in hundreds to thousands of retained duplicates. WGDs are strongly supported by evidence commonly found in many species-rich lineages of eukaryotes, and thus are considered as a major driving force in species diversification. We per- formed comparative genomic and phylogenomic analyses of 59 public genomes/transcriptomes and 46 newly sequenced transcriptomes covering major lineages of angiosperms to detect large-scale gene dupli- cation events by surveying tens of thousands of gene family trees. These analyses confirmed most of the previously reported WGDs and provided strong evidence for novel ones in many lineages. The detected WGDs supported a model of exponential gene loss during evolution with an estimated half-life of approx- imately 21.6 million years, and were correlated with both the emergence of lineages with high degrees of diversification and periods of global climate changes. The new datasets and analyses detected many novel WGDs widely spread during angiosperm evolution, uncovered preferential retention of gene functions in essential cellular metabolisms, and provided clues for the roles of WGD in promoting angiosperm radiation and enhancing their adaptation to environmental changes.展开更多
By using the method of electrophoresis,three isozymes (lactate dehydrogenase,malate dehydrogenase and esterase) of three species of genus Gymnocypris were described and analyzed from North Tibet in this paper. The...By using the method of electrophoresis,three isozymes (lactate dehydrogenase,malate dehydrogenase and esterase) of three species of genus Gymnocypris were described and analyzed from North Tibet in this paper. The results showed that all three isozymes presented interspecific difference and distinct differentiation among individuals in the same population,and there was no electrophorectic difference between males and females. Analysis of relationships among three naked carps indicated a high degree of similarity between G. selincuoensis and G. cuoensis ,whereas low degree between G. selincuoensis and G. namensis . Furthermore,three isozymes presented expression of null alleles,and the duplicate genes of LDH A 2,LDH B 2,s MDH A 2 and m MDH B 2 also expressed in some individuals. Compared to other tetraploid fishes,three naked carps retained more functional duplicate genes and null alleles. This suggests fishes of genus Gymnocypris are at the early stage of evolution after polyploidization than that of fishes of Catostomidae,it directly related to the later time of schizothoracine fishes originate as well as severe environment.展开更多
Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the ...Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the same point and thus are duplicates of one another. Duplicate questions make Stack Overflow site maintenance harder, waste resources that could have been used to answer other questions, and cause developers to unnecessarily wait for answers that are already available. To reduce the problem of duplicate questions, Stack Overflow allows questions to be manually marked as duplicates of others. Since there are thousands of questions submitted to Stack Overflow every day, manually identifying duplicate questions is a difficult work. Thus, there is a need for an automated approach that can help in detecting these duplicate questions. To address the above-mentioned need, in this paper, we propose an automated approach named DuPPREDICTOR that takes a new question as input and detects potential duplicates of this question by considering multiple factors. DuPPREDICTOR extracts the title and description of a question and also tags that are attached to the question. These pieces of information (title, description, and a few tags) are mandatory information that a user needs to input when posting a question. DuPPREDICTOR then computes the latent topics of each question by using a topic model. Next, for each pair of questions, it computes four similarity scores by comparing their titles, descriptions, latent topics, and tags. These four similarity scores are finally combined together to result in a new similarity score that comprehensively considers the multiple factors. To examine the benefit of DuPPREDICTOR, we perform an experiment on a Stack Overflow dataset which contains a total of more than two million questions. The result shows that DuPPREDICTOR can achieve a recali-rate@20 score of 63.8%. We compare our approach with the standard search engine of Stack Overflow, and DuPPREDICTOR improves its recall-rat展开更多
3D geological modeling is an inevitable choice for coal exploration to adapt to the transformation of coal mining for green, fine, transparent and Intelligent mining. In the traditional Coalfield exploration geologica...3D geological modeling is an inevitable choice for coal exploration to adapt to the transformation of coal mining for green, fine, transparent and Intelligent mining. In the traditional Coalfield exploration geological reports, the spatial expression form for the coal seams and their surrounding rocks are 2D maps. These 2D maps are excellent data sources for constructing 3D geological models of coal field exploration areas. How to construct 3D models from these 2D maps has been studying in coal exploration industry for a long time, and still no breakthrough has been achieved so far. This paper discusses the principle, method and software design idea of constructing 3D geological model of an exploration area with 2D maps made by AutoCAD/MapGIS. At first, the paper analyzes 3D geological surface expression mode in 3D geological modeling software. It is pointed out that although contour method has unique advantages in coal field exploration, TIN (Triangular Irregular Network) is still the standard configuration of 3D modeling software for coal field. Then, the paper discusses the method of 2D line features obtaining elevation and upgrading 2D curve to 3D curve. Next, the method of semi-automatic partition is introduced to build the boundary ring of the surface patch, that is, the user clicks and selects the line feature to build the outer boundary ring of the surface patch. Then, Auto-process method for fault line inside of the outer boundary ring is discussed, it including construction of fault ring, determining fault ring being normal fault ring or reverse fault ring and an algorithm of dealing with normal fault ring. An algorithm of dealing with reverse fault ring is discussed detailly, the method of expanding reverse fault ring and dividing the duplicate area in reverse fault into two portions is introduced. The paper also discusses the method of extraction ridge line/valley line, the construction of fault plane, the construction of stratum and coal body. The above ideas and methods have been initially implement展开更多
The emergence of next-generation sequencing (NGS) technologies has significantly improved sequencing throughput and reduced costs. However, the short read length, duplicate reads and massive volume of data make the ...The emergence of next-generation sequencing (NGS) technologies has significantly improved sequencing throughput and reduced costs. However, the short read length, duplicate reads and massive volume of data make the data processing much more difficult and complicated than the first-generation sequencing technology. Al- though there are some software packages developed to assess the data quality, those packages either are not easily available to users or require bioinformatics skills and computer resources. Moreover, almost all the quality assessment software currently available didn't taken into account the sequencing errors when dealing with the du- plicate assessment in NGS data. Here, we present a new user-friendly quality assessment software package called BIGpre, which works for both Illumina and 454 platforms. BIGpre contains all the functions of other quality assessment software, such as the correlation between forward and reverse reads, read GC-content distribution, and base Ns quality. More importantly, BIGpre incorporates associated programs to detect and remove duplicate reads after taking sequencing errors into account and trimming low quality reads from raw data as well. BIGpre is primarily written in Perl and integrates graphical capability from the statistics package R. This package produces both tabular and graphical summaries of data quality for sequencing datasets from Illumina and 454 platforms. Processing hundreds of millions reads within minutes, this package provides immediate diagnostic information for user to manipulate sequencing data for downstream analyses. BIGpre is freely available at http://bigpre.sourceforge.net/.展开更多
Evidence-based literature reviews play a vital role in contemporary research,facilitating the synthesis of knowledge from multiple sources to inform decisionmaking and scientific advancements.Within this framework,de-...Evidence-based literature reviews play a vital role in contemporary research,facilitating the synthesis of knowledge from multiple sources to inform decisionmaking and scientific advancements.Within this framework,de-duplication emerges as a part of the process for ensuring the integrity and reliability of evidence extraction.This opinion review delves into the evolution of de-duplication,highlights its importance in evidence synthesis,explores various de-duplication methods,discusses evolving technologies,and proposes best practices.By addressing ethical considerations this paper emphasizes the significance of deduplication as a cornerstone for quality in evidence-based literature reviews.展开更多
针对云中软件即服务(Software as a Service,SaaS)多租户共享存储模式下恶意服务提供商伪造、删除或篡改租户定制存储的数据副本数据问题,结合多租户数据共享存储特点以及租户间隐私与隔离需求,提出了面向租户的多副本完整性保护机制(Te...针对云中软件即服务(Software as a Service,SaaS)多租户共享存储模式下恶意服务提供商伪造、删除或篡改租户定制存储的数据副本数据问题,结合多租户数据共享存储特点以及租户间隐私与隔离需求,提出了面向租户的多副本完整性保护机制(Tenant-oriented duplication integrity checking scheme,TDIC).TDIC通过对租户副本元组进行周期性随机抽样的方式,来降低验证对象的生成代价.为适应租户数据的动态更新,建立面向租户多副本辅助验证结构(Tenant duplication authentication structure,TDAS),TDAS可以将每个数据节点上不同租户的副本验证信息隔离,保证租户副本验证过程的隔离性.结合租户元组的同态标签与TDAS,TDIC可以在不泄露租户数据内容的前提下,委托可信第三方对租户副本进行抽样检查.分析表明,如果租户逻辑视图中包含一万个数据元组时,在元组破坏率为1%的情况下发现数据被破坏的随机抽样数目最大约为元组总数的5%,相对全部验证的方法有效降低了系统资源消耗.展开更多
Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, a...Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O(√G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.展开更多
A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the d...A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the data preprocessing module,and then,in the heterogeneous records processing module it calculates the similarity degree of the entity records to obtain the duplicate records based on the weights calculated in the homogeneous records processing module.Unlike traditional methods,the proposed approach is implemented without schema matching in advance.And multiple estimators with selective algorithms are adopted to reach a better matching efficiency.The experimental results show that the duplicate identification model is feasible and efficient.展开更多
文摘Gene duplications provide evolutionary potentials for generating novel functions, while polyploidization or whole genome duplication (WGD) doubles the chromosomes initially and results in hundreds to thousands of retained duplicates. WGDs are strongly supported by evidence commonly found in many species-rich lineages of eukaryotes, and thus are considered as a major driving force in species diversification. We per- formed comparative genomic and phylogenomic analyses of 59 public genomes/transcriptomes and 46 newly sequenced transcriptomes covering major lineages of angiosperms to detect large-scale gene dupli- cation events by surveying tens of thousands of gene family trees. These analyses confirmed most of the previously reported WGDs and provided strong evidence for novel ones in many lineages. The detected WGDs supported a model of exponential gene loss during evolution with an estimated half-life of approx- imately 21.6 million years, and were correlated with both the emergence of lineages with high degrees of diversification and periods of global climate changes. The new datasets and analyses detected many novel WGDs widely spread during angiosperm evolution, uncovered preferential retention of gene functions in essential cellular metabolisms, and provided clues for the roles of WGD in promoting angiosperm radiation and enhancing their adaptation to environmental changes.
文摘By using the method of electrophoresis,three isozymes (lactate dehydrogenase,malate dehydrogenase and esterase) of three species of genus Gymnocypris were described and analyzed from North Tibet in this paper. The results showed that all three isozymes presented interspecific difference and distinct differentiation among individuals in the same population,and there was no electrophorectic difference between males and females. Analysis of relationships among three naked carps indicated a high degree of similarity between G. selincuoensis and G. cuoensis ,whereas low degree between G. selincuoensis and G. namensis . Furthermore,three isozymes presented expression of null alleles,and the duplicate genes of LDH A 2,LDH B 2,s MDH A 2 and m MDH B 2 also expressed in some individuals. Compared to other tetraploid fishes,three naked carps retained more functional duplicate genes and null alleles. This suggests fishes of genus Gymnocypris are at the early stage of evolution after polyploidization than that of fishes of Catostomidae,it directly related to the later time of schizothoracine fishes originate as well as severe environment.
文摘Stack Overflow is a popular on-line question and answer site for software developers to share their experience and expertise. Among the numerous questions posted in Stack Overflow, two or more of them may express the same point and thus are duplicates of one another. Duplicate questions make Stack Overflow site maintenance harder, waste resources that could have been used to answer other questions, and cause developers to unnecessarily wait for answers that are already available. To reduce the problem of duplicate questions, Stack Overflow allows questions to be manually marked as duplicates of others. Since there are thousands of questions submitted to Stack Overflow every day, manually identifying duplicate questions is a difficult work. Thus, there is a need for an automated approach that can help in detecting these duplicate questions. To address the above-mentioned need, in this paper, we propose an automated approach named DuPPREDICTOR that takes a new question as input and detects potential duplicates of this question by considering multiple factors. DuPPREDICTOR extracts the title and description of a question and also tags that are attached to the question. These pieces of information (title, description, and a few tags) are mandatory information that a user needs to input when posting a question. DuPPREDICTOR then computes the latent topics of each question by using a topic model. Next, for each pair of questions, it computes four similarity scores by comparing their titles, descriptions, latent topics, and tags. These four similarity scores are finally combined together to result in a new similarity score that comprehensively considers the multiple factors. To examine the benefit of DuPPREDICTOR, we perform an experiment on a Stack Overflow dataset which contains a total of more than two million questions. The result shows that DuPPREDICTOR can achieve a recali-rate@20 score of 63.8%. We compare our approach with the standard search engine of Stack Overflow, and DuPPREDICTOR improves its recall-rat
文摘3D geological modeling is an inevitable choice for coal exploration to adapt to the transformation of coal mining for green, fine, transparent and Intelligent mining. In the traditional Coalfield exploration geological reports, the spatial expression form for the coal seams and their surrounding rocks are 2D maps. These 2D maps are excellent data sources for constructing 3D geological models of coal field exploration areas. How to construct 3D models from these 2D maps has been studying in coal exploration industry for a long time, and still no breakthrough has been achieved so far. This paper discusses the principle, method and software design idea of constructing 3D geological model of an exploration area with 2D maps made by AutoCAD/MapGIS. At first, the paper analyzes 3D geological surface expression mode in 3D geological modeling software. It is pointed out that although contour method has unique advantages in coal field exploration, TIN (Triangular Irregular Network) is still the standard configuration of 3D modeling software for coal field. Then, the paper discusses the method of 2D line features obtaining elevation and upgrading 2D curve to 3D curve. Next, the method of semi-automatic partition is introduced to build the boundary ring of the surface patch, that is, the user clicks and selects the line feature to build the outer boundary ring of the surface patch. Then, Auto-process method for fault line inside of the outer boundary ring is discussed, it including construction of fault ring, determining fault ring being normal fault ring or reverse fault ring and an algorithm of dealing with normal fault ring. An algorithm of dealing with reverse fault ring is discussed detailly, the method of expanding reverse fault ring and dividing the duplicate area in reverse fault into two portions is introduced. The paper also discusses the method of extraction ridge line/valley line, the construction of fault plane, the construction of stratum and coal body. The above ideas and methods have been initially implement
基金supported by the National Natural Science Foundation of China (Grant No.31000561 and 30900825)the Knowledge Innovation Program of the Chinese Academy of Sciences (Grant No.KSCX2-EW-R-01-04)
文摘The emergence of next-generation sequencing (NGS) technologies has significantly improved sequencing throughput and reduced costs. However, the short read length, duplicate reads and massive volume of data make the data processing much more difficult and complicated than the first-generation sequencing technology. Al- though there are some software packages developed to assess the data quality, those packages either are not easily available to users or require bioinformatics skills and computer resources. Moreover, almost all the quality assessment software currently available didn't taken into account the sequencing errors when dealing with the du- plicate assessment in NGS data. Here, we present a new user-friendly quality assessment software package called BIGpre, which works for both Illumina and 454 platforms. BIGpre contains all the functions of other quality assessment software, such as the correlation between forward and reverse reads, read GC-content distribution, and base Ns quality. More importantly, BIGpre incorporates associated programs to detect and remove duplicate reads after taking sequencing errors into account and trimming low quality reads from raw data as well. BIGpre is primarily written in Perl and integrates graphical capability from the statistics package R. This package produces both tabular and graphical summaries of data quality for sequencing datasets from Illumina and 454 platforms. Processing hundreds of millions reads within minutes, this package provides immediate diagnostic information for user to manipulate sequencing data for downstream analyses. BIGpre is freely available at http://bigpre.sourceforge.net/.
文摘Evidence-based literature reviews play a vital role in contemporary research,facilitating the synthesis of knowledge from multiple sources to inform decisionmaking and scientific advancements.Within this framework,de-duplication emerges as a part of the process for ensuring the integrity and reliability of evidence extraction.This opinion review delves into the evolution of de-duplication,highlights its importance in evidence synthesis,explores various de-duplication methods,discusses evolving technologies,and proposes best practices.By addressing ethical considerations this paper emphasizes the significance of deduplication as a cornerstone for quality in evidence-based literature reviews.
基金supported by the "Hundred Talents Program" of CAS and the National Natural Science Foundation of China under Grant No. 60772034.
文摘Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O(√G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.
基金The National Natural Science Foundation of China(No.60673139)
文摘A duplicate identification model is presented to deal with semi-structured or unstructured data extracted from multiple data sources in the deep web.First,the extracted data is generated to the entity records in the data preprocessing module,and then,in the heterogeneous records processing module it calculates the similarity degree of the entity records to obtain the duplicate records based on the weights calculated in the homogeneous records processing module.Unlike traditional methods,the proposed approach is implemented without schema matching in advance.And multiple estimators with selective algorithms are adopted to reach a better matching efficiency.The experimental results show that the duplicate identification model is feasible and efficient.