Modern applications require large databases to be searched for regions that are similar to a given pattern. The DNA sequence analysis, speech and text recognition, artificial intelligence, Internet of Things, and many...Modern applications require large databases to be searched for regions that are similar to a given pattern. The DNA sequence analysis, speech and text recognition, artificial intelligence, Internet of Things, and many other applications highly depend on pattern matching or similarity searches. In this paper, we discuss some of the string matching solutions developed in the past. Then, we present a novel mathematical model to search for a given pattern and it’s near approximates in the text.展开更多
This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for docu...This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently.展开更多
Ensuring the correctness of answers to substring queries has not been a concern for consumers working within the traditional confines of their own organisational infrastructure. This is due to the fact that organisati...Ensuring the correctness of answers to substring queries has not been a concern for consumers working within the traditional confines of their own organisational infrastructure. This is due to the fact that organisations generally trust their handling of their own data hosted on their own servers and networks. With cloud computing however, where both data and processing are delegated to unknown servers, guarantees of the correctness of queries need to be available. The verification of the results of substring searches has not been given much focus to date within the wider scope of data and query, verification. We present a verification scheme for existential substring searc, hes on text files, which is the first of its kind to satisfy the desired properties of authenticity, completeness, and freshness. The scheme is based on suffix arrays, Merkle hash trees and cryptographic hashes to provide strong guarantees of correctness for the consumer, even in fully untrusted environments. We provide a description of our scheme, along with the results of experiments conducted on a fully-working prototype.展开更多
本文介绍了一种基于最大公共子串(Longest Common Substring,LCS)算法的术语抽取方法:按标点符号对领域文档进行切分;抽取切分后的语句片断的所有最大公共子串作为候选术语集;通过停用词过滤、对照领域词筛选和术语嵌套子串筛选等规...本文介绍了一种基于最大公共子串(Longest Common Substring,LCS)算法的术语抽取方法:按标点符号对领域文档进行切分;抽取切分后的语句片断的所有最大公共子串作为候选术语集;通过停用词过滤、对照领域词筛选和术语嵌套子串筛选等规则进行判别,得到最终的术语集。通过学前教育领域术语抽取的实验,验证了该算法可以有效地抽取中文领域术语:术语抽取平均准确率达84.2%;4~6字符双词术语抽取的效果尤佳,准确率接近100%。展开更多
高效求解2个字符串的最长公共子串(Longest Common Substring)是实现很多字符串算法的关键。文中首先给出了求解LCP问题的动态规划算法,广义后缀树算法,研究并分析了这两种算法,得出动态规划算法易于理解,但时间复杂度较高;广义后缀树...高效求解2个字符串的最长公共子串(Longest Common Substring)是实现很多字符串算法的关键。文中首先给出了求解LCP问题的动态规划算法,广义后缀树算法,研究并分析了这两种算法,得出动态规划算法易于理解,但时间复杂度较高;广义后缀树算法的时间复杂度较低,但实现较为复杂并且广义后缀树占用的空间也较多。最后提出了一个新算法,该算法使用2个字符串的广义后缀数组,在保持和广义后缀树时间复杂度相等的基础上,可以简单地实现并且占用较少的空间。展开更多
文摘Modern applications require large databases to be searched for regions that are similar to a given pattern. The DNA sequence analysis, speech and text recognition, artificial intelligence, Internet of Things, and many other applications highly depend on pattern matching or similarity searches. In this paper, we discuss some of the string matching solutions developed in the past. Then, we present a novel mathematical model to search for a given pattern and it’s near approximates in the text.
文摘This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently.
文摘Ensuring the correctness of answers to substring queries has not been a concern for consumers working within the traditional confines of their own organisational infrastructure. This is due to the fact that organisations generally trust their handling of their own data hosted on their own servers and networks. With cloud computing however, where both data and processing are delegated to unknown servers, guarantees of the correctness of queries need to be available. The verification of the results of substring searches has not been given much focus to date within the wider scope of data and query, verification. We present a verification scheme for existential substring searc, hes on text files, which is the first of its kind to satisfy the desired properties of authenticity, completeness, and freshness. The scheme is based on suffix arrays, Merkle hash trees and cryptographic hashes to provide strong guarantees of correctness for the consumer, even in fully untrusted environments. We provide a description of our scheme, along with the results of experiments conducted on a fully-working prototype.
文摘本文介绍了一种基于最大公共子串(Longest Common Substring,LCS)算法的术语抽取方法:按标点符号对领域文档进行切分;抽取切分后的语句片断的所有最大公共子串作为候选术语集;通过停用词过滤、对照领域词筛选和术语嵌套子串筛选等规则进行判别,得到最终的术语集。通过学前教育领域术语抽取的实验,验证了该算法可以有效地抽取中文领域术语:术语抽取平均准确率达84.2%;4~6字符双词术语抽取的效果尤佳,准确率接近100%。
文摘高效求解2个字符串的最长公共子串(Longest Common Substring)是实现很多字符串算法的关键。文中首先给出了求解LCP问题的动态规划算法,广义后缀树算法,研究并分析了这两种算法,得出动态规划算法易于理解,但时间复杂度较高;广义后缀树算法的时间复杂度较低,但实现较为复杂并且广义后缀树占用的空间也较多。最后提出了一个新算法,该算法使用2个字符串的广义后缀数组,在保持和广义后缀树时间复杂度相等的基础上,可以简单地实现并且占用较少的空间。