Many data sets contain temporal records which span a long period of time; each record is associated with a time stamp and describes some aspects of a real-world en- tity at a particular time (e.g., author information...Many data sets contain temporal records which span a long period of time; each record is associated with a time stamp and describes some aspects of a real-world en- tity at a particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to perform interest- ing longitudinal data analysis. However, existing record link- age techniques ignore temporal information and fall short for temporal data. This article studies linking temporal records. First, we ap- ply time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider the time order of the records and make global decisions. Ex- perimental results show that our algorithms significantly out- perform traditional linkage methods on various temporal data sets.展开更多
AIM:To establish the hospitalized prevalence of severe Crohn's disease(CD) and ulcerative colitis(UC) in Wales from 1999 to 2007;and to investigate long-term mortality after hospitalization and associations with s...AIM:To establish the hospitalized prevalence of severe Crohn's disease(CD) and ulcerative colitis(UC) in Wales from 1999 to 2007;and to investigate long-term mortality after hospitalization and associations with social deprivation and other socio-demographic factors.METHODS:Record linkage of administrative inpatient and mortality data for 1467 and 1482 people hospitalised as emergencies for ≥ 3d for CD and UC,respectively.The main outcome measures were hospitalized prevalence,mortality rates and standardized mortality ratios for up to 5 years follow-up after hospitalization.RESULTS:Hospitalized prevalence was 50.1 per 100 000 population for CD and 50.6 for UC.The hospitalized prevalence of CD was significantly higher(P < 0.05) in females(57.4) than in males(42.2),and was highest in people aged 16-29 years,but the prevalence of UC was similar in males(51.0) and females(50.1),and increased continuously with age.The hospital-ized prevalence of CD was slightly higher in the most deprived areas,but there was no association between social deprivation and hospitalized prevalence of UC.Mortality was 6.8% and 14.6% after 1 and 5 years follow-up for CD,and 9.2% and 20.8% after 1 and 5 years for UC.For both CD and UC,there was little discernible association between mortality and social deprivation,distance from hospital,urban/rural residence and geography.CONCLUSION:CD and UC have distinct demographic profiles.The higher prevalence of hospitalized CD in more deprived areas may reflect higher prevalence and higher hospital dependency.展开更多
ESystems based on EHRs(Electronic health records)have been in use for many years and their amplified realizations have been felt recently.They still have been pioneering collections of massive volumes of health data.D...ESystems based on EHRs(Electronic health records)have been in use for many years and their amplified realizations have been felt recently.They still have been pioneering collections of massive volumes of health data.Duplicate detections involve discovering records referring to the same practical components,indicating tasks,which are generally dependent on several input parameters that experts yield.Record linkage specifies the issue of finding identical records across various data sources.The similarity existing between two records is characterized based on domain-based similarity functions over different features.De-duplication of one dataset or the linkage of multiple data sets has become a highly significant operation in the data processing stages of different data mining programmes.The objective is to match all the records associated with the same entity.Various measures have been in use for representing the quality and complexity about data linkage algorithms,and many other novel metrics have been introduced.An outline of the problem existing in themeasurement of data linkage and de-duplication quality and complexity is presented.This article focuses on the reprocessing of health data that is horizontally divided among data custodians,with the purpose of custodians giving similar features to sets of patients.The first step in this technique is about an automatic selection of training examples with superior quality from the compared record pairs and the second step involves training the reciprocal neuro-fuzzy inference system(RANFIS)classifier.Using the Optimal Threshold classifier,it is presumed that there is information about the original match status for all compared record pairs(i.e.,Ant Lion Optimization),and therefore an optimal threshold can be computed based on the respective RANFIS.Febrl,Clinical Decision(CD),and Cork Open Research Archive(CORA)data repository help analyze the proposed method with evaluated benchmarks with current techniques.展开更多
Cloud storage is essential for managing user data to store and retrieve from the distributed data centre.The storage service is distributed as pay a service for accessing the size to collect the data.Due to the massiv...Cloud storage is essential for managing user data to store and retrieve from the distributed data centre.The storage service is distributed as pay a service for accessing the size to collect the data.Due to the massive amount of data stored in the data centre containing similar information and file structures remaining in multi-copy,duplication leads to increase storage space.The potential deduplication system doesn’t make efficient data reduction because of inaccuracy in finding similar data analysis.It creates a complex nature to increase the storage consumption under cost.To resolve this problem,this paper proposes an efficient storage reduction called Hash-Indexing Block-based Deduplication(HIBD)based on Segmented Bind Linkage(SBL)Methods for reducing storage in a cloud environment.Initially,preprocessing is done using the sparse augmentation technique.Further,the preprocessed files are segmented into blocks to make Hash-Index.The block of the contents is compared with other files through Semantic Content Source Deduplication(SCSD),which identifies the similar content presence between the file.Based on the content presence count,the Distance Vector Weightage Correlation(DVWC)estimates the document similarity weight,and related files are grouped into a cluster.Finally,the segmented bind linkage compares the document to find duplicate content in the cluster using similarity weight based on the coefficient match case.This implementation helps identify the data redundancy efficiently and reduces the service cost in distributed cloud storage.展开更多
多方隐私保护下的记录链接(privacy-preserving record linkage,简称PPRL)是在隐私保护下,从多个数据源中找出代表现实世界中同一实体的过程.该过程除了最终匹配结果被数据源之间共享外,其他信息均未被泄露.随着数据量的日益增大和现实...多方隐私保护下的记录链接(privacy-preserving record linkage,简称PPRL)是在隐私保护下,从多个数据源中找出代表现实世界中同一实体的过程.该过程除了最终匹配结果被数据源之间共享外,其他信息均未被泄露.随着数据量的日益增大和现实世界数据质量问题的存在(如拼写错误、顺序颠倒等),多方PPRL方法的可扩展性和容错性面临挑战.目前,已有的大部分多方PPRL方法都是精确匹配方法,不具有容错性.还有少部分多方PPRL近似方法具有容错性,但在处理存在质量问题的数据时,由于容错性差和时间代价过大,并不能有效地找出数据源间的共同实体.因此,提出一种结合布隆过滤、安全合计、动态阈值、检查机制和改进的Dice相似度函数的多方PPRL近似方法.首先,利用布隆过滤将各数据源中的每条记录信息转换成由0和1组成的位数组.然后,计算每个对应位置bit 1所占的比率,并利用动态阈值和检查机制来判定匹配成功的位置.最后,通过改进的Dice相似度函数计算出记录间的相似度,进而判断记录间是否匹配成功.实验结果表明:所提出的方法具有较好的可扩展性,并且在保证查准率的同时,比已有的多方近似PPRL方法具有更高的容错性.展开更多
AIM:To investigate associations between perinatal risk factors and subsequent inflammatory bowel disease (IBD) in children and young adults.METHODS:Record linked abstracts of birth registrations,maternity,day case and...AIM:To investigate associations between perinatal risk factors and subsequent inflammatory bowel disease (IBD) in children and young adults.METHODS:Record linked abstracts of birth registrations,maternity,day case and inpatient admissions in a defined population of southern England.Investigation of 20 perinatal factors relating to the maternity or the birth:maternal age,Crohn's disease (CD) or ulcerative colitis (UC) in the mother,maternal social class,marital status,smoking in pregnancy,ABO blood group and rhesus status,pre-eclampsia,parity,the infant's presentation at birth,caesarean delivery,forceps delivery,sex,number of babies delivered,gestational age,birthweight,head circumference,breastfeeding and Apgar scores at one and five minutes.RESULTS:Maternity records were present for 180 children who subsequently developed IBD.Univariate analysis showed increased risks of CD among children of mothers with CD (P=0.011,based on two cases of CD in both mother and child) and children of mothers who smoked during pregnancy.Multivariate analysis confirmed increased risks of CD among children of mothers who smoked (odds ratio=2.04,95% CI=1.06-3.92) and for older mothers aged 35+ years (4.81,2.32-9.98).Multivariate analysis showed that there were no significant associations between CD and 17 other perinatal risk factors investigated.It also showed that,for UC,there were no significant associations with the perinatal factors studied.CONCLUSION:This study shows an association between CD in mother and child;and elevated risks of CD in children of older mothers and of mothers who smoked.展开更多
链接跨组织数据库中表示同一实体的记录,同时保护存储在这些数据库中实体的隐私,是安全有效地整合多源数据资源的核心技术之一。然而,已有隐私保护记录链接(privacy-preserving record linkage,PPRL)技术中的分块方法不能同时保证高查...链接跨组织数据库中表示同一实体的记录,同时保护存储在这些数据库中实体的隐私,是安全有效地整合多源数据资源的核心技术之一。然而,已有隐私保护记录链接(privacy-preserving record linkage,PPRL)技术中的分块方法不能同时保证高查全率和高查准率,强隐私性的匹配方法存在时间代价过大的不足,且对多于两个数据库间的匹配研究很少。针对上述问题,提出了一种多方强隐私保护记录链接方法(multi-partystrong-privacy-preserving record linkage,MP-SPPRL)。首先,提出了一种局部敏感哈希(locality sensitiveHashing,LSH)结合后缀分块的二次分块方法,并引入分块分散度调节两次分块,在保证MP-SPPRL高查全率的前提下有效地提高了查准率;接着,利用滑动窗口合并分块生成候选记录组,保证MP-SPPRL的容错率;然后,采用基于同态加密的Hamming距离计算方法,设计了一种适用于大型数据的基于安全多方计算(securemulti-party computation,SMC)的可伸缩多方记录匹配算法,通过缩减加密记录数量和提前终止不可能匹配的候选记录组的距离计算,显著降低了匹配的时间代价,提高了MP-SPPRL的效率;最后,通过大量实验验证了MP-SPPRL的高查全率、高查准率和高效性。展开更多
The applications of unique identifiers such as name, home address and social security number to link different datasets have been commonly used and well-published. Also, the theoretical concepts of probabilistic algor...The applications of unique identifiers such as name, home address and social security number to link different datasets have been commonly used and well-published. Also, the theoretical concepts of probabilistic algorithm in record linkage have been well-defined in the literature. However, few studies have reported the applications of its probabilistic algorithm using non-unique identifiers. In this paper, we investigate several variables (weight, height, waist, age, sex, smoking and alcohol habit) as non-unique identifiers using Japanese cohort dataset with three-year baseline of 1989-1991 to observe how effectively these identifiers can be used and what influence those may have on record linkage. Moreover, we modify the conditions of these identifiers and estimate the sensitivity, specificity and accuracy for comparison. We further investigate this by using extended ten-year baseline of 1989-1999 as well. As a result, we conclude that the combination of age, sex, weight and height predicts better estimation with regards to the sensitivity, specificity and accuracy than other combinations in both men and women in case of using three-year baseline, whereas the combination of age, sex and height predicts better in both men and women in case of using ten-year baseline.展开更多
1Introduction Record linkage(RL)groups_records corresponding to the same entities in datasets,and is a long-standing topic in data management and mining communities[1-2].In big data era,real-time data applications bec...1Introduction Record linkage(RL)groups_records corresponding to the same entities in datasets,and is a long-standing topic in data management and mining communities[1-2].In big data era,real-time data applications become popular,and callfor payas-you-go RL(PRL),which produces as many match pairs as possible in very limited time(much shorter than the overall RLruntime).展开更多
文摘Many data sets contain temporal records which span a long period of time; each record is associated with a time stamp and describes some aspects of a real-world en- tity at a particular time (e.g., author information in DBLP). In such cases, we often wish to identify records that describe the same entity over time and so be able to perform interest- ing longitudinal data analysis. However, existing record link- age techniques ignore temporal information and fall short for temporal data. This article studies linking temporal records. First, we ap- ply time decay to capture the effect of elapsed time on entity value evolution. Second, instead of comparing each pair of records locally, we propose clustering methods that consider the time order of the records and make global decisions. Ex- perimental results show that our algorithms significantly out- perform traditional linkage methods on various temporal data sets.
文摘AIM:To establish the hospitalized prevalence of severe Crohn's disease(CD) and ulcerative colitis(UC) in Wales from 1999 to 2007;and to investigate long-term mortality after hospitalization and associations with social deprivation and other socio-demographic factors.METHODS:Record linkage of administrative inpatient and mortality data for 1467 and 1482 people hospitalised as emergencies for ≥ 3d for CD and UC,respectively.The main outcome measures were hospitalized prevalence,mortality rates and standardized mortality ratios for up to 5 years follow-up after hospitalization.RESULTS:Hospitalized prevalence was 50.1 per 100 000 population for CD and 50.6 for UC.The hospitalized prevalence of CD was significantly higher(P < 0.05) in females(57.4) than in males(42.2),and was highest in people aged 16-29 years,but the prevalence of UC was similar in males(51.0) and females(50.1),and increased continuously with age.The hospital-ized prevalence of CD was slightly higher in the most deprived areas,but there was no association between social deprivation and hospitalized prevalence of UC.Mortality was 6.8% and 14.6% after 1 and 5 years follow-up for CD,and 9.2% and 20.8% after 1 and 5 years for UC.For both CD and UC,there was little discernible association between mortality and social deprivation,distance from hospital,urban/rural residence and geography.CONCLUSION:CD and UC have distinct demographic profiles.The higher prevalence of hospitalized CD in more deprived areas may reflect higher prevalence and higher hospital dependency.
基金This research project was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project Number(PNURSP2022R234),Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.
文摘ESystems based on EHRs(Electronic health records)have been in use for many years and their amplified realizations have been felt recently.They still have been pioneering collections of massive volumes of health data.Duplicate detections involve discovering records referring to the same practical components,indicating tasks,which are generally dependent on several input parameters that experts yield.Record linkage specifies the issue of finding identical records across various data sources.The similarity existing between two records is characterized based on domain-based similarity functions over different features.De-duplication of one dataset or the linkage of multiple data sets has become a highly significant operation in the data processing stages of different data mining programmes.The objective is to match all the records associated with the same entity.Various measures have been in use for representing the quality and complexity about data linkage algorithms,and many other novel metrics have been introduced.An outline of the problem existing in themeasurement of data linkage and de-duplication quality and complexity is presented.This article focuses on the reprocessing of health data that is horizontally divided among data custodians,with the purpose of custodians giving similar features to sets of patients.The first step in this technique is about an automatic selection of training examples with superior quality from the compared record pairs and the second step involves training the reciprocal neuro-fuzzy inference system(RANFIS)classifier.Using the Optimal Threshold classifier,it is presumed that there is information about the original match status for all compared record pairs(i.e.,Ant Lion Optimization),and therefore an optimal threshold can be computed based on the respective RANFIS.Febrl,Clinical Decision(CD),and Cork Open Research Archive(CORA)data repository help analyze the proposed method with evaluated benchmarks with current techniques.
文摘Cloud storage is essential for managing user data to store and retrieve from the distributed data centre.The storage service is distributed as pay a service for accessing the size to collect the data.Due to the massive amount of data stored in the data centre containing similar information and file structures remaining in multi-copy,duplication leads to increase storage space.The potential deduplication system doesn’t make efficient data reduction because of inaccuracy in finding similar data analysis.It creates a complex nature to increase the storage consumption under cost.To resolve this problem,this paper proposes an efficient storage reduction called Hash-Indexing Block-based Deduplication(HIBD)based on Segmented Bind Linkage(SBL)Methods for reducing storage in a cloud environment.Initially,preprocessing is done using the sparse augmentation technique.Further,the preprocessed files are segmented into blocks to make Hash-Index.The block of the contents is compared with other files through Semantic Content Source Deduplication(SCSD),which identifies the similar content presence between the file.Based on the content presence count,the Distance Vector Weightage Correlation(DVWC)estimates the document similarity weight,and related files are grouped into a cluster.Finally,the segmented bind linkage compares the document to find duplicate content in the cluster using similarity weight based on the coefficient match case.This implementation helps identify the data redundancy efficiently and reduces the service cost in distributed cloud storage.
文摘多方隐私保护下的记录链接(privacy-preserving record linkage,简称PPRL)是在隐私保护下,从多个数据源中找出代表现实世界中同一实体的过程.该过程除了最终匹配结果被数据源之间共享外,其他信息均未被泄露.随着数据量的日益增大和现实世界数据质量问题的存在(如拼写错误、顺序颠倒等),多方PPRL方法的可扩展性和容错性面临挑战.目前,已有的大部分多方PPRL方法都是精确匹配方法,不具有容错性.还有少部分多方PPRL近似方法具有容错性,但在处理存在质量问题的数据时,由于容错性差和时间代价过大,并不能有效地找出数据源间的共同实体.因此,提出一种结合布隆过滤、安全合计、动态阈值、检查机制和改进的Dice相似度函数的多方PPRL近似方法.首先,利用布隆过滤将各数据源中的每条记录信息转换成由0和1组成的位数组.然后,计算每个对应位置bit 1所占的比率,并利用动态阈值和检查机制来判定匹配成功的位置.最后,通过改进的Dice相似度函数计算出记录间的相似度,进而判断记录间是否匹配成功.实验结果表明:所提出的方法具有较好的可扩展性,并且在保证查准率的同时,比已有的多方近似PPRL方法具有更高的容错性.
基金Supported by (in part) National Institute for Health Research,England,Grant No.NCCRCD ZRC/002/002/026
文摘AIM:To investigate associations between perinatal risk factors and subsequent inflammatory bowel disease (IBD) in children and young adults.METHODS:Record linked abstracts of birth registrations,maternity,day case and inpatient admissions in a defined population of southern England.Investigation of 20 perinatal factors relating to the maternity or the birth:maternal age,Crohn's disease (CD) or ulcerative colitis (UC) in the mother,maternal social class,marital status,smoking in pregnancy,ABO blood group and rhesus status,pre-eclampsia,parity,the infant's presentation at birth,caesarean delivery,forceps delivery,sex,number of babies delivered,gestational age,birthweight,head circumference,breastfeeding and Apgar scores at one and five minutes.RESULTS:Maternity records were present for 180 children who subsequently developed IBD.Univariate analysis showed increased risks of CD among children of mothers with CD (P=0.011,based on two cases of CD in both mother and child) and children of mothers who smoked during pregnancy.Multivariate analysis confirmed increased risks of CD among children of mothers who smoked (odds ratio=2.04,95% CI=1.06-3.92) and for older mothers aged 35+ years (4.81,2.32-9.98).Multivariate analysis showed that there were no significant associations between CD and 17 other perinatal risk factors investigated.It also showed that,for UC,there were no significant associations with the perinatal factors studied.CONCLUSION:This study shows an association between CD in mother and child;and elevated risks of CD in children of older mothers and of mothers who smoked.
文摘链接跨组织数据库中表示同一实体的记录,同时保护存储在这些数据库中实体的隐私,是安全有效地整合多源数据资源的核心技术之一。然而,已有隐私保护记录链接(privacy-preserving record linkage,PPRL)技术中的分块方法不能同时保证高查全率和高查准率,强隐私性的匹配方法存在时间代价过大的不足,且对多于两个数据库间的匹配研究很少。针对上述问题,提出了一种多方强隐私保护记录链接方法(multi-partystrong-privacy-preserving record linkage,MP-SPPRL)。首先,提出了一种局部敏感哈希(locality sensitiveHashing,LSH)结合后缀分块的二次分块方法,并引入分块分散度调节两次分块,在保证MP-SPPRL高查全率的前提下有效地提高了查准率;接着,利用滑动窗口合并分块生成候选记录组,保证MP-SPPRL的容错率;然后,采用基于同态加密的Hamming距离计算方法,设计了一种适用于大型数据的基于安全多方计算(securemulti-party computation,SMC)的可伸缩多方记录匹配算法,通过缩减加密记录数量和提前终止不可能匹配的候选记录组的距离计算,显著降低了匹配的时间代价,提高了MP-SPPRL的效率;最后,通过大量实验验证了MP-SPPRL的高查全率、高查准率和高效性。
文摘The applications of unique identifiers such as name, home address and social security number to link different datasets have been commonly used and well-published. Also, the theoretical concepts of probabilistic algorithm in record linkage have been well-defined in the literature. However, few studies have reported the applications of its probabilistic algorithm using non-unique identifiers. In this paper, we investigate several variables (weight, height, waist, age, sex, smoking and alcohol habit) as non-unique identifiers using Japanese cohort dataset with three-year baseline of 1989-1991 to observe how effectively these identifiers can be used and what influence those may have on record linkage. Moreover, we modify the conditions of these identifiers and estimate the sensitivity, specificity and accuracy for comparison. We further investigate this by using extended ten-year baseline of 1989-1999 as well. As a result, we conclude that the combination of age, sex, weight and height predicts better estimation with regards to the sensitivity, specificity and accuracy than other combinations in both men and women in case of using three-year baseline, whereas the combination of age, sex and height predicts better in both men and women in case of using ten-year baseline.
基金supported by the National Natural Science Foundation of China(Grant Nos.62002262,61672142,61602103,62072086,62072084)the National Key Research and Development Project of China(2018YFB1003404).
文摘1Introduction Record linkage(RL)groups_records corresponding to the same entities in datasets,and is a long-standing topic in data management and mining communities[1-2].In big data era,real-time data applications become popular,and callfor payas-you-go RL(PRL),which produces as many match pairs as possible in very limited time(much shorter than the overall RLruntime).