Parallel computing techniques have been introduced into digital image correlation(DIC) in recent years and leads to a surge in computation speed. The graphics processing unit(GPU)-based parallel computing demonstrated...Parallel computing techniques have been introduced into digital image correlation(DIC) in recent years and leads to a surge in computation speed. The graphics processing unit(GPU)-based parallel computing demonstrated a surprising effect on accelerating the iterative subpixel DIC, compared with CPU-based parallel computing. In this paper, the performances of the two kinds of parallel computing techniques are compared for the previously proposed path-independent DIC method, in which the initial guess for the inverse compositional Gauss-Newton(IC-GN) algorithm at each point of interest(POI) is estimated through the fast Fourier transform-based cross-correlation(FFT-CC) algorithm. Based on the performance evaluation, a heterogeneous parallel computing(HPC) model is proposed with hybrid mode of parallelisms in order to combine the computing power of GPU and multicore CPU. A scheme of trial computation test is developed to optimize the configuration of the HPC model on a specific computer. The proposed HPC model shows excellent performance on a middle-end desktop computer for real-time subpixel DIC with high resolution of more than 10000 POIs per frame.展开更多
Porous materials present significant advantages for absorbing radioactive isotopes in nuclear waste streams.To improve absorption efficiency in nuclear waste treatment,a thorough understanding of the diffusion-advecti...Porous materials present significant advantages for absorbing radioactive isotopes in nuclear waste streams.To improve absorption efficiency in nuclear waste treatment,a thorough understanding of the diffusion-advection process within porous structures is essential for material design.In this study,we present advancements in the volumetric lattice Boltzmann method(VLBM)for modeling and simulating pore-scale diffusion-advection of radioactive isotopes within geopolymer porous structures.These structures are created using the phase field method(PFM)to precisely control pore architectures.In our VLBM approach,we introduce a concentration field of an isotope seamlessly coupled with the velocity field and solve it by the time evolution of its particle population function.To address the computational intensity inherent in the coupled lattice Boltzmann equations for velocity and concentration fields,we implement graphics processing unit(GPU)parallelization.Validation of the developed model involves examining the flow and diffusion fields in porous structures.Remarkably,good agreement is observed for both the velocity field from VLBM and multiphysics object-oriented simulation environment(MOOSE),and the concentration field from VLBM and the finite difference method(FDM).Furthermore,we investigate the effects of background flow,species diffusivity,and porosity on the diffusion-advection behavior by varying the background flow velocity,diffusion coefficient,and pore volume fraction,respectively.Notably,all three parameters exert an influence on the diffusion-advection process.Increased background flow and diffusivity markedly accelerate the process due to increased advection intensity and enhanced diffusion capability,respectively.Conversely,increasing the porosity has a less significant effect,causing a slight slowdown of the diffusion-advection process due to the expanded pore volume.This comprehensive parametric study provides valuable insights into the kinetics of isotope uptake in porous structures,facilitating the de展开更多
Breakage of particles will have greatly influence on mechanical behavior of granular material(GM)under external loads,such as ballast,rockfill and sand.The discrete element method(DEM)is one of the most popular method...Breakage of particles will have greatly influence on mechanical behavior of granular material(GM)under external loads,such as ballast,rockfill and sand.The discrete element method(DEM)is one of the most popular methods for simulating GM as each particle is represented on its own.To study breakage mechanism of particle breakage,a cohesive contact mode is developed based on the GPU accelerated DEM code-Blaze-DEM.A database of the 3D geometry model of rock blocks is established based on the 3D scanning method.And an agglomerate describing the rock block with a series of non-overlapping spherical particles is used to build the DEM numerical model of a railway ballast sample,which is used to the DEM oedometric test to study the particles’breakage characteristics of the sample under external load.Furthermore,to obtain the meso-mechanical parameters used in DEM,a black-analysis method is used based on the laboratory tests of the rock sample.Based on the DEM numerical tests,the particle breakage process and mechanisms of the railway ballast are studied.All results show that the developed code can better used for large scale simulation of the particle breakage analysis of granular material.展开更多
In the design of a graphic processing unit(GPU),the processing speed of triangle rasterization is an important factor that determines the performance of the GPU.An architecture of a multi-tile parallel-scan rasterizat...In the design of a graphic processing unit(GPU),the processing speed of triangle rasterization is an important factor that determines the performance of the GPU.An architecture of a multi-tile parallel-scan rasterization accelerator was proposed in this paper.The accelerator uses a bounding box algorithm to improve scanning efficiency.It rasterizes multiple tiles in parallel and scans multiple lines at the same time within each tile.This highly parallel approach drastically improves the performance of rasterization.Using the 65 nm process standard cell library of Semiconductor Manufacturing International Corporation(SMIC),the accelerator can be synthesized to a maximum clock frequency of 220 MHz.An implementation on the Genesys2 field programmable gate array(FPGA)board fully verifies the functionality of the accelerator.The implementation shows a significant improvement in rendering speed and efficiency and proves its suitability for high-performance rasterization.展开更多
When training a large-scale knowledge graph embedding(KGE)model with multiple graphics processing units(GPUs),the partition-based method is necessary for parallel training.However,existing partition-based training met...When training a large-scale knowledge graph embedding(KGE)model with multiple graphics processing units(GPUs),the partition-based method is necessary for parallel training.However,existing partition-based training methods suffer from low GPU utilization and high input/output(IO)overhead between the memory and disk.For a high IO overhead between the disk and memory problem,we optimized the twice partitioning with fine-grained GPU scheduling to reduce the IO overhead between the CPU memory and disk.For low GPU utilization caused by the GPU load imbalance problem,we proposed balanced partitioning and dynamic scheduling methods to accelerate the training speed in different cases.With the above methods,we proposed fine-grained partitioning KGE,an efficient KGE training framework with multiple GPUs.We conducted experiments on some benchmarks of the knowledge graph,and the results show that our method achieves speedup compared to existing framework on the training of KGE.展开更多
A new Graphics Processing Unit(GPU) parallelization strategy is proposed to accelerate sparse finite element computation for three dimensional electromagnetic analysis.The parallelization strategy is employed based on...A new Graphics Processing Unit(GPU) parallelization strategy is proposed to accelerate sparse finite element computation for three dimensional electromagnetic analysis.The parallelization strategy is employed based on a new compression format called sliced ELL Four(sliced ELL-F).The sliced ELL-F format-based parallelization strategy is designed for hastening many addition,dot product,and Sparse Matrix Vector Product(SMVP) operations in the Conjugate Gradient Norm(CGN) calculation of finite element equations.The new implementation of SMVP on GPUs is evaluated.The proposed strategy executed on a GPU can efficiently solve sparse finite element equations,espe-cially when the equations are huge sparse(size of most rows in a coefficient matrix is less than 8).Numerical results show the sliced ELL-F format-based parallelization strategy can reach signi?cant speedups compared to Compressed Sparse Row(CSR) format.展开更多
In this study,insights into the effect of interfacial anisotropy on a complex hexagonal close-packed(hcp) dendritic growth during alloy solidification were gained by graphics processing unit(GPU)-accelerated three-dim...In this study,insights into the effect of interfacial anisotropy on a complex hexagonal close-packed(hcp) dendritic growth during alloy solidification were gained by graphics processing unit(GPU)-accelerated three-dimensional(3D) phase-field simulations,as demonstrated for a Mg-Gd alloy.An anisotropic phasefield model with finite interface dissipation was developed by incorporating the contribution of the anisotropy of interfacial energy into the total free energy functional.The modified spherical harmonic anisotropy function was then chosen for the hcp crystal.The GPU parallel computing algorithm was implemented in the present phase-field model,and a corresponding code was developed in the compute unified device architecture parallel computing platform.Benchmark tests indicated that the calculation efficiency of a single TESLA V100 GPU could be~80times that of open multi-processing(OpenMP) with eight central processing unit cores.By coupling the phase-field model with reliable thermodynamic and interfacial energy descriptions,the 3D phase-field simulation of α-Mg dendritic growth in the Mg-6Gd(in wt%) alloy during solidification was performed.Various two-dimensional dendrite morphologies were revealed by cutting the simulated 3D dendrite along different crystallographic planes.Typical sixfold equiaxed and butterflied microstructures observed in experiments were well reproduced.展开更多
A graphics processing unit(GPU)-accelerated vector-form particle-element method,i.e.,the finite particle method(FPM),is proposed for 3D elastoplastic contact of structures involving strong nonlinearities and computati...A graphics processing unit(GPU)-accelerated vector-form particle-element method,i.e.,the finite particle method(FPM),is proposed for 3D elastoplastic contact of structures involving strong nonlinearities and computationally expensive contact calculations.A hexahedral FPM element with reduced integration and anti-hourglass is developed to model structural elastoplastic behaviors.The 3D space containing contact surfaces is decomposed into cubic cells and the contact search is performed between adjacent cells to improve search efficiency.A connected list data structure is used for storing contact particles to facilitate the parallel contact search procedure.The contact constraints are enforced by explicitly applying normal and tangential contact forces to the contact particles.The proposed method is fully accelerated by GPU-based parallel computing.After verification,the performance of the proposed method is compared with the serial finite element code Abaqus/Explicit by testing two large-scale contact examples.The maximum speedup of the proposed method over Abaqus/Explicit is approximately 80 for the overall computation and 340 for contact calculations.Therefore,the proposed method is shown to be effective and efficient.展开更多
With the unstructured grid, the Finite Volume Coastal Ocean Model(FVCOM) is converted from its original FORTRAN code to a Compute Unified Device Architecture(CUDA) C code, and optimized on the Graphic Processor U...With the unstructured grid, the Finite Volume Coastal Ocean Model(FVCOM) is converted from its original FORTRAN code to a Compute Unified Device Architecture(CUDA) C code, and optimized on the Graphic Processor Unit(GPU). The proposed GPU-FVCOM is tested against analytical solutions for two standard cases in a rectangular basin, a tide induced flow and a wind induced circulation. It is then applied to the Ningbo's coastal water area to simulate the tidal motion and analyze the flow field and the vertical tide velocity structure. The simulation results agree with the measured data quite well. The accelerated performance of the proposed 3-D model reaches 30 times of that of a single thread program, and the GPU-FVCOM implemented on a Tesla k20 device is faster than on a workstation with 20 CPU cores, which shows that the GPU-FVCOM is efficient for solving large scale sea area and high resolution engineering problems.展开更多
This paper presents a novel geometrical voxelization algorithm for polygonal models.First,distance computation is performed slice by slice on graphics processing units(GPUs) between geometrical primitives and voxels...This paper presents a novel geometrical voxelization algorithm for polygonal models.First,distance computation is performed slice by slice on graphics processing units(GPUs) between geometrical primitives and voxels for line/surface voxelization.A novel solid filling process is then proposed to assist surface voxelization and achieve solid voxelization. Furthermore,using the proposed transfer functions,both binary and anti-aliasing voxelizations are achievable. Finally,the proposed approach can be applied to voxelize streamlines for 3D vector fields using line voxelization.The proposed approach obtains desired experimental results.展开更多
Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N...Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.展开更多
As a subsequent work of previous studies of authors, a new parallel computation approach is proposed to simulate the coupled dynamics of a rigid-flexible multibody system and compressible fluid. In this approach, the ...As a subsequent work of previous studies of authors, a new parallel computation approach is proposed to simulate the coupled dynamics of a rigid-flexible multibody system and compressible fluid. In this approach, the smoothed particle hydrodynamics(SPH) method is used to model the compressible fluid, the natural coordinate formulation(NCF) and absolute nodal coordinate formulation(ANCF) are used to model the rigid and flexible bodies, respectively. In order to model the compressible fluid properly and efficiently via SPH method, three measures are taken as follows. The first is to use the Riemann solver to cope with the fluid compressibility, the second is to define virtual particles of SPH to model the dynamic interaction between the fluid and the multibody system, and the third is to impose the boundary conditions of periodical inflow and outflow to reduce the number of SPH particles involved in the computation process. Afterwards, a parallel computation strategy is proposed based on the graphics processing unit(GPU) to detect the neighboring SPH particles and to solve the dynamic equations of SPH particles in order to improve the computation efficiency. Meanwhile, the generalized-alpha algorithm is used to solve the dynamic equations of the multibody system. Finally, four case studies are given to validate the proposed parallel computation approach.展开更多
The phase field simulation has been actively studied as a powerful method to investigate the microstructural evolution during the solidification.However,it is a great challenge to perform the phase field simulation in...The phase field simulation has been actively studied as a powerful method to investigate the microstructural evolution during the solidification.However,it is a great challenge to perform the phase field simulation in large length and time scale.The developed graphics processing unit(GPU)calculation is used in the phase filed simulation,greatly accelerating the calculation efficiency.The results show that the computation with GPU is about 36 times faster than that with a single Central Processing Unit(CPU)core.It provides the feasibility of the GPU-accelerated phase field simulation on a desktop computer.The GPU-accelerated strategy will bring a new opportunity to the application of phase field simulation.展开更多
Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of ...Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of phase-contrast imaging, the grating-based phase contrast imaging has been widely accepted owing to the advantage of wide range of sample selections and exemption of coherent source. However, the downside is the substantially larger amount of data generated from the phase-stepping method which slows down the reconstruction process. Graphic processing unit(GPU) has the advantage of allowing parallel computing which is very useful for large quantity data processing. In this paper, a compute unified device architecture(CUDA) C program based on GPU is introduced to accelerate the phase retrieval and filtered back projection(FBP) algorithm for grating-based tomography. Depending on the size of the data, the CUDA C program shows different amount of speed-up over the standard C program on the same Visual Studio 2010 platform. Meanwhile, the speed-up ratio increases as the size of data increases.展开更多
A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary netw...A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species.展开更多
Large deformation contact problems generally involve highly nonlinear behaviors,which are very time-consuming and may lead to convergence issues.The finite particle method(FPM)effectively separates pure deformation fr...Large deformation contact problems generally involve highly nonlinear behaviors,which are very time-consuming and may lead to convergence issues.The finite particle method(FPM)effectively separates pure deformation from total motion in large deformation problems.In addition,the decoupled procedures of the FPM make it suitable for parallel computing,which may provide an approach to solve time-consuming issues.In this study,a graphics processing unit(GPU)-based parallel algorithm is proposed for two-dimensional large deformation contact problems.The fundamentals of the FPM for planar solids are first briefly introduced,including the equations of motion of particles and the internal forces of quadrilateral elements.Subsequently,a linked-list data structure suitable for parallel processing is built,and parallel global and local search algorithms are presented for contact detection.The contact forces are then derived and directly exerted on particles.The proposed method is implemented with main solution procedures executed in parallel on a GPU.Two verification problems comprising large deformation frictional contacts are presented,and the accuracy of the proposed algorithm is validated.Furthermore,the algorithm’s performance is investigated via a large-scale contact problem,and the maximum speedups of total computational time and contact calculation reach 28.5 and 77.4,respectively,relative to commercial finite element software Abaqus/Explicit running on a single-core central processing unit(CPU).The contact calculation time percentage of the total calculation time is only 18%with the FPM,much smaller than that(50%)with Abaqus/Explicit,demonstrating the efficiency of the proposed method.展开更多
We present a novel algorithm BADF(Bounding Volume Hierarchy Based Adaptive Distance Fields)for accelerating the construction of ADFs(adaptive distance fields)of rigid and deformable models on graphics processing units...We present a novel algorithm BADF(Bounding Volume Hierarchy Based Adaptive Distance Fields)for accelerating the construction of ADFs(adaptive distance fields)of rigid and deformable models on graphics processing units.Our approach is based on constructing a bounding volume hierarchy(BVH)and we use that hierarchy to generate an octree-based ADF.We exploit the coherence between successive frames and sort the grid points of the octree to accelerate the computation.Our approach is applicable to rigid and deformable models.Our GPU-based(graphics processing unit based)algorithm is about 20x--50x faster than current mainstream central processing unit based algorithms.Our BADF algorithm can construct the distance fields for deformable models with 60k triangles at interactive rates on an NVIDIA GTX GeForce 1060.Moreover,we observe 3x speedup over prior GPU-based ADF algorithms.展开更多
The gravity gradient is a secondary derivative of gravity potential,containing more high-frequency information of Earth’s gravity field.Gravity gradient observation data require deducting its prior and intrinsic part...The gravity gradient is a secondary derivative of gravity potential,containing more high-frequency information of Earth’s gravity field.Gravity gradient observation data require deducting its prior and intrinsic parts to obtain more variational information.A model generated from a topographic surface database is more appropriate to represent gradiometric effects derived from near-surface mass,as other kinds of data can hardly reach the spatial resolution requirement.The rectangle prism method,namely an analytic integration of Newtonian potential integrals,is a reliable and commonly used approach to modeling gravity gradient,whereas its computing efficiency is extremely low.A modified rectangle prism method and a graphical processing unit(GPU)parallel algorithm were proposed to speed up the modeling process.The modified method avoided massive redundant computations by deforming formulas according to the symmetries of prisms’integral regions,and the proposed algorithm parallelized this method’s computing process.The parallel algorithm was compared with a conventional serial algorithm using 100 elevation data in two topographic areas(rough and moderate terrain).Modeling differences between the two algorithms were less than 0.1 E,which is attributed to precision differences between single-precision and double-precision float numbers.The parallel algorithm showed computational efficiency approximately 200 times higher than the serial algorithm in experiments,demonstrating its effective speeding up in the modeling process.Further analysis indicates that both the modified method and computational parallelism through GPU contributed to the proposed algorithm’s performances in experiments.展开更多
Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a war...Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a warp may jump to different paths after conditional branches.Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs.To alleviate the waste of SIMD lanes,threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes.However,this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions,resulting in that no warps are scheduled in some cases.In this paper,we propose an approach to reduce the overhead of barrier synchronizat ions induced by compactions,In our approach,a compaction is bypassed by warps whose threads all jump to the same path after branches.Moreover,warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing.In addition,a compaction is canceled if idle lanes can not be reduced via this compaction.The experimental results demonstrate that our approach provides an average improvement of 21%over the baseline GPU for applications with massive divergent branches,while recovering the performance loss induced by compactions by 13%on average for applications with many non-divergent control flows.展开更多
For electromagnetic scattering of 3?D complex electrically large conducting targets,a new hybrid algorithm,MoM?PO/SBR algorithm,is presented to realize the interaction of information between method of moment(MoM)and p...For electromagnetic scattering of 3?D complex electrically large conducting targets,a new hybrid algorithm,MoM?PO/SBR algorithm,is presented to realize the interaction of information between method of moment(MoM)and physical optics(PO)/shooting and bouncing ray(SBR).In the algorithm,the COC file that based on the Huygens equivalent principle is introduced,and the conversion interface between the equivalent surface and the target is established.And then,the multi?task flow model presented in this paper is adopted to conduct CPU/graphics processing unit(GPU)tests of the algorithm under three modes,i.e.,MPI/OpenMP,MPI/compute unified device architecture(CUDA)and multi?task programming model(MTPM).Numerical results are presented and compared with reference solutions in order to illustrate the accuracy and the efficiency of the proposed algorithm.展开更多
基金supported by the National Natural Science Foundation of China(Grant Nos.11772131,11772132,11772134&11472109)the Natural Science Foundation of Guangdong Province,China(Grant Nos.2015A030308017,2015A030311046&2015B010131009)+2 种基金the Opening fund of State Key Laboratory of Nonlinear Mechanics(LNM)CASthe State Key Lab of Subtropical Building Science,South China University of Technology(Grant Nos.2014ZC17&2017ZD096)
文摘Parallel computing techniques have been introduced into digital image correlation(DIC) in recent years and leads to a surge in computation speed. The graphics processing unit(GPU)-based parallel computing demonstrated a surprising effect on accelerating the iterative subpixel DIC, compared with CPU-based parallel computing. In this paper, the performances of the two kinds of parallel computing techniques are compared for the previously proposed path-independent DIC method, in which the initial guess for the inverse compositional Gauss-Newton(IC-GN) algorithm at each point of interest(POI) is estimated through the fast Fourier transform-based cross-correlation(FFT-CC) algorithm. Based on the performance evaluation, a heterogeneous parallel computing(HPC) model is proposed with hybrid mode of parallelisms in order to combine the computing power of GPU and multicore CPU. A scheme of trial computation test is developed to optimize the configuration of the HPC model on a specific computer. The proposed HPC model shows excellent performance on a middle-end desktop computer for real-time subpixel DIC with high resolution of more than 10000 POIs per frame.
基金supported as part of the Center for Hierarchical Waste Form Materials,an Energy Frontier Research Center funded by the U.S.Department of Energy,Office of Science,Basic Energy Sciences under Award No.DE-SC0016574.
文摘Porous materials present significant advantages for absorbing radioactive isotopes in nuclear waste streams.To improve absorption efficiency in nuclear waste treatment,a thorough understanding of the diffusion-advection process within porous structures is essential for material design.In this study,we present advancements in the volumetric lattice Boltzmann method(VLBM)for modeling and simulating pore-scale diffusion-advection of radioactive isotopes within geopolymer porous structures.These structures are created using the phase field method(PFM)to precisely control pore architectures.In our VLBM approach,we introduce a concentration field of an isotope seamlessly coupled with the velocity field and solve it by the time evolution of its particle population function.To address the computational intensity inherent in the coupled lattice Boltzmann equations for velocity and concentration fields,we implement graphics processing unit(GPU)parallelization.Validation of the developed model involves examining the flow and diffusion fields in porous structures.Remarkably,good agreement is observed for both the velocity field from VLBM and multiphysics object-oriented simulation environment(MOOSE),and the concentration field from VLBM and the finite difference method(FDM).Furthermore,we investigate the effects of background flow,species diffusivity,and porosity on the diffusion-advection behavior by varying the background flow velocity,diffusion coefficient,and pore volume fraction,respectively.Notably,all three parameters exert an influence on the diffusion-advection process.Increased background flow and diffusivity markedly accelerate the process due to increased advection intensity and enhanced diffusion capability,respectively.Conversely,increasing the porosity has a less significant effect,causing a slight slowdown of the diffusion-advection process due to the expanded pore volume.This comprehensive parametric study provides valuable insights into the kinetics of isotope uptake in porous structures,facilitating the de
基金project of “Natural Science Foundation of China, China (Nos. 5187914, 51679123, 51479095)”
文摘Breakage of particles will have greatly influence on mechanical behavior of granular material(GM)under external loads,such as ballast,rockfill and sand.The discrete element method(DEM)is one of the most popular methods for simulating GM as each particle is represented on its own.To study breakage mechanism of particle breakage,a cohesive contact mode is developed based on the GPU accelerated DEM code-Blaze-DEM.A database of the 3D geometry model of rock blocks is established based on the 3D scanning method.And an agglomerate describing the rock block with a series of non-overlapping spherical particles is used to build the DEM numerical model of a railway ballast sample,which is used to the DEM oedometric test to study the particles’breakage characteristics of the sample under external load.Furthermore,to obtain the meso-mechanical parameters used in DEM,a black-analysis method is used based on the laboratory tests of the rock sample.Based on the DEM numerical tests,the particle breakage process and mechanisms of the railway ballast are studied.All results show that the developed code can better used for large scale simulation of the particle breakage analysis of granular material.
基金the Scientific Research Program Funded by Shaanxi Provincial Education Department(20JY058)。
文摘In the design of a graphic processing unit(GPU),the processing speed of triangle rasterization is an important factor that determines the performance of the GPU.An architecture of a multi-tile parallel-scan rasterization accelerator was proposed in this paper.The accelerator uses a bounding box algorithm to improve scanning efficiency.It rasterizes multiple tiles in parallel and scans multiple lines at the same time within each tile.This highly parallel approach drastically improves the performance of rasterization.Using the 65 nm process standard cell library of Semiconductor Manufacturing International Corporation(SMIC),the accelerator can be synthesized to a maximum clock frequency of 220 MHz.An implementation on the Genesys2 field programmable gate array(FPGA)board fully verifies the functionality of the accelerator.The implementation shows a significant improvement in rendering speed and efficiency and proves its suitability for high-performance rasterization.
文摘When training a large-scale knowledge graph embedding(KGE)model with multiple graphics processing units(GPUs),the partition-based method is necessary for parallel training.However,existing partition-based training methods suffer from low GPU utilization and high input/output(IO)overhead between the memory and disk.For a high IO overhead between the disk and memory problem,we optimized the twice partitioning with fine-grained GPU scheduling to reduce the IO overhead between the CPU memory and disk.For low GPU utilization caused by the GPU load imbalance problem,we proposed balanced partitioning and dynamic scheduling methods to accelerate the training speed in different cases.With the above methods,we proposed fine-grained partitioning KGE,an efficient KGE training framework with multiple GPUs.We conducted experiments on some benchmarks of the knowledge graph,and the results show that our method achieves speedup compared to existing framework on the training of KGE.
基金Supported by the National Natural Science Foundation of China (No. 60801039)
文摘A new Graphics Processing Unit(GPU) parallelization strategy is proposed to accelerate sparse finite element computation for three dimensional electromagnetic analysis.The parallelization strategy is employed based on a new compression format called sliced ELL Four(sliced ELL-F).The sliced ELL-F format-based parallelization strategy is designed for hastening many addition,dot product,and Sparse Matrix Vector Product(SMVP) operations in the Conjugate Gradient Norm(CGN) calculation of finite element equations.The new implementation of SMVP on GPUs is evaluated.The proposed strategy executed on a GPU can efficiently solve sparse finite element equations,espe-cially when the equations are huge sparse(size of most rows in a coefficient matrix is less than 8).Numerical results show the sliced ELL-F format-based parallelization strategy can reach signi?cant speedups compared to Compressed Sparse Row(CSR) format.
基金supported by the Natural Science Foundation of Hunan Province for Distinguished Young Scholars (No. 2021JJ10062)National Key Research and Development Program of China (No. 2016YFB0301101)+2 种基金Science and Technology Program of Guangxi province, China (No. AB21220028)the financial support from the Fundamental Research Funds for the Central Universities of Central South University (No. 2019zzts050)Postgraduate Scientific Research Innovation Project of Hunan Province (No. CX20190106)。
文摘In this study,insights into the effect of interfacial anisotropy on a complex hexagonal close-packed(hcp) dendritic growth during alloy solidification were gained by graphics processing unit(GPU)-accelerated three-dimensional(3D) phase-field simulations,as demonstrated for a Mg-Gd alloy.An anisotropic phasefield model with finite interface dissipation was developed by incorporating the contribution of the anisotropy of interfacial energy into the total free energy functional.The modified spherical harmonic anisotropy function was then chosen for the hcp crystal.The GPU parallel computing algorithm was implemented in the present phase-field model,and a corresponding code was developed in the compute unified device architecture parallel computing platform.Benchmark tests indicated that the calculation efficiency of a single TESLA V100 GPU could be~80times that of open multi-processing(OpenMP) with eight central processing unit cores.By coupling the phase-field model with reliable thermodynamic and interfacial energy descriptions,the 3D phase-field simulation of α-Mg dendritic growth in the Mg-6Gd(in wt%) alloy during solidification was performed.Various two-dimensional dendrite morphologies were revealed by cutting the simulated 3D dendrite along different crystallographic planes.Typical sixfold equiaxed and butterflied microstructures observed in experiments were well reproduced.
基金supported by the National Natural Science Foundation of China(Nos.51908492,52008366,and 52238001)the Zhejiang Provincial Natural Science Foundation of China(Nos.LY21E080022 and LQ21E080019).
文摘A graphics processing unit(GPU)-accelerated vector-form particle-element method,i.e.,the finite particle method(FPM),is proposed for 3D elastoplastic contact of structures involving strong nonlinearities and computationally expensive contact calculations.A hexahedral FPM element with reduced integration and anti-hourglass is developed to model structural elastoplastic behaviors.The 3D space containing contact surfaces is decomposed into cubic cells and the contact search is performed between adjacent cells to improve search efficiency.A connected list data structure is used for storing contact particles to facilitate the parallel contact search procedure.The contact constraints are enforced by explicitly applying normal and tangential contact forces to the contact particles.The proposed method is fully accelerated by GPU-based parallel computing.After verification,the performance of the proposed method is compared with the serial finite element code Abaqus/Explicit by testing two large-scale contact examples.The maximum speedup of the proposed method over Abaqus/Explicit is approximately 80 for the overall computation and 340 for contact calculations.Therefore,the proposed method is shown to be effective and efficient.
基金Project supported by the National Natural Science Foundation of China(Grant No.51279028,51479175)the Public Science and Technology Research Funds Projects of Ocean(Grant No.201405025)
文摘With the unstructured grid, the Finite Volume Coastal Ocean Model(FVCOM) is converted from its original FORTRAN code to a Compute Unified Device Architecture(CUDA) C code, and optimized on the Graphic Processor Unit(GPU). The proposed GPU-FVCOM is tested against analytical solutions for two standard cases in a rectangular basin, a tide induced flow and a wind induced circulation. It is then applied to the Ningbo's coastal water area to simulate the tidal motion and analyze the flow field and the vertical tide velocity structure. The simulation results agree with the measured data quite well. The accelerated performance of the proposed 3-D model reaches 30 times of that of a single thread program, and the GPU-FVCOM implemented on a Tesla k20 device is faster than on a workstation with 20 CPU cores, which shows that the GPU-FVCOM is efficient for solving large scale sea area and high resolution engineering problems.
基金supported by the"National Science Council"under Grant No.095-2917-I-259-001.
文摘This paper presents a novel geometrical voxelization algorithm for polygonal models.First,distance computation is performed slice by slice on graphics processing units(GPUs) between geometrical primitives and voxels for line/surface voxelization.A novel solid filling process is then proposed to assist surface voxelization and achieve solid voxelization. Furthermore,using the proposed transfer functions,both binary and anti-aliasing voxelizations are achievable. Finally,the proposed approach can be applied to voxelize streamlines for 3D vector fields using line voxelization.The proposed approach obtains desired experimental results.
基金supported by the National Natural Science Foundation of China (No.11172134)the Funding of Jiangsu Innovation Program for Graduate Education (No.CXLX13_132)
文摘Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.
基金supported by the 111 China Project(Grant No.B16003)the National Natural Science Foundation of China(Grant Nos.11290151,11702022,and 11221202)
文摘As a subsequent work of previous studies of authors, a new parallel computation approach is proposed to simulate the coupled dynamics of a rigid-flexible multibody system and compressible fluid. In this approach, the smoothed particle hydrodynamics(SPH) method is used to model the compressible fluid, the natural coordinate formulation(NCF) and absolute nodal coordinate formulation(ANCF) are used to model the rigid and flexible bodies, respectively. In order to model the compressible fluid properly and efficiently via SPH method, three measures are taken as follows. The first is to use the Riemann solver to cope with the fluid compressibility, the second is to define virtual particles of SPH to model the dynamic interaction between the fluid and the multibody system, and the third is to impose the boundary conditions of periodical inflow and outflow to reduce the number of SPH particles involved in the computation process. Afterwards, a parallel computation strategy is proposed based on the graphics processing unit(GPU) to detect the neighboring SPH particles and to solve the dynamic equations of SPH particles in order to improve the computation efficiency. Meanwhile, the generalized-alpha algorithm is used to solve the dynamic equations of the multibody system. Finally, four case studies are given to validate the proposed parallel computation approach.
基金supported by the China Postdoctoral Science Foundation(Grant No.2013M540772)the Young Scientists Fund of the National Natural Science Foundation of China(Grant Nos.61203233,51101124,51101125)
文摘The phase field simulation has been actively studied as a powerful method to investigate the microstructural evolution during the solidification.However,it is a great challenge to perform the phase field simulation in large length and time scale.The developed graphics processing unit(GPU)calculation is used in the phase filed simulation,greatly accelerating the calculation efficiency.The results show that the computation with GPU is about 36 times faster than that with a single Central Processing Unit(CPU)core.It provides the feasibility of the GPU-accelerated phase field simulation on a desktop computer.The GPU-accelerated strategy will bring a new opportunity to the application of phase field simulation.
基金the National Basic Research Program(973) of China(No.2010CB834300)the Biomedical Engineering Cross-Research Fund of Shanghai Jiao Tong University(Nos.YG2011MS49 and YG2013MS65)
文摘Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of phase-contrast imaging, the grating-based phase contrast imaging has been widely accepted owing to the advantage of wide range of sample selections and exemption of coherent source. However, the downside is the substantially larger amount of data generated from the phase-stepping method which slows down the reconstruction process. Graphic processing unit(GPU) has the advantage of allowing parallel computing which is very useful for large quantity data processing. In this paper, a compute unified device architecture(CUDA) C program based on GPU is introduced to accelerate the phase retrieval and filtered back projection(FBP) algorithm for grating-based tomography. Depending on the size of the data, the CUDA C program shows different amount of speed-up over the standard C program on the same Visual Studio 2010 platform. Meanwhile, the speed-up ratio increases as the size of data increases.
基金National Natural Science Foundation of China (No. 60975084)Natural Science Foundation of Fujian Province,China (No.2011J05159)
文摘A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species.
基金This work was supported by the National Key Research and Development Program of China[Grant No.2016YFC0800200]the National Natural Science Foundation of China[Grant Nos.51778568,51908492,and 52008366]+1 种基金Zhejiang Provincial Natural Science Foundation of China[Grant Nos.LQ21E080019 and LY21E080022]This work was also sup-ported by the Key Laboratory of Space Structures of Zhejiang Province(Zhejiang University)and the Center for Balance Architecture of Zhejiang University.
文摘Large deformation contact problems generally involve highly nonlinear behaviors,which are very time-consuming and may lead to convergence issues.The finite particle method(FPM)effectively separates pure deformation from total motion in large deformation problems.In addition,the decoupled procedures of the FPM make it suitable for parallel computing,which may provide an approach to solve time-consuming issues.In this study,a graphics processing unit(GPU)-based parallel algorithm is proposed for two-dimensional large deformation contact problems.The fundamentals of the FPM for planar solids are first briefly introduced,including the equations of motion of particles and the internal forces of quadrilateral elements.Subsequently,a linked-list data structure suitable for parallel processing is built,and parallel global and local search algorithms are presented for contact detection.The contact forces are then derived and directly exerted on particles.The proposed method is implemented with main solution procedures executed in parallel on a GPU.Two verification problems comprising large deformation frictional contacts are presented,and the accuracy of the proposed algorithm is validated.Furthermore,the algorithm’s performance is investigated via a large-scale contact problem,and the maximum speedups of total computational time and contact calculation reach 28.5 and 77.4,respectively,relative to commercial finite element software Abaqus/Explicit running on a single-core central processing unit(CPU).The contact calculation time percentage of the total calculation time is only 18%with the FPM,much smaller than that(50%)with Abaqus/Explicit,demonstrating the efficiency of the proposed method.
基金the National Key Research and Development Program of China under Grant No.2018AAA0102703the National Natural Science Foundation of China under Grant Nos.61972341,61972342,and 61732015.
文摘We present a novel algorithm BADF(Bounding Volume Hierarchy Based Adaptive Distance Fields)for accelerating the construction of ADFs(adaptive distance fields)of rigid and deformable models on graphics processing units.Our approach is based on constructing a bounding volume hierarchy(BVH)and we use that hierarchy to generate an octree-based ADF.We exploit the coherence between successive frames and sort the grid points of the octree to accelerate the computation.Our approach is applicable to rigid and deformable models.Our GPU-based(graphics processing unit based)algorithm is about 20x--50x faster than current mainstream central processing unit based algorithms.Our BADF algorithm can construct the distance fields for deformable models with 60k triangles at interactive rates on an NVIDIA GTX GeForce 1060.Moreover,we observe 3x speedup over prior GPU-based ADF algorithms.
文摘The gravity gradient is a secondary derivative of gravity potential,containing more high-frequency information of Earth’s gravity field.Gravity gradient observation data require deducting its prior and intrinsic parts to obtain more variational information.A model generated from a topographic surface database is more appropriate to represent gradiometric effects derived from near-surface mass,as other kinds of data can hardly reach the spatial resolution requirement.The rectangle prism method,namely an analytic integration of Newtonian potential integrals,is a reliable and commonly used approach to modeling gravity gradient,whereas its computing efficiency is extremely low.A modified rectangle prism method and a graphical processing unit(GPU)parallel algorithm were proposed to speed up the modeling process.The modified method avoided massive redundant computations by deforming formulas according to the symmetries of prisms’integral regions,and the proposed algorithm parallelized this method’s computing process.The parallel algorithm was compared with a conventional serial algorithm using 100 elevation data in two topographic areas(rough and moderate terrain).Modeling differences between the two algorithms were less than 0.1 E,which is attributed to precision differences between single-precision and double-precision float numbers.The parallel algorithm showed computational efficiency approximately 200 times higher than the serial algorithm in experiments,demonstrating its effective speeding up in the modeling process.Further analysis indicates that both the modified method and computational parallelism through GPU contributed to the proposed algorithm’s performances in experiments.
基金the National Natural Science Foundation of China(No.61702521)the Natural Science Foundation of Tianjin(No.18JCQNJC00400)+1 种基金the Scientific Research Foundation of Civil Aviation University of China(No.2017QD12S)the Fundamental Research Funds for the Central Universities of Civil Aviation University of China(Nos.3122018C023 and 3122018C021)。
文摘Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a warp may jump to different paths after conditional branches.Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs.To alleviate the waste of SIMD lanes,threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes.However,this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions,resulting in that no warps are scheduled in some cases.In this paper,we propose an approach to reduce the overhead of barrier synchronizat ions induced by compactions,In our approach,a compaction is bypassed by warps whose threads all jump to the same path after branches.Moreover,warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing.In addition,a compaction is canceled if idle lanes can not be reduced via this compaction.The experimental results demonstrate that our approach provides an average improvement of 21%over the baseline GPU for applications with massive divergent branches,while recovering the performance loss induced by compactions by 13%on average for applications with many non-divergent control flows.
文摘For electromagnetic scattering of 3?D complex electrically large conducting targets,a new hybrid algorithm,MoM?PO/SBR algorithm,is presented to realize the interaction of information between method of moment(MoM)and physical optics(PO)/shooting and bouncing ray(SBR).In the algorithm,the COC file that based on the Huygens equivalent principle is introduced,and the conversion interface between the equivalent surface and the target is established.And then,the multi?task flow model presented in this paper is adopted to conduct CPU/graphics processing unit(GPU)tests of the algorithm under three modes,i.e.,MPI/OpenMP,MPI/compute unified device architecture(CUDA)and multi?task programming model(MTPM).Numerical results are presented and compared with reference solutions in order to illustrate the accuracy and the efficiency of the proposed algorithm.