Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of com...Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI- Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X-21X speedup over Apache Hadoop, and 2X-3X speedup over Apache Spark for PageRank and K-means.展开更多
随着云计算、物联网等现代信息技术的高速发展,各行各业的数据急剧增长,特别是价值密度较低的非结构化数据的迅速增长,急需要一种高性能分布式系统来挖掘海量数据里所蕴藏的价值。论述了通过MongoDB-Connector for Hadoop连接器构建的基...随着云计算、物联网等现代信息技术的高速发展,各行各业的数据急剧增长,特别是价值密度较低的非结构化数据的迅速增长,急需要一种高性能分布式系统来挖掘海量数据里所蕴藏的价值。论述了通过MongoDB-Connector for Hadoop连接器构建的基于MongoDB与Hadoop MapReduce的数据分析平台的数据传输机制,并从Chunk size设置、分片方式、MongoDB分片集群部署、CAP、混合分区、有向无环图、计算本地化及设置预测机制等方面提出了改进数据分析系统的措施,最后,对这些措施在舆情分析和超市顾客购买行为分析等项目实践中的应用效果进行了分析,论证了这些措施在改进性能中的可行性。研究结果可供大数据相关领域的人员参考。展开更多
In this paper we study a seismic sensing platform using Shakebox, a low-noise and low-power 24- bit wireless accelerometer sensor. The advances of wireless sensor offer the potential to monitor earthquake in Californi...In this paper we study a seismic sensing platform using Shakebox, a low-noise and low-power 24- bit wireless accelerometer sensor. The advances of wireless sensor offer the potential to monitor earthquake in California at unprecedented spatial and temporal scales. We are exploring the possibility of incorporating Shakebox into California Seismic Network (CSN), a new earthquake monitoring system based on a dense array of low-cost acceleration seismic sensors. Compared to the Phidget/Sheevaplug sensors currently used in CSN, the Shakebox sensors have several advantages. However, Shakebox sensor collects 4K Bytes of seismic data per second, giving around 0.4G Bytes of data in a single day. Therefore how to process such large amount of seismic data becomes a new challenge. We adopt Hadoop/MapReduce, a popular software framework for processing vast amounts of data in-parallel on large clusters of commodity hardware. In this research, the test bed-generated seismic data generation will be reported, the map and reduce function design will be presented, the application of MapReduce on the testbed-generated data will be illustrated, and the result will be analyzed.展开更多
文摘Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI- Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X-21X speedup over Apache Hadoop, and 2X-3X speedup over Apache Spark for PageRank and K-means.
文摘随着云计算、物联网等现代信息技术的高速发展,各行各业的数据急剧增长,特别是价值密度较低的非结构化数据的迅速增长,急需要一种高性能分布式系统来挖掘海量数据里所蕴藏的价值。论述了通过MongoDB-Connector for Hadoop连接器构建的基于MongoDB与Hadoop MapReduce的数据分析平台的数据传输机制,并从Chunk size设置、分片方式、MongoDB分片集群部署、CAP、混合分区、有向无环图、计算本地化及设置预测机制等方面提出了改进数据分析系统的措施,最后,对这些措施在舆情分析和超市顾客购买行为分析等项目实践中的应用效果进行了分析,论证了这些措施在改进性能中的可行性。研究结果可供大数据相关领域的人员参考。
文摘In this paper we study a seismic sensing platform using Shakebox, a low-noise and low-power 24- bit wireless accelerometer sensor. The advances of wireless sensor offer the potential to monitor earthquake in California at unprecedented spatial and temporal scales. We are exploring the possibility of incorporating Shakebox into California Seismic Network (CSN), a new earthquake monitoring system based on a dense array of low-cost acceleration seismic sensors. Compared to the Phidget/Sheevaplug sensors currently used in CSN, the Shakebox sensors have several advantages. However, Shakebox sensor collects 4K Bytes of seismic data per second, giving around 0.4G Bytes of data in a single day. Therefore how to process such large amount of seismic data becomes a new challenge. We adopt Hadoop/MapReduce, a popular software framework for processing vast amounts of data in-parallel on large clusters of commodity hardware. In this research, the test bed-generated seismic data generation will be reported, the map and reduce function design will be presented, the application of MapReduce on the testbed-generated data will be illustrated, and the result will be analyzed.