Big Data Management and Processing in Data Centre Clouds
As we delve deeper into the ‘Digital Age’, we witness an explosive growth in the volume, velocity, and variety of the data available on the Internet. For example, in 2012 about 2.5 quintillion bytes of data was created on a daily basis. The data originated from multiple types of sources including mobile devices, sensors, individual archives, social networks, Internet of Things, enterprises, cameras, software logs, etc. Such ‘Data Explosions’ has led to one of the most challenging research issues of the current Information and Communication Technology (ICT) era: how to effectively and optimally manage such large amount of data and identify new ways to analyze large amounts of data for unlocking information. The issue is also known as the ‘Big Data’ problem, which is defined as the practice of collecting complex data sets so large that it becomes difficult to analyze and interpret manually or using on-hand data management applications (e.g., Microsoft Excel). From the perspective of real-world applications, the Big Data problem has also become a common phenomenon in domain of science, engineering, and commerce. Representative applications include social media analytics, high energy physics, earth observation, genomics, connectomics, automobile simulations, medical imaging, body area networks, and the like.
To illustrate the Big Data application, we give a typical example scenario that arises in the healthcare domain: the problem of managing petabytes of multimedia content produced by advanced medical imaging devices. In conjunction with traditional X-rays, medical imaging can now delve deeper into the human body, discovering and analysing smaller and smaller details. A research team from Williams College at Harvard University (http://www.gizmag.com/medical-imaging-tracking-molecules-live-tissue-video-rate/17202/) has developed a new type of optical medical imaging device that captures high resolution live video of human cells and molecules. Latest report (www.corp.att.com/healthcare/docs/medical-imaging-cloud.pdf) from AT&T reveals that medical content (including X-rays, CT scans, MRIs, mammograms, and other pathology test reports) archives are increasing by 20-40 percent each year. In 2012, there were 1 billion of such content in United States alone, accounting to one-third of global storage demand.
Another important class of Big Data application in the healthcare domain includes the Medical Body Area Networks (MBANs). According to the market intelligence company ABI research , over the next five years, close to five million disposable wireless MBAN sensors will be shipped. MBANs enable a continuous monitoring of patient’s condition by sensing and transmitting measurements such as heart rate, electrocardiogram (ECG), body temperature, respiratory rate, chest sounds, and blood pressure etc. MBANs will allow: (i) real-time and historical monitoring of patient’s health; (ii) infection control; (iii) patient identification and tracking; and (iv) geo-fencing and vertical alarming. However, to manage and analyze such massive MBAN data from millions of patients in real-time, healthcare providers will need access to an intelligent and highly secure ICT infrastructure.
In all of the aforementioned application scenarios, hundreds of petabytes of heterogeneous data (images, text, video, and the like) will be generated and required to be efficiently processed (stored, distributed, and indexed with a schema and semantics) in a way that does not compromise end-users’ Quality of Service (QoS) in terms of data availability, data search delay, data analysis delay, and the like. Many of the existing ICT systems that store, process, distribute, and index hundreds of petabytes of heterogeneous data fall shortly of this challenge or do not exist. There has been a paradigm shift of executing high performance Big Data applications from physical hardware- and locally managed software-enabled platforms to virtualized data centre clouds. This migration is driven by two facts:
1. Data-intensive computing becomes the fourth paradigm of scientific discovery; it is therefore required to provide computing infrastructures (such as specialized data centres) and software frameworks (data processing technologies such as MapReduce and Workflow, distributed storage system, etc.), which are specifically optimized for Big Data applications.
2. Cloud computing assembles large networks of virtualized services: hardware resources (CPU, storage, and network) and software resources (e.g., databases, message-queuing systems, monitoring systems, load-balancers). In industry these services are referred to as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Cloud computing services are hosted in large data centres, often referred to as data farms, operated by companies such as Amazon, Apple, GoGrid, Google, and Microsoft. Cloud computing gives application developers the ability to marshal virtually infinite resources with an option to pay-per-use and as needed, instead of requiring upfront investments in resources that may never be optimally used. Once applications are hosted on cloud resources, users are able to access them from anywhere at any time, using devices ranging from wide class of mobile devices (smartphones, tablets) to desktop computers. The data centre cloud provides virtual centralization of application, computing, and data. While cloud computing optimises the use of resources, it does not (yet) provide an effective solution hosting Big Data applications.
Despite the aforementioned technological advances of the data processing paradigms (e.g. the MapReduce paradigm, workflow technologies) and cloud computing, large-scale reliable system-level software for Big Data applications are yet to become commonplace. Large-scale distributed data-intensive applications, e.g., 3D model reconstruction in medical imaging, medical body area networks, earth observation applications, distributed blog analysis, and high energy physics simulation, need to process and manage massive data sets across geographically-distributed data centres. The provisioning of these applications across multiple data centre resources is required because they combine multiple, independent, and geographically distributed software and hardware resources such as data sources, middleware frameworks, and experiment models. The capability of existing data center computing paradigms (e.g.,MapReduce, workflow technologies) for data processing, however, is limited to compute and storage infrastructures within a local area network, e.g., a single cluster within a data centre. This is leading to unsatisfied Quality of Service (QoS) in terms of timeliness of results, dissemination of results across organizations, and administrative costs. There are multiple reasons for this state of affairs including: (i) lack of software frameworks and services that allow portability of such applications across multiple data centres (e.g., public data centres, private data centres, hybrid data centres, etc.) (ii) unavailability of required resources within a data centre; and (iii) manual approaches leading to non-optimal resource provisioning; and (iv) lack of a right set of programming abstractions, which can extend the capability of existing data processing paradigms to multiple data centres.
**Copyright on my articles is held by respective publishers. They are posted here for educational purpose only. If you want to use them for commercial purpose, please consult copyright owners!
- R. Ranjan, O. Rana, S. Nepal, M. Yousif, P. James, Z. Wen, S. Barr, P. Watson, P. P. Jayaraman, D. Georgakopoulos, M.Villari, M. Fazio, S. Garg, R. Buyya, L. Wang, A. Y. Zomaya, and S. Dustdar, “The Next Grand Challenges: Integrating the Internet of Things and Data Science,” Volume 5, Issue 3, Pages 12-26, May./Jun. 2018, doi: 10.1109/MCC.2018.032591612, (Reviewed by Editorial Board) [ISI impact factor: 2.92]
- R. Ranjan, S. Garg, A. Khoskbar, E. Solaiman, P. James, and D. Georgakopoulos, “Orchestrating BigData Analysis Workflows,” IEEE Cloud Computing, IEEE Computer Society. (To Appear May 2017, Reviewed by Editorial Board) [ISI impact factor: 2.92]
- A. Khoshkbarforoushha, R. Ranjan, R. Gaire, E. Abbasnejad, L. Wang, and Albert Y. Zomaya, “Distribution Based Workload Modelling of Continuous Queries in Clouds,” IEEE Transactions on Emerging Topics in Computing, IEEE Computer Society. (Accepted July 2016). [ISI impact factor: 3.6]
- A. Khoshkbarforoushhaa, A. Khosravianc, and R. Ranjan, “Elasticity Management of Streaming Data Analytics Flows on Clouds,” Journal of Computer System Sciences, Elsevier. [ISI Impact Factor 1.58, ERA A*] (Accepted November 2016, in press)
- K. Alwasel, Y. Li, P. P. Jayaraman, S. Garg, R. N. Calheiros, and R. Ranjan, “Programming SDN-Native Big Data Applications: Research Gap Analysis,” IEEE Cloud Computing, IEEE Computer Society. (To Appear September 2017, Reviewed by Editorial Board) [ISI impact factor: 2.92]
- Z. Deng, W. Han, L. Wang, R. Ranjan, A. Y. Zomaya, and W. Jie, “An Efficient Online Direction Preserving Compression Apprach for Trajectory Streaming Data“, Future Generation Computer Systems Journal, Elsevier Press. [ERA A Journal, ISI impact factor: 2.6] (accepted September 2016, in press)
- Z. Deng, L. Wang, W. Han, R. Ranjan, and A. Zomaya, “G-ML-Octree: An Update-Efficient Index Structure for Simulating 3D Moving Objects across GPUs,” IEEE Transactions on Parallel and Distributed Systems, vol. PP, no. 99, pp. 1-1., doi: 10.1109/TPDS.2017.2787747. [ERA A* Journal, ISI impact factor 2.1]
- S. Nepal, R. Ranjan, and K-K. R. Choo, “ Trustworthy Processing of Healthcare Big Data in Hybrid Clouds,” IEEE Cloud Computing, Volume 2, Issue 2, 2015, BlueSkies Column, IEEE Computer Society. [ISI impact factor: 2.92]
- L. Wang and R. Ranjan, “Processing Distributed Internet of Things Data in Clouds, ” IEEE Cloud Computing, Volume 2, Issue 1, BlueSkies Column, IEEE Computer Society. [ISI impact factor: 2.92]
- R. Ranjan, “Streaming Big Data Processing in Datacenter Clouds“. IEEE Cloud Computing 1(1): 78-83 (2014). [ISI impact factor: 2.92]
- E. Bertina, S. Nepal, and R. Ranjan, “Building Sensor-Based Big Data Cyberinfrastructures,” IEEE Cloud Computing, Volume 2, Issue 5, 2015, IEEE Computer Society. [ISI impact factor: 2.92]
- P. Liu, L. Wang, R. Ranjan, “IK-SVD: Towards Sparse Representation of Spatial-Temporal Remote Sensing Big Data”, IEEE Computing in Science and Engineering Magazine, IEEE Computer Society Press. (Accepted April 2014)
- Z. Deng, X. Wu, L. Wang, X. Chen, R. Ranjan, A. Zomaya, and D. Chen, “Parallel Processing of Dynamic Continuous Queries over Streaming Data Flows “, IEEE Transactions on Parallel and Distributed Systems“, IEEE Computer Society Press. [ERA A* Journal, ISI impact factor 1.4] (Accepted February 27, 2014 and to appear)
- W. Song, L. Wang,R. Ranjan, J. Kolodziej, and D. Chen, “Towards Modeling Large-scale Data flows in a Multi-datacentre Computing System with Petri Net”, IEEE Systems Journal, IEEE Computer Society Press. [ISI Impact Factor 1.27] (Accepted September 2013)
- L. Wang, J. Tao, R. Ranjan, H. Marten, A. Streit, Jingying Chen, and Dan Chen, “G-Hadoop: MapReduce across Distributed Data Centers for Data-intensive Computing”, Future Generation Computer Systems Journal, Volume 29, Issue 3, Pages 739-750, March 2013, DOI: dx.doi.org/10.1016/j.future.2012.09.001, Elsevier Press. [ERA A Ranked Journal, ISI impact factor: 1.978]
- J. Zhao, L. Wang, J. Tao, J. Chen, R. Ranjan, J. Kolodziej, A. Streit, and D. Georgakopoulos, “A Security Framework in G-Hadoop for Data-intensive Computing across Distributed Clusters”, Journal of Computer and System Sciences, Elsevier Press, Accepted May 2013. [ERA A* Journal, ISI impact factor 1.00]
- A. Guabtni, R. Ranjan, F. Rabhi, “A Workload-driven Approach to Database Query Processing in the Cloud”, In the Journal of Supercomputing, Volume 63, Issue 3, March 2013, Pages 722-736, Springer Netherlands, Press, doi: 10.1007/s11227-011-07. [ERA B Ranking, ISI impact factor: 0.91]
- C. Liu, J. Chen, L. Yang, X. Yang,R. Ranjan, and R. Kotagiri, “Authorized Public Auditing of Dynamic Big Data Storage on Cloud with Efficient Verifiable Fine-grained Updates”, IEEE Transactions on Parallel and Distributed System, IEEE Computer Society Press. [ERA A* Journal, ISI impact factor 1.4] (Accepted August 2013 and to appear)
- L. Li, W. Xue, R. Ranjan, Z. Jin, “A Scalable Helmholtz Solver in GRAPES over Large Scale Multi-core Cluster”, In the Journal of Concurrency and Computation: Practice and Experience (CCPE), Wiley Press, Published online: 18 JAN 2013, DOI: 10.1002/cpe.2979. [ERA A Ranked Journal, ISI impact factor: 0.84]
- C. Liu, X. Zhang, C. Liu, Y. Yang, R. Ranjan, D. Georgakopulos, and Jinjun Chen., “An Iterative Hierarchical Key Exchange Scheme for Secure Scheduling of Big Data Applications in Cloud Computing”, The 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (IEEE TrustCom-13), IEEE Computer Society. [ERA A Ranking]
- L. Wang, R. Ranjan, S. Khan and J. Kołodziej, “Parallel Processing of Massive EEG Data with MapReduce“, 18th IEEE International Conference on Parallel and Distributed Systems, Singapore, December 17-19, 2012, IEEE Computer Society. [ERA B Ranking]
- L. Wang, D. Chen, Z. Deng, and R. Ranjan, “A Simulation Study on Urban Water Threat Dectection in Modern Cyberinfrastructures”, High-Performance Grid and Cloud Computing Workshop, In conjunction with IPDSPS 2012, May 21-25, 2012, IEEE Computer Society. [not ranked]