Last week, in Part 1 of this two-part blog, we looked at trends in Big Data and analytics, and started to touch on the relationship with HPC (High Performance Computing). In this week’s blog we take a look at the usage of Big Data in HPC and what commercial and HPC Big Data environments have in common, as well as their differences.
High Performance Computing has been the breeding ground for many important mainstream computing and IT developments, including:
- The Web
- Cluster computing
- Cloud computing
- Hi-quality visualization and animation
- Parallel computing
- and arguably, Big Data itself
Big Data has indeed been a reality in many HPC disciplines for decades, including:
- Particle physics
- Weather and climate modeling
- Petroleum seismic processing
All of these fields and others generate massive amounts of data, which must be cleaned, calibrated, reduced and analyzed in great depth in order to extract knowledge. This might be a new genetic sequence, the identification of a new particle such as the Higgs Boson, the location and characteristics of an oil reservoir, or a more accurate weather forecast. And naturally the data volumes and velocity are growing continually as scientific and engineering instrumentation becomes more advanced.
A recent article, published in the July 2015 issue of the Communications of the ACM, is titled “Exascale computing and Big Data”. Authors Daniel A. Reed and Jack Dongarra note that “scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics”.
(n.b. Exascale is 1000 x Petascale, which in turn is 1000 x Terascale. HPC and Big Data are already well into the Petascale era. Europe, Japan, China and the U.S. have all announced Exascale HPC initiatives spanning the next several years.)
What’s in common between Big Data environments and HPC environments? Both are characterized by racks and racks of commodity x86 systems configured as compute clusters. Both environments have compute system management challenges in terms of power, cooling, reliability and administration, scaling to as many as hundreds of thousands of cores and many Petabytes of data. Both are characterized by large amounts of local node storage, increasing use of flash memory for fast data access and high-bandwidth switches between compute nodes. And both are characterized by use of Linux OS operating systems or flavors of Unix. Open source software is generally favored up through the middleware level.
What’s different? Big Data and analytics uses VMs above the OS, SANs as well as local storage, the Hadoop (parallel) file system, key-value store methods, and a different middleware environment including Map-Reduce, Hive and the like. Higher-level languages (R, Python, Pig Latin) are preferred for development purposes.
HPC uses C, C++, and Fortran traditional compiler development environments, numerical and parallel libraries, batch schedulers and the Lustre parallel file system. And in some cases HPC systems employ accelerator chips such as Nvidia GPUs or Intel Xeon Phi processors, to enhance floating point performance. (Prediction: we’ll start seeing more and more of these used in Big Data analytics as well – http://www.nvidia.com/object/data-science-analytics-database.html).
But in both cases the pipeline is essentially:
Data acquisition -> Data processing -> Model / Simulation -> Analytics -> Results
The analytics must be based on and informed by a model that is attempting to capture the essence of the phenomena being measured and analyzed. There is always a model — it may be simple or complex; it may be implicit or explicit.
Human behavior is highly complex, and every user, every customer, every patient, is unique. As applications become more complex in search of higher accuracy and greater insight, and as compute clusters and data management capabilities become more powerful, the models or assumptions behind the analytics will in turn become more complex and more capable. This will result in more predictive and prescriptive power.
Our general conclusion is that while there are some distinct differences between Big Data and HPC, there are significant commonalities. Big Data is more the province of social sciences and HPC more the province of physical sciences and engineering, but they overlap, and especially so when it comes to the life sciences. Is bioinformatics HPC or Big Data? Yes, both. How about the analysis of clinical trials for new pharmaceuticals? Arguably, both again.
So cross-fertilization and areas of convergence will continue, while each of Big Data and HPC continue to develop new methods appropriate to their specific disciplines. And many of these new methods will crossover to the other area when appropriate.
The National Science Foundation believes in the convergence of Big Data and HPC and is putting $2.4 million of their money into this at the University of Michigan, in support of various applications including climate science, cardiovascular disease and dark matter and dark energy. See:
What are your thoughts on the convergence (or not) of Big Data and HPC?