HPC|Scale Special Interest Subgroup

Chair: Richard Graham

Computational science is the field of study concerned with constructing mathematical models and numerical techniques that represent scientific, social scientific or engineering problems and employing these models on computers, or clusters of computers to analyze, explore or solve these models. Numerical simulation enables the study of complex phenomena that would be too expensive or dangerous to study by direct experimentation. The quest for ever higher levels of detail and realism in such simulations requires enormous computational capacity, and has provided the impetus for breakthroughs in computer algorithms and architectures.

Among the high-performance computing systems, HPC clusters have become the preferred solution as they provide an efficient performance compute solution based on industry-standard hardware connected by a high-speed network. The main benefits of clusters are affordability, flexibility and availability. While the HPC cluster architecture is being used by the majority of HPC systems today (for example, more than 80% of the TOP500 supercomputer list is listed as clusters), very large-scale systems tend to use proprietary solutions. Historically, proprietary solutions were necessary to provide low-latency, high bandwidth and high reliability to enable scaling to tens and hundreds-of-thousands of CPUs. Today, commodity-based clusters provide a low-latency and high bandwidth solution, in addition to advanced scalability and reliability capabilities, delivering the solution requirements for the world’s largest supercomputers.

The HPC|Scale working group mission is to explore the capabilities of upcoming advanced clustering technology to be utilized by HPC commodity clusters to replace expensive, non-flexible proprietary systems and provide a better solution for future, large-scale HPC systems.

The HPC|Scale subgroup includes the following organizations from the HPC Advisory Council membership: Mellanox Technologies, Oak Ridge National Lab, University of Wisconsin Madison, Ohio State University.


MPI collectives accelerations
Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Many scientific applications use collective communications to satisfy a range of communication needs, such as determining the magnitude of residual vectors in iterative numerical solvers, performing Fourier transforms, and performing distributed-data reductions. The global nature of these communication patterns leads to them being a major factor in determining the performance and scalability of simulation codes, with these effects increasing with process count. Developing effective strategies for collective operations is essential for applications to make effective use of available compute resources, and offers the potential for rather large improvements in application performance. These benefits increase with system complexity and size. The HPC|Scale working group is investigating the capabilities of offloading the MPI collectives operations from the MPI library to the network and the effect on scalability and performance.


Towards Exascale Computing
After we broke the Petaflop performance barrier (the top 3 systems on the TOP500 supercomputers list demonstrated higher than 1 Petaflop sustained performance), the HPC community is exploring development efforts for breaking the Exaflop barrier. Many organizations and industry collaboration have been established around the world, and multiple media outlets were created to cover the development and collaboration progress. (Read more)

Exploring weather/atmospheric research applications as Scale
The High Order Method Modeling Environment (HOMME) and the modified version of The Parallel Ocean Program (POPperf) are two important applications for atmospheric and weather research. With an emphasis on efficiency, portability, maintainability and most importantly, scalability, HOMME and POPperf have been successfully deployed over the years on a wide variety of high-performance systems, such as Cray and Blue-Gene. With the increased adoption of HPC commodity clusters based on the high-speed InfiniBand network, understanding HOMME and POPperf scalability and optimization options available for clusters at scale are crucial for utilizing the capability of InfiniBand-based systems to serve as cost effective high-performance and highly scalable solutions. Our results identify HOMME and POPperf scaling capabilities on one of the world’s largest InfiniBand networks and demonstrate the critical elements for optimizing these applications at scale. (Read more)

Exascale: The Beginning of the Great HPC Disruption
With the eventual arrival of exascale systems, we face a level of disruption that is unlike anything this community has ever experienced. As we examine the potential disruptive impact of exascale computing, we have to keep in mind that we're looking at much more than bigger and faster computers or innovative new technology. (Read more)

CFD applications (OpenFOAM) at Scale and the Effect of Oversubscribed Fabrics
The presentation explores OpenFOAM scaling capabilities and reviews the effect of non-blocking versus blocking networks. (Read more)