The HPC Technology Behind the World's Top Supercomputers is in Your Data Center
Dave Turek, vice president for HPC and OpenPOWER, shares how POWER9 runs the world's top supercomputers.
Dave Turek Vice President for HPC and Openpower, IBM Systems, Image by Dan Bigelow
By Adam Oxford11/01/2019
When French oil company Total goes prospecting, it has a not-so-secret weapon to improve its chances of hitting black gold. In June of this year, the company cut the ribbon on Pangea III to much worldwide acclaim. Pangea III is the fastest supercomputer on the planet that’s not owned by a government or academic institution, and the 11th fastest overall.
Pangea III increases the computing resources available to Total group by a factor of five. That gives it an unprecedented capability to analyze geological data, run simulations for planning potential drill sites, and help the research and development team with the development of new algorithms and artificial intelligence (AI) for the future.
And it does all of this while consuming less than a tenth of the power per petaflop than Total’s previous Pangea machines.
The IBM-built supercomputer features hundreds of compute nodes, each boasting a combination of multiple 18-core IBM POWER9* processors and NVIDIA Volta V100 GPUs. In total, it boasts 291,024 processor cores, which deliver up to 25 thousand million million floating-point operations per second (25 petaflops). It runs Red Hat’s open-source OS—which was recently acquired by IBM and will continue to make the machine even more efficient and effective over the course of its lifetime.
That Total’s machine is the most powerful in the commercial world shouldn’t come as too much of a surprise, however. Supercomputing website Top500.com lists the first and second most capable high-performance computers (HPCs) as Oak Ridge National Laboratory’s Summit and the Lawrence Livermore National Laboratory’s Sierra. Both Summit and Sierra were switched on last year, and are based on the same architecture as Pangea III, albeit one that’s been scaled up by—in Summit’s case—a factor of eight.
The historical fixation on algorithms created the risk that people would keep doing things the way they always have been done, but would you rather tell the CEO you’ve saved 2% of your time with this impressive new compute capability, or that you did something boring to optimize data transfer that shaved four months off the process?
What Makes a Successful Supercomputer?
To put the power of these three machines into context, when measured by peak petaflops, Summit is twice as powerful as the No. 4 supercomputer on the Top500 list and draws a little over half of the energy. Pangea III is 38% faster than its nearest commercial rival, with similar energy needs.
These achievements are made possible because IBM has been thinking big for decades about how to scale supercomputers and has made two important design decisions over the last 20 years that have been critical to IBM's current success.
“Sometimes being lucky is as important as being smart,” says Dave Turek, vice president for HPC and OpenPOWER IBM Systems, “We made design decisions in 1999 around the evolving landscape and guessed that the future for supercomputing would focus on energy efficiency. It’s easy to build a small supercomputer with a single frame of 18 compute nodes, and it doesn’t matter whether the amount of energy that it draws is 50kW or 100kW. In terms of cost to run, there’s not a lot of difference. But when you look at scaling that up to 200 frames—the size of Summit—you’re saving 10MW.”
The focus on energy efficiency led to another early realization: As supercomputing moved into the big data era, optimizing the way information flows becomes just as important as how energy flows.
“Around about 2011, we reviewed some literature to look at the cost of moving a byte of data from storage to an HPC application,” Turek says, “and found that it was around 6,000 pico-cents. That’s trivial, until you start to scale up the amount of data to exabytes. The costs start to rise exponentially as more data is processed.”
As well as the financial costs, moving exabytes of data around takes time—and that can be measured in days or even weeks. A traditional supercomputing application, Turek explains, is a constant flow of information, more akin to a human body’s circulatory system than a process of putting data in and collecting results out.
“It’s an iterative process,” he says, “You run a simulation, then you take the data that it produces and run a new simulation with those parameters. Data is always moving in the system, so the cost of moving that data was important to know.”
More importantly, however, supercomputers were beginning to be turned to new uses for data processing and analytics. Big data was about to become a major factor in supercomputer design, and IBM anticipated the paradigm shift with uncanny timing.
HPC: It's all About the Workflow
One of the most important factors guiding the design strategy at IBM has been to stay focused on the ways in which clients actually use supercomputers in the field. As an example, Turek recalls working with an oil company that wanted to improve its ability to locate viable fields. The application in question analyzed the seismic waveforms produced by a series of controlled explosions in a particular area, to work out what the underground structure and geology looked like.
It was taking the exploration team 13 months from the time the data was recorded to being able to greenlight drilling.
“I thought that they would be looking for a faster way to run the algorithms that parsed waveforms,” Turek says, “But the actual time spent using the supercomputer was small. Even if they had had a machine that could calculate results instantly, the overall time saving would have been around 5%.
Total oil and gas needed faster ways to process the data before it was analyzed. The historical fixation on algorithms created the risk that people would keep doing things the way they always have been done, but would you rather tell the CEO you’ve saved 2% of your time with this impressive new compute capability, or that you did something boring to optimize data transfer that shaved four months off the process?”
The way we’ve previously thought of HPC is passe. It’s progressively informed by AI to make it run better or even replace it entirely.
Partnerships Powering Performance
Another key insight was the importance of partnerships, and the importance incorporating other technologies into a system. For example, the Summit design integrates a large number of NVIDIA Volta V100 GPU Accelerators, each with 5,120 compute cores designed for parallel processing.
Getting data from the CPU to the GPU and back again as efficiently as possible is critical, and this led to new innovations in the transfer bus. Likewise, it offloads elements of the data processing directly to the Mellanox network interface cards (NICs), which have accelerators built in.
The Power of AI
These design decisions helped give the machines an unprecedented ability to use AI and deep learning to speed up the way they process information.
As illustration, Turek cites the ability for Summit, Sierra and Pangea III to build in Bayesian analysis into dealing with big data problems. Bayesian mathematics is a branch of statistics that uses probabilities to validate a hypothesis, rather than running multiple models sequentially and then picking the best fit. This proves useful when tackling problems such as materials science.
“A commercial client approached us a few years ago with a challenge around designing a new shampoo,” Turek says, “It sounds trivial, but it’s a sophisticated problem in chemical formulation. Chemicals can be combined in millions of ways and millions of simulations could be run. With a Bayesian methodology, you choose the next simulation based on what you already know from the ones that you’ve already performed. Being able to optimize this on an accelerator can reduce the compute time needed by 95%.”
Another emerging use for the architecture is “cognitive discovery.” The sheer volume of research conducted in most fields today means that it’s becoming impossible for any one researcher—or even a team of researchers—to stay completely up to date with the latest literature. Their work may overlook recent discoveries or advances, or even repeat previous experiments, because they simply don’t know that they’ve happened.
STM, the association for academic and professional publishers, estimates that some 43,000 journals and 8 million researchers worldwide publish a total of 3 million papers per year—and the number is growing by 5% annually. Somewhat ironically, the term “scientific overload” returns some 556,000 results in Google Scholar.
Cognitive discovery tries to tackle this problem by drawing upon the big data and deep learning capabilities of the IBM architecture to ingest huge amounts of scientific material and return information relevant to a particular problem researchers are trying to solve. It’s proving to be highly effective.
Turek points to a team from IBM’s research facility in Zurich who worked on materials science problems relating to lithium ion batteries. The challenge that they faced was that the mathematics was proving difficult to work on at scale: They couldn’t take advantage of all of the compute nodes available to them, because the algorithm would only run on a handful. After a period of months, the team successfully refined the programming to the point where it was successfully taking advantage of the entire machine.
To get there, however, they had to read more than 4,000 recently published papers on the subject. Attempting the problem again using a cognitive discovery process, they were able to build up a corpus of relevant information from a collection of 45 million papers in a matter of days.
Another example of cognitive discovery is the cloud-based IBM RXN for chemistry application, which uses semantic analysis of literature to predict the results of chemical reactions. According to Turek, this allows researchers to explore many problems without the need for developing parallel computing programs of their own.
“The way we’ve previously thought of HPC is passe,” says Turek, “It’s progressively informed by AI to make it run better or even replace it entirely.”
These kinds of advances will be even more important in the future as supercomputing becomes more decentralized. In the same way that processes are offloaded on to GPU accelerators for highly parallel problem solving today, other forms of accelerator—such as quantum processors—will become more common tomorrow. And with that level of complexity, allowing AI to decide how and where the most appropriate part of a machine to run a particular part of an application will become essential.
HPC for All
What’s more, they’re also democratizing supercomputing, and enabling smaller businesses to access time on HPCs more cost effectively.
“It’s getting to the stage where you can do a stupendous number of things with relatively little effort,” says Turek, “Most of our proof of concept deployments have been made with small-to-medium-sized companies. It will help address the skills gap, too. If you can solve a problem in 80 hours using cognitive discovery, you can hire a really good data scientist to do it.”
The challenge with data science skills, Turek continues, isn’t just that they are costly. They’re also concentrated in geographical areas. This puts firms operating in some regions or countries at a disadvantage. But as tools and techniques for automating supercomputing and AI are used, the better they’re understood, revealing more opportunities simplifying HPC tasks. Already, this is helping smaller firms who don’t have access to expensive talent to compete with firms that do. In the future, the hope is that supercomputing and AI will be accessible to all.
And thanks to the work of IBM, that future is getting closer all of the time.
Lawrence Livermore National Laboratory's Sierra. Photography courtesy of the Lawrence Livermore National Laboratory
The Summit supercomputer is photographed at Oak Ridge National Laboratory. Photography courtesy of Oak Ridge National Laboratory