TACC Uses Longhorn Supercomputer to Solve Pressing Problems
The Texas Advanced Computing Center is using POWER9 technology to more quickly provide answers to pressing problems
Image by Texas Advanced Computing Center
By Jim Utsler10/01/2020
Big problems require big solutions, and in today’s world, big systems are often needed to reach those solutions. And that’s how the Texas Advanced Computing Center (TACC) at the University of Texas (UT) in
Austin is approaching the research it supports, by going big
in true Texas fashion.
In fact, as of June 2020, TACC had three systems on the TOP500 list of the world’s most powerful supercomputers. These include Stampede2, Frontera and Longhorn, which is powered by IBM POWER9™ technology. This would be an achievement for any large commercial or governmental organization, but it’s a true triumph for an academic institution, as smaller, non-academic POWER9 adopters have concluded.
“TACC has some support from the university and other agencies but our big machines, including the IBM system, were also supported by the National Science Foundation,” says Dan Stanzione, associate vice president for research, UT Austin and executive director, TACC. “We’re part of what we call a ‘cyberinfrastructure,’ which is an open-science computing environment. We have users here at UT Austin and also around the country and internationally. We support several thousand different projects, including more than a hundred researchers working on coronavirus.”
Effectively Deploying GPUs
Although the coronavirus is currently one of its top computing priorities, researchers using these systems also work on basic science, computational biology, materials, molecular dynamics and various engineering specialties. All of the different TACC systems support almost 3,000 projects from about 450 institutions worldwide.
UT’s sciences department builds instruments such as cryogenic electron microscopes to help solve the largest science and engineering questions in the world. Broadly, these can be categorized as:
- Simulation problems, where there are complete mathematical models of physical phenomena can be queried
- AI problems, with no complete physical models but mounds of data; this requires the construction of statistical models based on the data against which questions can be asked
- Analytics problems, where experimental infrastructures or apparatuses generate massive amounts of data researchers want to ask questions about
“We deployed an updated ecosystem for working on these largest problems last year and Longhorn (our IBM/GPU cluster) was one component of this,” Stanzione says. “We particularly chose the POWER9 solution because it was the most effective way to deploy GPUs, with NVIDIA NVLink maximizing host-to-GPU connectivity and GPU-to-GPU connectivity. At the moment, this is truly compelling for workloads in molecular dynamics and deep learning. So, we combined the IBM solution to be part of our larger infrastructure to handle specific workloads.”
This is because some workloads require faster response times. Although the x86-based Stampede2 and Frontera systems are more powerful in terms of sheer brute force (and rank higher on the TOP500 list), Longhorn, which is based on the same technologies as the Summit and Sierra supercomputers (ranked No. 1 and No. 2 on the TOP500 list), is nimbler. This is thanks to the heavy use of GPUs and the very quick interconnectivity afforded by NVLink.
“Our main reason for selecting those servers is having NVLink linking back to the main processor. With any sort of Intel- or AMD-based solution with GPUs, NVLink can only connect between the GPUs; you can’t get them to talk back to the processor and main memory,” Stanzione explains. “You’re limited to a PCI-type connection, a regular system bus in the computer. So, that distinguishing feature was what made us go with POWER9. The faster we can go from GPU via NVLink to the processor, the faster we can get results.”
For example, if an organization is constantly streaming data through the GPUs, the data stream may stop. This is because the system processors are sitting idly by, waiting for the data they need to process workloads. Having a fast and closely integrated connection from GPU to GPU, GPU to system memory, system memory to other systems, and those systems to storage matters. An organization may have a very fast computer, but it’s being starved as the CPUs wait for GPU cycles. That’s why the TACC’s Longhorn system focuses on
“We have other machines that focus on smaller problems,” Stanzione says, “but big problems aren’t going to fit in the memory of a single GPU or in the memory of a single server, so balanced I/O performance back to remote memory or storage is critically important. You can get answers much quicker in such an environment.”
“Many of our users who were already big GPU fans think Longhorn is great. It's faster than the previous GPU systems they've had access to. It gives them an additional tool with which to work, and the numbers have been really encouraging.”
When developing Longhorn, TACC looked at a budget cap, watching what the U.S. Department of Energy (DoE) did with Summit in terms of how many GPUs per node it was deploying. That was a crucial architectural consideration, making sure to keep everything affordable and in balance. For instance, if an organization has too many GPUs working with a particular CPU, problems regarding processing speeds can arise.
Optimally, organizations should discover, on average, the best balance for its workload lift. In the case of TACC, it felt that four GPUs per node was the correct correlation, with two GPUs per CPU socket. If considerations such as that aren’t part of the design equation, some nodes won’t be used as effectively as they could be.
“We were watching the DoE as its machines came up, watching what their experiences were and which design elements benchmarked better,” Stanzione remarks. “Following those observations, we decided to work with IBM to build and deploy Longhorn.”
TACC chose to go with a relatively small cluster by node count, but those nodes have more horsepower than a typical CPU-based node-cluster does. The x86-based Frontera system, for example, has more than 8,000 nodes, but Longhorn has only 112, with a total of 448 CPUs. But it actually has the horsepower equivalent of a conventional 700- or 800-node cluster.
TACC supports many different types of scientific disciplines and instruments, such as genome sequencing for the novel coronavirus. It also processes information from scientific tools like the Large Hadron Collider, the Laser Interferometer Gravitational-Wave Observatory and the various devices involved in the Internet of Things, which have a variety of scientific purposes, such as wildfire and hurricane modeling.
Some researchers, depending on types of coding and workloads, will spread their work across the TACC super-system spectrum while others focus on one system or another. As Stanzione explains, “A researcher here at UT Austin does a type of fast tumor identification, and there’s a GPU centric part to that workflow, such as using AI to deduce what’s in an image. It will run on the IBM GPU system on Longhorn, but his preprocessing steps will run on the traditional CPUs on Frontera.”
Large and Small Scale
Whichever the case, it’s become clear that the POWER9 Longhorn has become a key component in the TACC’s supercomputing infrastructure. This is especially true now as simulation, AI and analytics problems become more mainstream. They often demand the quick GPU-to-GPU and GPU-to-CPU interconnects provided by NVLink.
Although TACC is primarily focused on big, seemingly intractable scientific problems, other organizations can take advantage of the same technology, if even on a smaller scale. Financial institutions, for example, can leverage real-time AI to more quickly respond to potential instances of fraud, and medical facilities can use the technology to help healthcare providers diagnose a variety of conditions.
As Stanzione puts it, “Many of our users who were already big GPU fans think Longhorn is great. It’s faster than the previous GPU systems they’ve had access to. It gives them an additional tool with which to work, and the numbers have been really encouraging.
Jim Utsler, IBM Systems magazine senior writer, has been writing for IBM since the mid-1990s.