IBM Researchers Maximize Apache Spark’s Capabilities
Illustration by Lonnie Busch
Big data is growing exponentially. While this is driving the transformation of businesses, organizations are struggling to simplify all of that data and ultimately unlock its value.
Many organizations are turning to Apache Spark to address this challenge. Widely used to gain insights from complex data, the open-source big data project completes full data analytics operations in memory and excels at analyzing streaming data that requires multiple operations.
In 2015, IBM redesigned more than 15 of its core analytics and commerce solutions with Apache Spark to push the company’s real-time data processing capabilities into high gear. But IBM Research teams are going a step further by figuring out ways to maximize Spark’s role in big data analytics.
As more data is produced, analyzing it in main memory becomes harder. Researchers at the IBM Austin Research Laboratory (ARL) have created—and are conducting trials on—a solution that uses flash storage attached through the Coherent Accelerator Processor Interface (CAPI) introduced with POWER8* technology.
CAPI Flash technology is the interface between the flash device and the memory hierarchy, which has lower latency and higher bandwidth. Jobs today are larger than the physical memory that can fit on the nodes system of most in-memory databases. CAPI Flash creates more storage. Because of the highly iterative nature of a typical Spark application, businesses must page memory in and out, or spill it to disk. Researchers at ARL have developed an alternative, which takes advantage of CAPI Flash and reduces the latency that occurs when data is moved between memory and storage by completely bypassing the OS’s I/O subsystem.
“The thing about Spark is as long as everything remains in memory—which is the fastest data storage we have in a computer—everything is fine, but we’re seeing systems spill data frequently as problem sizes grow,” says Jan Rellermeyer, research scientist, ARL, and adjunct assistant professor at the University of Texas at Austin. “Another issue is that main memory can easily become the most expensive part of a computer system these days. With CAPI Flash, we’re providing an alternative to using something as slow as a hard disk drive or even a traditionally attached flash device, thereby taking less of a performance penalty when having to spill to a secondary storage. CAPI Flash can be used to take the same size of problem and run it with less main memory.”
As problem sizes grow larger than the main memory of their computers, businesses must scale out data and run it on a cluster of machines, which is what Spark has ultimately been designed to support. But with CAPI Flash on POWER8 systems, businesses can bridge this gap and extend their computer’s main memory system with an alternative that may not be as fast as dynamic RAM (DRAM) but offers a more competitive price per gigabyte.
comments powered by