Succinct Spark: Queries on Compressed RDDs

We are very excited to announce the release of Succinct Spark, a Spark package that integrates Succinct within the Apache Spark ecosystem. Succinct Spark enables search, count, range and random access queries on compressed RDDs. This allows users to use Spark as a document store (with search on documents) similar to ElasticSearch, a key-value store (with search on values) similar to HyperDex, and an experimental DataFrame interface (with search along columns in a table). When used as a document store, Succinct Spark is 2.75x faster than ElasticSearch for search queries while requiring 2.5x lower storage, and over 75x faster than native Spark.

Succinct Spark Overview

As we discussed in our last post, search is becoming an increasingly powerful primitive in big data analytics and web services. Many web services support some form of search, including LinkedIn searchTwitter searchFacebook search, Netflix search, airlines, hotels, as well as services specifically built around search — Google, Bing, Yelp, to name a few. Apache Spark supports search via full RDD scans. While fast enough for small datasets, data scans become inefficient as dataset become even moderately large. One way to avoid data scans is to implement indexes, but can significantly increase the memory overhead.

Succinct Spark achieves a unique tradeoff — storage overhead no worse (and often lower) than data-scan based techniques and query latency comparable to index-based techniques. Succinct Spark enables search (and a wide range of other queries) directly on compressed representation of the RDDs. What differentiates Succinct Spark is that queries are supported without storing any secondary indexes, without data scans and without data decompression — all the required information is embedded within the compressed RDD and queries are executed directly on the compressed RDD.

In addition, Succinct Spark supports random access of records without scanning the entire RDD, a functionality that we believe will significantly speed up a large number of applications.

An example

Consider a collection of Wikipedia articles stored on HDFS as a flat unstructured file. Let us see how Succinct Spark supports the above functionalities:

// Import SuccinctRDD
import edu.berkeley.cs.succinct._

// Create a Spark RDD as a collection of articles; ctx is the SparkContext
val articlesRDD = ctx.textFile("/path/to/data").map(_.getBytes)

// Compress the Spark RDD into a Succinct Spark RDD, and persist it in memory
// Note that this is a time consuming step (usually at 8GB/hour/core) since data needs to be compressed. 
// We are actively working on making this step faster.
val succinctRDD = articlesRDD.succcinct.persist()

// SuccinctRDD supports a set of powerful primitives directly on compressed RDD
// Let us start by counting the number of occurrences of "Berkeley" across all Wikipedia articles
val count = succinctRDD.count("Berkeley")

// Now suppose we want to find all offsets in the collection at which “Berkeley” occurs; and 
// create an RDD containing all resulting offsets 
val offsetsRDD = succinctRDD.search("Berkeley")

// Let us look at the first ten results in the above RDD
val offsets = offsetsRDD.take(10)

// Finally, let us extract 20 bytes before and after one of the occurrences of “Berkeley”
val offset = offsets(0)
val data = succinctRDD.extract(offset - 20, 40)

Many more examples on using Succinct Spark are outlined here.

Performance

kv-document-search-2The figure compares the search performance of Succinct Spark against ElasticSearch and native Spark. We use a 40GB collection of Wikipedia documents over a 4-server Amazon EC2 cluster with 120GB RAM (so that all systems fit in memory). The search queries use words with varying number of occurrences (1–10,000) with uniform random distribution across 10 bins (1–1000, 1000-2000, etc). Note that the y-axis is on log scale.

Interestingly, Succinct Spark is roughly 2.75x faster than Elasticsearch. This is when ElasticSearch does not have the overhead of Spark job execution, and have all the data fit in memory. Succinct Spark achieves this speed up while requiring roughly 2.5x lower memory than ElasticSearch (due to compression, and due to storing no additional indexes)! Succinct Spark is over two orders of magnitude faster than Spark’s native RDDs due to avoiding data scans. Random access on documents has similar performance gains (with some caveats).

Below, we describe a few interesting use cases for Succinct Spark, including a number of interfaces exposed in the release. For more details on the Succinct Spark release (and Succinct in general), usage and benchmark results, please see +Spark webpagethe NSDI paper, or a more detailed technical report.

Succinct Spark abstractions and use cases

Succinct Spark exposes three interfaces, each of which may have several interesting use cases. We outline some of them below:

  • SuccinctRDD
    • Interface: Flat (unstructured) files
    • Example application: log analytics
    • Example: one can search across logs (e.g., errors for debugging), or perform random access (e.g., extract logs at certain timestamps).
    • System with similar functionality: Lucene
  • SuccinctKVRDD
    • Interface: Semi-structured data
    • Example application: document stores, key-value stores
    • Example:
      • (document stores) search across a collection of Wikipedia documents and return all documents that contain, say, string “University of California at Berkeley”. Extract all (or a part of) documents.
      • (key-value stores) search across a set of tweets stored in a key-value store for tweets that contain “Succinct”. Extract all tweets from the user “_ragarwal_”.
    • System with similar functionality: ElasticSearch
  • (An experimental) DataFrame interface
    • Interface: Search and random access on structured data like tables
    • Example applications: point queries on columnar stores
    • Example: given a table with schema {userID, location, date-of-birth, salary, ..}, find all users who were born between 1980 and 1985.
    • Caveat: We are currently working on some very exciting projects to support a number of additional SQL operators efficiently directly on compressed RDDs.

When not to use Succinct Spark

There are a few applications that are not suitable for Succinct Spark — long sequential reads, and search for strings that occur very frequently (you may not want to search for “a” or “the”).

Looking Ahead

We at AMPLab are working on several interesting projects to make Succinct Spark more memory efficient, faster and more expressive. To give you an idea about what is next, we are going to close this post with a hint on our next post: executing Regular Expression queries directly on compressed RDDs. Stay tuned!

Introduction to the Succinct project

Web applications and services today collect, store and analyze an immense amount of data. The sheer size of the data has fundamentally changed the bottlenecks in systems for big data analytics. In particular, memory bandwidth and CPU performance continue to grow at a rate much faster than bandwidth between CPU and slower storage devices (SSD, disk, etc.). The result is an I/O bottleneck, that is (and will continue to be) getting worse!

A fundamental approach to alleviating the I/O bottleneck is to use data compression. Traditional compression techniques have led to significant gains in terms of storage costs, energy costs, and performance for a wide variety of batch processing jobs. Traditional compression techniques have also been used for reducing I/O bottlenecks in columnar stores with significant performance improvements for OLAP workloads that typically require scanning the entire dataset (see Daniel Abadi’s thesis, this paper, and  references within).

However, the aforementioned compression and query execution techniques are unsuitable for a wide variety of workloads that do not necessarily require data scans (e.g., point queries). One example is search, a fundamental primitive supported by many web applications and services. Examples include Facebook search, Twitter search, LinkedIn search, Airline and hotel search, and services that are specifically built around search (Google, Bing, Yelp, to name a few). Another example is random access as typically performed via get() interface in key-value stores, NoSQL stores, document stores, etc. Queries in such workloads are often short-lived (ideally sub-millisecond), and data scans and/or decompression lead to significant performance degradation. Given the large number of applications that run such workloads, we at AMPLab decided to take a stab at this problem and asked the following fundamental question:

Is it possible to execute point queries (e.g., search and random access) directly on compressed data without performing data scans?

Exploring the above question led to the Succinct project! At a high-level, Succinct enables a wide range of queries including search, range and wildcard queries over arbitrary strings as well as random access into the input data directly on a compressed representation of the inputWhat differentiates Succinct from previous systems that support point queries is that Succinct supports these queries without storing any indexes, without data scans and without data decompression — all the required information is embedded within the compressed representation and queries are executed directly on the compressed representation.

On real-world and benchmark datasets, Succinct can execute sub-millisecond search queries while keeping as much as an order of magnitude more input data in faster storage compared to state-of-the-art systems that provide similar functionality using indexes. For example, on a server with 128GB RAM, Succinct can push as much as 163 — 250GB of raw data, depending on the dataset, while executing search queries within a millisecond. Thus, Succinct executes more queries in faster storage, leading to lower query latency than existing systems for a much larger range of input sizes.

Over next couple of weeks, we will be providing much more information on Succinct — the techniques, tradeoffs and benchmark results over several real-world applications! We are also very excited about the upcoming open-source release of Succinct Spark, making point queries extremely efficient on Spark. Stay tuned!

Finally, over next couple of weeks, I will write a lot more about several very exciting follow up projects on Succinct that we have been working on at AMPLab.