• Data Science
  • Mike Vedomske
  • NOV 13, 2017

Why I Think Spark Will Have the Staying Power of SQL

Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Spark is to SQL what calculus is to algebra.

Old-timer Just Keeps on Tickin’

SQL has been around for almost 40 years. SQL has been around in commercial form since 1979. That’s when Relational Software, Inc. (which later became Oracle) released Oracle version 2 (which was the marketing renaming of what was really version 1).

Think about that for a second. A hotshot fresh-out-of-undergrad SQL-skilled new hire would be retiring in just a couple years. Some people still build their careers off of SQL skills. In other words, SQL had incredible staying power. It still does.

Enter the Young Gun

So what does this have to do with AI? Well, I’m going to go out on a limb and say that we’re three years into a similar journey with another landscape shifting technology: Spark. Spark was initially released as an Apache project in May of 2014. I happened to be a fresh hire (albeit PhD, not undergrad) and Spark was HOT.  I mean, it was exactly what our company (and many others) needed and every release just got better.

I have a few reasons that I believe will help Spark stay meaningful through the years.

1. Daddy Warbucks Got Yer Back

SQL was supported by a strong company (read, had commercial support) while also taking advantage of the open source efforts of outside contributors and was eventually standardized. Spark has the commercial support of Databricks which is currently valued at nearly $1B only a few years into its existence. And as an Apache project, it is developed at an extremely rapid clip by a vibrant open-source community.

What’s probably even more important is the fact that it is used at so many large companies. In other words, it has weaseled it’s way into the core toolset of much of the world’s GDP. And that’s just the beginning, because according to DataBricks, they’re still working on reaching the other 99%.

2. One Stop ML Shop

One of the things that made it really great was it could pass through HQL and then soon had it’s own SQL-like language, Spark-SQL. For the first time, data wrangling and machine learning could be executed on big data in one place in well-known languages at extraordinary speed. It was the holy grail of big data science.

Spark meets two primary needs:

  1. Easy data wrangling (in a familiar approach: SQL)
  2. Many of your favorite machine learning algorithms at scale.

In other words, SQL’s staying power, and natural way of thinking about data, is what will help Spark also have staying power. Yes, most data stores are no-SQL, but the fact that you can use non-relational databases and think about the data in them as if they were relational is what makes it so powerful. Notice, SQL is still the reference here. All databases are referenced by their relation to SQL, that’s saying something.

Spark can handle pretty much any data store you throw at it and data scientists can use a common way of thinking about data (SQL) for handling it. You don’t have to use the SQL-like interface, but it’s there, and many take advantage of it. Don’t care for the SQL/HQL aproach? That’s fine, you can use Spark like many use bash for data wrangling. Spark spans many skill levels.

3. It Feels Familiar

Because Spark has a machine learning library, you can use it much like you would familiar data science languages like R and Python. The usefulness here goes beyond just syntax, it’s the process that makes it so user-friendly.

Interactively playing with and exploring data is one of the most powerful parts of R and Python. You can very quickly start to peel back the layers and find the stories within the data. Before Spark, that process was painful and slow (sorry MapReduce 🙁 … ). Suddenly with Spark, working with very large data sets felt much more like what we experienced in R and Python. Sure, there was still some waiting, but nothing close to what it was before.

The second powerful parts of R and Python are the packages that contain numerous algorithms for machine learning (and just about any other data-related task you can think of). Spark does this as well, though in a more limited way (due to the parallelization it requires). Spark makes big data feel a little smaller. In today’s parlance, the user experience is solid.


See You In 40 Years

SQL made working with data much simpler. For the first time, people could use a straighforward logic and language for getting at previously hidden knowledge. Spark is the next natural step of that evolution. In this step, the hidden knowledge is less explicit, and is found via feature engineering, machine learning, and dipping into vast stores of previously untapped data. Because Spark makes doing these things simple in the way that SQL made the first step of data exploration simple.

Spark is to SQL what calculus is to algebra. And that’s why I think Spark will have the staying power of SQL.

 Originally published at iot for all

Boston city bkg

Made in Boston @

The Harvard Innovation Lab


Matching Providers

Matching providers 2
comments powered by Disqus.