Why I Think Spark Will Have the Staying Power of SQL

Mike Vedomske Mike Vedomske
February 15, 2019 Big Data, Cloud & DevOps

Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Spark is to SQL what calculus is to algebra.

experfy-blog

Old-timer Just Keeps on Tickin’

>

SQL has been around for almost 40 years. SQL has been around in commercial form since 1979. That’s when Relational Software, Inc. (which later became Oracle) released Oracle version 2 (which was the marketing renaming of what was really version 1).

Think about that for a second. A hotshot fresh-out-of-undergrad SQL-skilled new hire would be retiring in just a couple years. Some people still build their careers off of SQL skills. In other words, SQL had incredible staying power. It still does.

Enter the Young Gun

>

So what does this have to do with AI? Well, I’m going to go out on a limb and say that we’re three years into a similar journey with another landscape shifting technology: Spark. Spark was initially released as an Apache project in May of 2014. I happened to be a fresh hire (albeit PhD, not undergrad) and Spark was HOT.  I mean, it was exactly what our company (and many others) needed and every release just got better.

I have a few reasons that I believe will help Spark stay meaningful through the years.

experfy-blog

1. Daddy Warbucks Got Yer Back

>

SQL was supported by a strong company (read, had commercial support) while also taking advantage of the open source efforts of outside contributors and was eventually standardized. Spark has the commercial support of Databricks which is currently valued at nearly $1B only a few years into its existence. And as an Apache project, it is developed at an extremely rapid clip by a vibrant open-source community.

What’s probably even more important is the fact that it is used at so many large companies. In other words, it has weaseled it’s way into the core toolset of much of the world’s GDP. And that’s just the beginning, because according to DataBricks, they’re still working on reaching the other 99%.

2. One Stop ML Shop

>

One of the things that made it really great was it could pass through HQL and then soon had it’s own SQL-like language, Spark-SQL. For the first time, data wrangling and machine learning could be executed on big data in one place in well-known languages at extraordinary speed. It was the holy grail of big data science.

Spark meets two primary needs:

  1. Easy data wrangling (in a familiar approach: SQL)
  2. Many of your favorite machine learning algorithms at scale.

In other words, SQL’s staying power, and natural way of thinking about data, is what will help Spark also have staying power. Yes, most data stores are no-SQL, but the fact that you can use non-relational databases and think about the data in them as if they were relational is what makes it so powerful. Notice, SQL is still the reference here. All databases are referenced by their relation to SQL, that’s saying something.

Spark can handle pretty much any data store you throw at it and data scientists can use a common way of thinking about data (SQL) for handling it. You don’t have to use the SQL-like interface, but it’s there, and many take advantage of it. Don’t care for the SQL/HQL aproach? That’s fine, you can use Spark like many use bash for data wrangling. Spark spans many skill levels.

3. It Feels Familiar

>

Because Spark has a machine learning library, you can use it much like you would familiar data science languages like R and Python. The usefulness here goes beyond just syntax, it’s the process that makes it so user-friendly.

Interactively playing with and exploring data is one of the most powerful parts of R and Python. You can very quickly start to peel back the layers and find the stories within the data. Before Spark, that process was painful and slow (sorry MapReduce 🙁 … ). Suddenly with Spark, working with very large data sets felt much more like what we experienced in R and Python. Sure, there was still some waiting, but nothing close to what it was before.

The second powerful parts of R and Python are the packages that contain numerous algorithms for machine learning (and just about any other data-related task you can think of). Spark does this as well, though in a more limited way (due to the parallelization it requires). Spark makes big data feel a little smaller. In today’s parlance, the user experience is solid.

 

See You In 40 Years

>

SQL made working with data much simpler. For the first time, people could use a straighforward logic and language for getting at previously hidden knowledge. Spark is the next natural step of that evolution. In this step, the hidden knowledge is less explicit, and is found via feature engineering, machine learning, and dipping into vast stores of previously untapped data. Because Spark makes doing these things simple in the way that SQL made the first step of data exploration simple.

Spark is to SQL what calculus is to algebra. And that’s why I think Spark will have the staying power of SQL.

 Originally published at iot for all

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Mike Vedomske

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    How Will IIoT Create More Shareholder Value?

    How Will IIoT Create More Shareholder Value?

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.