With the scale of data growing at a rapid and ominous pace, we needed a way to process potential petabytes of data quickly, and we simply couldn’t make a single computer process that amount of data at a reasonable pace. This problem is solved by creating a cluster of machines to perform the work for you, but how do those machines work together to solve the common problem? Spark is the cluster computing framework for large-scale data processing.
We have a lot of data, and we aren’t getting rid of any of it. We need a way to store increasing amounts of data at scale, with protections against data-loss stemming from hardware failure. Finally, we need a way to digest all this data with a quick feedback loop. Thank the cosmos we have Hadoop and Spark. Apache Spark is a wonderfully powerful tool for data analysis and transformation.