Eric Girouard

About Me

Eric Girouard is Data Engineer at BHE, an independent health analytics company.

High Level Overview of Apache Spark

With the scale of data growing at a rapid and ominous pace, we needed a way to process potential petabytes of data quickly, and we simply couldn’t make a single computer process that amount of data at a reasonable pace. This problem is solved by creating a cluster of machines to perform the work for you, but how do those machines work together to solve the common problem? Spark is the cluster computing framework for large-scale data processing.

Why We Need Apache Spark

We have a lot of data, and we aren’t getting rid of any of it. We need a way to store increasing amounts of data at scale, with protections against data-loss stemming from hardware failure. Finally, we need a way to digest all this data with a quick feedback loop. Thank the cosmos we have Hadoop and Spark. Apache Spark is a wonderfully powerful tool for data analysis and transformation.

The Harvard Innovation Lab

Made in Boston @

The Harvard Innovation Lab


Matching Providers

Matching providers 2
comments powered by Disqus.