The Challenge of Data Scale is Driving Data Center Automation

Joan Wrabetz Joan Wrabetz
July 3, 2018 Big Data, Cloud & DevOps

Ready to learn Data Science? Browse Data Science Training and Certification courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Data scale is one of the top challenges that is changing the face of data centers

In the previous blog 5 top data challenges that are changing the face of data centers, I introduced a set of challenges presented by “New Data” – data that is  both transactional and unstructured, publicly available and privately collected, and its value is derived from the ability to aggregate and analyze it.  Loosely speaking we can divide this new data into two categories: big data – large aggregated data sets used for batch analytics – and fast data – data collected from many sources that is used to drive immediate decision making. The big data–fast data paradigm is driving a completely new architecture for data centers (both public and private).

In this blog, I would like to focus on Challenge #2:  Data scale is driving data center automation.

Public cloud providers already operate at massive scale.  As smaller organizations move to public cloud, the remaining private datacenters are also getting much larger.   A big driver for this scale is data – both the data storage itself, and the requirements for computing scale to analyze that data.  

Unfortunately, at the scale of Petabytes of storage and thousands of compute nodes, any manual management is simply cost-prohibitive.  This is leading to a completely new set of storage architectures that can operate a large scale and require very little management of the data.  This means management of moving the data, protecting the data, and making the data available at the right performance level for the analysis that is needed at any point in time – for example big data needs difference performance than fast data. 

A new class of storage vendor has emerged, whose solutions accomplish this goal through a combination of 1) software defined storage 2) commodity building block hardware componentry 3) distributed scalable storage architectures and 4) application awareness.  This combination ensures that many of the storage management needs of a datacenter are essentially automated as part of the data solution itself, and do not require outside manual interaction.  Let’s look at each of these solution characteristics and how they make large scale datacenter operations cost effective: 

Software defined storage on commodity building block hardware

By separating data management functions completely from hardware, new software defined storage solutions make it possible to build storage systems from a small set of common hardware building blocks in a datacenter.  This dramatically reduces the cost of scaling.  No pre-planning required, just pop a new compute and/or storage node into a rack and turn it on.  New software defined storage solutions will start to use that new capacity and performance immediately and make that capacity available to all of the users in most cases.  If new computing capabilities become available, there is no “lift and shift” of the existing nodes, because the software doesn’t really care about the underlying hardware.  The software defined storage is now provisioned and managed in the software layer so this can be done once and not every time hardware is added.  Eliminating hardware planning and management, and separating storage configuration from hardware cuts the management costs associated with large scale storage dramatically. 

Distributed scalable storage architectures 

Managing and provisioning individual storage units at large scale is essentially impossible.  This has led to two compute and storage architectures that address this issue:  Hyperconverged and Hyperscale.  Both architectures provide small building block units of compute and storage that can be scaled out as the datacenter grows.  Software defined storage on top of those architectures takes advantage of those building blocks to grow capacity and performance.  A critical component of these solutions is a distributed storage software architecture.  This means that storage is aggregated from across the entire hyperconverged or hyperscale hardware system and is then presented to users and applications as logical units of storage that can be any size.  The majority of these software defined architectures are also smart enough to take advantage of different types of storage that operate at different performance (such as Flash, hard drives and cold storage).  The software layers provide data management functions to perform functions like putting data on the right tier, caching data for performance, replicating data for reliability, deduplicating data for efficiency, etc.  Smart storage software that can take advantage of large scale and non-uniform hardware eliminates another big swath of management cost. 

Application Awareness 

At the end of the day, data is accessed and used by applications.  This means that most of the storage is going to be provisioned to individual applications.  Historically, this was a manual operation performed by smart administrators, like DBAs (Database Administrators).  However, in the world of large scale data analytics databases like Hadoop and memory centric databases like Spark and Mongodb, manually provisioning storage in units like LUNS to applications is just not feasible.  Many of these applications also run in some kind of virtual environment, such as VMs or containers.  New storage architectures often present logical storage from a large aggregated pool directly into the application.  This eliminates a couple of big manual provisioning steps that were historically required for storage.  In some cases, software defined storage architectures have the capability to give each application logical storage that is specifically created to match their unique performance requirements.  For example, a MongoDB installation will require all flash storage, while a large Hadoop cluster can operate with hard drives quite effectively.  The software defined storage solution can manage these choices automatically with some guidance from the application.  

The bottom line for large scale datacenters is a requirement to move away from storage “in the box” and toward hyperconverged- and hyperscale-friendly software defined storage architectures that operate at scale with a minimum of manual intervention and management. 

Keep an eye out for the next blog in this series on challenges that are changing the face of datacenter, coming soon.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Joan Wrabetz

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Technical Debt in Machine Learning

    Technical Debt in Machine Learning

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.