The Case for Using Data Simulators to Drive Big Data Success

Ravi Condamoor Ravi Condamoor
February 15, 2019 Big Data, Cloud & DevOps

Ready to learn Big Data? Browse courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Business Rationale for a Data Simulator

While a lot has been said and written about the business value of unifying data silos for insights, Big Data solution providers often encounter problems in convincing their customers to break down these silos.  It is not that customers are unwilling to act by either finding data in-house or procuring third party data that can be brought together, they need to be convinced that this extra effort will result in significantly enhanced business outcomes. The speed of vendor access to data in a complex corporate environment is directly proportional to the number of legacy systems and proprietary mechanisms in the company’s IT operations. Case studies and generic demos only help in starting a conversation but not in closing the deal.  As a solutions provider trying to get work done, this became a challenging issue, and to address it we started using Data Simulators to break the impasse. The use of a simulator allowed us to generate synthetic data in numerous types, shapes, and value to suit most business cases. It also helped our Proof of Concepts (POCs) to be more focused on insights and action, and less on ingestion and ETL processes. While ingestion of data is a real problem and needs to be solved, we found that our customers prefer deferring it to the production phase rather than the POC phase. This way the insight horse was always ahead of the ingestion cart! Data simulators are also accelerating our deployments in newer domains where there is little historical data or where it’s hard to get data – wearables, industrial internet, expensive third-party data and more.

Now that I have said something to highlight the business value of a Data Simulator, I present below a deep dive of the architecture and implementation of our simulator – BigSim.

Under The Hood – BigSim

BigSim is designed to provide flexibility and control in generating large data sets through templates and minimal coding. Users just need to provide the data specifications in an XML template defining the semantic type, range, volume, velocity, and shape. Since much of the data generation process is an independent task, multiple simulator instances can run independently on different machines; thereby creating large data sets that can be pushed to a common data storage, or streamed. These simulated data sets can be used for capacity planning, what-if scenario testing, extrapolating small data sets with certain amount of randomness so as to simulate real-world data sets, fill in missing data in incomplete data sets and such.

 

Key Features of BigSim

Extensibility and Adaptability

The simulator can easily be extended and adapted to generate custom data patterns using a library of pre-built primitive and user defined types.The XML snippets below show examples of how this can be done.

Fine Grain Control

A robust simulation platform should be able to support easy control of the volume and velocity of the data to support multiple usage scenarios. Smart grids, Black Friday sales, high frequency trading, and Twitter fire hose, all generate data of varying types, volumes and velocity. BigSim provides adequate dials and knobs to deal with such needs.

This load distribution template shown below generates data records for an hour with varying loads distributed across different time slices.  

Support for Data in Motion and Data at Rest

With streaming analytics gaining popularity alongside batch analytics, simulators are expected to generate large volumes of data to support both forms of analytics. BigSim has the ability to push data into a CSV file and into various SQL and NoSQL databases. It can also stream the generated data in real-time or at desired intervals for consumption by stream-based services.

The snippet below shows the configuration for a Batch (CSV, Cassandra) and Streaming data generation.

Conclusion

For a long time now, simulators  have played a vital role in engineering domain with offerings such as wind tunnels, flight simulators, and load and stress testers. These have without a doubt resulted in bringing innovative and safer products faster to market.  Our experience has shown that rolling out data-driven products and services targeting both enterprises and consumers can be accelerated through a robust data simulator. Big Data projects no longer have to be stymied by not enough data, cannot access data, missing data, or incorrect data.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Ravi Condamoor

    Tags
    Big Data & Technology
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Looking to Enter the AI Industry? Here Are Some Tips

    Looking to Enter the AI Industry? Here Are Some Tips

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.