Synthetic Data: Useful, Privacy-Risk-Free Data

Hugo Ponte Hugo Ponte
November 23, 2020 AI & Machine Learning

Computer vision models need to be trained on vast data sets, and synthetic data—images generated using the same CGI software as big budget movies and games—can train that AI without compromising anyone’s personal information.

The people have spoken. They want stricter privacy guarantees when it comes to the collection, use, and dissemination of their personal details.

Traditionally the problem has been that compiling useful data sets requires infringing on people’s personal information, but guaranteeing privacy means either smaller or lower quality data sets, or stripping them of information to the point they are no longer useful.

How can we increase both data utility

At the risk of simplifying a complex problem: synthetic data is the solution.

Let’s take a step back. First, why do we need more data? To train AI[1]. AI makes up our today. And our tomorrow: we are already leveraging AI towards a future of self-driving cars, robot surgeons and virtual assistants. Machine learning and deep learning, as subsets of AI, make up the new programming paradigm, where engineers ask how a computer can automatically learn and make its own performance rules just by looking at data. With machine learning, humans input data as well as the answers expected from the data and the computer figures out its rules (this is the AI, so to speak). This model can then be deployed to new data to produce original answers. Bottom-line: the more data a model can train on, the better the model will perform.

So, to push technological development, we need more data. But not just any data – we need  quality data. A model will only be as learned as the data on which it is trained.[2]

Which leads us to our second question, where do we get the data now? Today, the norm is to use real data sets. Walk down any street in San Francisco and guaranteed you’ll see at least one car outfitted in sensors and cameras, gathering data to train its autonomous vehicle brethren. Also par for the course is data scraped off the internet. That old picture you uploaded to that website you built in third grade? Publicly available, so yeah, that’s fair game.

There are many problems with using real data to train AI. Besides the more technical problems (e.g., the necessary labelling/annotating of data is a tedious and imprecise manual exercise that falls short of the detail and richness we need to meet the increasingly complex tasks we demand from our AI), using real data to train models is rife with privacy risks (especially now with the rise of comprehensive privacy regimes like the GDPR in Europe and the CCPA in California). To counteract these risks, real data must undergo a de-identification process, which, as mentioned above, reduces the utility of the data set.

De-identification, sometimes referred to as anonymization, strips a data set of personal identifiers. The extent of what -and how- data is anonymized is important: if data elements used to identify an individual are removed (i.e., anonymized) from a data set, the remaining data becomes nonpersonal information and privacy and data protection laws generally do not apply. But, the data set is now less rich and has less information on which an AI can train.

Further, while there is a regulatory distinction between de-identified/anonymized information and pseudonymized data (legal term for data that can be reversed and re-identify individuals), the truth of the matter is that all anonymized data is subject to reversal. The only real bar is the state of technology at the point in time. Anonymized data today becomes pseudonymized data tomorrow as AI becomes better at re-identifying data points. In the future, algorithms will likely be capable of linking seemingly innocuous data points to construct very intimate profiles on us.

And thus our third question: where can we get data that is useful and not inevitably subject to re-identification? Enter synthetic data.

Synthetic data is useful: it is computer generated and thus inherently boasts pixel-perfect labels and annotations, and has the potential to cover all edge cases, utilizing ML techniques to augment real distributions.

Synthetic data also erases privacy concerns. We can snooze the consequences of using real data, try and strip it (generalize and suppress it) to the point where, today, we can no longer identify the discrete real data points within the set. But this is a temporary band-aid. Synthetic data is fake data; no personal identifiers that could be susceptible to re-identification down the road. Synthetic data guarantees privacy by changing the paradigm and getting rid of any need to use real data.

So yes, a generalization of a complex problem, but synthetic data may be how we strike the balance between privacy and utility.

With synthetic data, we can have our cake and eat it: more precise, accurate, and complex AI (which necessitates detailed data), and guaranteed privacy.


[1]   In the words of Francois Chollet, AI & deep learning researcher and developer of Keras: “A concise definition of the field of [AI] would be as follows: the effort to automate intellectual tasks normally performed by humans. AI is a general field that encompasses machine learning and deep learning…” See Chollet, F. Deep Learning with Python. Manning Publications (2017).

[2]   Moreover, models are notoriously ‘stupid’: they will find the path of least resistance and follow that until taught differently. AI models are creatures of statistics – they will output the statistics of the data set they were trained on. Check out this nifty convolutional neural network (CNN) visualizer and see for yourself: https://www.cs.ryerson.ca/~aharley/vis/conv/. Because of this reality, anyone training AI needs to be very careful and cognizant of the limits and biases inherent in each data set.

[1]   In the words of Francois Chollet, AI & deep learning researcher and developer of Keras: “A concise definition of the field of [AI] would be as follows: the effort to automate intellectual tasks normally performed by humans. AI is a general field that encompasses machine learning and deep learning…” See Chollet, F. Deep Learning with Python. Manning Publications (2017).

[2]   Moreover, models are notoriously ‘stupid’: they will find the path of least resistance and follow that until taught differently. AI models are creatures of statistics – they will output the statistics of the data set they were trained on. Check out this nifty convolutional neural network (CNN) visualizer and see for yourself: https://www.cs.ryerson.ca/~aharley/vis/conv/. Because of this reality, anyone training AI needs to be very careful and cognizant of the limits and biases inherent in each data set.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Hugo Ponte

    Tags
    AIComputer VisionData SetsSynthetic Data
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Machine Learning Is Sometimes Wrong — How You Deal With That Is EVERYTHING

    Machine Learning Is Sometimes Wrong — How You Deal With That Is EVERYTHING

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.