{"id":22460,"date":"2020-11-23T09:34:19","date_gmt":"2020-11-23T09:34:19","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/synthetic-data-useful-privacy-risk-free-data\/"},"modified":"2023-10-04T17:21:28","modified_gmt":"2023-10-04T17:21:28","slug":"synthetic-data-useful-privacy-risk-free-data","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/synthetic-data-useful-privacy-risk-free-data\/","title":{"rendered":"Synthetic Data: Useful, Privacy-Risk-Free Data"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"22460\" class=\"elementor elementor-22460\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-73c1ec63 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"73c1ec63\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a8bd722\" data-id=\"a8bd722\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5c33d8d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"5c33d8d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-744bec0\" data-id=\"744bec0\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a1d8ce7 elementor-widget elementor-widget-text-editor\" data-id=\"a1d8ce7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Computer vision models need to be trained on vast data sets, and synthetic data\u2014images generated using the same CGI software as big budget movies and games\u2014can <a href=\"https:\/\/www.experfy.com\/blog\/ai-ml\/training-ai-with-cgi\/\" target=\"_blank\" rel=\"noreferrer noopener\">train that AI<\/a>\u00a0<em>without\u00a0<\/em>compromising anyone\u2019s personal information.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1cb708c elementor-widget elementor-widget-text-editor\" data-id=\"1cb708c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The people have spoken. They want stricter privacy guarantees when it comes to the collection, use, and dissemination of their personal details.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5325b3c elementor-widget elementor-widget-text-editor\" data-id=\"5325b3c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Traditionally the problem has been that compiling useful data sets requires infringing on people\u2019s personal information, but guaranteeing privacy means either smaller or lower quality data sets, or stripping them of information to the point they are no longer useful.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1e6197b elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1e6197b\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-b23b956\" data-id=\"b23b956\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-de29d0b elementor-widget elementor-widget-heading\" data-id=\"de29d0b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">How can we increase both data utility<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-f9cad52 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"f9cad52\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a5a786b\" data-id=\"a5a786b\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-893e22b elementor-widget elementor-widget-text-editor\" data-id=\"893e22b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>At the risk of simplifying a complex problem: synthetic data is the solution.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-80cd6bc elementor-widget elementor-widget-text-editor\" data-id=\"80cd6bc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Let\u2019s take a step back. First,\u00a0<strong>why do we need more data?<\/strong>\u00a0To train AI<a href=\"https:\/\/zumolabs.ai\/2020\/04\/24\/synthetic-data-useful-privacy-risk-free-data\/#_ftn1\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">[1]<\/a>. AI makes up our today. And our tomorrow: we are already leveraging AI towards a future of self-driving cars, robot surgeons and virtual assistants. Machine learning and deep learning, as subsets of AI, make up the new programming paradigm, where engineers ask how a computer can automatically learn and make its own performance rules just by looking at data. With machine learning, humans input data as well as the answers expected from the data and the computer figures out its rules (this is the AI, so to speak). This model can then be deployed to new data to produce original answers. Bottom-line: the more data a model can train on, the better the model will perform.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e71da27 elementor-widget elementor-widget-text-editor\" data-id=\"e71da27\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So, to push technological development, we need more data. But not just any data \u2013 we need\u00a0<em> quality<\/em>\u00a0data. A model will only be as learned as the data on which it is trained.<a href=\"https:\/\/zumolabs.ai\/2020\/04\/24\/synthetic-data-useful-privacy-risk-free-data\/#_ftn2\" class=\"broken_link\" rel=\"noopener\">[2]<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6988925 elementor-widget elementor-widget-text-editor\" data-id=\"6988925\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Which leads us to our second question<strong>, where do we get the data now?<\/strong>\u00a0Today, the norm is to use real data sets. Walk down any street in San Francisco and guaranteed you\u2019ll see at least one car outfitted in sensors and cameras, gathering data to train its autonomous vehicle brethren. Also par for the course is data scraped off the internet. That old picture you uploaded to that website you built in third grade? Publicly available, so yeah, that\u2019s fair game.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3d39209 elementor-widget elementor-widget-text-editor\" data-id=\"3d39209\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>There are many problems with using real data to train AI. Besides the more technical problems (e.g., the necessary labelling\/annotating of data is a tedious and imprecise manual exercise that falls short of the detail and richness we need to meet the increasingly complex tasks we demand from our AI), using real data to train models is rife with privacy risks (especially now with the rise of comprehensive privacy regimes like the GDPR in Europe and the CCPA in California). To counteract these risks, real data must undergo a de-identification process, which, as mentioned above, reduces the utility of the data set.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0fd3296 elementor-widget elementor-widget-text-editor\" data-id=\"0fd3296\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>De-identification, sometimes referred to as anonymization, strips a data set of personal identifiers. The extent of what -and how- data is anonymized is important: if data elements used to identify an individual are removed (i.e., anonymized) from a data set, the remaining data becomes nonpersonal information and privacy and data protection laws generally do not apply. But, the data set is now less rich and has less information on which an AI can train.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b612067 elementor-widget elementor-widget-text-editor\" data-id=\"b612067\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Further, while there is a regulatory distinction between de-identified\/anonymized information and pseudonymized data (legal term for data that can be reversed and re-identify individuals), the truth of the matter is that all anonymized data is subject to reversal. The only real bar is the state of technology at the point in time. Anonymized data today becomes pseudonymized data tomorrow as AI becomes better at re-identifying data points. In the future, algorithms will likely be capable of linking seemingly innocuous data points to construct very intimate profiles on us.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9bf792e elementor-widget elementor-widget-text-editor\" data-id=\"9bf792e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>And thus our third question:\u00a0<strong>where can we get data that is useful and not inevitably subject to re-identification?<strong>\u00a0Enter synthetic data.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-153d0e6 elementor-widget elementor-widget-text-editor\" data-id=\"153d0e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Synthetic data is useful: it is computer generated and thus inherently boasts pixel-perfect labels and annotations, and has the potential to cover all edge cases, utilizing ML techniques to augment real distributions.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7c31fbf elementor-widget elementor-widget-text-editor\" data-id=\"7c31fbf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Synthetic data also erases privacy concerns. We can snooze the consequences of using real data, try and strip it (generalize and suppress it) to the point where, today, we can no longer identify the discrete real data points within the set. But this is a temporary band-aid. Synthetic data is fake data; no personal identifiers that could be susceptible to re-identification down the road. Synthetic data guarantees privacy by changing the paradigm and getting rid of any need to use real data.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dc9fe10 elementor-widget elementor-widget-text-editor\" data-id=\"dc9fe10\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>So yes, a generalization of a complex problem, but synthetic data may be how we strike the balance between privacy and utility.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b168f83 elementor-widget elementor-widget-text-editor\" data-id=\"b168f83\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>With synthetic data, we can have our cake and eat it: more precise, accurate, and complex AI (which necessitates detailed data), and guaranteed privacy.<\/p>\n<hr class=\"wp-block-separator\" \/>\n<p><a href=\"https:\/\/zumolabs.ai\/2020\/04\/24\/synthetic-data-useful-privacy-risk-free-data\/#_ftnref1\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">[1]<\/a>\u00a0\u00a0 In the words of Francois Chollet, AI &amp; deep learning researcher and developer of Keras: \u201c<em>A concise definition of the field of [AI] would be as follows: the effort to automate intellectual tasks normally performed by humans. AI is a general field that encompasses machine learning and deep learning\u2026<\/em>\u201d See Chollet, F. Deep Learning with Python. Manning Publications (2017).\n\n<p><a href=\"https:\/\/zumolabs.ai\/2020\/04\/24\/synthetic-data-useful-privacy-risk-free-data\/#_ftnref2\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">[2]<\/a>\u00a0\u00a0 Moreover, models are notoriously \u2018stupid\u2019: they will find the path of least resistance and follow that until taught differently. AI models are creatures of statistics \u2013 they will output the statistics of the data set they were trained on. Check out this nifty convolutional neural network (CNN) visualizer and see for yourself:\u00a0<a href=\"https:\/\/www.cs.ryerson.ca\/~aharley\/vis\/conv\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.cs.ryerson.ca\/~aharley\/vis\/conv\/<\/a>. Because of this reality, anyone training AI needs to be very careful and cognizant of the limits and biases inherent in each data set.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\n\n\n<p><a href=\"https:\/\/zumolabs.ai\/2020\/04\/24\/synthetic-data-useful-privacy-risk-free-data\/#_ftnref1\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">[1]<\/a>\u00a0\u00a0 In the words of Francois Chollet, AI &amp; deep learning researcher and developer of Keras: \u201c<em>A concise definition of the field of [AI] would be as follows: the effort to automate intellectual tasks normally performed by humans. AI is a general field that encompasses machine learning and deep learning\u2026<\/em>\u201d See Chollet, F. Deep Learning with Python. Manning Publications (2017).\n<!-- wp:paragraph -->\n<p><a href=\"https:\/\/zumolabs.ai\/2020\/04\/24\/synthetic-data-useful-privacy-risk-free-data\/#_ftnref2\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">[2]<\/a>\u00a0\u00a0 Moreover, models are notoriously \u2018stupid\u2019: they will find the path of least resistance and follow that until taught differently. AI models are creatures of statistics \u2013 they will output the statistics of the data set they were trained on. Check out this nifty convolutional neural network (CNN) visualizer and see for yourself:\u00a0<a href=\"https:\/\/www.cs.ryerson.ca\/~aharley\/vis\/conv\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.cs.ryerson.ca\/~aharley\/vis\/conv\/<\/a>. Because of this reality, anyone training AI needs to be very careful and cognizant of the limits and biases inherent in each data set.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Computer vision models need to be trained on vast data sets, and synthetic data\u2014images generated using the same CGI software as big budget movies and games\u2014can train that AI without compromising anyone\u2019s personal information.<\/p>\n","protected":false},"author":949,"featured_media":17996,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[226,487,1046,853],"ppma_author":[3796],"class_list":["post-22460","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-ai","tag-computer-vision","tag-data-sets","tag-synthetic-data"],"authors":[{"term_id":3796,"user_id":949,"is_guest":0,"slug":"hugo-ponte","display_name":"Hugo Ponte","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/10\/Hugo-Ponte-150x150.jpg","user_url":"http:\/\/zumolabs.ai","last_name":"Ponte","first_name":"Hugo","job_title":"","description":"Hugo Ponte is Co-Founder at Zumo Labs that generates custom synthetic data sets for more robust and reliable computer vision models."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22460","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/949"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22460"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22460\/revisions"}],"predecessor-version":[{"id":33247,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22460\/revisions\/33247"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/17996"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22460"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22460"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22460"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22460"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}