{"id":2310,"date":"2020-03-11T02:15:43","date_gmt":"2020-03-11T02:15:43","guid":{"rendered":"http:\/\/kusuaks7\/?p=1915"},"modified":"2024-01-01T16:11:04","modified_gmt":"2024-01-01T16:11:04","slug":"data-privacy-in-the-age-of-big-data","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/data-privacy-in-the-age-of-big-data\/","title":{"rendered":"Data Privacy in the Age of Big Data"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2310\" class=\"elementor elementor-2310\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1f398bcc elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1f398bcc\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-64b5ce5f\" data-id=\"64b5ce5f\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-6e2b594 elementor-widget elementor-widget-heading\" data-id=\"6e2b594\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 style=\"color: #aaa;font-style: italic\">Learn how little privacy you have and how differential privacy aims to help.<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0f89be4 elementor-widget elementor-widget-text-editor\" data-id=\"0f89be4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<section>\n<blockquote>\n<p id=\"4c2a\" data-selectable-paragraph=\"\">\u201cArguing that you don\u2019t care about the right to privacy because you have nothing to hide is no different than saying you don\u2019t care about free speech because you have nothing to say.\u201d \u2015\u00a0<strong>Edward Snowden<\/strong><\/p>\n<\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-904be03 elementor-widget elementor-widget-text-editor\" data-id=\"904be03\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"31b1\" data-selectable-paragraph=\"\">In November 2017, the running app Strava released a data visualization map showing every activity ever uploaded to their system. This amounted to over 3 trillion GPS points, which came from devices ranging from smartphones to fitness trackers such as Fitbits and smartwatches. One of the aspects of the app is that you can see popular routes in major cities, or find individuals in remote areas with unusual exercise patterns<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-37fb775 elementor-widget elementor-widget-image\" data-id=\"37fb775\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1459\/1*YUP-3Rjk5MLkWH0pKV9Suw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0d459d5 elementor-widget elementor-widget-text-editor\" data-id=\"0d459d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\"><span style=\"background-color: rgba(0, 0, 0, 0.05);\">The Strava global heatmap released in November 2017.\u00a0<\/span><a style=\"background-color: rgba(0, 0, 0, 0.05);\" href=\"https:\/\/blog.strava.com\/zi\/press\/strava-community-creates-ultimate-map-of-athlete-playgrounds\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Source<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d13fd64 elementor-widget elementor-widget-text-editor\" data-id=\"d13fd64\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c34d\" data-selectable-paragraph=\"\">The idea of sharing your exercise activities with others may seem fairly innocuous, but this map exposed the location of military bases and personnel on active service. One such location was a secret (not anymore) military base in the Helmand province of Afghanistan.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2246c38 elementor-widget elementor-widget-image\" data-id=\"2246c38\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1240\/1*xHiKcJ69cTwGgsXeeuqikQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4ceffa9 elementor-widget elementor-widget-text-editor\" data-id=\"4ceffa9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">A military base in Helmand Province, Afghanistan with route taken by joggers highlighted by Strava. Photograph: Strava Heatmap<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1942060 elementor-widget elementor-widget-text-editor\" data-id=\"1942060\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"aed9\" data-selectable-paragraph=\"\">This was not the only base exposed, and in fact, activity in locations such as Syria and Djibouti, users were almost exclusively U.S. military personnel. The Strava data visualization was thus to some extent a highly detailed map of U.S. military personnel stationed worldwide.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-467b682 elementor-widget elementor-widget-text-editor\" data-id=\"467b682\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6757\" data-selectable-paragraph=\"\">In this age of ever-increasing data, it is becoming increasingly difficult to maintain any semblance of privacy. In exchange for convenience in our lives, we now provide constant information to companies to help improve them to improve their business operations.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-15cf909 elementor-widget elementor-widget-text-editor\" data-id=\"15cf909\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"fa32\" data-selectable-paragraph=\"\">Whilst this has many positive benefits, such as an IPhone knowing my location and being able to tell me about nearby restaurants or the quickest route to a location, it can also be used in adverse ways which can result in substantial privacy violations to the individual. This problem will only escalate as we move towards an increasingly data-driven society.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-eec8977 elementor-widget elementor-widget-image\" data-id=\"eec8977\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/960\/0*BvVrYl2HJoZBo0-Y.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f58caa0 elementor-widget elementor-widget-text-editor\" data-id=\"f58caa0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">Image courtesy of\u00a0<a href=\"https:\/\/www.forbes.com\/sites\/tomcoughlin\/2018\/11\/27\/175-zettabytes-by-2025\/#332366895459\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">Forbes<\/a>.<\/p>\n<p id=\"c949\" data-selectable-paragraph=\"\">In this article, I will talk about some of the biggest blunders in terms of privacy leaks from publicly released datasets, the different types of attacks that can be made on these datasets to re-identify individuals, as well as introduce the current best defense we have for maintaining our privacy in a data-driven society:\u00a0<em>differential privacy<\/em>.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed960e5 elementor-widget elementor-widget-heading\" data-id=\"ed960e5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"181d\" data-selectable-paragraph=\"\">What is Data Privacy?<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3e634a4 elementor-widget elementor-widget-text-editor\" data-id=\"3e634a4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"fc39\" data-selectable-paragraph=\"\">In 1977,\u00a0<a href=\"https:\/\/archives.vrdc.cornell.edu\/info7470\/2011\/Readings\/dalenius-1977.pdf\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Tore Dalenius<\/a>\u00a0articulated a desideratum for statistical databases:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4770bd2 elementor-widget elementor-widget-text-editor\" data-id=\"4770bd2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote>\n<p id=\"c167\" data-selectable-paragraph=\"\">Nothing about an individual should be learnable from the database that cannot be learned without access to the database. \u2014\u00a0<strong><em>Tore Dalenius<\/em><\/strong><\/p>\n<\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-091833f elementor-widget elementor-widget-text-editor\" data-id=\"091833f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7b42\" data-selectable-paragraph=\"\">This idea hoped to extend the well-known concept of\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Semantic_security\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">semantic security<\/a>\u00a0\u2014 the idea that when a message is encoded into\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Ciphertext\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">ciphertext<\/a>\u00a0using a cryptographic algorithm, the ciphertext contains no information about the underlying message, often referred to as\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Plaintext\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">plaintext<\/a>\u00a0\u2014 to databases.<\/p>\n<p id=\"bc4a\" data-selectable-paragraph=\"\">However, for various reasons, it has been proven in a paper by\u00a0<a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/11787006_1\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Cynthia Dwork<\/a>, professor of Computer Science at Harvard University, that this idea put forth by Dalenius is a mathematical impossibility.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ecea6cd elementor-widget elementor-widget-text-editor\" data-id=\"ecea6cd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"59cb\" data-selectable-paragraph=\"\">One of the reasons behind this is that whilst in cryptography we often talk about two users communicating on a secure channel whilst protected from bad actors, in the case of databases, it is the users themselves who must be considered the bad actors.<\/p>\n<p id=\"9f3a\" data-selectable-paragraph=\"\">So, we don\u2019t have very many options. One thing we can do is identify aspects of our dataset that are particularly important in identifying a person and deal with these in such a way that the dataset effectively becomes \u201canonymized\u201d.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-547e016 elementor-widget elementor-widget-text-editor\" data-id=\"547e016\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6b13\" data-selectable-paragraph=\"\"><strong>Anonymization\/de-identification<\/strong>\u00a0of a dataset essentially means that it not possible (at least ostensibly) that any given individual in the dataset can be identified. We use the term\u00a0<strong>reidentification\u00a0<\/strong>to refer to an individual who has been identified from an anonymized dataset.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ea7416a elementor-widget elementor-widget-heading\" data-id=\"ea7416a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"a388\" data-selectable-paragraph=\"\">The Pyramid of Identifiability<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-60217fb elementor-widget elementor-widget-text-editor\" data-id=\"60217fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f0d5\" data-selectable-paragraph=\"\">Personal data exists on a spectrum of identifiability. Think of a pyramid, where at the top of the pyramid we have data that can directly identify an individual: a name, phone number, or social security number.<\/p>\n<p id=\"bc8f\" data-selectable-paragraph=\"\">These forms of data are collectively referred to as \u2018<strong><em>direct identifiers<\/em>.<\/strong>\u2019<\/p>\n<p id=\"d545\" data-selectable-paragraph=\"\">Below the direct identifiers on the pyramid is data that can be indirectly, yet unambiguously, linked to an individual. Only a small amount of data is needed to uniquely identify an individual, such as gender, date of birth, and zip code \u2014 which in combination can uniquely identify 87% of the U.S. population.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-03e5991 elementor-widget elementor-widget-text-editor\" data-id=\"03e5991\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"35f8\" data-selectable-paragraph=\"\">These data are collectively called \u2018<strong><em>indirect identifiers<\/em><\/strong>\u2019 or \u2018<strong><em>quasi-identifiers.<\/em><\/strong>\u2019<\/p>\n<p id=\"31c5\" data-selectable-paragraph=\"\">Below the quasi-identifiers is data that can be ambiguously connected to multiple people \u2014 physical measurements, restaurant preferences, or an individuals\u2019 favorite movies.<\/p>\n<p id=\"b738\" data-selectable-paragraph=\"\">The fourth stage of our pyramid is data that cannot be linked to any specific person \u2014 aggregated census data, or broad survey results.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5a85c45 elementor-widget elementor-widget-text-editor\" data-id=\"5a85c45\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"2744\" data-selectable-paragraph=\"\">Lastly, at the bottom of the pyramid, there is data that is not directly related to individuals at all: weather reports and geographic data.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e9073c elementor-widget elementor-widget-image\" data-id=\"6e9073c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/994\/1*SHz_nsvqdjEcoPbh0xowBQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-58cb688 elementor-widget elementor-widget-text-editor\" data-id=\"58cb688\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">The levels of identifiability of data.\u00a0<a href=\"https:\/\/georgetownlawtechreview.org\/re-identification-of-anonymized-data\/GLTR-04-2017\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Source<\/a><\/p>\n<p id=\"857c\" data-selectable-paragraph=\"\">In this way, information that is more difficult to relate to an individual is placed lower on the pyramid of identifiability. However, as data becomes more and more scrubbed of personal information, its usefulness for research and analytics directly decreases. As a result, privacy and utility are on opposite ends of this spectrum \u2014 maximum usefulness from the data at the top of the staircase and maximum privacy at the bottom of the staircase. As data gets more and more scrubbed \u2014 its usefulness for analysis decreases.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-43aa425 elementor-widget elementor-widget-image\" data-id=\"43aa425\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/850\/0*hN4YS-JOPg57r84v.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-ef67123 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"ef67123\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-65b38d2\" data-id=\"65b38d2\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-53c9a56 elementor-widget elementor-widget-text-editor\" data-id=\"53c9a56\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">The trade-off between utility and privacy.\u00a0<a href=\"https:\/\/www.researchgate.net\/publication\/318866074_Practical_Implications_of_Sharing_Data_A_Primer_on_Data_Privacy_Anonymization_and_De-Identification\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">Source<\/a><\/p>\n<p id=\"e37c\" data-selectable-paragraph=\"\">As a data scientist, this seems to be an unsatisfying answer. We would always like our data to be as accurate as possible in order that we might make the most accurate inferences. However, from the other viewpoint, the less we can discern about an individual, the better it is in terms of their privacy.<\/p>\n<p id=\"0357\" data-selectable-paragraph=\"\">So how do we balance this trade-off? Are we doomed to either lose our accuracy or our privacy? Are we playing a zero-sum game? We will come to find that the answer is, actually, yes. However, there are now ways of mathematically determining the privacy level of a dataset such that we obtain the maximal amount of information whilst providing a suitable level of privacy for individuals present in the data. More on this later.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-185e5fb elementor-widget elementor-widget-heading\" data-id=\"185e5fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\">\n<h1 id=\"ca61\" data-selectable-paragraph=\"\"><strong>Scrubbing Techniques for Anonymizing Data<\/strong><\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fe236d6 elementor-widget elementor-widget-text-editor\" data-id=\"fe236d6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"05b6\" data-selectable-paragraph=\"\">There are four common techniques that can be used for deidentifying a dataset:<\/p>\n<p id=\"fbe5\" data-selectable-paragraph=\"\"><strong>[1] Deletion or Redaction<\/strong><\/p>\n<p id=\"7937\" data-selectable-paragraph=\"\">This technique is most commonly used for direct identifiers that you do not want to release, such as phone numbers, social security numbers, or home addresses. This can often be done automatically as it directly corresponds with primary keys in a database. Deletion simply put would mean if we had an Excel spreadsheet we would delete columns corresponding to direct identifiers.<\/p>\n<p id=\"2a69\" data-selectable-paragraph=\"\">In the table below, the first column, \u201cName,\u201d can be removed without compromising the usefulness of the data for future research.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-55d3b96 elementor-widget elementor-widget-image\" data-id=\"55d3b96\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1218\/1*ouZaUy39loOALZP3Ux14Uw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8be2e56 elementor-widget elementor-widget-text-editor\" data-id=\"8be2e56\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"4723\" data-selectable-paragraph=\"\">This technique is not foolproof. Direct identifiers are often not clearly marked, and important information may be mistaken for personal information and deleted accidentally.<\/p>\n<p id=\"91e9\" data-selectable-paragraph=\"\"><strong>[2] Pseudonymization<\/strong><\/p>\n<p id=\"0a68\" data-selectable-paragraph=\"\">The second approach entails merely changing the \u2018Name\u2019 category to a unique but anonymous value, such as the hashed value of a column, or a user ID. These can be randomly generated or determine through an algorithm. However, this should be done with caution. If you have a list of students and you release their grades using an anonymous ID, it is probably a good idea not to do it in alphabetical order as it makes it fairly easy to reidentify people!<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-639e61c elementor-widget elementor-widget-text-editor\" data-id=\"639e61c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"bf1a\" data-selectable-paragraph=\"\">Similarly, if a deterministic algorithm is used to perform the pseudonymization, and the nature of the algorithm used is uncovered, it then compromises the anonymity of the individuals.<\/p>\n<p id=\"6ba4\" data-selectable-paragraph=\"\">For example, in 2014 the New York City Taxi and Limousine Commission released a dataset of all taxi trips taken in New York City that year. Before releasing the data the Taxi and Limousine Commission attempted to scrub it of identifying information, specifically they pseudonymized the taxicab medallion numbers and driver\u2019s license numbers. Bloggers were, however, able to discover the algorithm used to alter the medallion numbers and then reverse the pseudonymization.<\/p>\n<p id=\"cedc\" data-selectable-paragraph=\"\">This approach also shares the same weaknesses as the first approach \u2014 direct identifiers can be difficult to identify and replace, and indirect identifiers are inadvertently left in the dataset.<\/p>\n<p id=\"8efc\" data-selectable-paragraph=\"\">Pseudonyms also cease to be effective if the same unique pseudonym is continually used throughout a dataset, in multiple datasets, or for a long period.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-42f9355 elementor-widget elementor-widget-text-editor\" data-id=\"42f9355\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"54e8\" data-selectable-paragraph=\"\"><strong>[3] Statistical Noise<\/strong><\/p>\n<p id=\"677e\" data-selectable-paragraph=\"\">Whilst the first two approaches apply almost exclusively to direct identifiers, the latter two apply almost exclusively to indirect identifiers.<\/p>\n<p id=\"21bd\" data-selectable-paragraph=\"\">We can envisage the third approach of adding statistical noise as pixelating someone\u2019s face in an image. We essentially allow the data to still exist there, but it is somewhat obscured by random noise. Depending on the way this is done, this can be a very effective technique.<\/p>\n<p id=\"164b\" data-selectable-paragraph=\"\">Some ways statistical noise is introduced into datasets include:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-98a4377 elementor-widget elementor-widget-text-editor\" data-id=\"98a4377\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li id=\"1831\" data-selectable-paragraph=\"\"><strong>Generalization:<\/strong>\u00a0Specific values can be reported as a range. For instance, a patient\u2019s age can be reported as 70\u201380 instead of giving a full birthdate.<\/li>\n \t<li id=\"6bc7\" data-selectable-paragraph=\"\"><strong>Perturbation:<\/strong>\u00a0Specific values can be randomly adjusted for all patients in a dataset. For example, systematically adding or subtracting the same number of days from when a patient was admitted for care, or adding noise from a normal distribution.<\/li>\n \t<li id=\"b4cb\" data-selectable-paragraph=\"\"><strong>Swapping:\u00a0<\/strong>Data can be exchanged between individual records within a dataset.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b497d4d elementor-widget elementor-widget-text-editor\" data-id=\"b497d4d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"66bb\" data-selectable-paragraph=\"\">As you may have suspected, the more direct or indirect identifiers that are removed and\/or obscured with statistical noise, the lower the accuracy of our data becomes.<\/p>\n<p id=\"2f60\" data-selectable-paragraph=\"\"><strong>[4] Aggregation<\/strong><\/p>\n<p id=\"9508\" data-selectable-paragraph=\"\">The fourth technique is similar to the idea of generalization discussed in the statistical noise section. Instead of releasing raw data, the dataset is aggregated and only a summary statistic or subset is released.<\/p>\n<p id=\"01c6\" data-selectable-paragraph=\"\">For example, a dataset might only provide the total number of patients treated, rather than each patient\u2019s individual record. However, if only a small subsample is released, the probability of reidentification increases \u2014 such as a subset that only contains a single individual.<\/p>\n<p id=\"ce8d\" data-selectable-paragraph=\"\">In an aggregated dataset, an individual\u2019s direct or indirect identifiers are withheld from publication. However, the summary data must be based on a broad enough range of data to not lead to the identification of a specific individual. For instance, in the above example, only one female patient visited the hospital. She would be easier to re-identify than if the data included thirty women who had spent time at the hospital.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-aec1210 elementor-widget elementor-widget-heading\" data-id=\"aec1210\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"8789\" data-selectable-paragraph=\"\">Privacy Leaks<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c79e45b elementor-widget elementor-widget-text-editor\" data-id=\"c79e45b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9d60\" data-selectable-paragraph=\"\">There are so many privacy leaks that I could bring up that it is actually very concerning, so I have only cherry-picked a few of these stories to illustrate certain important points.<\/p>\n<p id=\"fa24\" data-selectable-paragraph=\"\">It is important to note that I am not talking about\u00a0<em>data leaks<\/em>\u00a0here: where some bad actor hacks into a company or government database and steals confidential information about customers, although this is also incredibly common and becoming an increasing concern with the advent of the Internet of Things (IoT).<\/p>\n<p id=\"26a7\" data-selectable-paragraph=\"\">We are talking explicitly about data that is publicly available\u2014 i.e. you could (at least at the time) go and download it \u2014 or business data that was then used to identify individuals. This also extends to business intelligence used to reveal personal information about a person.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-db93f96 elementor-widget elementor-widget-heading\" data-id=\"db93f96\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"4e56\" data-selectable-paragraph=\"\">Netflix Prize<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a788833 elementor-widget elementor-widget-image\" data-id=\"a788833\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1840\/0*-u9NcEvm6Qq1BL8Q.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-75c1cd8 elementor-widget elementor-widget-text-editor\" data-id=\"75c1cd8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7e94\" data-selectable-paragraph=\"\">The $1 million Netflix Prize was a competition started by Netflix to improve the company\u2019s movie recommendation system. For the competition, the company released a large anonymized database which contenders were to use as input data for the recommendation engine.<\/p>\n<p id=\"4058\" data-selectable-paragraph=\"\">Two individuals, graduate student Arvind Narayanan and Professor Vitaly Shmatikov from the University of Austin were able to reidentify two individuals in the dataset published by Netflix.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-43aa19b elementor-widget elementor-widget-text-editor\" data-id=\"43aa19b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote>\n<p id=\"59c9\" data-selectable-paragraph=\"\">\u201cReleasing the data and just removing the names does nothing for privacy\u2026 if you know their name and a few records, then you can identify that person in the other (private) database.\u201d\u00a0<em>\u2014\u00a0<\/em><strong><em>Vitaly Shmatikov<\/em><\/strong><\/p>\n<\/blockquote>\n<p id=\"dea4\" data-selectable-paragraph=\"\">Netflix did not include names in their dataset, and instead used an anonymous identifier for each user. It was found that when the collection of movie ratings was combined with a public database of ratings, it was enough to identify the people.<\/p>\n<p id=\"8286\" data-selectable-paragraph=\"\">Narayanan and Shmatikov demonstrated the danger by using public reviews published by a \u201cfew dozen\u201d people in the Internet Movie Database (IMDb) to identify movie ratings of two of the users in Netflix\u2019s data in a\u00a0<a href=\"http:\/\/arxiv.org\/abs\/cs\/0610105\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">paper<\/a>\u00a0they published soon after Netflix released the data.<\/p>\n<p id=\"5266\" data-selectable-paragraph=\"\">Exposing movie ratings that the reviewer thought were private could expose significant details about the person. For example, the researchers found that one of the people had strong \u2014 ostensibly private \u2014 opinions about some liberal and gay-themed films and also had ratings for some religious films.<\/p>\n<p id=\"a13f\" data-selectable-paragraph=\"\">More generally, the research demonstrated that information that a person believes to be benign could be used to identify them in other private databases. In privacy and intelligence circles, the result has been understood for decades, but this case brought the subject to light to the mass media.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0b1d567 elementor-widget elementor-widget-heading\" data-id=\"0b1d567\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"aed7\" data-selectable-paragraph=\"\"><strong>America Online<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ec604d4 elementor-widget elementor-widget-text-editor\" data-id=\"ec604d4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0204\" data-selectable-paragraph=\"\">In August 2006, AOL Research released a file containing 20 million search keywords from over 658,000 users over a 3-month period. The dataset was intended for research purposes but was removed after 3 days after gaining significant notoriety. However, it was too late at this point as the data had already been mirrored and distributed on the internet. The leak culminated in the resignation of CTO Dr. Abdur Chowdhury.<\/p>\n<p id=\"d5ed\" data-selectable-paragraph=\"\">What went wrong? The data was thought to be anonymous but was found to reveal sensitive details of the searcher&#8217;s private lives, including social security numbers, credit card numbers, addresses, and, in one case, apparently a searcher\u2019s\u00a0<a href=\"http:\/\/plentyoffish.wordpress.com\/2006\/08\/07\/aol-search-data-shows-users-planning-to-commit-murder\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">intent to kill their wife<\/a>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-afac780 elementor-widget elementor-widget-image\" data-id=\"afac780\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/646\/0*NmFk59MfmovrmTt9.gif\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-097db4f elementor-widget elementor-widget-text-editor\" data-id=\"097db4f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">Some miscellaneous queries extracted from the AOL dataset.\u00a0<a href=\"https:\/\/www.somethingawful.com\/weekend-web\/aol-search-log-2\/3\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Source<\/a><\/p>\n<p id=\"31c8\" data-selectable-paragraph=\"\">Whilst AOL did not specifically identify users, instead,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Personally_identifiable_information\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">personally identifiable information<\/a>\u00a0was present in many of the queries. Have you ever tried to Google yourself? That is essentially what resulted in privacy leaks here. Some people were even naive enough to type their social security numbers and addresses into the search database.<\/p>\n<p id=\"0d9a\" data-selectable-paragraph=\"\">As the queries were attributed by AOL to particular user numerically identified accounts, an individual could be identified and matched to their account and search history by such information.\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/The_New_York_Times\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><em>The New York Times<\/em><\/a>\u00a0was able to locate an individual from the released and anonymized search records by cross referencing them with phonebook listings.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e5be35d elementor-widget elementor-widget-heading\" data-id=\"e5be35d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"671a\" data-selectable-paragraph=\"\"><strong>Target Teenage Pregnancy Leak<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-004631d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"004631d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fdee3ce\" data-id=\"fdee3ce\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-c53b4ea elementor-widget elementor-widget-text-editor\" data-id=\"c53b4ea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c3a8\" data-selectable-paragraph=\"\">In an\u00a0<a href=\"https:\/\/www.nytimes.com\/2012\/02\/19\/magazine\/shopping-habits.html?pagewanted=1&amp;_r=1&amp;hp\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">article released by New York Times writer Charles Duhigg<\/a>\u00a0in 2002, it was disclosed that one of Target\u2019s statistician Andrew Pole was asked to develop a product prediction model to figure out if a customer was pregnant based on the purchases she made in-store.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b816a52 elementor-widget elementor-widget-image\" data-id=\"b816a52\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/480\/0*i1_PiDUWwe82MeYz.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b7a659c elementor-widget elementor-widget-text-editor\" data-id=\"b7a659c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">News story by WCPO 9 about Target\u2019s pregnancy prediction model.\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=XH1wQEgROg4\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Source<\/a><\/p>\n<p id=\"7951\" data-selectable-paragraph=\"\">The system would analyze buying habits and use this to discern the likelihood that a customer was pregnant, and would then mail coupons and advertisements for pregnancy-related items. This seems fairly innocuous, and even a positive thing to many people. However, one shopper that started to receive these coupons was a teenage girl whose outraged father called the store to complain about the advertisements. Target knew, but the father did not, that the teenage girl was pregnant.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1d38ec1 elementor-widget elementor-widget-text-editor\" data-id=\"1d38ec1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"02ce\" data-selectable-paragraph=\"\">The data supporting Target\u2019s classification of \u201cpregnant\u201d was unrelated to the teenager\u2019s identity. Instead, Target based its conclusion on the teenager\u2019s purchases from a group of twenty-five products correlated with pregnant shoppers, such as unscented lotion and vitamin supplements.<\/p>\n<p id=\"58ca\" data-selectable-paragraph=\"\">Although such data standing alone might not reveal the shopper\u2019s identity, Target\u2019s big data prediction system derived a \u201ccreepy\u201d and likely unwelcome inference from her shopping patterns.<\/p>\n<p id=\"a7a1\" data-selectable-paragraph=\"\">This is essentially an example of a classification task in machine learning, and you would be surprised how easy it can be to use seemingly unrelated information to predict personal features such as age, gender, race, political affiliation, etc. In fact, this is essentially what was done by Cambridge Analytica during the 2016 U.S. Presidential Elections using data provided by Facebook.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b0d3a03 elementor-widget elementor-widget-heading\" data-id=\"b0d3a03\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"9195\" data-selectable-paragraph=\"\">Latanya Sweeney<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f25562b elementor-widget elementor-widget-text-editor\" data-id=\"f25562b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"65eb\" data-selectable-paragraph=\"\">In 1996, a Ph.D. student at MIT called Latanya Sweeney combined publicly released and anonymous medical data released by the Group Insurance Company (GIC) with public voter records (which she purchased for $20) and was able to reidentify Massachusetts governor William Weld from the dataset.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-635bca4 elementor-widget elementor-widget-image\" data-id=\"635bca4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/444\/0*lfX7hnkhgdek6Ozg.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0c2cb0f elementor-widget elementor-widget-text-editor\" data-id=\"0c2cb0f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">Information used by Latanya Sweeney to reidentify governor William Weld.<\/p>\n<p id=\"c7fa\" data-selectable-paragraph=\"\">She then used the newly acquired information and sent a letter to his home address explaining what she had done and that the way that data had been anonymized is clearly inadequate.<\/p>\n<p id=\"5822\" data-selectable-paragraph=\"\">The results of her reidentification experiment had a significant impact on privacy centered policymaking including the health privacy legislation HIPAA.<\/p>\n<p id=\"0dfe\" data-selectable-paragraph=\"\">A full article describing the events in detail can be found\u00a0here.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-36af9ea elementor-widget elementor-widget-text-editor\" data-id=\"36af9ea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"b75a\" data-selectable-paragraph=\"\">Since this time, she published one of the most important areas on the topic of data privacy, called\u00a0<a href=\"http:\/\/dataprivacylab.org\/projects\/identifiability\/paper1.pdf\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">\u201cSimple Demographics Often Identify People Uniquely (Data Privacy Working Paper 3) Pittsburgh 2000\u201d<\/a>.<\/p>\n<p id=\"0e39\" data-selectable-paragraph=\"\">This paper concludes that 87% of individuals in the U.S. can be uniquely identified alone by knowing 3 pieces of information: gender, date of birth, and zip code.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-352e2e1 elementor-widget elementor-widget-image\" data-id=\"352e2e1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/643\/0*TUXwM3Sm3SSHcYji.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1845aa5 elementor-widget elementor-widget-text-editor\" data-id=\"1845aa5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">A figure from Latanya Sweeney\u2019s paper showing the percentage of the population identifiable via different quasi-identifiers.<\/p>\n<p id=\"833e\" data-selectable-paragraph=\"\">Latanya is now a professor and runs the\u00a0<a href=\"https:\/\/dataprivacylab.org\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Data Privacy Lab<\/a>\u00a0at Harvard University. Some of the techniques for anonymizing datasets are outlined later in this article, such as \u2018<a href=\"https:\/\/en.wikipedia.org\/wiki\/K-anonymity\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><strong>k-anonymity<\/strong><\/a>\u2019.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0604290 elementor-widget elementor-widget-heading\" data-id=\"0604290\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"82b7\" data-selectable-paragraph=\"\">Golden State Killer<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fee807c elementor-widget elementor-widget-text-editor\" data-id=\"fee807c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"d959\" data-selectable-paragraph=\"\">This story is probably the most interesting and has the most far-reaching implications. The infamous Golden State Killer, a murderer and serial rapist active in Sacramento Country, California, between 1978 and 1986, was finally caught in April 2018 at age 72 when a relative of his uploaded genetic information from a personal genomics test by 23AndMe to a public online database.<\/p>\n<p id=\"7f6d\" data-selectable-paragraph=\"\">Investigators uploaded the criminal\u2019s DNA to the database in hope that they would find a partial match. To their amazement, they did. The investigators began exploring family trees to match DNA collected from one of the crime scenes. From there, they investigated individuals within those trees in order to narrow down a suspect.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef71e82 elementor-widget elementor-widget-text-editor\" data-id=\"ef71e82\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0006\" data-selectable-paragraph=\"\">Eventually, they found an individual within the right age group that had lived in the areas that the Golden State Killer had been active. They gathered the DNA of James DeAngelo from items he had thrown away, which were then analyzed in a lab and found to be a direct match to the killer.<\/p>\n<p id=\"8cb7\" data-selectable-paragraph=\"\">Whilst catching a serial killer is clearly a positive thing, this use of genetic information raised significant ethical questions about how genetic information should be used:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bdf86d5 elementor-widget elementor-widget-text-editor\" data-id=\"bdf86d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"92b0\" data-selectable-paragraph=\"\"><strong>[1]\u00a0<\/strong>The most obvious of these is that your genetic information directly identifies you \u2014 you cannot use mathematical techniques to \u2018anonymize\u2019 it, and if you did, it would no longer be representative of you in any way.<\/p>\n<p id=\"3f76\" data-selectable-paragraph=\"\"><strong>[2]\u00a0<\/strong>The second and perhaps more haunting idea is that a fairly distant relative of yours has the capability of violating your own privacy. If your cousin is found to have a genetic predisposition to ovarian cancer and the information is in an online database, that can directly be linked to you by an insurance company to raise your premium. This destroys the whole concept of informed consent which is present during most data collection processes.<\/p>\n<p id=\"6783\" data-selectable-paragraph=\"\">In 2008 the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Genetic_Information_Nondiscrimination_Act\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Genetic Information Non-Disclosure Act (GINA)<\/a>\u00a0was introduced in the U.S. that prohibited certain types of genetic discrimination. This act meant that health insurance providers could not discriminate on the basis of genetics.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-73e8164 elementor-widget elementor-widget-text-editor\" data-id=\"73e8164\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"715f\" data-selectable-paragraph=\"\">For example, if it was found that an individual had a mutated BRCA2 gene \u2014 a gene that is commonly associated with an increased risk of contracting breast cancer \u2014 they would be forbidden to use this information to discriminate against the individual in any way.<\/p>\n<p id=\"29c5\" data-selectable-paragraph=\"\">However, this act said nothing about discrimination in life insurance policies, disability insurance policies, or long-term health care policies. If you uploaded a personal genomics test to an online and public database, you better believe that your life insurance company knows about it.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7c41ad6 elementor-widget elementor-widget-heading\" data-id=\"7c41ad6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"c120\" data-selectable-paragraph=\"\">Reidentification Attacks<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8097f91 elementor-widget elementor-widget-text-editor\" data-id=\"8097f91\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"b13a\" data-selectable-paragraph=\"\">In the previous section, we saw that even anonymized datasets can be subject to reidentification attacks. This can cause harm to the individuals in the dataset, as well as those associated with analyzing or producing the dataset.<\/p>\n<p id=\"29d7\" data-selectable-paragraph=\"\">Of the leaks discussed, several of these occurred in public datasets that were released for research purposes or as part of a company\u2019s commercial activities.<\/p>\n<p id=\"d80f\" data-selectable-paragraph=\"\">This presents significant issues for both companies and academia. The only two real options are:<\/p>\n<p id=\"d1e6\" data-selectable-paragraph=\"\"><strong>(1)<\/strong>\u00a0to not use or severely restrict public data, which is a non-starter in a global society and would significantly stifle scientific progress, or<\/p>\n<p id=\"dd66\" data-selectable-paragraph=\"\"><strong>(2)<\/strong>\u00a0develop viable privacy methods to allow public data to be used without substantial privacy risks to individual participants that are part of the public data.<\/p>\n<p id=\"1fb0\" data-selectable-paragraph=\"\">However, if we are going to develop privacy methods, we need to be aware of what types of attacks individuals might be able to perform on a database.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-04e56df elementor-widget elementor-widget-heading\" data-id=\"04e56df\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"f0e0\" data-selectable-paragraph=\"\">Linkage Attack<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e07b8cc elementor-widget elementor-widget-image\" data-id=\"e07b8cc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2600\/0*6E8ywYstmcKojqZb.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b1a5251 elementor-widget elementor-widget-text-editor\" data-id=\"b1a5251\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"68df\" data-selectable-paragraph=\"\">A linkage attack attempts to re-identify individuals in an anonymized dataset by combining that data with background information. The \u2018linking\u2019 uses quasi-identifiers, such as zip or postcode, gender, salary, etc that are present in both sets to establish identifying connections.<\/p>\n<p id=\"cc41\" data-selectable-paragraph=\"\">Many organizations aren\u2019t aware of the linkage risk involving quasi-identifiers, and while they may mask direct identifiers, they often don\u2019t think of masking or generalizing the quasi-identifiers. This is exactly how Latanya Sweeney was able to find the address of governor William Weld and is also what got Netflix in trouble during their Netflix Prize competition!<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-eae96b3 elementor-widget elementor-widget-heading\" data-id=\"eae96b3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"3eff\" data-selectable-paragraph=\"\">Differencing Attack<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-92692b9 elementor-widget elementor-widget-text-editor\" data-id=\"92692b9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6314\" data-selectable-paragraph=\"\">This is an attack where an attacker can isolate an individual value by combining multiple aggregate statistics about a data set. This essentially attacks the aggregation method that was discussed in the data scrubbing methods section.<\/p>\n<p id=\"0cc0\" data-selectable-paragraph=\"\">A simple example of this would be querying a database about users who have cancer. We ask the database about users who have cancer, and we then ask the database about users who have cancer whose name is not John. We can potentially use these combined query results to perform a differencing attack on John.<\/p>\n<p id=\"2c7c\" data-selectable-paragraph=\"\">Here is another example, consider a fictional Loyalty Card data product where the data product contains the total amount spent by all customers on a given day and the total amount spent by the subgroup of customers using a loyalty card. If there is exactly one customer who makes a purchase without a loyalty card, some simple arithmetic on the two statistics for that day reveals this customer\u2019s precise total amount spent, with only the release of aggregate values.<\/p>\n<p id=\"e726\" data-selectable-paragraph=\"\">A more general version of this attack is known as a\u00a0<strong>composition attack.\u00a0<\/strong>This attack involves combining the results of many queries to perform a differencing attack.<\/p>\n<p id=\"980a\" data-selectable-paragraph=\"\">For example, imagine a database that uses statistical noise to perturb its results. If we ask the database the same query 10,000 times, depending on the way it creates statistical noise, we may be able to average out the statistical noise data and obtain a result close to the true value. One way of tackling this might be to limit the number of queries allowed or add the same noise to queries that are the same, but doing this adds additional issues to the system.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6680323 elementor-widget elementor-widget-heading\" data-id=\"6680323\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"689f\" data-selectable-paragraph=\"\">Homogeneity Attack<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6690528 elementor-widget elementor-widget-text-editor\" data-id=\"6690528\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">Example of a homogeneity attack on a de-identified database<\/p>\n<p id=\"5c33\" data-selectable-paragraph=\"\">A homogeneity attack leverages the case where all values of a sensitive data attribute are the same. The above table provides an example where the sex and zip code of a database have been retracted and generalized, but we are still able to determine a pretty good idea of the salaries of every individual in zip code 537**. Despite the fact we cannot explicitly identify individuals, we may still be able to find out that one of the entries relates to them.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6db13c4 elementor-widget elementor-widget-heading\" data-id=\"6db13c4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"ec0f\" data-selectable-paragraph=\"\">Background Knowledge Attack<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-04e19c1 elementor-widget elementor-widget-text-editor\" data-id=\"04e19c1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"bc47\" data-selectable-paragraph=\"\">Background knowledge attacks are particularly difficult to defend against. They essentially rely on background knowledge of an individual which may be instrumental in deidentifying someone in a dataset.<\/p>\n<p id=\"4e8c\" data-selectable-paragraph=\"\">For example, imagine an individual that knows that their neighbor goes to a specific hospital and also knows certain attributes about them such as their zip code, age, and sex, and they want to know what medical condition they may have. They have found that there are two entries in the database that correspond to this information, one of which has cancer and the other of which has heart disease. They have already narrowed down the results in the database considerably, but can they fully identify the individual?<\/p>\n<p id=\"4585\" data-selectable-paragraph=\"\">Now imagine that the neighbor is Japanese, and the individual knows that Japanese people are much less likely to contract heart disease than a typical individual. They may conclude from this with reasonable certainty that the individual has cancer.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-09cc8ac elementor-widget elementor-widget-text-editor\" data-id=\"09cc8ac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c160\" data-selectable-paragraph=\"\">Background knowledge attacks are often modeled using Bayesian statistics involving prior and posterior beliefs based on attributes within the dataset since this is essentially what an individual is doing when performing this kind of attack. A condition known as Bayes-optimal privacy is known to help defend against this kind of attack.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7cdcbae elementor-widget elementor-widget-heading\" data-id=\"7cdcbae\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"fbc3\" data-selectable-paragraph=\"\">K-Anonymity, L-Diversity, and T-Closeness<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-733f440 elementor-widget elementor-widget-text-editor\" data-id=\"733f440\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e7e2\" data-selectable-paragraph=\"\">In this section, I will introduce three techniques that can be used to reduce the probability that certain attacks can be performed. The simplest of these methods is\u00a0<strong>k-anonymity<\/strong>, followed by\u00a0<strong>l-diversity<\/strong>, and then followed by\u00a0<strong>t-closeness<\/strong>. Other methods have been proposed to form a sort of alphabet soup, but these are the three most commonly utilized. With each of these, the analysis that must be performed on the dataset becomes increasingly complex and undeniably has implications on the statistical validity of the dataset.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-97b9753 elementor-widget elementor-widget-heading\" data-id=\"97b9753\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"eae2\" data-selectable-paragraph=\"\">K-Anonymity<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-04fe444 elementor-widget elementor-widget-text-editor\" data-id=\"04fe444\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e7a8\" data-selectable-paragraph=\"\">As Latanya Sweeney states in her seminal paper:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d1308b7 elementor-widget elementor-widget-text-editor\" data-id=\"d1308b7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote>\n<p id=\"90e9\" data-selectable-paragraph=\"\">A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release.<\/p>\n<\/blockquote>\n<p id=\"b19e\" data-selectable-paragraph=\"\">This essentially means that as long as my dataset contains at least k entries for a given set of quasi-identifiers, then it is k-anonymous. If we take the superkey of zip code, gender, and age as done previously, a dataset would be 3-anonymous if each and every possible combination of zip code, gender, and age in the dataset had at least 3 entries within them.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5c0b2b1 elementor-widget elementor-widget-image\" data-id=\"5c0b2b1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2600\/0*kJGc9nF7BV58K_ZW.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bfd0bf1 elementor-widget elementor-widget-text-editor\" data-id=\"bfd0bf1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">Original dataset (left) and 4-anonymous dataset (right).\u00a0Source<\/p>\n<p id=\"9e6f\" data-selectable-paragraph=\"\">This helps to prevent linkage attacks since an attacker can not make links to another database with a high degree of certainty. However, it is still possible for someone to perform a homogeneity attack, as in the example where all k individuals have the same value. For example, if three individuals that happen to be the same age group, gender, and zip code also all happen to have the same type of cancer, k-anonymity does not protect their privacy. Similarly, it does not defend against background knowledge attacks.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d70e73b elementor-widget elementor-widget-image\" data-id=\"d70e73b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/659\/0*tM-p9117hHPDWgpI.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-de9e55c elementor-widget elementor-widget-text-editor\" data-id=\"de9e55c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">An example of homogeneity and background knowledge attacks on k-anonymous datasets.\u00a0<a href=\"https:\/\/elf11.github.io\/2017\/05\/22\/kanonymity.html\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Source<\/a><\/p>\n<p id=\"03dd\" data-selectable-paragraph=\"\">What is an acceptable level of k-anonymity? There is no clear cut definition but several papers have argued that a level of k=5 or k=10 is preferred. Most individuals in the academic realm seem to agree that k=2 is insufficient and k=3 is the bare minimum needed to preserve privacy (although does not necessarily guarantee it).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f26026f elementor-widget elementor-widget-text-editor\" data-id=\"f26026f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"4d89\" data-selectable-paragraph=\"\">As you have probably realized, the higher level of k that you choose, the lower utility our data has since we must perform generalization (reducing the number of unique values in a column), blurring (obscuring certain data features or combining them), and suppression (deletion of row tuples that cannot satisfy k=3 after other de-identification methods are applied). Instead of suppression, synthetic rows can be added that take samples from either the marginal distribution of each column or the joint distribution of all columns.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e254e99 elementor-widget elementor-widget-text-editor\" data-id=\"e254e99\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"edfa\" data-selectable-paragraph=\"\">Clearly, all of these mechanisms will substantially skew or bias the statistics of a dataset, and the trade-offs of these techniques are still a subject of academic research. This is discussed for HarvardX data in the publication \u201c<a href=\"https:\/\/ieeexplore.ieee.org\/document\/7552278\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\"><em>Statistical Tradeoffs between Generalization and Suppression in the De-identification of Large-Scale Data Sets<\/em><\/a>\u201d by Olivia Angiuli and Jim Waldo.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d5e8018 elementor-widget elementor-widget-text-editor\" data-id=\"d5e8018\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a2bd\" data-selectable-paragraph=\"\">Making a dataset k-anonymous can result in huge proportions of the data being lost (80% of the data can be removed via suppression alone) or added (such as adding 3 rows for every row currently in a dataset to make it 4-anonymous) when the number of quasi-identifiers becomes large, which occurs in most large public datasets.<\/p>\n<p id=\"b94c\" data-selectable-paragraph=\"\">Having tried to make a k-anonymous dataset myself before I can tell you that it is by no means easy to achieve anonymity and still have data that is not useless.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7d7bc69 elementor-widget elementor-widget-heading\" data-id=\"7d7bc69\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"6b3b\" data-selectable-paragraph=\"\">L-Diversity<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a0628a1 elementor-widget elementor-widget-text-editor\" data-id=\"a0628a1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0abf\" data-selectable-paragraph=\"\">Several people have noted the possibility of attacks that can be performed on k-anonymous datasets, so privacy researchers have taken it a step further and proposed l-diversity. The authors define l-diversity to be:<\/p>\n\n<blockquote>\n<p id=\"5d42\" data-selectable-paragraph=\"\">\u2026the requirement that the values of the sensitive attributes are well-represented in each group.<\/p>\n<\/blockquote>\n<p id=\"164d\" data-selectable-paragraph=\"\">They expand on this in mathematical detail to essentially mean that any attribute that is considered \u2018sensitive\u2019, such as what medical conditions an individual has, or whether a student passed or failed a class, takes on atleast L distinct values within each subset k.<\/p>\n<p id=\"819f\" data-selectable-paragraph=\"\">In simpler terms, this means that if we take a block of four individual\u2019s data from a university class that are found to have the same quasi-identifiers (such as same zip code, gender, and age), then there must be at least L distinct values within that group \u2014 we cannot have all of the individuals in the group with just a passing grade.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6859576 elementor-widget elementor-widget-text-editor\" data-id=\"6859576\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"fc47\" data-selectable-paragraph=\"\">This helps to ensure that individuals cannot be uniquely identified from a homogeneity attack. However, it may still violate someone\u2019s privacy if all of the values of the sensitive attribute are unfavorable \u2014 such as all having different but low grades in a \u2018grade\u2019 column, or all having different types of cancer in a \u2018medical condition\u2019 column.<\/p>\n<p id=\"2d29\" data-selectable-paragraph=\"\">This does not make things perfect but it is a step further than k-anonymity. However, once again, it raises additional questions as to the statistical validity of a dataset after this is performed on a dataset, since it will involve suppressing or adding rows which will alter the distribution of the data and also acts as a form of self-sampling bias.<\/p>\n<p id=\"ccec\" data-selectable-paragraph=\"\">For example, in EdX data, it was found that students that completed classes provided much more information about themselves than casual observers of classes, and thus most of these could be uniquely identified. Thus, when k-anonymity and l-diversity were used on the dataset, it removed most of the people who had completed the classes! Clearly, this is not an ideal circumstance, and there are still open questions about how this should be handled to minimize the introduction of bias to datasets in this manner.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9c5220a elementor-widget elementor-widget-heading\" data-id=\"9c5220a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"f994\" data-selectable-paragraph=\"\">T-Closeness<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-59dddd5 elementor-widget elementor-widget-text-editor\" data-id=\"59dddd5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7151\" data-selectable-paragraph=\"\">Yet another extension of k-anonymity and l-diversity, privacy researchers also proposed t-closeness. They describe t-closeness as:<\/p>\n\n<blockquote>\n<p id=\"5707\" data-selectable-paragraph=\"\">We propose a novel privacy notion called t-closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold t). We choose to use the Earth Mover Distance measure for our t-closeness requirement.<\/p>\n<\/blockquote>\n<p id=\"f7f5\" data-selectable-paragraph=\"\">In the above paragraph, an equivalence class is meant as the k individuals in a k-anonymous subset. The idea is essentially to ensure that not only are the values in an equivalence class l-diverse, but that the distribution of those L distinct values should be as close to the overall data distribution as possible. This would aim to help remove some of the bias that was introduced into the EdX dataset with regards to removing people who had successfully completed courses.<\/p>\n<p id=\"ae19\" data-selectable-paragraph=\"\">Here are reference scientific papers for these and other algorithms aimed at protecting data privacy:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e13655 elementor-widget elementor-widget-text-editor\" data-id=\"6e13655\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1235\" data-selectable-paragraph=\"\"><a href=\"https:\/\/dataprivacylab.org\/projects\/kanonymity\/index.html\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><strong>k-Anonymity: a model for protecting privacy<\/strong><\/a><\/p>\n<p id=\"268d\" data-selectable-paragraph=\"\"><a href=\"https:\/\/personal.utdallas.edu\/~mxk055100\/courses\/privacy08f_files\/ldiversity.pdf\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><strong>\u2113-Diversity: Privacy Beyond k-Anonymity<\/strong><\/a><\/p>\n<p id=\"1abd\" data-selectable-paragraph=\"\"><a href=\"https:\/\/www.cs.purdue.edu\/homes\/ninghui\/papers\/t_closeness_icde07.pdf\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><strong>t-Closeness: Privacy Beyond k-Anonymity and -Diversity<\/strong><\/a><\/p>\n<p id=\"877e\" data-selectable-paragraph=\"\"><a href=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.100.3713&amp;rep=rep1&amp;type=pdf\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><strong>m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets<\/strong><\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f5fad98 elementor-widget elementor-widget-heading\" data-id=\"f5fad98\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"d9c4\" data-selectable-paragraph=\"\">What Regulations Exist?<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d4c8ac9 elementor-widget elementor-widget-text-editor\" data-id=\"d4c8ac9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ebae\" data-selectable-paragraph=\"\">There are a number of privacy regulations that exist and they can vary substantially, it would be a bit arduous to write and explain them all, so I have just chosen as an example to compare the HIPAA and FERPA regulations in the United States.<\/p>\n<p id=\"e6be\" data-selectable-paragraph=\"\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Health_Insurance_Portability_and_Accountability_Act\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><strong>Health Insurance Portability and Accountability Act (HIPAA)<\/strong><\/a><\/p>\n<p id=\"d047\" data-selectable-paragraph=\"\">This regulation covers all medical data in the U.S. and initially (when released in 1996) specified that all personally identifiable information (PII) must be removed from a dataset before it is released. From our discussion early, this corresponds to direct identifiers \u2014 those pieces of information that directly identify who you are, such as name, address, and phone number. Thus HIPAA could be satisfied by merely suppressing (deleting\/removing) these features from the publicly released data.<\/p>\n<p id=\"ce38\" data-selectable-paragraph=\"\">Since then, the regulations have been updated that now require a dataset to be de-identified in compliance with the HIPAA Privacy Rule, which states that either:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ad4863f elementor-widget elementor-widget-text-editor\" data-id=\"ad4863f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<ol>\n \t<li id=\"8c75\" data-selectable-paragraph=\"\">The removal of 18 specific identifiers listed above (Safe Harbor Method)<\/li>\n \t<li id=\"b237\" data-selectable-paragraph=\"\">The expertise of an experienced statistical expert to validate and document the statistical risk of re-identification is very small (Statistical Method)<\/li>\n<\/ol>\n<p id=\"eb2e\" data-selectable-paragraph=\"\">The 18 attributes that come under the purview of\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Protected_health_information\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">protected health information<\/a>\u00a0are:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5510d21 elementor-widget elementor-widget-text-editor\" data-id=\"5510d21\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ol>\n \t<li id=\"99d0\" data-selectable-paragraph=\"\">Names<\/li>\n \t<li id=\"6f4e\" data-selectable-paragraph=\"\">All geographical identifiers smaller than a state, except for the initial three digits of a zip code if, according to the current publicly available data from the U.S. Bureau of the Census: the geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people.<\/li>\n \t<li id=\"60be\" data-selectable-paragraph=\"\">Dates (other than year) directly related to an individual<\/li>\n \t<li id=\"6a13\" data-selectable-paragraph=\"\">Phone Numbers<\/li>\n \t<li id=\"478d\" data-selectable-paragraph=\"\">Fax numbers<\/li>\n \t<li id=\"29e4\" data-selectable-paragraph=\"\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Email\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Email<\/a>\u00a0addresses<\/li>\n \t<li id=\"ab23\" data-selectable-paragraph=\"\">Social security numbers<\/li>\n \t<li id=\"4c5f\" data-selectable-paragraph=\"\">Medical record numbers<\/li>\n \t<li id=\"088d\" data-selectable-paragraph=\"\">Health insurance beneficiary numbers<\/li>\n \t<li id=\"85d0\" data-selectable-paragraph=\"\">Account numbers<\/li>\n \t<li id=\"361d\" data-selectable-paragraph=\"\">Certificate\/license numbers<\/li>\n \t<li id=\"fe7b\" data-selectable-paragraph=\"\">Vehicle identifiers and serial numbers, including license plate numbers;<\/li>\n \t<li id=\"7754\" data-selectable-paragraph=\"\">Device identifiers and serial numbers;<\/li>\n \t<li id=\"91da\" data-selectable-paragraph=\"\">Web\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Uniform_Resource_Locator\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Uniform Resource Locators<\/a>\u00a0(URLs)<\/li>\n \t<li id=\"4575\" data-selectable-paragraph=\"\">Internet Protocol (IP) address numbers<\/li>\n \t<li id=\"0362\" data-selectable-paragraph=\"\">Biometric identifiers, including finger, retinal and voiceprints<\/li>\n \t<li id=\"4edf\" data-selectable-paragraph=\"\">Full face photographic images and any comparable images<\/li>\n \t<li id=\"e7a4\" data-selectable-paragraph=\"\">Any other unique identifying number, characteristic, or code except the unique code assigned by the investigator to code the data<\/li>\n<\/ol>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-24cf4fb elementor-widget elementor-widget-text-editor\" data-id=\"24cf4fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"953d\" data-selectable-paragraph=\"\">Note that HIPAA data does not explicitly have to be k-anonymous.<\/p>\n<p id=\"5b6a\" data-selectable-paragraph=\"\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/Family_Educational_Rights_and_Privacy_Act\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><strong>Family Education Rights and Privacy Act (FERPA)<\/strong><\/a><\/p>\n<p id=\"6967\" data-selectable-paragraph=\"\">This regulation covers all education data in the U.S., including K-12 and university information. This regulation specifies that not only must all PII be removed but there must also be no possibility of anyone being identified to a high degree of certainty. This means we must satisfy a more stringent requirement than HIPAA. If we are looking at university medical information, the more stringent regulation is the one that must be followed, so in this case, FERPA would be dominant instead of HIPAA.<\/p>\n<p id=\"c229\" data-selectable-paragraph=\"\">The requirements of FERPA can be satisfied by using\u00a0<strong>k-anonymity.<\/strong><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5827005 elementor-widget elementor-widget-heading\" data-id=\"5827005\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"674b\" data-selectable-paragraph=\"\">Differential Privacy<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-184c14c elementor-widget elementor-widget-text-editor\" data-id=\"184c14c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c91c\" data-selectable-paragraph=\"\">The whole point of this article has been to build you up to differential privacy and how it plans to revolutionize the notion of privacy in a data-driven world. Differential privacy, in various forms, has been adopted by Google, Apple, Uber, and even the US Census Bureau.<\/p>\n<p id=\"b6aa\" data-selectable-paragraph=\"\">We have seen with the methods of k-anonymity, l-diversity, and t-closeness that they are by no means perfect, and they cannot guarantee that they are futureproof. We want to enable statistical analysis of datasets \u2014 such as inference about populations, machine learning training, useful descriptive statistical \u2014 whilst still protecting individual-level data against all attack strategies and auxiliary information that is available about the individual. With the previous methods, someone could come along with new technology or algorithm in 10 years time and reidentify the entire dataset \u2014 there are no formal guarantees of privacy.<\/p>\n<p id=\"f9d7\" data-selectable-paragraph=\"\">This is what differential privacy offers us: a<strong>\u00a0mathematical guarantee of privacy that is measurable and is futureproof<\/strong>. With differential privacy, the goal is to give each individual roughly the same privacy that would result from having their data removed. That is, the statistical functions run on the database should not overly depend on the data of any one individual.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-54fc5c2 elementor-widget elementor-widget-text-editor\" data-id=\"54fc5c2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"2d3a\" data-selectable-paragraph=\"\">Cryptography does not work with datasets, because the potential adversaries are the dataset users themselves. Thus, privacy researchers have developed a mathematical framework under the assumption that the data analyst is an adversary, and aims to minimize the possibility that sensitive information is disclosed to the analyst, even when the analyst asks multiple sequential queries to the dataset.<\/p>\n<p id=\"20ad\" data-selectable-paragraph=\"\">The revelation came when privacy researchers stopped trying to ensure that privacy was a property of the data output, and instead began to think of it as a property of the data analysis itself. This led to the formulation of differential privacy which offers a form of \u2018privacy by design\u2019, instead of tagging privacy on at the end as an afterthought.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-958203b elementor-widget elementor-widget-image\" data-id=\"958203b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1722\/1*8IopPAx12xCHgWj-fMedwA.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-35d1bb0 elementor-widget elementor-widget-text-editor\" data-id=\"35d1bb0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">A curator that mediates the interfacing of the data analyst (adversary) with the raw data source (our database system).<\/p>\n<p id=\"5330\" data-selectable-paragraph=\"\">Thus our requirement is that an adversary shouldn\u2019t be able to tell if any single individual\u2019s data were changed arbitrarily. In simpler terms, if I remove the second entry in the dataset, the adversary would not be able to tell the difference between the two datasets.<\/p>\n<p id=\"a6a2\" data-selectable-paragraph=\"\">We measure the difference between (1) the dataset with individual X, and (2) the dataset without individual X, using a variable known as \u03f5.<\/p>\n<p id=\"cf64\" data-selectable-paragraph=\"\">How does this work?<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bd4bdaa elementor-widget elementor-widget-image\" data-id=\"bd4bdaa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1816\/1*wAnefgX5Ay_0Qoj6sZfA-w.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0f20ae0 elementor-widget elementor-widget-text-editor\" data-id=\"0f20ae0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"644e\" data-selectable-paragraph=\"\">Let&#8217;s imagine that our data analyst asks the curator what fraction of the people in the dataset are HIV positive and have a blood type that is type B. A differentially private system would respond to this question and would add a known level of random noise to the data.\u00a0<strong>The algorithm is transparent<\/strong>, so the data analyst is allowed to know exactly what distribution this random noise is sampled from, and it makes no difference to the algorithmic privacy.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c718e95 elementor-widget elementor-widget-text-editor\" data-id=\"c718e95\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"39ca\" data-selectable-paragraph=\"\">The amount of noise needed to be added to ensure that if one individual is removed from a dataset, the data analyst would not be able to tell is of the order of 1\/n, where n is the number of people in the dataset. This should make sense since a noise level of 2\/n would mean that an analyst could not tell if an individual\u2019s data was changed, added, or removed, or it was a function of the noise added to the dataset.<\/p>\n<p id=\"663a\" data-selectable-paragraph=\"\">As the number of individuals in the dataset increases, the amount of noise that must be added to protect the privacy of a single individual gets progressively smaller.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-855fbe9 elementor-widget elementor-widget-image\" data-id=\"855fbe9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1946\/1*InTeBbUKPOIfK7c7K1klFQ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-17362da elementor-widget elementor-widget-text-editor\" data-id=\"17362da\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6b3a\" data-selectable-paragraph=\"\">The noise added is typically of the form of a Laplace distribution, and the level of privacy can be controlled with our \u2018privacy parameter\u2019 which we call \u03f5. We can think of this value as the difference between two datasets that differ only in one way: that an individual X is present in one of the datasets and not in the other.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9bdd3ef elementor-widget elementor-widget-image\" data-id=\"9bdd3ef\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1347\/0*3WiQd3CgnbxZeBNJ.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f229e52 elementor-widget elementor-widget-text-editor\" data-id=\"f229e52\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">The intuitive definition of the privacy parameter.<\/p>\n<p id=\"d092\" data-selectable-paragraph=\"\">When the value of \u03f5 is very small, we have greater privacy \u2014 we are effectively adding a larger amount of noise to the dataset to mask the presence of specific individuals. When the value of \u03f5 is very large, we have weaker privacy. Typically, the value of \u03f5 is less than 1, and is usually closer to zero and around 0.01\u20130.1.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b42c3a8 elementor-widget elementor-widget-text-editor\" data-id=\"b42c3a8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"1a9d\" data-selectable-paragraph=\"\">What has been created here is an algorithm that ensures that whatever an adversary learns about me, it could have learned from everyone else\u2019s data. Thus, inferences such as whether there is a link between smoking and lung cancer would still come up clearly in the data, but information about whether a specific individual smokes or has cancer would be masked.<\/p>\n<p id=\"c81f\" data-selectable-paragraph=\"\">How does this work for datasets which allow multiple queries to be asked?<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0328408 elementor-widget elementor-widget-text-editor\" data-id=\"0328408\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7037\" data-selectable-paragraph=\"\">This is an important question: if I ask enough questions to the dataset, will it eventually confess everyone\u2019s information to me? This is where the concept of the privacy budget comes in.<\/p>\n<p id=\"b8dd\" data-selectable-paragraph=\"\">The \u2018mechanism\u2019 (our data curator that interacts with the database and data analyst) is constrained from leaking individual-specific information. Each time you ask a question and it responds to your query, you use up some of your privacy budget. You can continue to ask questions about the same group of data until you reach the maximum privacy budget, at that point the mechanism will refuse to answer your questions.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-48799df elementor-widget elementor-widget-text-editor\" data-id=\"48799df\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0364\" data-selectable-paragraph=\"\">Notice that this does not stop you from ever asking a question about the data ever again, it means that the mechanism will enforce that the data you access must have a new amount of noise added to it to prevent privacy leaks.<\/p>\n<p id=\"e4a9\" data-selectable-paragraph=\"\">The reason the concept of the privacy budget works is that the notion of the privacy parameter composes gracefully \u2014 privacy values from subsequent queries can simply be added together.<\/p>\n<p id=\"c379\" data-selectable-paragraph=\"\"><strong>Thus, for k queries, we have a differential privacy of k\u03b5. As long as k\u03b5 &lt; privacy budget, the mechanism will still respond to queries.<\/strong><\/p>\n<p id=\"c344\" data-selectable-paragraph=\"\">Hopefully, by this point you are coming to realize the implications of such a mathematically assured privacy algorithm, and now understand why it is superior to the notions of k-anonymity, l-diversity, and t-closeness. This has been implemented by companies in different forms, and we will quickly look at the differences between the formulations.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-04d6b89 elementor-widget elementor-widget-heading\" data-id=\"04d6b89\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"d7de\" data-selectable-paragraph=\"\">Global Differential Privacy<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a71c65d elementor-widget elementor-widget-text-editor\" data-id=\"a71c65d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a97e\" data-selectable-paragraph=\"\">This is the most intuitive form of differential privacy that I alluded to in the examples above, where the database is managed by a company or government curator in a centralized (single) location.<\/p>\n<p id=\"170e\" data-selectable-paragraph=\"\">As an example, the US Census Bureau is planning to use global differential privacy for the 2020 US Census. This means the bureau will get all of the data and place it into a database unadulterated. Following this, researchers and interested parties will be able to query to census database and retrieve information about census data as long as they do not surpass their privacy budget.<\/p>\n<p id=\"07e3\" data-selectable-paragraph=\"\">Some people are concerned by global differential privacy, as the data exists at its source in raw form. If a private company did this (<a href=\"https:\/\/arxiv.org\/pdf\/1809.07750.pdf\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Uber<\/a>\u00a0is currently the only company I am aware that uses global differential privacy), then if that data is subpoenaed, the company would have to hand over sensitive information about individuals. Fortunately, for the US Census Bureau, they take a vow when joining the bureau to never violate the privacy of individual\u2019s through census data, so they take this pretty seriously. Additionally, the bureau is not allowed to be subpoenaed by any government agencies, even agencies such as the FBI or CIA, so your data is in fairly safe hands.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-795f47b elementor-widget elementor-widget-heading\" data-id=\"795f47b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"683b\" data-selectable-paragraph=\"\">Local Differential Privacy<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-54b99f1 elementor-widget elementor-widget-text-editor\" data-id=\"54b99f1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"cd7e\" data-selectable-paragraph=\"\">This form of differential privacy is used by\u00a0<a href=\"https:\/\/static.googleusercontent.com\/media\/research.google.com\/en\/\/pubs\/archive\/42852.pdf\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Google in their RAPPOR system<\/a>\u00a0for Google Chrome, as well as\u00a0<a href=\"https:\/\/www.apple.com\/privacy\/docs\/Differential_Privacy_Overview.pdf\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Apple IPhones using IOS 10<\/a>\u00a0and above. The idea is that information is sent from an individual\u2019s device but the noise is added at the source and then sent to Apple or Google\u2019s database in its adulterated form. Thus, Google and Apple do not have access to the raw and sensitive data, and even if they are subpoenaed and that data was acquired by someone else, it would still not violate your privacy.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-91d9757 elementor-widget elementor-widget-image\" data-id=\"91d9757\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1302\/0*DBRtjaa2UFmxJvxw.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-40330d6 elementor-widget elementor-widget-text-editor\" data-id=\"40330d6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">The difference between local and global differential privacy.\u00a0<a href=\"https:\/\/www.accessnow.org\/understanding-differential-privacy-matters-digital-rights\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Source<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e74212a elementor-widget elementor-widget-heading\" data-id=\"e74212a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"872b\" data-selectable-paragraph=\"\">Final Comments<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-415817c elementor-widget elementor-widget-text-editor\" data-id=\"415817c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8dbe\" data-selectable-paragraph=\"\">Congratulations on getting to the end of the article! This has been quite a deep dive into data privacy and I urge you to keep up to date on what is happening in the privacy world \u2014 knowing where and how your data is used and protected by companies and governments is likely to become an important topic in the data-driven societies of the future.<\/p>\n\n<blockquote>\n<p id=\"76d2\" data-selectable-paragraph=\"\">\u201cTo be left alone is the most precious thing one can ask of the modern world.\u201d\n\u2015\u00a0<strong>Anthony Burgess<\/strong><\/p>\n<\/blockquote>\n<\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>This article takes you to a deep dive into data privacy and urges you to keep up to date on what is happening in the privacy world &mdash; knowing where and how your data is used and protected by companies and governments is likely to become an important topic in the data-driven societies of the future. Learn how little privacy you have and how differential privacy aims to help.<\/p>\n","protected":false},"author":682,"featured_media":8268,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[95],"ppma_author":[3471],"class_list":["post-2310","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-big-data-amp-technology"],"authors":[{"term_id":3471,"user_id":682,"is_guest":0,"slug":"matthew-stewart","display_name":"Matthew Stewart","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_c57055f3-5301-4262-af65-4cc7d40cbf3d-150x150.jpg","user_url":"https:\/\/criticalfutureglobal.com\/","last_name":"Stewart","first_name":"Matthew","job_title":"","description":"Matthew Stewart is a Machine Learning consultant on AI for\u00a0<a href=\"https:\/\/www.criticalfutureglobal.com\/\" target=\"_blank\" rel=\"noopener\">Critical Future<\/a>, and machine learning engineer at Scalable Magic, an AI-based digital media startup. He is also a Graduate Teaching Assistant and a Ph.D. Candidate at Harvard University."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2310","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/682"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2310"}],"version-history":[{"count":7,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2310\/revisions"}],"predecessor-version":[{"id":35275,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2310\/revisions\/35275"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/8268"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2310"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2310"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2310"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2310"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}