{"id":2258,"date":"2020-02-14T02:25:57","date_gmt":"2020-02-14T02:25:57","guid":{"rendered":"http:\/\/kusuaks7\/?p=1863"},"modified":"2024-01-10T16:05:22","modified_gmt":"2024-01-10T16:05:22","slug":"good-pipelines-bad-data","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/good-pipelines-bad-data\/","title":{"rendered":"Good pipelines, bad data"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2258\" class=\"elementor elementor-2258\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1b093ae7 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1b093ae7\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-f7aa54c\" data-id=\"f7aa54c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1f7f7c5 elementor-widget elementor-widget-heading\" data-id=\"1f7f7c5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>How to start trusting data in your company.<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-76d0887 elementor-widget elementor-widget-image\" data-id=\"76d0887\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/6016\/0*n3sdHq2D6wsNab08\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3478e02 elementor-widget elementor-widget-text-editor\" data-id=\"3478e02\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@element5digital?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Element5 Digital<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Unsplash<\/a><\/p>\n<p id=\"df95\" data-selectable-paragraph=\"\"><em>It\u2019s 2020, and we\u2019re still using \u201cphotos and a paper trail\u201d to validate data. In the recent\u00a0<\/em><a href=\"https:\/\/www.nytimes.com\/2020\/02\/03\/us\/politics\/iowa-caucuses.html?action=click&amp;module=Spotlight&amp;pgtype=Homepage\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><em>Iowa election<\/em><\/a><em>, data inconsistencies eroded trust in the results. This is just one of the many recent, and prominent, examples of pervasive \u201c<\/em><a href=\"https:\/\/towardsdatascience.com\/the-rise-of-data-downtime-841650cedfd5\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\"><em>data downtime<\/em><\/a><em>.\u201d<\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d497983 elementor-widget elementor-widget-text-editor\" data-id=\"d497983\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9869\" data-selectable-paragraph=\"\"><em>Data downtime refers to periods of time when your data is partial, erroneous, missing or otherwise inaccurate \u2014 and almost every data organization I know struggles with it. In fact, this\u00a0<\/em><a href=\"https:\/\/hbr.org\/2020\/02\/data-driven-decisions-start-with-these-4-questions?utm_source=linkedin&amp;utm_campaign=hbr&amp;utm_medium=social\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><em>HBR article<\/em><\/a><em>\u00a0cites a study that found companies lose an average of $15M per year due to bad data. In this blog post, I will cover an approach to managing data downtime that has been adopted by some of the best teams in the industry.<\/em><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fe95443 elementor-widget elementor-widget-heading\" data-id=\"fe95443\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\"><h1 id=\"2096\" data-selectable-paragraph=\"\">So, what does it mean to measure data downtime?<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed1afd8 elementor-widget elementor-widget-text-editor\" data-id=\"ed1afd8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"2611\" data-selectable-paragraph=\"\">To begin unpacking that, let\u2019s look into what counts as \u201cdowntime\u201d. Data downtime refers to any time data is \u201cdown\u201d, i.e. when data teams find themselves answering \u201cno\u201d to common questions such as:<\/p>\n\n<ul>\n \t<li id=\"e325\" data-selectable-paragraph=\"\">Is the data in this report up-to-date?<\/li>\n \t<li id=\"d6cf\" data-selectable-paragraph=\"\">Is the data complete?<\/li>\n \t<li id=\"0e67\" data-selectable-paragraph=\"\">Are fields within reasonable ranges?<\/li>\n \t<li id=\"5412\" data-selectable-paragraph=\"\">Do my assumptions about upstream sources still hold true?<\/li>\n \t<li id=\"d73e\" data-selectable-paragraph=\"\">\u2026 and more<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-498a04b elementor-widget elementor-widget-text-editor\" data-id=\"498a04b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"c3c8\" data-selectable-paragraph=\"\">Or in other words\u2026\u00a0<em>Can I trust my data?<\/em><\/p>\n<p id=\"4fed\" data-selectable-paragraph=\"\"><strong>Answering these questions in real time is hard.<\/strong><\/p>\n<p id=\"7b84\" data-selectable-paragraph=\"\">Data organizations large and small are challenged with these questions since (1) consistently tracking this information across data pipelines requires substantial resources; (2) at best, information is limited to a small subset of the data that had been laboriously instrumented; and (3) even when available, sifting through this information is tedious enough that teams often find about data issues in hindsight.<\/p>\n<p id=\"6a67\" data-selectable-paragraph=\"\">In fact, it is typical that data consumers \u2014 product managers, marketing experts, executives, data scientists, or even customers \u2014 identify data downtime right at the moment when they need to use the data. And somehow, that always happens late on a Friday afternoon\u2026<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7eeaf3b elementor-widget elementor-widget-heading\" data-id=\"7eeaf3b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h1 class=\"elementor-heading-title elementor-size-default\">\n<h1 id=\"d49e\" data-selectable-paragraph=\"\">How come we know everything about how well our data infrastructure is performing, but so little about whether the data is right?<\/h1><\/h1>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-58d4400 elementor-widget elementor-widget-text-editor\" data-id=\"58d4400\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f421\" data-selectable-paragraph=\"\">A helpful corollary here is drawing on the world of infrastructure observability. Almost every engineering team has tools to monitor and track infrastructure and guarantee that it is performing as expected. This is often referred to as observability \u2014 the ability to determine a system\u2019s health based on its outputs.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d2e3dae elementor-widget elementor-widget-text-editor\" data-id=\"d2e3dae\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3323\" data-selectable-paragraph=\"\">Great data teams make investments in data observability \u2014 the ability to determine whether the data flowing in the system is healthy. With observability comes the opportunity to detect issues before they impact data consumers, and to then pinpoint and fix problems in minutes instead of days and weeks.<\/p>\n<p id=\"24e6\" data-selectable-paragraph=\"\">So what makes for great data observability? Based on learnings from over 100 data teams, we\u2019ve identified the following:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-69c0529 elementor-widget elementor-widget-image\" data-id=\"69c0529\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1084\/0*nghcK_GapAgigOEa\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-7f84e52 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"7f84e52\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-4061104\" data-id=\"4061104\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a8e5625 elementor-widget elementor-widget-text-editor\" data-id=\"a8e5625\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"af33\" data-selectable-paragraph=\"\">Each pillar encapsulates a series of questions which, in aggregate, provide a holistic view of data health.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7ee9f5c elementor-widget elementor-widget-text-editor\" data-id=\"7ee9f5c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li id=\"3bd8\" data-selectable-paragraph=\"\"><strong>Freshness:<\/strong>\u00a0is the data recent? When was the last time it was generated? What upstream data is included\/omitted?<\/li>\n \t<li id=\"7139\" data-selectable-paragraph=\"\"><strong>Distribution:\u00a0<\/strong>is the data within accepted ranges? Is it properly formatted? Is it complete?<\/li>\n \t<li id=\"d53c\" data-selectable-paragraph=\"\"><strong>Volume:<\/strong>\u00a0has all the data arrived?<\/li>\n \t<li id=\"5aea\" data-selectable-paragraph=\"\"><strong>Schema:<\/strong>\u00a0what is the schema, and how has it changed? Who has made these changes and for what reasons?<\/li>\n \t<li id=\"569b\" data-selectable-paragraph=\"\"><strong>Lineage:<\/strong>\u00a0for a given data asset, what are the upstream sources and downstream assets which are impacted by it? Who are the people generating this data, and who is relying on it for decision making?<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9f41ea9 elementor-widget elementor-widget-text-editor\" data-id=\"9f41ea9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6206\" data-selectable-paragraph=\"\">Admittedly, data can break in endless ways and for a wide range of reasons. Surprisingly, we have found time and time again that these pillars \u2014 if tracked and monitored \u2014 will surface almost any meaningful data downtime event.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Data downtime refers to periods of time when your data is partial, erroneous, missing or otherwise inaccurate, and almost every data organization struggles with it. Data downtime refers to any time data when data teams find themselves answering &ldquo;no&rdquo; to common questions such as is the data in this report up-to-date, or is the data complete, and more. This blog post will cover an approach to managing data downtime that has been adopted by some of the best teams in the industry.<\/p>\n","protected":false},"author":727,"featured_media":3677,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[3567],"class_list":["post-2258","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":3567,"user_id":727,"is_guest":0,"slug":"barr-moses","display_name":"Barr Moses","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Moses","first_name":"Barr","job_title":"","description":"Barr Moses is Co-Founder &amp; CEO at <a href=\"http:\/\/montecarlodata.com\/\">Monte Carlo<\/a>. She is an entrepreneur, speaker at O&rsquo;Reilly for Machine learning training, and worked at the Statistics Department at Stanford."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2258","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/727"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2258"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2258\/revisions"}],"predecessor-version":[{"id":35452,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2258\/revisions\/35452"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3677"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2258"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2258"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2258"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2258"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}