{"id":2267,"date":"2020-02-19T01:16:01","date_gmt":"2020-02-18T22:16:01","guid":{"rendered":"http:\/\/kusuaks7\/?p=1872"},"modified":"2024-01-08T14:18:38","modified_gmt":"2024-01-08T14:18:38","slug":"why-you-need-data-quality-automation-to-make-data-driven-decisions-with-confidence","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/why-you-need-data-quality-automation-to-make-data-driven-decisions-with-confidence\/","title":{"rendered":"Why You Need Data Quality Automation To Make Data-Driven Decisions With Confidence"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2267\" class=\"elementor elementor-2267\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-57439474 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"57439474\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7a59a70a\" data-id=\"7a59a70a\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3f291d40 elementor-widget elementor-widget-text-editor\" data-id=\"3f291d40\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe quarterly report showed a 40% drop in sales over the last quarter. The executive team panicked. The entire company is collapsing! Or is it? After the investigation concluded, they discovered that the sales were okay. The report was incorrect because an automated system had failed. The infrastructure that copied files with orders didn\u2019t copy 6 out of 10 files to the data lake \u2014 false alarm. The problem was not sales, but rather the quality of the data that was the basis of the dashboards.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9a97887 elementor-widget elementor-widget-heading\" data-id=\"9a97887\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"dataqualityisessentialfortrustworthydecisions\">Data quality is essential for trustworthy decisions<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-991faa4 elementor-widget elementor-widget-text-editor\" data-id=\"991faa4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tData quality is critical because data is used for decision making and powering AI models. Models and decisions are only as good as the data behind them, so any lack of confidence in the data means they are less useful in predicting and providing insights, slowing down, and undermining fast decision making. Trust in data is hard to get and easy to lose, so data quality must be maintained for models and dashboards to be useful at all times. In the last 10 years, volume, variety, and velocity of business data have increased dramatically, making it impossible to control the quality of data with static testing. Therefore, To ensure confidence in analytics and predictive models at all times, data-driven firms need to be able to monitor data quality in production.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d15b1dd elementor-widget elementor-widget-text-editor\" data-id=\"d15b1dd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn this blog post, we review a typical data flow, causes of data corruption how to set confidence levels and data quality goals. Lastly, we go under the hood and explain our solution to monitoring data quality in production in real-time.\n\nBelow is a basic diagram of a typical data flow. Data enters the flow diagram from internal databases and other sources. It enters a central repository consisting mainly of the enterprise data warehouse and data lake. Data is then distributed to various outputs as required.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2551737 elementor-widget elementor-widget-image\" data-id=\"2551737\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2019\/10\/data-129.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-204aca1 elementor-widget elementor-widget-text-editor\" data-id=\"204aca1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAnywhere along this data flow, corruption may occur. We look at protections at each step.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-257fc80 elementor-widget elementor-widget-heading\" data-id=\"257fc80\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"theleadingcausesofdatacorruption\">The leading causes of data corruption<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-760c52e elementor-widget elementor-widget-text-editor\" data-id=\"760c52e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tData corruption takes on many forms and has many causes. Three of the most prevalent causes of data corruption are code, data sources, and infrastructure. Catching data corruption requires data monitoring in production in realtime.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1902762 elementor-widget elementor-widget-heading\" data-id=\"1902762\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"codecorruption\">Code\u00a0corruption<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7defc33 elementor-widget elementor-widget-text-editor\" data-id=\"7defc33\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe application itself may cause some data corruption. Traditional application code debugging techniques may find some bugs, but there is a high probability it may not find all. Static testing of code in data pipelines is not enough because it is expensive to have 100% coverage, so not all defects are identified.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c0887b4 elementor-widget elementor-widget-heading\" data-id=\"c0887b4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"datasourcesandextractioncorruption\">Data\u00a0sources\u00a0and\u00a0extraction\u00a0corruption<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-85c7707 elementor-widget elementor-widget-text-editor\" data-id=\"85c7707\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tData may be corrupt directly out of the data source. Sensors may fail; human error during data input could occur.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2debc2d elementor-widget elementor-widget-heading\" data-id=\"2debc2d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3 id=\"infrastructurecorruption\">Infrastructure\u00a0corruption<\/h3><\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8c50cc4 elementor-widget elementor-widget-text-editor\" data-id=\"8c50cc4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSources out of the direct control of the application can cause data corruption. Networks can go down; power glitches can affect data integrity; even sunspots have been known to cause network mishaps.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7a0e5fb elementor-widget elementor-widget-heading\" data-id=\"7a0e5fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"usingconfidencelevelstofinetunedataquality\">Using confidence levels to fine-tune data quality<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3462bb8 elementor-widget elementor-widget-text-editor\" data-id=\"3462bb8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tData quality is not a binary measurement. Some data may be difficult to resolve as valid or corrupt. For example, an expected value of a column may be between 5 and 50, but 0.3% of values in the dataset are out of range. Does this indicate value corruption or simply outliers? To better understand these marginal values, a \u201cconfidence level\u201d can be attached along with the data quality flag.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-325184f elementor-widget elementor-widget-heading\" data-id=\"325184f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"thegoalsofdataqualitymonitoring\">The goals of data quality monitoring<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0ee8147 elementor-widget elementor-widget-text-editor\" data-id=\"0ee8147\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet\u2019s focus on the data lake. With no data quality monitoring, the data lake would look like this:\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5f0c16c elementor-widget elementor-widget-image\" data-id=\"5f0c16c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2019\/10\/data-130.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fd5b0d3 elementor-widget elementor-widget-text-editor\" data-id=\"fd5b0d3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOn the left are jobs supplying data to the lake. Data is stored and manipulated with no quality checking. Any corrupted data would not be detected, and it can spread the corruption to adjoining modules.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2c933f6 elementor-widget elementor-widget-text-editor\" data-id=\"2c933f6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn the figure below, data quality jobs get inserted between regular data processing jobs. Each data processing job is isolated from the others with the data quality jobs:\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1577d29 elementor-widget elementor-widget-image\" data-id=\"1577d29\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2019\/10\/data-131.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c6d94f6 elementor-widget elementor-widget-text-editor\" data-id=\"c6d94f6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen the data quality job detects an anomaly, we immediately instigate action before the corruption spreads. Depending on the severity of the corruption, the remedy may be to flag the error and continue or, for more severe issues, the pipeline may be stopped to prevent further corruption.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c8f6661 elementor-widget elementor-widget-text-editor\" data-id=\"c8f6661\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLet&#8217;s look at these steps in more detail\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b0e3d83 elementor-widget elementor-widget-heading\" data-id=\"b0e3d83\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"detectdatacorruption\">Detect data corruption<\/h4><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-df0faa9 elementor-widget elementor-widget-text-editor\" data-id=\"df0faa9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis blog deals mainly with this section. The first step in controlling data quality is always detecting the offending data point.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-78cdacc elementor-widget elementor-widget-heading\" data-id=\"78cdacc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"preventitfromspreading\">Prevent it from spreading<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cadcbed elementor-widget elementor-widget-text-editor\" data-id=\"cadcbed\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOnce we find an issue, it should be isolated from causing further corruption. Depending on the degree of severity, this could mean tagging the offending data point, disabling a sub-system, or even shutting down the entire system.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-60f6a7a elementor-widget elementor-widget-heading\" data-id=\"60f6a7a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"fixit\">Fix it<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4b49c3c elementor-widget elementor-widget-text-editor\" data-id=\"4b49c3c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOnce we identify an issue and have it isolated, the next step is to fix it. The support team responsible for the system is alerted. They may take one of several actions depending on the source of the issue. A system or subsystem may need rebooting. A sensor may need repair. The offending datapoint may need removal. These are a sampling of possible remedies.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-102a007 elementor-widget elementor-widget-heading\" data-id=\"102a007\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"methodstoimplementdataqualityinrealtime\">Methods to implement data quality in real-time<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-68a5d06 elementor-widget elementor-widget-text-editor\" data-id=\"68a5d06\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe proper way to enforce data quality in real-time is to embed data quality assurance logic in the data processing pipeline, after each step of the pipeline.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f24bf66 elementor-widget elementor-widget-image\" data-id=\"f24bf66\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2019\/10\/data-132.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-39e39cb elementor-widget elementor-widget-text-editor\" data-id=\"39e39cb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t&#8220;Methods to implement data quality in real-time Why You Need Data Quality Automation To Make Data-Driven Decisions With Confidence&#8221; \t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ab70b0f elementor-widget elementor-widget-heading\" data-id=\"ab70b0f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"controldivergencefromsorattheinput\">Control divergence from SoR at the input<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9606d40 elementor-widget elementor-widget-text-editor\" data-id=\"9606d40\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li>Validate correctness of the imported data<\/li>\n \t<li>Prevent stale data from entering the system<\/li>\n \t<li>Prevent corruption accumulation in stream processing use cases<\/li>\n \t<li>Check data before it gets in the lake<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0aa63e6 elementor-widget elementor-widget-text-editor\" data-id=\"0aa63e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tEvery system of record (SoR) shares the following characteristics: it provides the most complete, most accurate and most timely data, it has the best structural conformance to the data model, it is nearest to the point of operational entry, and it can be used to feed other systems.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-78cfdbc elementor-widget elementor-widget-text-editor\" data-id=\"78cfdbc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe data quality job should compare the system of records and datasets in the analytical data platform to ensure completeness and control the staleness of data in the platform. Comparing the SoR and datasets is especially useful if the import is done by streaming, since errors may accumulate.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c2b97f6 elementor-widget elementor-widget-heading\" data-id=\"c2b97f6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"validatebusinessrulesofstoreddata\">Validate business rules of stored data<\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c0caa2f elementor-widget elementor-widget-text-editor\" data-id=\"c0caa2f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n<li>Enforce schema<\/li>\n<li>Check for nulls<\/li>\n<li>Validate data ranges<\/li>\n<li>Specify and enforce data invariants<\/li>\n<\/ul>\n<p>Business rules are manually set up for data validation purposes. The system looks for issues such as nulls, boundaries for numerical fields, or other business validation rules for specific data fields. A simple business rule looks something like this:<\/p>\n<p><br><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9eef973 elementor-widget elementor-widget-text-editor\" data-id=\"9eef973\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIf the actual value falls outside of this predetermined range, it needs further investigation as possible data corruption.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-062ec6b elementor-widget elementor-widget-text-editor\" data-id=\"062ec6b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tA null is a lack of data. In short, if you expect data and do not get it, that is an error. While null detection sounds trivial, it may not get flagged as a data quality error.\n<h4 id=\"detectanomaliesattheoutput\">Detect anomalies at the output<\/h4>\n<ul>\n \t<li>Fully automatic data quality enforcement<\/li>\n \t<li>Collect data profile, metrics, and statics<\/li>\n \t<li>Train ML models<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1c62528 elementor-widget elementor-widget-text-editor\" data-id=\"1c62528\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\u201cSomething does not look quite right.\u201d Humans are naturally inclined to seek patterns and can notice when those disrupted patterns. Computers can be trained to do the same thing to a much finer degree.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-43cf1fc elementor-widget elementor-widget-text-editor\" data-id=\"43cf1fc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBelow is an example of a data graph bounded by a rolling average. An anomaly well outside of the normal range causes the rolling average to take a substantial departure from the norm and becomes immediately suspect:\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5375a01 elementor-widget elementor-widget-image\" data-id=\"5375a01\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2019\/10\/data-quality.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5bcf5fb elementor-widget elementor-widget-heading\" data-id=\"5bcf5fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"dataqualityreferencedemo\">Data quality reference demo<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1df5d73 elementor-widget elementor-widget-text-editor\" data-id=\"1df5d73\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tGrid Dynamics created a demonstration anomaly detection application using open-source modules Griffin, Grafana, ElasticSearch, along with custom code and ML models. The goal is to minimize the manual definition of rules and automatically generate anomaly detection logic and automatically embed it into the data pipelines. The figure below shows the demo reference architecture and technology stack:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6b692b3 elementor-widget elementor-widget-image\" data-id=\"6b692b3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/blog.griddynamics.com\/content\/images\/2019\/10\/data-133.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e53c5ed elementor-widget elementor-widget-text-editor\" data-id=\"e53c5ed\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong><a href=\"https:\/\/griffin.apache.org\/#\" rel=\"noopener\">Apache Griffin<\/a><\/strong>\u00a0&#8211; An open-source Data Quality solution for Big Data, which supports both batch and streaming mode. It offers a unified process to measure your data quality from different perspectives, helping you build trusted data assets, therefore boost your confidence for your business.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0204bb6 elementor-widget elementor-widget-text-editor\" data-id=\"0204bb6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong><a href=\"https:\/\/grafana.com\/\" rel=\"noopener\">Grafana<\/a><\/strong>\u00a0&#8211; Grafana allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data-driven culture.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a7e8868 elementor-widget elementor-widget-text-editor\" data-id=\"a7e8868\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong><a href=\"https:\/\/www.elastic.co\/\" rel=\"noopener\">ElasticSearch<\/a><\/strong>\u00a0&#8211; Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9b33227 elementor-widget elementor-widget-text-editor\" data-id=\"9b33227\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tData quality jobs can be embedded in any data processing pipeline, implemented in Apache Spark, Apache Flink, Apache Beam, traditional Hadoop MapReduce jobs, or commercial data engineering tools. It can be integrated into on-premise data platforms or cloud API data platforms such as Google Dataproc, Google Dataflow, Google BigQuery, Amazon EMR, Azure Databricks Spark, Azure HDInsight.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-52536ed elementor-widget elementor-widget-heading\" data-id=\"52536ed\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Conclusion<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-34a95df elementor-widget elementor-widget-text-editor\" data-id=\"34a95df\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWithout trust, you have nothing\u2026\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8de19fd elementor-widget elementor-widget-text-editor\" data-id=\"8de19fd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn this blog, we covered the causes of data correction as well as methods to implement data quality in real-time. Trusting the quality of your data is as important as the data itself; any decision based on mistrusted data becomes mistrusted itself. The cascade effect caused by the lack of data quality monitoring can become severe enough to impact the entire enterprise negatively. This issue is not isolated to any one industry but can affect virtually any company. For a smooth-running organization, data quality is essential.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7cafed8 elementor-widget elementor-widget-text-editor\" data-id=\"7cafed8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nThis article originally appeared at <a href=\"https:\/\/blog.griddynamics.com\/why-you-need-data-quality-automation-to-make-data-driven-decisions\/\" rel=\"noopener\">Grid Dynamics Blog<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Data quality is critical because data is used for decision making and powering AI models. Models and decisions are only as good as the data behind them, so any lack of confidence in the data means they are less useful in predicting and providing insights, slowing down, and undermining fast decision making. Trust in data is hard to get and easy to lose, so data quality must be maintained for models and dashboards to be useful at all times.<\/p>\n","protected":false},"author":729,"featured_media":3720,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[97],"ppma_author":[3571],"class_list":["post-2267","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-artificial-intelligence"],"authors":[{"term_id":3571,"user_id":729,"is_guest":0,"slug":"max-martynov","display_name":"Max Martynov","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Martynov","first_name":"Max","job_title":"","description":"Max Martynov is CTO at Grid Dynamics. Over the last decade, his focus evolved from HPC and scalable distributed platforms to Digital Transformation, Cloud, BigData, DevOps, Microservices architecture, and AI."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2267","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/729"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2267"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2267\/revisions"}],"predecessor-version":[{"id":35420,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2267\/revisions\/35420"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3720"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2267"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2267"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2267"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2267"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}