Piketty Revisited: Improving Economics through Data Science

Michael Brodie Michael Brodie
February 15, 2019 Big Data, Cloud & DevOps

Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.

How Data Curation Can Enable More Faithful Data Science (In Much Less Time)

experfy-blog

Thomas Piketty’s monumental Capital in the Twenty-First Century has deep challenges and lessons beyond the world of economics. 

In adhering to strict economic principles Piketty achieved the nearly impossible with no guidance from Data Science let alone tools. He instinctively followed many as-yet unwritten principles of this emerging field. As an unwitting Data Science pioneer, Piketty’s work took a decade. Following his model, emerging data curation tools may enable others to achieve similar results in two years or less. 

Let’s look at his challenge: 

For each of his many hypotheses in his ~700-page book, Piketty had to: 

  • Discover, ingest, clean, analyze, transform, consolidate, enrich / interpolate and validate relevant economic data …
  • From ~200 external, official data sources with incomplete schemas and minor data heterogeneity …
  • To form meaningful, consolidated, and curated target data set that met functional and data governance requirements (e.g., veracity, data quality and significance) for the hypothesis.

For Piketty, faithful data science here was a dynamic process, requiring him to explore, acquire and curate the data he needed; model and verify it; modify his hypothesis and all the data that went with it; document its provenance; and then and only then publish it. 

Rigorous? Yes. Protracted? Yup – a decade. Worth it? Absolutely, especially in the context of the data controversy that ensued. 

 experfy-blog

Typically, debates about science, medicine and economics focus around modelling and the related assumptions. However, the rapidly growing role, use, and value of Big Data adds Data Science to the debate, and whether the appropriate scientific, statistical and economic methods were followed in conducting the research. Specifically, did the Data Curation – the discovery, analysis and combination of data sources into a curated data set for use in the analysis – achieve the veracity, data quality and significance requirements of the economic analysis? Does the data curation provenance demonstrate that adequate data governance was applied? 

Let’s address these questions through the example of economics. 

Computational Economics (a term coined or at least endorsed by Piketty, joining Computational Biology, Computational Social Science and many more) should be based on sound data science principles just as scientific experiments are based on the principles of the scientific method. 

Piketty’s work is an example of using computational economics to test collections of economic hypotheses against available data. Good data science practice in these cases involves determining data requirements from the economic hypotheses including the veracity, data quality, and significance requirements. These controls, as they are called in physical science experiments and clinical studies, are derived from the model, the intended analysis, and should be verified together with the model and the intended analysis by expert economists. The economic analysis proceeds in two steps: data curation and data analysis. 

As the Piketty data controversy illustrates, raw data sources are seldom in a form required for analysis. Raw data sources are replete with errors, inconsistencies, and gaps and must be augmented and combined to meet the requirements of the analysis. The economic facts that are analyzed are those curated from the raw data sources. 

The data curation step involves discovering, analyzing, cleaning, transforming, combining, and de-duplicating data sources to produce target data sources that meet the requirements for input to the analysis. Every data curation step should be documented as data provenance that is then compared against the controls to determine the extent to which the appropriate data governance was followed and the required data quality was achieved. 

All of this must be verified by reviewers, supporters and detractors alike. The analytical results should be accompanied by the measures of data quality supported by the data provenance, and data governance, hence the extent to which the requirements were met. Variations from the requirements should be considered in establishing a level of confidence in the analytical results. 

Data provenance that demonstrates sound data science principles and data curation practices offers the best transparency for a worthy economic theory. Anything less can be assumed to conceal poor data practices. 

In conclusion, as the example of computational economics suggests, data curation (as supported by Tamr) must be thought of as an activity that precedes – as independent of but partnered with – analysis. Approached this way, the analyst has a far better chance to remain faithful to strong data science principles. Alternatively, the analyst may violate the principles of their analysis and domain, in Piketty’s case economics. 

Have a look at Piketty’s groundbreaking results in economics and data science. 

Previous version published at Tamar  
 

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Michael Brodie

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Some Essential Hacks and Tricks for Machine Learning with Python

    Some Essential Hacks and Tricks for Machine Learning with Python

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.