Why Documentation is Important in Data Science?

Admond Lee Admond Lee
April 19, 2019 Big Data, Cloud & DevOps

Especially in Data Science Project Management

Whenever we talk about data science, chances are we’ll first think of those fancy stuff like AI, deep learning, machine learning etc.

But nobody talks about documentation for one of the reasons that we all know — it doesn’t sound sexy (or it sounds so boring).

And I completely agree that documentation is often not one of the most interesting things for data scientists. But still, it’s importance is no less than other data science workflow, especially in terms of data science project management.

In fact, documentation is no longer just the task done by programmers or developers. It’s something that we as a data scientist should know and be able to perform this task only a regular basis.

Interestingly, a 2017 Github Open Source Survey showed that “Incomplete or Confusing Documentation” was the top complaint about open source software.

2017 Github Open Source Survey

Documentation is highly valued, but often overlooked.

And this is not just only for open source documentation. Similar importance holds true to our actual workplace whereby we have to document data science workflow and be sure to keep it updated from time to time.

In the following writing, you’ll see why documentation is important in data science.

Let’s get started!

So why documentation is important in data science?

1. Reproducibility

Yes, reproducibility.

I especially agree with the article written by Matt.0 about The “Gold Standard” for Data Science Project Management. He has a Github repo for a Gold Standard workflow for setting up a new data science project directory. Go check that out!

In his article, he mentioned about this:

This excellent post on a Gold Standard for software documentation by Daniele Procida sums it up nicely when he says:

“It doesn’t matter how good your software is, because if the documentation is not good enough, people will not use it.

Even if for some reason they have to use it because they have no choice, without good documentation, they won’t use it effectively or the way you’d like them to.”

Without a good and organized documentation, people will not use the code that you’ve worked day and night for, let alone reproduce your results.

Because let’s face it.

Even if you did a great job in writing a code base for a project and you feel so proud of it. And one fine day one of your colleagues in your team intends to make improvements on top of your existing codes.

So your colleague wants to reproduce your results but he/she has no ideas on how to do due to the poor documentation. Heck, your colleague might not even understand your codes.

Reading someone else’s codes is always a torture without good documentation and comments.

The key point here is this: Make documentation as clear and simple as possible. Document your workflow consistently when you’re doing your projects (not after your projects). Document for the sake of helping others to understand your workflow and code in order for them to reproduce your results in future — either to fix some bugs or make improvements. If possible, find someone to take a look at your documentation to make sure that others are able to follow your explanation.

2. Ensure Successful Project Completion

This is important.

It doesn’t matter if you are an intern or in a full time position.

Because guess what?

No employees will be in a company forever. If they are no longer in the company and they pass down their hall finished projects without any documentation behind to others, there are only two outcome — the projects will be left hanging in the air or the projects will require much more time than expected to complete.

I learned this important lesson when I was a data scientist intern in my first internship.

At first I simply couldn’t understand the reasons behind when I was asked to make documentation from how I collected, cleaned and analyzed data to what machine learning models that I used and compared as well as the results obtained during my whole internship period.

Only after I’ve gained more experience in this field did I realize how important documentation is for data science projects, and I couldn’t be more grateful of what I’ve learned in my first internship. And I mean the detailed documentation with all your mistakes made, methods attempted, insights obtained and future suggestion actions etc.

We can’t assume others would understand what our projects are all about without showing them what our projects are and how they can build on top of your legacy when you’re no longer there.

This will make sure your projects will still continue to operate and be in good hands.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Admond Lee

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Key Kubernetes Concepts

    Key Kubernetes Concepts

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.