{"id":1348,"date":"2019-02-15T10:32:04","date_gmt":"2019-02-15T07:32:04","guid":{"rendered":"http:\/\/kusuaks7\/?p=953"},"modified":"2023-08-08T14:13:40","modified_gmt":"2023-08-08T14:13:40","slug":"the-effect-of-naming-in-data-science-code","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/the-effect-of-naming-in-data-science-code\/","title":{"rendered":"The Effect of Naming in Data Science Code"},"content":{"rendered":"<p style=\"margin-left: -1.95pt;\"><strong><em>Ready to learn Data Science? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>\u00a0like\u00a0<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p style=\"margin-left: -1.95pt; text-align: center;\"><img decoding=\"async\" style=\"width: 650px; height: 300px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/2000\/1*oD0Kb93iePyHdefQJx31lQ.jpeg\" alt=\"experfy-blog\" \/><\/p>\n<p>Even though there are tools allowing to practice data science without coding, they are far from sufficient. Data scientists will be writing and reading code. Reading code that has poor readability is a horrible experience. This post focuses on the importance of naming entities (e.g. variables, functions) and how easily it improves the quality of your code.<\/p>\n<h3 style=\"margin-left: -1.6pt;\"><strong>\u201cThere will be\u00a0code\u201d<\/strong><\/h3>\n<h4><em>\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u201c\u2026 I also expect that the number of domain-specific <\/em><\/h4>\n<h4><em>\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0languages will continue to grow. This will be a good thing. <\/em><\/h4>\n<h4><em>\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0But it will not eliminate code.\u201d<\/em><\/h4>\n<p>wrote\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Robert_C._Martin\" target=\"_blank\" rel=\"noopener noreferrer\">Robert C. Martin<\/a>\u00a0in the first page of his book\u00a0<a href=\"https:\/\/www.amazon.com\/Clean-Code-Handbook-Software-Craftsmanship\/dp\/0132350882\" target=\"_blank\" rel=\"noopener noreferrer\">Clean Code: A Handbook of Agile Software Craftsmanship<\/a>, in the first chapter called \u201cthere will be code\u201d. We, data scientists, write and read code. Even though there are helping tools, we will be still writing and reading code. It is one of our core practices. This is how we analyze the data, train models, predict outcomes and many more. I strongly believe that there is no escape from code for a data scientist.<\/p>\n<p>We can still practice data science without coding, at some level. From early days to today there were graphical tools that allow analyzing data or practicing machine learning. One example of such tools is\u00a0<a href=\"https:\/\/www.cs.waikato.ac.nz\/ml\/weka\/\" target=\"_blank\" rel=\"noopener noreferrer\">WEKA<\/a>. WEKA is a bundle of machine learning tools with a graphical user interface. According to Wikipedia, its development started in 1993<a href=\"https:\/\/en.wikipedia.org\/wiki\/Weka_%28machine_learning%29\" target=\"_blank\" rel=\"noopener noreferrer\">*<\/a>. It allows users to conduct machine learning experiments and more without writing a single line of code. Then why I insist on saying writing and reading code is fundamental to data scientists? Because there will be custom operations.<\/p>\n<p>Graphical tools have only so many operations. If you are not doing same tasks everyday, there will be a time when tools will not have the operation you need. This can be an analysis, a machine learning model, or some other operation. You need control, customization, and expansion over your operations at some level. The level depends on the task, libraries, or tools. As many problems require new or custom approaches, it is quite soon that you will overgrow those graphical tools.<\/p>\n<p>It is seldom that we work alone. Often organizations do not have a single data scientist, they have data science teams. Even if it is not the case, data scientists work with other disciplines. This collaboration requires good data science code quality.<\/p>\n<p>Moreover, if you are working for a company, your colleagues from other teams will need your code. In order to embed your models into backend, frontend or another system, you will need code. A model that cannot be deployed or integrated would be useless for your company.<\/p>\n<p>If you agree that there will be code in data science, let us talk about how to write good code for data science. Writing good code is a hard task. There are well written texts explaining why it is necessary and how to achieve it. In this post, my goal is to focus on a tiny bit of that. The bit that requires no training or education, but will improve your code quality significantly. That bit I will focus on is\u00a0<strong>naming,\u00a0<\/strong>and it will improve the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Computer_programming#Readability_of_source_code\" target=\"_blank\" rel=\"noopener noreferrer\">readability of your code<\/a>.<\/p>\n<div align=\"center\">\n<hr align=\"center\" size=\"0\" width=\"100%\" \/>\n<\/div>\n<h3 style=\"margin-left: -1.6pt;\"><strong>One Easy Trick:\u00a0Renaming<\/strong><\/h3>\n<p>It is a horrible experience reading a piece of code that has poor readability. Let us look at this simple example below:<\/p>\n<p style=\"text-align: center;\">\n<p>Try to guess what this code does. Author of this code knew the goal of each line, and the goal of entire code while she was writing it. However, as a reader you see a code snippet you need to decipher. As the author did not pay attention to readability, you will spend more time and energy trying to understand this code. Furthermore, you will be more prone to make mistakes. Let us dissect this example to see why it is bad:<\/p>\n<ul>\n<li><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">import pandas as <\/span><\/span><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">p<\/span><\/span><span style=\"background-color: #e6e6fa;\">d<\/span> I know importing pandas as pd is very standard these days. All the data scientists who work with pandas will understand this shorthand. It is not the biggest problem in the code, but I believe that this can be improved. Addition to that, short (e.g. 2 characters) variable names are troublesome in autocompletion (e.g. pd vs pdb).<\/li>\n<li><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">df =\u00a0&#8230;\u00a0<\/span><\/span> Again, probably because of the pandas tutorials, it is a wide applied practice to name a DataFrame df. However it is hiding what the data is in this context. What kind of data is that?<\/li>\n<li><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">&#8220;f1&#8221;:&#8230; &#8220;f2&#8221;:&#8230;<\/span><\/span>\u00a0These are the columns of our data frame. However it is not informative. What kind of data those columns have? Why they are enumerated as 1 and 2? Does enumerating have a purpose (e.g. first and second of something), or not?<\/li>\n<li><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">df[df.f1 &gt;= 18]<\/span><\/span>\u00a0What is the meaning of 18 in this line? Is it some kind of magic number? Why are we filtering larger than or equal to 18, and why are we filtering on f1 column?<\/li>\n<li><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #e6e6fa;\">out = df2.f2.mean()<\/span><\/span>\u00a0Same problem all over. What is the significance of column f2? Why are we taking the mean of it?<\/li>\n<\/ul>\n<p>Now let me rewrite it in a more readable way instead of explaining the purpose of it. I will just rename the variables and keep the rest of the code same.<\/p>\n<p style=\"text-align: center;\">\n<p>With just renaming, you understand the goal of the snippet and the operation in each line only with a glimpse. I believe that explanation of the code is not necessary now as it is very clear.<\/p>\n<h3 style=\"margin-left: -1.6pt;\"><strong>data =\u00a0\u2026, But Which\u00a0Data?<\/strong><\/h3>\n<p>I believe that naming in data science code can be harder than a generic software code. We have less concepts that fit into object design, more abstract entities, and diverse collections. This makes naming harder in data science. As an example, think about working with table that has lots of columns. For instance a table that has columns about customer as person (e.g. name, age), customer behavior (e.g. purchases), geographic information (e.g. address of the purchase), and temporal information (e.g. time of the purchase). How would you name this table?<\/p>\n<p>You can come up with different names for this table but there are common bad ones. For instance do not name it\u00a0dt\u00a0or\u00a0df. Even though you will see df in every place in pandas documentation, you must understand that they are short example codes that do not belong to a project. Also do not name it\u00a0data\u00a0. Yes, it is data, but would you name a variable that holds the age \u201cinteger\u201d? \u201cData\u201d as a name is very vague. Which data is that? What kind of information does it hold?<\/p>\n<p>Do not be afraid to use long names. Longer names usually carry more information about the entity. Having longer names is not a burden. There are many decent IDEs with auto-complete features, like\u00a0<a href=\"https:\/\/www.jetbrains.com\/pycharm\/download\/\" target=\"_blank\" rel=\"noopener noreferrer\">PyCharm<\/a>, so you do not have to write the full name.<\/p>\n<p>When coding, think about the next person who will read your code. Will she understand the main goal of the script and how each line serves to that goal? More specific to naming: will she capture the meaning of that entity quickly by just the name?<\/p>\n<div align=\"center\">\n<hr align=\"center\" size=\"0\" width=\"100%\" \/>\n<\/div>\n<h3 style=\"margin-left: -1.6pt;\"><strong>No Excuse for Bad\u00a0Code<\/strong><\/h3>\n<p>Using better names is not only for your teammates or only yourself. Both your teammates and yourself benefit from this habit. From your teammate\u2019s perspective, she will read the code you wrote easier and faster. From your personal perspective, you will write your code easier as you will have less cognitive load memorizing the meaning of your entities. As a result, your team and organization will benefit from this habit.<\/p>\n<p>The habit of naming better might seem hard to build first. You may not want to spend your time on finding better names. However this is a habit that pays back. You should practice it even if the code you write is a prototype, or part of a tiny project. Without knowing, such code may turn into a bigger project. There are many cases where projects that supposed to be very small, but development kept going for years; projects that designed as \u201cfire and forget\u201d ended up being very important for the organization. This is why even your shortest code should have good naming.<\/p>\n<p>If you never thought about better naming, I hope after reading this you will try naming your entities better, and see how it improves the quality of your code.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ready to learn Data Science? Browse courses\u00a0like\u00a0Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab. Even though there are tools allowing to practice data science without coding, they are far from sufficient. Data scientists will be writing and reading code. Reading code that has poor readability is a<\/p>\n","protected":false},"author":263,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[1779],"class_list":["post-1348","post","type-post","status-publish","format-standard","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":1779,"user_id":263,"is_guest":0,"slug":"kemal-yesilbek","display_name":"Kemal Yesilbek","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Yesilbek","first_name":"Kemal","job_title":"","description":"Kemal Tugrul Yesilbek, data scientist at Lone Rooftop, is focused on machine learning and data science practices. He published multiple research papers on machine learning and its applications in academic journals and conferences. He is experienced in building machine learning solutions from idea to operation."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1348","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/263"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1348"}],"version-history":[{"count":2,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1348\/revisions"}],"predecessor-version":[{"id":30060,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1348\/revisions\/30060"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1348"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1348"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1348"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1348"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}