{"id":1654,"date":"2019-04-24T05:45:49","date_gmt":"2019-04-24T05:45:49","guid":{"rendered":"http:\/\/kusuaks7\/?p=1259"},"modified":"2023-08-24T07:51:19","modified_gmt":"2023-08-24T07:51:19","slug":"decision-trees-how-to-optimize-my-decision-making-process","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/decision-trees-how-to-optimize-my-decision-making-process\/","title":{"rendered":"Decision Trees: How to Optimize My Decision Making Process?"},"content":{"rendered":"<p>Let&#39;s imagine you are playing a game of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Twenty_Questions\" rel=\"noopener\">Twenty Questions<\/a>. Your opponent has secretly chosen a subject, and you must figure out what he\/she chose. At each turn, you may ask a yes-or-no question, and your opponent must answer truthfully. How do you find out the secret in the fewest number of questions?<\/p>\n<p>It should be obvious some questions are better than others. For example, asking &quot;Can it fly?&quot; as your first question is likely to be unfruitful, whereas asking &quot;Is it alive?&quot; is a bit more useful. Intuitively, you want each question to significantly narrow down the space of possibly secrets, eventually leading to your answer.<\/p>\n<p>That is the basic idea behind&nbsp;<strong>decision trees<\/strong>. At each point, you consider a set of questions that can partition your data set. You choose the question that provides the best split and again find the best questions for the partitions. You stop once all the points you are considering are of the same class. Then the task of classification is easy. You can simply grab a point, and chuck it down the tree. The questions will guide it to its appropriate class.<\/p>\n<h3><strong>IMPORTANT TERMS<\/strong><\/h3>\n<p>Decision tree is a type of supervised learning algorithm that can be used in both regression and classification problems. It works for both categorical and continuous input and output variables.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" alt=\"decision-tree.png\" data-image=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a658c2241b0df2099c68\/1536796307457\/decision-tree.png\" data-image-dimensions=\"592x326\" data-image-focal-point=\"0.5,0.5\" data-image-id=\"5b99a658c2241b0df2099c68\" data-image-resolution=\"750w\" data-load=\"false\" data-src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a658c2241b0df2099c68\/1536796307457\/decision-tree.png\" data-type=\"image\" src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a658c2241b0df2099c68\/1536796307457\/decision-tree.png?format=750w\" style=\"width: 592px; height: 326px;\" \/><\/p>\n<p>Let&rsquo;s identify important terminologies on Decision Tree, looking at the image above:<\/p>\n<ul data-rte-list=\"default\">\n<li>\n<p><strong>Root Node<\/strong> represents the entire population or sample. It further gets divided into 2 or more homogeneous sets.<\/p>\n<\/li>\n<li>\n<p><strong>Splitting<\/strong> is a process of dividing a node into 2 or more sub-nodes.<\/p>\n<\/li>\n<li>\n<p>When a sub-node splits into further sub-nodes, it is called a <strong>Decision Node<\/strong>.<\/p>\n<\/li>\n<li>\n<p>Nodes that do not split is called a <strong>Terminal Node<\/strong> or a <strong>Leaf<\/strong>.<\/p>\n<\/li>\n<li>\n<p>When you remove sub-nodes of a decision node, this process is called <strong>pruning<\/strong>. The opposite of pruning is <strong>splitting<\/strong>.<\/p>\n<\/li>\n<li>\n<p>A sub-section of an entire tree is called a <strong>branch<\/strong>.<\/p>\n<\/li>\n<li>\n<p>A node, which is divided into sub-nodes is called a <strong>Parent Node<\/strong> of the sub-nodes; whereas the sub-nodes are called the <strong>Child<\/strong> of the parent node.<\/p>\n<\/li>\n<\/ul>\n<h3><strong>HOW IT WORKS<\/strong><\/h3>\n<p>Here I only talk about <strong>classification tree<\/strong>, which is used to predict a qualitative response. Regression tree is the one used to predict quantitative values.<\/p>\n<p>In a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs. In interpreting the results of a classification tree, we are often interested not only in the class prediction corresponding to a particular terminal node region, but also in the class proportions among the training observations that fall into that region.&nbsp;The task of growing a classification tree relies on using one of these 3 methods&nbsp;as a criterion for making the binary splits:<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" alt=\"\" data-image=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a68140ec9a28d64c90ef\/1536796294090\/\" data-image-dimensions=\"1200x800\" data-image-focal-point=\"0.5,0.5\" data-image-id=\"5b99a68140ec9a28d64c90ef\" data-image-resolution=\"750w\" data-load=\"false\" data-src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a68140ec9a28d64c90ef\/1536796294090\/\" data-type=\"image\" src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a68140ec9a28d64c90ef\/1536796294090\/?format=750w\" style=\"width: 700px; height: 467px;\" \/><\/p>\n<p><strong>1 &#8211; Classification Error Rate<\/strong>: We can define the &ldquo;hit rate&rdquo; as the fraction of training observations in a particular region that don&#39;t belong to the most widely occurring class. The error is given by this equation:<\/p>\n<p><em>E = 1 &#8211; argmax_{c}(&pi;\u0302<\/em> <em>mc)<\/em><\/p>\n<p>in which &pi;\u0302 mc&nbsp;represents the fraction of training data in region <em>R_m<\/em> that belong to class <em>c<\/em>.<\/p>\n<p><strong>2 &#8211; Gini Index<\/strong>: The Gini Index is an alternative error metric that is designed to show how &ldquo;pure&rdquo; a region is. &quot;Purity&quot; in this case means how much of the training data in a particular region belongs to a single class. If a region <em>R_m<\/em> contains data that is mostly from a single class <em>c<\/em> then the Gini Index value will be small:<\/p>\n<p><img decoding=\"async\" alt=\"gini.png\" data-image=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a6a22b6a288509f20944\/1536796328803\/gini.png\" data-image-dimensions=\"168x54\" data-image-focal-point=\"0.5,0.5\" data-image-id=\"5b99a6a22b6a288509f20944\" data-image-resolution=\"750w\" data-load=\"false\" data-src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a6a22b6a288509f20944\/1536796328803\/gini.png\" data-type=\"image\" src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a6a22b6a288509f20944\/1536796328803\/gini.png?format=750w\" \/><\/p>\n<p><strong>3 &#8211; Cross-Entropy<\/strong>: A third alternative, which is similar to the Gini Index, is known as the Cross-Entropy or Deviance:<\/p>\n<p><img decoding=\"async\" alt=\"cross-entropy.png\" data-image=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a6b56d2a73dd96d69b35\/1536796346592\/cross-entropy.png\" data-image-dimensions=\"168x54\" data-image-focal-point=\"0.5,0.5\" data-image-id=\"5b99a6b56d2a73dd96d69b35\" data-image-resolution=\"750w\" data-load=\"false\" data-src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a6b56d2a73dd96d69b35\/1536796346592\/cross-entropy.png\" data-type=\"image\" src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99a6b56d2a73dd96d69b35\/1536796346592\/cross-entropy.png?format=750w\" \/><\/p>\n<p>The cross-entropy will take on a value near zero if the &pi;\u0302 mc&rsquo;s are all near <em>0<\/em> or near <em>1<\/em>. Therefore, like the Gini index, the cross-entropy will take on a small value if the m-th node is pure. In fact, it turns out that the Gini index and the cross-entropy are quite similar numerically.<\/p>\n<p>When building a classification tree, either the Gini index or the cross-entropy are typically used to evaluate the quality of a particular split, since they are more sensitive to node purity than is the classification error rate. Any of these 3 approaches might be used when pruning the tree, but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal.<\/p>\n<h3><strong>IMPLEMENTATION FROM SCRATCH<\/strong><\/h3>\n<p>Let&rsquo;s&nbsp;walk through the decision tree&ndash;building algorithm, with all its fine details.&nbsp;To build a decision tree, we need to make an initial decision on the dataset to dictate which feature is used to split the data. To determine this, we must try every feature and measure which split will give us the best results. After that, we&rsquo;ll split the dataset into subsets. The subsets will then traverse down the branches of the first decision node. If the data on the branches is the same class, then we&rsquo;ve properly classified it and don&rsquo;t need to continue splitting it. If the data isn&rsquo;t the same, then we need to repeat the splitting process on this subset. The decision on how to split this subset is done the same way as the original dataset, and we repeat this process until we&rsquo;ve classified all the data.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" alt=\"\" data-image=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c57270a6adaccf76e24c\/1536804213692\/\" data-image-dimensions=\"1024x675\" data-image-focal-point=\"0.5,0.5\" data-image-id=\"5b99c57270a6adaccf76e24c\" data-image-resolution=\"750w\" data-load=\"false\" data-src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c57270a6adaccf76e24c\/1536804213692\/\" data-type=\"image\" src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c57270a6adaccf76e24c\/1536804213692\/?format=750w\" style=\"width: 700px; height: 461px;\" \/><\/p>\n<p>How do we split our dataset?&nbsp;One way to organize this messiness is to measure the information. Using <strong>information theory<\/strong>, we can measure the information before and after the split.&nbsp;The change in information before and after the split is known as the <strong>information gain<\/strong>. When we know how to calculate the information gain, we can split our data across every feature to see which split gives us the highest information gain. The split with the highest information gain is our best option.<\/p>\n<p>In order to calculate the information gain, we can use the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Entropy_(information_theory)#Definition\" rel=\"noopener\">Shannon Entropy<\/a>, which is the expected value of all the information of all possible values of our class. Let&rsquo;s see the Python code for it:<\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">from math import log<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">def ShannonEntropy(data):<br \/>\n&nbsp; # Number of instances in the dataset<br \/>\n&nbsp; labelCounts = {}<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; # Create a dictionary of all possible classes<br \/>\n&nbsp; for featVec in data:<br \/>\n&nbsp;&nbsp;&nbsp; currentLabel = featVec[-1]<br \/>\n&nbsp;&nbsp;&nbsp; if currentLabel not in labelCounts.keys():<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; labelCounts[currentLabel] = 0<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; labelCounts[currentLabel] += 1<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; shannonEntropy = 0.0<br \/>\n&nbsp; for key in labelCounts:<br \/>\n&nbsp;&nbsp;&nbsp; # Probability of a label in our dataset<br \/>\n&nbsp;&nbsp;&nbsp; prob = float(labelCounts[key]) \/ len(data)<br \/>\n&nbsp;&nbsp;&nbsp; # Calculate the Shannon Entropy<br \/>\n&nbsp;&nbsp;&nbsp; shannonEntropy -= prob * log(prob, 2) # logarithm base 2<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; return shannonEntropy<\/span><\/p>\n<p>As you can see, we first calculate a count of the number of instances in the dataset. Then we create a dictionary whose keys are the values in the final column. If a key was not encountered previously, one is created. For each key, we keep track of how many times this label occurs. Finally, we use the frequency of all the different labels to calculate the probability of that label. This probability is used to calculate the Shannon entropy, and we sum this up for all the labels.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" alt=\"\" data-image=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c5ca758d4663942a9491\/1536804302857\/\" data-image-dimensions=\"1738x966\" data-image-focal-point=\"0.5,0.5\" data-image-id=\"5b99c5ca758d4663942a9491\" data-image-resolution=\"750w\" data-load=\"false\" data-src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c5ca758d4663942a9491\/1536804302857\/\" data-type=\"image\" src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c5ca758d4663942a9491\/1536804302857\/?format=750w\" style=\"width: 700px; height: 389px;\" \/><\/p>\n<p>After figuring out a way to measure the entropy of the dataset, we need to split the dataset, measure the entropy on the split sets, and see if splitting it was the right thing to do. Here&rsquo;s the code to do so:<\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">def splitDataset(data, axis, value):<br \/>\n&nbsp; # Create a separate list<br \/>\n&nbsp; retData = []<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; for featVec in data:<br \/>\n&nbsp;&nbsp;&nbsp; if featVec[axis] == value:<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Cut out the feature to split on<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; reducedFeatVec = featVec[:axis]<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; reducedFeatVec.extend(featVec[axis+1:])<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; retData.append(reducedFeatVec)<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; return retData<\/span><\/p>\n<p>This code takes 3 input: the data to be split on, the feature to be split on, and the value of the feature to return. We create a new list at the beginning each time because we&rsquo;ll be calling this function multiple times on the same dataset and we don&rsquo;t want the original dataset modified.&nbsp;Our dataset is a list of lists; as we iterate over every item in the list and if it contains the value we&rsquo;re looking for, we&rsquo;ll add it to our newly created list. Inside the if statement, we cut out the feature that we split on.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" alt=\"choose-feature-to-split.jpg\" data-image=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c6090e2e72572880a738\/1536804366950\/choose-feature-to-split.jpg\" data-image-dimensions=\"2500x943\" data-image-focal-point=\"0.5,0.5\" data-image-id=\"5b99c6090e2e72572880a738\" data-image-resolution=\"750w\" data-load=\"false\" data-src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c6090e2e72572880a738\/1536804366950\/choose-feature-to-split.jpg\" data-type=\"image\" src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c6090e2e72572880a738\/1536804366950\/choose-feature-to-split.jpg?format=750w\" style=\"width: 700px; height: 264px;\" \/><\/p>\n<p>Now, we&rsquo;re going to combine 2 functions: ShannonEntropy and splitDataset to cycle through the dataset and decided which feature is the best to split on.<\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">def chooseBestFeatureToSplit(data):<br \/>\n&nbsp; numFeatures = len(data[0]) &#8211; 1<br \/>\n&nbsp; baseEntropy = ShannonEntropy(data)<br \/>\n&nbsp; bestInfoGain = 0.0<br \/>\n&nbsp; bestFeature = -1<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; for i in range(numFeatures):<br \/>\n&nbsp;&nbsp;&nbsp; # Create unique list of class labels<br \/>\n&nbsp;&nbsp;&nbsp; featList = [example[i] for example in data]<br \/>\n&nbsp;&nbsp;&nbsp; uniqueVals = set(featList)<br \/>\n&nbsp;&nbsp;&nbsp; newEntropy = 0.0<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp;&nbsp;&nbsp; # Calculate entropy for each split<br \/>\n&nbsp;&nbsp;&nbsp; for value in uniqueVals:<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; subData = splitDataset(data, i, value)<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; prob = len(subData) \/ float(len(data))<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; newEntropy += prob * ShannonEntropy(subData)<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp;&nbsp;&nbsp; infoGain = baseEntropy &#8211; newEntropy<br \/>\n&nbsp;&nbsp;&nbsp; if (infoGain &gt; bestInfoGain):<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Find the best information gain<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bestInfoGain = infoGain<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; bestFeature = 1<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; return bestFeature<\/span><\/p>\n<p>Let&rsquo;s quickly examine the code:<\/p>\n<ul data-rte-list=\"default\">\n<li>\n<p>We calculate the Shannon entropy of the whole dataset before any splitting has occurred and assign the value to variable <strong>baseEntropy.<\/strong><\/p>\n<\/li>\n<li>\n<p>The 1st for loop loops over all the features in our data. We use list comprehensions to create a list (<strong>featList<\/strong>) of all the i-th entries in the data or all the possible values present in the data.<\/p>\n<\/li>\n<li>\n<p>Next, we create a new set from a list to get the unique values out (<strong>uniqueVals<\/strong>).<\/p>\n<\/li>\n<li>\n<p>Then, we go through all the unique values of this feature and split the data for each feature (<strong>subData<\/strong>). The new entropy is calculated (<strong>newEntropy<\/strong>) and summed up for all the unique values of that feature. The information gain (<strong>infoGain<\/strong>) is the reduction in entropy (<strong>baseEntropy &#8211; newEntropy<\/strong>).<\/p>\n<\/li>\n<li>\n<p>Finally, we compare the information gain among all the features and return the index of the best feature to split on (<strong>bestFeature<\/strong>).<\/p>\n<\/li>\n<\/ul>\n<p>Now that we can measure how organized a dataset is and we can split the data, it&rsquo;s time to put all of this together and build the decision tree. From a dataset, we split it based on the best attribute to split. Once split, the data will traverse down the branches of the tree to another node. This node will then split the data again. We&rsquo;re going to use recursion to handle this.<\/p>\n<p>We&rsquo;ll only stop under the following conditions: (1) Run out of attributes on which to split or (2) all the instances in a branch are the same class. If all instances have the same class, then we&rsquo;ll create a leaf node. Any data that reaches this leaf node is deemed to belong to the class of that leaf node.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" alt=\"\" data-image=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c659575d1f3c47433e96\/1536804443337\/\" data-image-dimensions=\"1140x660\" data-image-focal-point=\"0.5,0.5\" data-image-id=\"5b99c659575d1f3c47433e96\" data-image-resolution=\"750w\" data-load=\"false\" data-src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c659575d1f3c47433e96\/1536804443337\/\" data-type=\"image\" src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c659575d1f3c47433e96\/1536804443337\/?format=750w\" style=\"width: 700px; height: 405px;\" \/><\/p>\n<p>Here&rsquo;s the code to build our decision trees:<\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">def createTree(data, labels):<br \/>\n&nbsp; classList = [example[-1] for example in data]<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; # Stop when all classes are equal<br \/>\n&nbsp; if classList.count(classList[0]) == len(classList):<br \/>\n&nbsp;&nbsp;&nbsp; return classList[0]<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; # When no more features, return majority<br \/>\n&nbsp; if len(data[0]) == 1:<br \/>\n&nbsp;&nbsp;&nbsp; # A function that returns the class that occurs with the greatest frequency<br \/>\n&nbsp;&nbsp;&nbsp; # You can write your own<br \/>\n&nbsp;&nbsp;&nbsp; return majorityCount(classList)<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; bestFeat = chooseBestFeatureToSplit(data)<br \/>\n&nbsp; bestFeatLabel = labels[bestFeat]<br \/>\n&nbsp; myTree = {bestFeatLabel:{}}<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; # Get list of unique values<br \/>\n&nbsp; del(labels[bestFeat])<br \/>\n&nbsp; featValues = [example[bestFeat] for example in data]<br \/>\n&nbsp; uniqueVals = set(featValues)<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; for value in uniqueVals:<br \/>\n&nbsp;&nbsp;&nbsp; subLabels = labels[:]<br \/>\n&nbsp;&nbsp;&nbsp; myTree[bestFeatLabel][value] = createTree(splitDataset(data, bestFeat, value), subLabels)<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">&nbsp; return myTree<\/span><\/p>\n<p>Our code takes 2 inputs: the data and a list of labels:<\/p>\n<ul data-rte-list=\"default\">\n<li>\n<p>We first create a list of all the class labels in the dataset and call this <strong>classList<\/strong>.&nbsp;The first stopping condition is that if all the class labels are the same, then we return this label.&nbsp;The second stopping condition is the case when there are no more features to split. If we don&rsquo;t meet the stopping conditions, then we use the function <strong>chooseBestFeatureToSplit<\/strong> to choose the best feature.<\/p>\n<\/li>\n<li>\n<p>To create the tree, we&rsquo;ll store it in a dictionary (<strong>myTree<\/strong>). Then we get&nbsp;get all the unique values from the dataset for our chosen feature (<strong>bestFeat<\/strong>).&nbsp;The unique value code uses sets (<strong>uniqueVals<\/strong>).<\/p>\n<\/li>\n<li>\n<p>Finally, we iterate over all the unique values from our chosen feature and recursively call <strong>createTree()<\/strong> for each split of the dataset. This value is inserted into <strong>myTree<\/strong> dictionary, so we end up with a lot of nested dictionaries representing our tree.<\/p>\n<\/li>\n<\/ul>\n<h3><strong>IMPLEMENTATION VIA SCIKIT-LEARN<\/strong><\/h3>\n<p>Now that we know how to implement the algorithm from scratch, let&rsquo;s make use of <strong>scikit-learn<\/strong>. In particular, we&rsquo;ll use the <em>DecisionTreeClassifier<\/em> class.&nbsp;Working with the iris dataset, we first import the data and split it into a training and a test part. Then we build a model using the default setting of fully developing the tree (growing the tree until all leaves are pure). We fix the random_state in the tree, which is used for tie-breaking internally:<\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\">from sklearn import datasets<br \/>\nfrom sklearn.model_selection import train_test_split<br \/>\nfrom sklearn.tree import DecisionTreeClassifier<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\"># Generate the iris dataset<br \/>\niris = datasets.load_iris()<br \/>\nX = iris.data<br \/>\ny = iris.target<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\"># Split data into training and test sets<br \/>\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\"># Build a model using the default setting of fully developing the tree<br \/>\ntree = DecisionTreeClassifier(random_state=0)<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\"># Fit the classifier on training set<br \/>\ntree.fit(X_train, y_train)<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\"># Make the predictions on test set<br \/>\npredictions = tree.predict(X_test)<br \/>\nprint(&quot;Test set predictions: {}&quot;.format(predictions))<\/span><\/p>\n<p><span style=\"font-family:courier new,courier,monospace;\"># Evaluate the model<br \/>\naccuracy = tree.score(X_test, y_test)<br \/>\nprint(&quot;Test set accuracy: {:.2f}&quot;.format(accuracy))<\/span><\/p>\n<p>Running the model should gives us a test set accuracy of 95%, meaning the model predicted the class correctly for 95% of the samples in the test dataset.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" alt=\"decision-trees-iris.jpg\" data-image=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c6aacd836605b060be80\/1536804525954\/decision-trees-iris.jpg\" data-image-dimensions=\"1200x638\" data-image-focal-point=\"0.5,0.5\" data-image-id=\"5b99c6aacd836605b060be80\" data-image-resolution=\"750w\" data-load=\"false\" data-src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c6aacd836605b060be80\/1536804525954\/decision-trees-iris.jpg\" data-type=\"image\" src=\"https:\/\/static1.squarespace.com\/static\/59d9b2749f8dce3ebe4e676d\/t\/5b99c6aacd836605b060be80\/1536804525954\/decision-trees-iris.jpg?format=750w\" style=\"width: 700px; height: 372px;\" \/><\/p>\n<h3><strong>STRENGTHS AND WEAKNESSES<\/strong><\/h3>\n<p>The major advantage of using decision trees is that they are intuitively very easy to explain. They closely mirror human decision-making compared to other regression and classification approaches. They can be displayed graphically, and they can easily handle qualitative predictors without the need to create dummy variables.<\/p>\n<p>Another huge advantage is such the algorithms are completely invariant to scaling of the data. As each feature is processed separately, and the possible splits of the data don&rsquo;t depend on scaling, no pre-processing like normalization or standardization of features is needed for decision tree algorithms. In particular, decision trees work well when we have features that are on completely different scales, or a mix of binary and continuous features.<\/p>\n<p>However, decision trees generally do not have the same level of predictive accuracy as other approaches, since they aren&#39;t quite robust. A small change in the data can cause a large change in the final estimated tree. Even with the use of pre-pruning, they tend to overfit and provide poor generalization performance. Therefore, in most applications, by aggregating many decision trees, using methods like&nbsp;<em>bagging<\/em>,&nbsp;<em>random forests<\/em>, and&nbsp;<em>boosting<\/em>, the predictive performance of decision trees can be substantially improved.<\/p>\n<p><em>Reference Sources:<\/em><\/p>\n<ul data-rte-list=\"default\">\n<li>\n<p><a href=\"http:\/\/www-bcf.usc.edu\/~gareth\/ISL\/\" rel=\"noopener\"><em>Introduction to Statistical Learning<\/em><\/a><em> by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2014) <\/em><\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/www.manning.com\/books\/machine-learning-in-action\" rel=\"noopener\"><em>Machine Learning In Action<\/em><\/a><em> by Peter Harrington (2012) <\/em><\/p>\n<\/li>\n<li>\n<p><a href=\"http:\/\/shop.oreilly.com\/product\/0636920030515.do\" rel=\"noopener\"><em>Introduction to Machine Learning with Python<\/em><\/a><em> by Sarah Guido and Andreas Muller (2016) <\/em><\/p>\n<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Decision tree is a type of supervised learning algorithm that can be used in both regression and classification problems. It works for both categorical and continuous input and output variables. The major advantage of using decision trees is that they are intuitively very easy to explain. They closely mirror human decision-making compared to other regression and classification approaches. They can be displayed graphically, and they can easily handle qualitative predictors without the need to create dummy variables.<\/p>\n","protected":false},"author":86,"featured_media":2554,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94],"ppma_author":[1842],"class_list":["post-1654","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":1842,"user_id":86,"is_guest":0,"slug":"james-le","display_name":"James Le","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Le","first_name":"James","job_title":"","description":"James Le is a Software Developer with experiences in Product Management and Data Analytics. He played a pivotal role in the operation of a start-up organization at Denison University."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1654","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/86"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1654"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1654\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2554"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1654"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1654"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1654"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1654"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}