{"id":997,"date":"2018-11-22T02:23:47","date_gmt":"2018-11-21T23:23:47","guid":{"rendered":"http:\/\/kusuaks7\/?p=602"},"modified":"2021-05-17T20:39:37","modified_gmt":"2021-05-17T20:39:37","slug":"train-test-split-and-cross-validation-in-python","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/train-test-split-and-cross-validation-in-python\/","title":{"rendered":"Train\/Test Split and Cross Validation in Python"},"content":{"rendered":"<p><strong><em>Ready to learn Data Science? Browse&nbsp;<a href=\"https:\/\/www.experfy.com\/training\/tracks\/data-science-training-certification\">Data Science Training and Certification<\/a> courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<section name=\"ab77\">\n<p id=\"960b\" name=\"960b\">This post is about Train\/Test Split and Cross Validation. As usual, I am going to give a short overview on the topic and then give an example on implementing it in Python. These are two rather important concepts in data science and data analysis and are used as tools to prevent (or at least minimize)&nbsp;<a data-href=\"https:\/\/en.wikipedia.org\/wiki\/Overfitting\" href=\"https:\/\/en.wikipedia.org\/wiki\/Overfitting\" rel=\"noopener noreferrer\" target=\"_blank\">overfitting<\/a>. I\u2019ll explain what that is \u2014 when we\u2019re using a statistical model (like linear regression, for example), we usually fit the model on a training set in order to make predications on a data that wasn\u2019t trained (general data). Overfitting means that what we\u2019ve fit the model too much to the training data. It will all make sense pretty soon, I promise!<\/p>\n<h3 id=\"edd3\" name=\"edd3\">What is Overfitting\/Underfitting a&nbsp;Model?<\/h3>\n<p id=\"7fe1\" name=\"7fe1\">As mentioned, in statistics and machine learning we usually split our data into two subsets: training data and testing data (and sometimes to three: train, validate and test), and fit our model on the train data, in order to make predictions on the test data. When we do that, one of two thing might happen: we overfit our model or we underfit our model. We don\u2019t want any of these things to happen, because they affect the predictability of our model \u2014 we might be using a model that has lower accuracy and\/or is ungeneralized (meaning you can\u2019t generalize your predictions on other data). Let\u2019s see what under and overfitting actually mean:<\/p>\n<h4 id=\"7b33\" name=\"7b33\">Overfitting<\/h4>\n<p id=\"8708\" name=\"8708\">Overfitting means that model we trained has trained \u201ctoo well\u201d and is now, well, fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features\/variables compared to the number of observations). This model will be very accurate on the training data but will probably be very not accurate on untrained or new data. It is because this model is not generalized (or not AS generalized), meaning you can generalize the results and can\u2019t make any inferences on other data, which is, ultimately, what you are trying to do. Basically, when this happens, the model learns or describes the \u201cnoise\u201d in the training data instead of the actual relationships between variables in the data. This noise, obviously, isn\u2019t part in of any new dataset, and cannot be applied to it.<\/p>\n<h4 id=\"28fb\" name=\"28fb\">Underfitting<\/h4>\n<p id=\"d5ed\" name=\"d5ed\">In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. As you probably guessed (or figured out!), this is usually the result of a very simple model (not enough predictors\/independent variables). It could also happen when, for example, we fit a linear model (like&nbsp;<a data-href=\"https:\/\/medium.com\/towards-data-science\/simple-and-multiple-linear-regression-in-python-c928425168f9\" href=\"https:\/\/medium.com\/towards-data-science\/simple-and-multiple-linear-regression-in-python-c928425168f9\" target=\"_blank\" rel=\"noopener noreferrer\">linear regression<\/a>) to data that is not linear. It almost goes without saying that this model will have poor predictive ability (on training data and can\u2019t be generalized to other data).<\/p>\n<figure id=\"a4a0\" name=\"a4a0\"><canvas height=\"18\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*tBErXYVvTw2jSUYK7thU2A.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*tBErXYVvTw2jSUYK7thU2A.png\"><\/figure>\n<p name=\"8a3b\" style=\"text-align: center;\">An example of overfitting, underfitting and a model that\u2019s \u201cjust&nbsp;right!\u201d<\/p>\n<p id=\"8a3b\" name=\"8a3b\">It is worth noting the underfitting is not as prevalent as overfitting. Nevertheless, we want to avoid both of those problems in data analysis. You might say we are trying to find the middle ground between under and overfitting our model. As you will see, train\/test split and cross validation help to avoid overfitting more than underfitting. Let\u2019s dive into both of them!<\/p>\n<h3 id=\"a5ba\" name=\"a5ba\">Train\/Test Split<\/h3>\n<p id=\"6a0b\" name=\"6a0b\">As I said before, the data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model\u2019s prediction on this subset.<\/p>\n<figure id=\"b916\" name=\"b916\"><canvas height=\"21\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*-8_kogvwmL1H6ooN1A1tsQ.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*-8_kogvwmL1H6ooN1A1tsQ.png\"><\/figure>\n<p name=\"bf8a\" style=\"text-align: center;\">Train\/Test Split<\/p>\n<p id=\"bf8a\" name=\"bf8a\">Let\u2019s see how to do this in Python. We\u2019ll do this using the&nbsp;<a data-href=\"http:\/\/scikit-learn.org\/stable\/index.html\" href=\"http:\/\/scikit-learn.org\/stable\/index.html\" rel=\"noopener noreferrer\" target=\"_blank\">Scikit-Learn library<\/a>and specifically the&nbsp;<a data-href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.train_test_split.html\" href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.train_test_split.html\" rel=\"noopener noreferrer\" target=\"_blank\">train_test_split method<\/a>. We\u2019ll start with importing the necessary libraries:<\/p>\n<div id=\"3c12\" name=\"3c12\"><span style=\"font-family:courier new,courier,monospace;\">import pandas as <\/span><span style=\"font-family:courier new,courier,monospace;\">pd<\/span><br \/>\n<span style=\"font-family:courier new,courier,monospace;\">from<\/span><span style=\"font-family:courier new,courier,monospace;\">sklearn<\/span><span style=\"font-family:courier new,courier,monospace;\"> import datasets, linear_model<br \/>\nfrom sklearn.model_selection import train_test_split<br \/>\nfrom<\/span><span style=\"font-family:courier new,courier,monospace;\">matplotlib<\/span><span style=\"font-family:courier new,courier,monospace;\"> import <\/span><span style=\"font-family:courier new,courier,monospace;\">pyplot<\/span><span style=\"font-family:courier new,courier,monospace;\"> as <\/span><span style=\"font-family:courier new,courier,monospace;\">plt<\/span><\/div>\n<div name=\"3c12\">&nbsp;<\/div>\n<p id=\"2a74\" name=\"2a74\">Let\u2019s quickly go over the libraries I\u2019ve imported:<\/p>\n<ul>\n<li id=\"4253\" name=\"4253\"><strong>Pandas<\/strong> \u2014 to load the data file as a Pandas data frame and analyze the data. If you want to read more on Pandas, feel free to check out&nbsp;<a data-href=\"https:\/\/medium.com\/@adi.bronshtein\/a-quick-introduction-to-the-pandas-python-library-f1b678f34673\" href=\"https:\/\/medium.com\/@adi.bronshtein\/a-quick-introduction-to-the-pandas-python-library-f1b678f34673\" target=\"_blank\" rel=\"noopener noreferrer\">my post<\/a>!<\/li>\n<li id=\"5f55\" name=\"5f55\">From&nbsp;<strong>Sklearn<\/strong>, I\u2019ve imported the&nbsp;<em>datasets<\/em>&nbsp;module, so I can load a sample dataset, and the&nbsp;<em>linear_model<\/em>, so I can run a linear regression<\/li>\n<li id=\"bca9\" name=\"bca9\">From&nbsp;<strong>Sklearn,&nbsp;<\/strong>sub-library&nbsp;<strong>model_selection<\/strong>, I\u2019ve imported the&nbsp;<em>train_test_split<\/em>&nbsp;so I can, well, split to training and test sets<\/li>\n<li id=\"8d86\" name=\"8d86\">From&nbsp;<strong>Matplotlib&nbsp;<\/strong>I\u2019ve imported&nbsp;<em>pyplot&nbsp;<\/em>in order to plot graphs of the data<\/li>\n<\/ul>\n<p id=\"c9ab\" name=\"c9ab\">OK, all set! Let\u2019s load in the&nbsp;<a data-href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.load_diabetes.html\" href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.load_diabetes.html\" rel=\"noopener noreferrer\" target=\"_blank\">diabetes dataset<\/a>, turn it into a data frame and define the columns\u2019 names:<\/p>\n<div id=\"97ce\" name=\"97ce\"><span style=\"font-family:courier new,courier,monospace;\"># Load the Diabetes Housing dataset<br \/>\ncolumns = \u201c<\/span><span style=\"font-family:courier new,courier,monospace;\">age sex<\/span><span style=\"font-family:courier new,courier,monospace;\"> <\/span><span style=\"font-family:courier new,courier,monospace;\">bmi<\/span><span style=\"font-family:courier new,courier,monospace;\"> map <\/span><span style=\"font-family:courier new,courier,monospace;\">tc<\/span><span style=\"font-family:courier new,courier,monospace;\"> <\/span><span style=\"font-family:courier new,courier,monospace;\">ldl<\/span><span style=\"font-family:courier new,courier,monospace;\"> <\/span><span style=\"font-family:courier new,courier,monospace;\">hdl<\/span><span style=\"font-family:courier new,courier,monospace;\"> tch <\/span><span style=\"font-family:courier new,courier,monospace;\">ltg<\/span><span style=\"font-family:courier new,courier,monospace;\"> <\/span><span style=\"font-family:courier new,courier,monospace;\">glu<\/span><span style=\"font-family:courier new,courier,monospace;\">\u201d.split() # Declare the columns names<br \/>\ndiabetes = datasets.load_diabetes() # Call the diabetes dataset from<\/span><span style=\"font-family:courier new,courier,monospace;\">sklearn<\/span><br \/>\n<span style=\"font-family:courier new,courier,monospace;\">df =<\/span><span style=\"font-family:courier new,courier,monospace;\">pd<\/span><span style=\"font-family:courier new,courier,monospace;\">.DataFrame(diabetes.data, columns=columns) # load the dataset as a pandas data frame<br \/>\ny = diabetes.target # define the target variable (dependent variable) as y<\/span><\/div>\n<div name=\"97ce\">&nbsp;<\/div>\n<p id=\"d0f5\" name=\"d0f5\">Now we can use the train_test_split function in order to make the split. The&nbsp;<em>test_size=0.2<\/em>&nbsp;inside the function indicates the percentage of the data that should be held over for testing. It\u2019s usually around 80\/20 or 70\/30.<\/p>\n<div id=\"653b\" name=\"653b\"><span style=\"font-family:courier new,courier,monospace;\"># create training and testing vars<br \/>\nX_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)<br \/>\nprint X_train.shape, y_train.shape<br \/>\nprint X_test.shape, y_test.shape<\/span><\/div>\n<div name=\"a443\"><span style=\"font-family:courier new,courier,monospace;\">(353, 10) (353,)<br \/>\n(89, 10) (89,)<\/span><\/div>\n<div name=\"a443\">&nbsp;<\/div>\n<p id=\"49d9\" name=\"49d9\">Now we\u2019ll fit the model on the training data:<\/p>\n<div id=\"baf8\" name=\"baf8\"><span style=\"font-family:courier new,courier,monospace;\"># fit a model<br \/>\nlm = linear_model.LinearRegression()<\/span><\/div>\n<div name=\"9dca\"><span style=\"font-family:courier new,courier,monospace;\">model = lm.fit(X_train, y_train)<br \/>\npredictions = lm.predict(X_test)<\/span><\/div>\n<div name=\"9dca\">&nbsp;<\/div>\n<p id=\"c9fb\" name=\"c9fb\">As you can see, we\u2019re fitting the model on the training data and trying to predict the test data. Let\u2019s see what (some of) the predictions are:<\/p>\n<div id=\"0f1b\" name=\"0f1b\"><span style=\"font-family:courier new,courier,monospace;\">predictions[0:5]<br \/>\narray([ 205.68012533,&nbsp;&nbsp; 64.58785513,&nbsp; 175.12880278,&nbsp; 169.95993301,<br \/>\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 128.92035866])<\/span><\/div>\n<div name=\"0f1b\">&nbsp;<\/div>\n<p id=\"1844\" name=\"1844\">Note: because I used [0:5] after predictions, it only showed the first five predicted values. Removing the [0:5] would have made it print all of the predicted values that our model created.<\/p>\n<p id=\"707b\" name=\"707b\">Let\u2019s plot the model:<\/p>\n<div id=\"99da\" name=\"99da\"><span style=\"font-family:courier new,courier,monospace;\">## The line \/ model<\/span><br \/>\n<span style=\"font-family:courier new,courier,monospace;\">plt<\/span><span style=\"font-family:courier new,courier,monospace;\">.scatter(y_test, predictions)<\/span><br \/>\n<span style=\"font-family:courier new,courier,monospace;\">plt<\/span><span style=\"font-family:courier new,courier,monospace;\">.xlabel(\u201cTrue Values\u201d)<br \/>\nplt.ylabel(\u201cPredictions\u201d)<\/span><\/div>\n<figure id=\"dc7c\" name=\"dc7c\"><canvas height=\"71\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*2f6x7rNuN0_zbW3_O5vOEA.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*2f6x7rNuN0_zbW3_O5vOEA.png\"><\/figure>\n<p id=\"0105\" name=\"0105\">And print the accuracy score:<\/p>\n<div id=\"ae35\" name=\"ae35\"><span style=\"font-family:courier new,courier,monospace;\">print \u201cScore:\u201d, model.score(X_test, y_test)<\/span><\/div>\n<div name=\"fd38\"><span style=\"font-family:courier new,courier,monospace;\">Score: 0.485829586737<\/span><\/div>\n<div name=\"fd38\">&nbsp;<\/div>\n<p id=\"7d1c\" name=\"7d1c\">There you go! Here is a summary of what I did: I\u2019ve loaded in the data, split it into a training and testing sets, fitted a regression model to the training data, made predictions based on this data and tested the predictions on the test data. Seems good, right?&nbsp;But train\/test split does have its dangers \u2014 what if the split we make isn\u2019t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? (imagine a file ordered by one of these). This will result in overfitting, even though we\u2019re trying to avoid it! This is where cross validation comes in.<\/p>\n<h3 id=\"7a51\" name=\"7a51\">Cross Validation<\/h3>\n<p id=\"5280\" name=\"5280\">In the previous paragraph, I mentioned the caveats in the train\/test split method. In order to avoid this, we can perform something called&nbsp;<a data-href=\"https:\/\/en.wikipedia.org\/wiki\/Cross-validation_(statistics)\" href=\"https:\/\/en.wikipedia.org\/wiki\/Cross-validation_%28statistics%29\" rel=\"noopener noreferrer\" target=\"_blank\">cross validation<\/a>. It\u2019s very similar to train\/test split, but it\u2019s applied to more subsets. Meaning, we split our data into k subsets, and train on k-1 one of those subset. What we do is to hold the last subset for test. We\u2019re able to do it for each of the subsets.<\/p>\n<figure id=\"47b3\" name=\"47b3\"><canvas height=\"37\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*4G__SV580CxFj78o9yUXuQ.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*4G__SV580CxFj78o9yUXuQ.png\"><\/figure>\n<p name=\"6af2\" style=\"text-align: center;\">Visual Representation of Train\/Test Split and Cross Validation . H\/t to my&nbsp;<a data-href=\"https:\/\/generalassemb.ly\/education\/data-science-immersive\" href=\"https:\/\/generalassemb.ly\/education\/data-science-immersive\" rel=\"noopener noreferrer\" target=\"_blank\">DSI<\/a>&nbsp;instructor,&nbsp;<a data-action=\"show-user-card\" data-action-type=\"hover\" data-action-value=\"690e8d667b35\" data-anchor-type=\"2\" data-href=\"https:\/\/medium.com\/@josephofiowa\" data-user-id=\"690e8d667b35\" href=\"https:\/\/medium.com\/@josephofiowa\" target=\"_blank\" rel=\"noopener noreferrer\">Joseph&nbsp;Nelson<\/a>!<\/p>\n<p id=\"6af2\" name=\"6af2\">There are a bunch of cross validation methods, I\u2019ll go over two of them: the first is&nbsp;<strong>K-Folds Cross Validation<\/strong>&nbsp;and the second is&nbsp;<strong>Leave One Out Cross Validation<\/strong>&nbsp;(LOOCV)<\/p>\n<h4 id=\"5bb4\" name=\"5bb4\">K-Folds Cross Validation<\/h4>\n<p id=\"e6fc\" name=\"e6fc\">In K-Folds Cross Validation we split our data into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.<\/p>\n<figure id=\"05a3\" name=\"05a3\"><canvas height=\"34\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*J2B_bcbd1-s1kpWOu_FZrg.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*J2B_bcbd1-s1kpWOu_FZrg.png\"><\/figure>\n<p name=\"0b46\" style=\"text-align: center;\">Visual representation of K-Folds. Again, H\/t to&nbsp;<a data-action=\"show-user-card\" data-action-type=\"hover\" data-action-value=\"690e8d667b35\" data-anchor-type=\"2\" data-href=\"https:\/\/medium.com\/@josephofiowa\" data-user-id=\"690e8d667b35\" href=\"https:\/\/medium.com\/@josephofiowa\" target=\"_blank\" rel=\"noopener noreferrer\">Joseph&nbsp;Nelson<\/a>!<\/p>\n<p id=\"0b46\" name=\"0b46\">Here is a very simple example from the&nbsp;<a data-href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.KFold.html#sklearn-model-selection-kfold\" href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.KFold.html#sklearn-model-selection-kfold\" rel=\"noopener noreferrer\" target=\"_blank\">Sklearn documentation<\/a>&nbsp;for K-Folds:<\/p>\n<div id=\"523c\" name=\"523c\"><span style=\"font-family:courier new,courier,monospace;\">from sklearn.model_selection import KFold # import KFold<br \/>\nX = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) # create an array<br \/>\ny = np.array([1, 2, 3, 4]) # Create another array<br \/>\nkf = KFold(n_splits=2) # Define the split &#8211; into 2 folds<br \/>\nkf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator<\/span><\/div>\n<div name=\"637f\"><span style=\"font-family:courier new,courier,monospace;\">print(kf)&nbsp;<\/span><\/div>\n<div name=\"0b89\"><span style=\"font-family:courier new,courier,monospace;\">KFold(n_splits=2, random_state=None, shuffle=False)<\/span><\/div>\n<div name=\"0b89\">&nbsp;<\/div>\n<p id=\"cfb9\" name=\"cfb9\">And let\u2019s see the result \u2014 the folds:<\/p>\n<div id=\"b6f3\" name=\"b6f3\"><span style=\"font-family:courier new,courier,monospace;\">for train_index, test_index in kf.split(X):<br \/>\nprint(\u201cTRAIN:\u201d, train_index, \u201cTEST:\u201d, test_index)<br \/>\nX_train, X_test = X[train_index], X[test_index]<br \/>\ny_train, y_test = y[train_index], y[test_index]<\/span><\/div>\n<div name=\"d82c\"><span style=\"font-family:courier new,courier,monospace;\">(&#8216;TRAIN:&#8217;, array([2, 3]), &#8216;TEST:&#8217;, array([0, 1]))<br \/>\n(&#8216;TRAIN:&#8217;, array([0, 1]), &#8216;TEST:&#8217;, array([2, 3]))<\/span><\/div>\n<div name=\"d82c\">&nbsp;<\/div>\n<p id=\"9ab3\" name=\"9ab3\">As you can see, the function split the original data into different subsets of the data. Again, very simple example but I think it explains the concept pretty well.<\/p>\n<h4 id=\"03b1\" name=\"03b1\">Leave One Out Cross Validation (LOOCV)<\/h4>\n<p id=\"8111\" name=\"8111\">This is another method for cross validation,&nbsp;<a data-href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut\" href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut\" rel=\"noopener noreferrer\" target=\"_blank\">Leave One Out Cross Validation<\/a>(by the way, these methods are not the only two, there are a bunch of other methods for cross validation. Check them out in the&nbsp;<a data-href=\"http:\/\/scikit-learn.org\/stable\/modules\/classes.html#module-sklearn.model_selection\" href=\"http:\/\/scikit-learn.org\/stable\/modules\/classes.html#module-sklearn.model_selection\" rel=\"noopener noreferrer\" target=\"_blank\">Sklearn website<\/a>). In this type of cross validation, the number of folds (subsets) equals to the number of observations we have in the dataset. We then average ALL of these folds and build our model with the average. We then test the model against the last fold. Because we would get a big number of training sets (equals to the number of samples), this method is very computationally expensive and should be used on small datasets. If the dataset is big, it would most likely be better to use a different method, like kfold.<\/p>\n<p id=\"f2c7\" name=\"f2c7\">Let\u2019s check out another example from Sklearn:<\/p>\n<div name=\"58fe\"><span style=\"font-family:courier new,courier,monospace;\"><strong>from<\/strong> <strong>sklearn.model_selection<\/strong> <strong>import<\/strong> LeaveOneOut<br \/>\nX = np.array([[1, 2], [3, 4]])<br \/>\ny = np.array([1, 2])<br \/>\nloo = LeaveOneOut()<br \/>\nloo.get_n_splits(X)<\/span><\/div>\n<div name=\"58fe\">&nbsp;<\/div>\n<div name=\"58fe\"><strong>for<\/strong> train_index, test_index <strong>in<\/strong> loo.split(X):<\/div>\n<div name=\"58fe\">\n&nbsp;&nbsp; <span style=\"font-family:courier new,courier,monospace;\">print(&#8220;TRAIN:&#8221;, train_index, &#8220;TEST:&#8221;, test_index)<br \/>\n&nbsp;&nbsp; X_train, X_test = X[train_index], X[test_index]<br \/>\n&nbsp;&nbsp; y_train, y_test = y[train_index], y[test_index]<br \/>\n&nbsp;&nbsp; print(X_train, X_test, y_train, y_test)<\/span><\/div>\n<div name=\"58fe\">&nbsp;<\/div>\n<p id=\"9703\" name=\"9703\">And this is the output:<\/p>\n<div id=\"56e3\" name=\"56e3\"><span style=\"font-family:courier new,courier,monospace;\">(&#8216;TRAIN:&#8217;, array([1]), &#8216;TEST:&#8217;, array([0]))<br \/>\n(array([[3, 4]]), array([[1, 2]]), array([2]), array([1]))<br \/>\n(&#8216;TRAIN:&#8217;, array([0]), &#8216;TEST:&#8217;, array([1]))<br \/>\n(array([[1, 2]]), array([[3, 4]]), array([1]), array([2]))<\/span><\/div>\n<div name=\"56e3\">&nbsp;<\/div>\n<p id=\"355d\" name=\"355d\">Again, simple example, but I really do think it helps in understanding the basic concept of this method.<\/p>\n<p id=\"b0f5\" name=\"b0f5\">So, what method should we use? How many folds? Well, the more folds we have, we will be reducing the error due the bias but increasing the error due to variance; the computational price would go up too, obviously \u2014 the more folds you have, the longer it would take to compute it and you would need more memory. With a lower number of folds, we\u2019re reducing the error due to variance, but the error due to bias would be bigger. It\u2019s would also computationally cheaper. Therefore, in big datasets, k=3 is usually advised. In smaller datasets, as I\u2019ve mentioned before, it\u2019s best to use LOOCV.<\/p>\n<\/section>\n<section name=\"0c07\">\n<hr>\n<p id=\"0a02\" name=\"0a02\">Let\u2019s check out the example I used before, this time with using cross validation. I\u2019ll use the&nbsp;<em>cross_val_predict&nbsp;<\/em>function to return the predicted values for each data point when it\u2019s in the testing slice.<\/p>\n<div id=\"52d3\" name=\"52d3\"><span style=\"font-family:courier new,courier,monospace;\"># Necessary imports:<br \/>\nfrom sklearn.cross_validation import cross_val_score, cross_val_predict<br \/>\nfrom<\/span><span style=\"font-family:courier new,courier,monospace;\">sklearn<\/span><span style=\"font-family:courier new,courier,monospace;\"> import metrics<\/span><\/div>\n<div name=\"52d3\">&nbsp;<\/div>\n<p id=\"6135\" name=\"6135\">As you remember, earlier on I\u2019ve created the train\/test split for the diabetes dataset and fitted a model. Let\u2019s see what is the score after cross validation:<\/p>\n<div id=\"aacf\" name=\"aacf\"><span style=\"font-family:courier new,courier,monospace;\"># Perform 6-fold cross validation<br \/>\nscores = cross_val_score(model, df, y, cv=6)<br \/>\nprint \u201cCross-validated scores:\u201d, scores<\/span><\/div>\n<div name=\"0ae5\">&nbsp;<\/div>\n<div name=\"0ae5\"><span style=\"font-family:courier new,courier,monospace;\">Cross-validated scores: [ 0.4554861&nbsp;&nbsp; 0.46138572&nbsp; 0.40094084&nbsp; 0.55220736&nbsp; 0.43942775&nbsp; 0.56923406]<\/span><\/div>\n<div name=\"0ae5\">&nbsp;<\/div>\n<p id=\"c77c\" name=\"c77c\">As you can see, the last fold improved the score of the original model \u2014 from 0.485 to 0.569. Not an amazing result, but hey, we\u2019ll take what we can get&nbsp;\ud83d\ude42<\/p>\n<p id=\"ad4f\" name=\"ad4f\">Now, let\u2019s plot the new predictions, after performing cross validation:<\/p>\n<div id=\"74fa\" name=\"74fa\"><span style=\"font-family:courier new,courier,monospace;\"># Make <\/span><span style=\"font-family:courier new,courier,monospace;\">cross validated<\/span><span style=\"font-family:courier new,courier,monospace;\"> predictions<br \/>\npredictions = cross_val_predict(model, df, y, cv=6)<\/span><br \/>\n<span style=\"font-family:courier new,courier,monospace;\">plt<\/span><span style=\"font-family:courier new,courier,monospace;\">.scatter(y, predictions)<\/span><\/div>\n<figure id=\"9941\" name=\"9941\"><canvas height=\"71\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*iHQeqD1jkhi1ihuxX_30lg.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*iHQeqD1jkhi1ihuxX_30lg.png\"><\/figure>\n<p id=\"fa8e\" name=\"fa8e\">You can see it\u2019s very different from the original plot from earlier. It is six times as many points as the original plot because I used cv=6.<\/p>\n<p id=\"1b02\" name=\"1b02\">Finally, let\u2019s check the R\u00b2 score of the model (R\u00b2 is a \u201cnumber that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s)\u201d. Basically, how accurate is our model):<\/p>\n<div id=\"5689\" name=\"5689\"><span style=\"font-family:courier new,courier,monospace;\">accuracy = metrics.r2_score(y, predictions)<br \/>\nprint \u201cCross-Predicted Accuracy:\u201d, accuracy<\/span><\/div>\n<div name=\"5689\">&nbsp;<\/div>\n<div name=\"8953\"><span style=\"font-family:courier new,courier,monospace;\">Cross-Predicted Accuracy: 0.490806583864<\/span><\/div>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model&rsquo;s prediction on this subset. Let&rsquo;s see how to do this in Python. Following a short overview on the topic, an example on implementing it in Python is given.<\/p>\n","protected":false},"author":397,"featured_media":22163,"comment_status":"open","ping_status":"open","sticky":false,"template":"single-post-2.php","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[94],"ppma_author":[2225],"class_list":["post-997","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":2225,"user_id":397,"is_guest":0,"slug":"adi-bronshtein","display_name":"Adi Bronshtein","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Bronshtein","first_name":"Adi","job_title":"","description":"Adi Bronshtein&nbsp;is Data Scientist and Data Science Instructor Associate at General Assembly"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/997","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/397"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=997"}],"version-history":[{"count":2,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/997\/revisions"}],"predecessor-version":[{"id":22165,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/997\/revisions\/22165"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/22163"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=997"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=997"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=997"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=997"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}