{"id":1047,"date":"2018-12-31T02:32:30","date_gmt":"2018-12-30T23:32:30","guid":{"rendered":"http:\/\/kusuaks7\/?p=652"},"modified":"2021-05-11T14:02:50","modified_gmt":"2021-05-11T14:02:50","slug":"tpot-automated-machine-learning-in-python","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/tpot-automated-machine-learning-in-python\/","title":{"rendered":"TPOT Automated Machine Learning in Python"},"content":{"rendered":"<p><strong><em>Ready to learn Machine Learning? Browse<\/em><\/strong> <strong><em><a href=\"https:\/\/www.experfy.com\/training\/tracks\/machine-learning-training-certification\">Machine Learning Training and Certification courses<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*dCD9QwVjhVnKKz6U.jpg\" \/><\/p>\n<p style=\"text-align: center;\">TPOT graphic from the&nbsp;<a data-href=\"http:\/\/epistasislab.github.io\/tpot\/\" href=\"http:\/\/epistasislab.github.io\/tpot\/\" rel=\"noopener noreferrer\" target=\"_blank\">docs<\/a><\/p>\n<p>&nbsp;<\/p>\n<p id=\"ea69\" name=\"ea69\">Automated machine learning doesn&rsquo;t replace the data scientist, (at least not yet) but it might be able to help you find good models faster. TPOT bills itself as your Data Science Assistant.<\/p>\n<blockquote id=\"cb7a\" name=\"cb7a\"><p>TPOT is meant to be an assistant that gives you ideas on how to solve a particular machine learning problem by exploring pipeline configurations that you might have never considered, then leaves the fine-tuning to more constrained parameter tuning techniques such as grid search.<\/p><\/blockquote>\n<p id=\"105d\" name=\"105d\">So TPOT helps you find good algorithms. Note that it isn&rsquo;t designed for automating deep learning &mdash; something like AutoKeras might be helpful there.<\/p>\n<figure id=\"abe1\" name=\"abe1\">\n<p><canvas height=\"34\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*iYQTbI4WVGUF1_F1.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*iYQTbI4WVGUF1_F1.png\" \/><\/p>\n<\/figure>\n<p name=\"f60d\" style=\"text-align: center;\"><a data-href=\"http:\/\/epistasislab.github.io\/tpot\/\" href=\"http:\/\/epistasislab.github.io\/tpot\/\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>An example machine learning pipeline (source: TPOT&nbsp;docs)<\/strong><\/a><\/p>\n<p id=\"f60d\" name=\"f60d\">TPOT is built on the scikit learn library and follows the scikit learn API closely. It can be used for regression and classification tasks and has special implementations for medical research.<\/p>\n<p id=\"99d6\" name=\"99d6\">TPOT is open source, well documented, and under active development. It&rsquo;s development was spearheaded by researchers at the University of Pennsylvania. TPOT appears to be one of the most popular autoML libraries, with nearly&nbsp;<a data-href=\"https:\/\/github.com\/EpistasisLab\/tpot\/\" href=\"https:\/\/github.com\/EpistasisLab\/tpot\/\" rel=\"noopener noreferrer\" target=\"_blank\">4,500 GitHub stars<\/a>&nbsp;as of August 2018.<\/p>\n<h3 id=\"47d3\" name=\"47d3\">How does TPOT&nbsp;work?<\/h3>\n<figure id=\"56b5\" name=\"56b5\">\n<p><canvas height=\"34\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*vqs4500HEIa4eN4D.png\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/0*vqs4500HEIa4eN4D.png\" \/><\/p>\n<\/figure>\n<p name=\"7ff3\" style=\"text-align: center;\"><a data-href=\"http:\/\/epistasislab.github.io\/tpot\/\" href=\"http:\/\/epistasislab.github.io\/tpot\/\" rel=\"noopener noreferrer\" target=\"_blank\">An example TPOT Pipeline<\/a>&nbsp;(source: TPOT&nbsp;docs)<\/p>\n<p id=\"7ff3\" name=\"7ff3\">TPOT has what its developers call a genetic search algorithm to find the best parameters and model ensembles. It could also be thought of as a natural selection or evolutionary algorithm. TPOT tries a pipeline, evaluates its performance, and randomly changes parts of the pipeline in search of better performing algorithms.<\/p>\n<blockquote id=\"72b5\" name=\"72b5\"><p>AutoML algorithms aren&rsquo;t as simple as fitting one model on the dataset; they are considering multiple machine learning algorithms (random forests, linear models, SVMs, etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling, PCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, as well as multiple ways to ensemble or stack the algorithms within the pipeline.<em>&nbsp;(source:&nbsp;<\/em><a data-href=\"http:\/\/epistasislab.github.io\/tpot\/using\/\" href=\"http:\/\/epistasislab.github.io\/tpot\/using\/\" rel=\"noopener noreferrer\" target=\"_blank\"><em>TPOT docs<\/em><\/a><em>)<\/em><\/p><\/blockquote>\n<p id=\"67d4\" name=\"67d4\">This power of TPOT comes from evaluating all kinds of possible pipelines automatically and efficiently. Doing this manually is cumbersome and slower.<\/p>\n<h3 id=\"eb6d\" name=\"eb6d\">Running TPOT<\/h3>\n<p id=\"717e\" name=\"717e\">Instantiating, fitting, and scoring the TPOT classifier is similar to any other sklearn classifier. Here&rsquo;s the format:<\/p>\n<div id=\"5b37\" name=\"5b37\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">tpot<\/span><\/span><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\"> = TPOTClassifier()<\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">tpot.fit(X_train, y_train)<\/span><\/span><br \/>\n<span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">tpot<\/span><\/span><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">.score(X_test, y_test)<\/span><\/span><\/div>\n<p name=\"12c6\">&nbsp;<\/p>\n<p id=\"12c6\" name=\"12c6\">TPOT comes with its own variation of one-hot encoding. Note that it could add it to a pipeline automatically because it treats features with&nbsp;<a data-href=\"https:\/\/github.com\/discdiver\/tpot\/blob\/master\/tpot\/builtins\/one_hot_encoder.py\" href=\"https:\/\/github.com\/discdiver\/tpot\/blob\/master\/tpot\/builtins\/one_hot_encoder.py\" rel=\"noopener noreferrer\" target=\"_blank\">fewer than 10 unique values as categorical<\/a>. If you want to use your own encoding strategy you can encode your data and&nbsp;<em>then<\/em>&nbsp;feed it into TPOT.<\/p>\n<p id=\"d227\" name=\"d227\">You can choose the scoring criterion for tpot.score (although a bug with Jupyter and multiple processor cores prevents you from having a custom scoring criterion with multiple processor cores in a Jupyter notebook).<\/p>\n<p id=\"461a\" name=\"461a\">It appears that you can&rsquo;t alter the scoring criteria TPOT uses internally as it searches for the best pipeline, just the scoring criteria for use on the test set after TPOT has chosen the best algorithms. This is an area where some users might want more control. Perhaps this option will be added in a future version.<\/p>\n<p id=\"e43d\" name=\"e43d\">TPOT writes information about the best performing algorithm and it&rsquo;s accuracy score to a file with tpot.export(). You can choose the level of verboseness you would like to see as TPOT runs and have it write pipelines to an output file as it runs in case it terminates early for some reason (e.g. your Kaggle Kernel crashes).<\/p>\n<h3 id=\"6e93\" name=\"6e93\">How long does TPOT take to&nbsp;run?<\/h3>\n<figure id=\"d860\" name=\"d860\">\n<p><canvas height=\"46\" width=\"75\"><\/canvas><img decoding=\"async\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*aZPBqIana8u1uf3RjxlUlA.jpeg\" src=\"https:\/\/cdn-images-1.medium.com\/max\/640\/1*aZPBqIana8u1uf3RjxlUlA.jpeg\" \/><\/p>\n<\/figure>\n<p id=\"0cf3\" name=\"0cf3\">The short answer is that it depends.<\/p>\n<p id=\"1c90\" name=\"1c90\">TPOT was designed to run for a while &mdash; hours or even a day. Although less complex problems with smaller datasets can see great results in minutes. You can adjust several parameters for TPOT to finish its searches faster, but at the expense of a less thorough search for an optimal pipeline. It was not designed to be a comprehensive search of preprocessing steps, feature selection, algorithms, and parameters, but it can come close if you set its parameters to be more exhaustive.<\/p>\n<p id=\"77b5\" name=\"77b5\">As the docs explain:<\/p>\n<blockquote id=\"1843\" name=\"1843\"><p>&hellip;TPOT will take a while to run on larger datasets, but it&rsquo;s important to realize why. With the default TPOT settings (100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing. To put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm and how long that grid search will take. That is 10,000 model configurations to evaluate with 10-fold cross-validation, which means that roughly 100,000 models are fit and evaluated on the training data in one grid search.<\/p><\/blockquote>\n<p id=\"431f\" name=\"431f\">Some of the data sets we&rsquo;ll see below only need a few minutes to find algorithms that score well; others might need days.<\/p>\n<p id=\"9510\" name=\"9510\">Here are the default TPOTClassifier parameters:<\/p>\n<div id=\"100d\" name=\"100d\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">generations=100, <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">population_size=100, <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">offspring_size=None&nbsp; # Jeff notes this gets set to population_size<\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">mutation_rate=0.9, <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">crossover_rate=0.1, <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">scoring=&quot;Accuracy&quot;,&nbsp; # for Classification<\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">cv=5, <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">subsample=1.0, <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">n_jobs=1,<\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">max_time_mins=None, <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">max_eval_time_mins=5,<\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">random_state=None, <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">config_dict=None,<\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">warm_start=False, <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">memory=None,<\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">periodic_checkpoint_folder=None, <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">early_stop=None<\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">verbosity=0<\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">disable_update_check=False<\/span><\/span><\/div>\n<p name=\"dd56\">&nbsp;<\/p>\n<p id=\"dd56\" name=\"dd56\">A description of each parameter can be found the&nbsp;<a data-href=\"http:\/\/epistasislab.github.io\/tpot\/api\/\" href=\"http:\/\/epistasislab.github.io\/tpot\/api\/\" rel=\"noopener noreferrer\" target=\"_blank\">docs<\/a>. Here are a few key ones that determine the number of pipelines TPOT will search through:<\/p>\n<div id=\"8ecc\" name=\"8ecc\"><span style=\"font-family:courier new,courier,monospace;\"><strong><span style=\"background-color:#E6E6FA;\">generations:<\/span><\/strong><span style=\"background-color:#E6E6FA;\"> int, optional (default: 100)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">Number of iterations to the run pipeline optimization process. Generally, TPOT will work better when you give it more generations(and therefore time) to optimize the pipeline.&nbsp;<\/span><\/span><\/div>\n<div name=\"65d9\">&nbsp;<\/div>\n<div name=\"65d9\"><span style=\"font-family:courier new,courier,monospace;\"><strong><span style=\"background-color:#E6E6FA;\">TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total <\/span><\/strong><span style=\"background-color:#E6E6FA;\">(emphasis mine).<\/span><\/span><\/div>\n<div name=\"e290\">&nbsp;<\/div>\n<div name=\"e290\"><span style=\"font-family:courier new,courier,monospace;\"><strong><span style=\"background-color:#E6E6FA;\">population_size:<\/span><\/strong><span style=\"background-color:#E6E6FA;\"> int, optional (default: 100)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">Number of individuals to retain in the GP population every generation. <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline.&nbsp;<\/span><\/span><\/div>\n<div name=\"ca8d\">&nbsp;<\/div>\n<div name=\"ca8d\"><span style=\"font-family:courier new,courier,monospace;\"><strong><span style=\"background-color:#E6E6FA;\">offspring_size:<\/span><\/strong><span style=\"background-color:#E6E6FA;\"> int, optional (default: None)<\/span><\/span><br \/>\n<span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">Number<\/span><\/span><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\"> of offspring to produce in each GP generation.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <\/span><br \/>\n<span style=\"background-color:#E6E6FA;\">By default, offspring_size = population_size.<\/span><\/span><\/div>\n<p name=\"2d40\">&nbsp;<\/p>\n<p id=\"2d40\" name=\"2d40\">When starting out with TPOT it&rsquo;s worth setting&nbsp;<em>verbosity=3<\/em>&nbsp;and&nbsp;<em>periodic_checkpoint_folder=&ldquo;any_string_you_like&rdquo;&nbsp;<\/em>so that you can watch the models evolve and training scores improve. You&rsquo;ll see some errors as some combinations of pipeline elements are incompatible, but don&rsquo;t sweat that.<\/p>\n<p id=\"b1e8\" name=\"b1e8\">If you&rsquo;re running on multiple cores and not using a custom scoring function, set n_jobs=-1 to use all available cores and speed up TPOT.<\/p>\n<h4 id=\"294c\" name=\"294c\">Search Space<\/h4>\n<p id=\"d957\" name=\"d957\">Here are the classification algorithms and parameters TPOT chooses from as of version 0.9:<\/p>\n<div id=\"f433\" name=\"f433\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.naive_bayes.<\/span><strong><span style=\"background-color:#E6E6FA;\">BernoulliNB<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;alpha&rsquo;: [1e-3, 1e-2, 1e-1, 1., 10., 100.], &lsquo;fit_prior&rsquo;: [True, False] },&nbsp;<\/span><\/span><\/div>\n<div name=\"80a6\">&nbsp;<\/div>\n<div name=\"80a6\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.naive_bayes.<\/span><strong><span style=\"background-color:#E6E6FA;\">MultinomialNB<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;alpha&rsquo;: [1e-3, 1e-2, 1e-1, 1., 10., 100.], &lsquo;fit_prior&rsquo;: [True, False] },&nbsp;<\/span><\/span><\/div>\n<div name=\"1d86\">&nbsp;<\/div>\n<div name=\"1d86\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.tree.<\/span><strong><span style=\"background-color:#E6E6FA;\">DecisionTreeClassifier<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;criterion&rsquo;: [&ldquo;gini&rdquo;, &ldquo;entropy&rdquo;], &lsquo;max_depth&rsquo;: range(1, 11), &lsquo;min_samples_split&rsquo;: range(2, 21), &lsquo;min_samples_leaf&rsquo;: range(1, 21) },&nbsp;<\/span><\/span><\/div>\n<div name=\"81e4\">&nbsp;<\/div>\n<div name=\"81e4\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.ensemble.<\/span><strong><span style=\"background-color:#E6E6FA;\">ExtraTreesClassifier<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;n_estimators&rsquo;: [100], &lsquo;criterion&rsquo;: [&ldquo;gini&rdquo;, &ldquo;entropy&rdquo;], &lsquo;max_features&rsquo;: np.arange(0.05, 1.01, 0.05), &lsquo;min_samples_split&rsquo;: range(2, 21), &lsquo;min_samples_leaf&rsquo;: range(1, 21), &lsquo;bootstrap&rsquo;: [True, False] },<\/span><\/span><\/div>\n<div name=\"7e15\">&nbsp;<\/div>\n<div name=\"7e15\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.ensemble.<\/span><strong><span style=\"background-color:#E6E6FA;\">RandomForestClassifier<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;n_estimators&rsquo;: [100], &lsquo;criterion&rsquo;: [&ldquo;gini&rdquo;, &ldquo;entropy&rdquo;], &lsquo;max_features&rsquo;: np.arange(0.05, 1.01, 0.05), &lsquo;min_samples_split&rsquo;: range(2, 21), &lsquo;min_samples_leaf&rsquo;: range(1, 21), &lsquo;bootstrap&rsquo;: [True, False] },&nbsp;<\/span><\/span><\/div>\n<div name=\"987d\">&nbsp;<\/div>\n<div name=\"987d\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.ensemble.<\/span><strong><span style=\"background-color:#E6E6FA;\">GradientBoostingClassifier<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;n_estimators&rsquo;: [100], &lsquo;learning_rate&rsquo;: [1e-3, 1e-2, 1e-1, 0.5, 1.], &lsquo;max_depth&rsquo;: range(1, 11), &lsquo;min_samples_split&rsquo;: range(2, 21), &lsquo;min_samples_leaf&rsquo;: range(1, 21), &lsquo;subsample&rsquo;: np.arange(0.05, 1.01, 0.05), &lsquo;max_features&rsquo;: np.arange(0.05, 1.01, 0.05) },<\/span><\/span><\/div>\n<div name=\"fe30\">&nbsp;<\/div>\n<div name=\"fe30\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.neighbors.<\/span><strong><span style=\"background-color:#E6E6FA;\">KNeighborsClassifier<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;n_neighbors&rsquo;: range(1, 101), &lsquo;weights&rsquo;: [&ldquo;uniform&rdquo;, &ldquo;distance&rdquo;], &lsquo;p&rsquo;: [1, 2] },&nbsp;<\/span><\/span><\/div>\n<div name=\"8b1c\">&nbsp;<\/div>\n<div name=\"8b1c\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.svm.<\/span><strong><span style=\"background-color:#E6E6FA;\">LinearSVC<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;penalty&rsquo;: [&ldquo;l1&rdquo;, &ldquo;l2&rdquo;], &lsquo;loss&rsquo;: [&ldquo;hinge&rdquo;, &ldquo;squared_hinge&rdquo;], &lsquo;dual&rsquo;: [True, False], &lsquo;tol&rsquo;: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1], &lsquo;C&rsquo;: [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.] },&nbsp;<\/span><\/span><\/div>\n<div name=\"f11c\">&nbsp;<\/div>\n<div name=\"f11c\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.linear_model.<\/span><strong><span style=\"background-color:#E6E6FA;\">LogisticRegression<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;penalty&rsquo;: [&ldquo;l1&rdquo;, &ldquo;l2&rdquo;], &lsquo;C&rsquo;: [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.], &lsquo;dual&rsquo;: [True, False] },&nbsp;<\/span><\/span><\/div>\n<div name=\"3a48\">&nbsp;<\/div>\n<div name=\"3a48\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;xgboost.<\/span><strong><span style=\"background-color:#E6E6FA;\">XGBClassifier<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;n_estimators&rsquo;: [100], &lsquo;max_depth&rsquo;: range(1, 11), &lsquo;learning_rate&rsquo;: [1e-3, 1e-2, 1e-1, 0.5, 1.], &lsquo;subsample&rsquo;: np.arange(0.05, 1.01, 0.05), &lsquo;min_child_weight&rsquo;: range(1, 21), &lsquo;nthread&rsquo;: [1] }<\/span><\/span><\/div>\n<p name=\"e5b6\">&nbsp;<\/p>\n<p id=\"e5b6\" name=\"e5b6\">And TPOT can stack classifiers, including the same classifier multiple times. One of the core developers of TPOT explains how it works in&nbsp;<a data-href=\"https:\/\/github.com\/EpistasisLab\/tpot\/issues\/360\" href=\"https:\/\/github.com\/EpistasisLab\/tpot\/issues\/360\" rel=\"noopener noreferrer\" target=\"_blank\">this issue<\/a>:<\/p>\n<blockquote>\n<p name=\"e5b6\">The pipeline&nbsp;<code><em>ExtraTreesClassifier(ExtraTreesClassifier(input_matrix, True, &#39;entropy&#39;, <\/em><\/code><\/p>\n<p name=\"e5b6\"><code><em>0.10000000000000001, 13, 6), True, &#39;gini&#39;, 0.75, 17, 4)<\/em><\/code>&nbsp;does the following:<\/p>\n<\/blockquote>\n<blockquote id=\"17e7\" name=\"17e7\"><p>Fit all of the original features using an ExtraTreesClassifier<\/p><\/blockquote>\n<blockquote id=\"20b4\" name=\"20b4\"><p>Take the predictions from that ExtraTreesClassifier and create a new feature using those predictions<\/p><\/blockquote>\n<blockquote id=\"1dc1\" name=\"1dc1\"><p>Pass the original features plus the new &ldquo;predicted feature&rdquo; to the 2nd ExtraTreesClassifier and use its predictions as the final predictions of the pipeline<\/p><\/blockquote>\n<blockquote id=\"0288\" name=\"0288\"><p>This process is called stacking classifiers, which is a fairly common tactic in machine learning.<\/p><\/blockquote>\n<p id=\"cac6\" name=\"cac6\">And here are the 11 preprocessors that could be applied by TPOT as of version 0.9.<\/p>\n<div id=\"d8a6\" name=\"d8a6\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.preprocessing.<\/span><strong><span style=\"background-color:#E6E6FA;\">Binarizer<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;threshold&rsquo;: np.arange(0.0, 1.01, 0.05) },&nbsp;<\/span><\/span><\/div>\n<div name=\"0959\">&nbsp;<\/div>\n<div name=\"0959\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.decomposition.<\/span><strong><span style=\"background-color:#E6E6FA;\">FastICA<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;tol&rsquo;: np.arange(0.0, 1.01, 0.05) },&nbsp;<\/span><\/span><\/div>\n<div name=\"652e\">&nbsp;<\/div>\n<div name=\"652e\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.cluster.<\/span><strong><span style=\"background-color:#E6E6FA;\">FeatureAgglomeration<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;linkage&rsquo;: [&lsquo;ward&rsquo;, &lsquo;complete&rsquo;, &lsquo;average&rsquo;], &lsquo;affinity&rsquo;: [&lsquo;euclidean&rsquo;, &lsquo;l1&rsquo;, &lsquo;l2&rsquo;, &lsquo;manhattan&rsquo;, &lsquo;cosine&rsquo;] },&nbsp;<\/span><\/span><\/div>\n<div name=\"a716\">&nbsp;<\/div>\n<div name=\"a716\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.preprocessing.<\/span><strong><span style=\"background-color:#E6E6FA;\">MaxAbsScaler<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { },&nbsp;<\/span><\/span><\/div>\n<div name=\"8af9\">&nbsp;<\/div>\n<div name=\"8af9\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.preprocessing.<\/span><strong><span style=\"background-color:#E6E6FA;\">MinMaxScaler<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { },&nbsp;<\/span><\/span><\/div>\n<div name=\"5224\">&nbsp;<\/div>\n<div name=\"5224\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.preprocessing.<\/span><strong><span style=\"background-color:#E6E6FA;\">Normalizer<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;norm&rsquo;: [&lsquo;l1&rsquo;, &lsquo;l2&rsquo;, &lsquo;max&rsquo;] },&nbsp;<\/span><\/span><\/div>\n<div name=\"41ee\">&nbsp;<\/div>\n<div name=\"41ee\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.kernel_approximation.<\/span><strong><span style=\"background-color:#E6E6FA;\">Nystroem<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;kernel&rsquo;: [&lsquo;rbf&rsquo;, &lsquo;cosine&rsquo;, &lsquo;chi2&rsquo;, &lsquo;laplacian&rsquo;, &lsquo;polynomial&rsquo;, &lsquo;poly&rsquo;, &lsquo;linear&rsquo;, &lsquo;additive_chi2&rsquo;, &lsquo;sigmoid&rsquo;], &lsquo;gamma&rsquo;: np.arange(0.0, 1.01, 0.05), &lsquo;n_components&rsquo;: range(1, 11) },&nbsp;<\/span><\/span><\/div>\n<div name=\"5f67\">&nbsp;<\/div>\n<div name=\"5f67\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.decomposition.<\/span><strong><span style=\"background-color:#E6E6FA;\">PCA<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;svd_solver&rsquo;: [&lsquo;randomized&rsquo;], &lsquo;iterated_power&rsquo;: range(1, 11) }, &lsquo;sklearn.preprocessing.PolynomialFeatures&rsquo;: { &lsquo;degree&rsquo;: [2], &lsquo;include_bias&rsquo;: [False], &lsquo;interaction_only&rsquo;: [False] },&nbsp;<\/span><\/span><\/div>\n<div name=\"3bdc\">&nbsp;<\/div>\n<div name=\"3bdc\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.kernel_approximation.<\/span><strong><span style=\"background-color:#E6E6FA;\">RBFSampler<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;gamma&rsquo;: np.arange(0.0, 1.01, 0.05) }, &lsquo;sklearn.preprocessing.RobustScaler&rsquo;: { },&nbsp;<\/span><\/span><\/div>\n<div name=\"2b61\">&nbsp;<\/div>\n<div name=\"2b61\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;sklearn.preprocessing.<\/span><strong><span style=\"background-color:#E6E6FA;\">StandardScaler<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { }, &lsquo;tpot.builtins.ZeroCount&rsquo;: { },&nbsp;<\/span><\/span><\/div>\n<div name=\"624b\">&nbsp;<\/div>\n<div name=\"624b\"><span style=\"font-family:courier new,courier,monospace;\"><span style=\"background-color:#E6E6FA;\">&lsquo;<\/span><strong><span style=\"background-color:#E6E6FA;\">tpot.builtins.OneHotEncoder<\/span><\/strong><span style=\"background-color:#E6E6FA;\">&rsquo;: { &lsquo;minimum_fraction&rsquo;: [0.05, 0.1, 0.15, 0.2, 0.25], &lsquo;sparse&rsquo;: [False] } (emphasis mine)<\/span><\/span><\/div>\n<p name=\"11a7\">&nbsp;<\/p>\n<p id=\"11a7\" name=\"11a7\">That&rsquo;s a pretty comprehensive list of sklearn ml algorithms and even a few you might not have used for preprocessing, including Nystroem and RBFSampler. The final preprocessing algorithm listed is the custom OneHotEncoder mentioned before. Note that the list contains no neural network algorithms.<\/p>\n<p id=\"ad49\" name=\"ad49\">The number of combinations appears to be nearly infinite &mdash; you can stack algorithms, including instances of the same algorithm. There may be an internal cap on the number of steps in the pipeline, but suffice to say there are a plethora of possible pipelines.<\/p>\n<p id=\"97a4\" name=\"97a4\">TPOT will likely not result in the same algorithm selection if you run it twice (maybe not even if random_state is set, I found, as discussed below). As the&nbsp;<a data-href=\"http:\/\/epistasislab.github.io\/tpot\/using\/\" href=\"http:\/\/epistasislab.github.io\/tpot\/using\/\" rel=\"noopener noreferrer\" target=\"_blank\">docs<\/a>&nbsp;explain:<\/p>\n<blockquote id=\"22d3\" name=\"22d3\"><p>If you&rsquo;re working with a reasonably complex dataset or run TPOT for a short amount of time, different TPOT runs may result in different pipeline recommendations. TPOT&rsquo;s optimization algorithm is stochastic in nature, which means that it uses randomness (in part) to search the possible pipeline space. When two TPOT runs recommend different pipelines, this means that the TPOT runs didn&rsquo;t converge due to lack of time&nbsp;<em>or<\/em>&nbsp;that multiple pipelines perform more-or-less the same on your dataset.<\/p><\/blockquote>\n<p id=\"36f2\" name=\"36f2\">Less talk &mdash; more action. Try out TPOT on some data!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Automated machine learning doesn&rsquo;t replace the data scientist, but it might be able to help you find good models faster. TPOT bills itself as your Data Science Assistant. TPOT is meant to be an assistant that gives you ideas on how to solve a particular machine learning problem by exploring pipeline configurations that you might have never considered, and then leaves the fine-tuning to more constrained parameter tuning techniques such as grid search.<\/p>\n","protected":false},"author":369,"featured_media":3783,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[2134],"class_list":["post-1047","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":2134,"user_id":369,"is_guest":0,"slug":"jeff-hale","display_name":"Jeff Hale","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Hale","first_name":"Jeff","job_title":"","description":"Jeff Hale is a co-founder of Rebel Desk, where he oversees technology, finance, and operations for this company. He&nbsp;is an experienced entrepreneur who has managed technology, operations, and finances for several companies.&nbsp;"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1047","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/369"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1047"}],"version-history":[{"count":1,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1047\/revisions"}],"predecessor-version":[{"id":7141,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1047\/revisions\/7141"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3783"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1047"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1047"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1047"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1047"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}