{"id":1519,"date":"2019-02-18T04:39:42","date_gmt":"2019-02-18T04:39:42","guid":{"rendered":"http:\/\/kusuaks7\/?p=1124"},"modified":"2023-08-09T14:16:29","modified_gmt":"2023-08-09T14:16:29","slug":"machine-learning-with-pyspark-and-mlib-solving-a-binary-classification-problem","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/machine-learning-with-pyspark-and-mlib-solving-a-binary-classification-problem\/","title":{"rendered":"Machine Learning with PySpark and MLlib\u200a\u2014\u200aSolving a Binary Classification Problem"},"content":{"rendered":"<p><strong><em>Ready to learn Machine Learning? <a href=\"https:\/\/www.experfy.com\/training\/courses\">Browse courses<\/a>\u00a0like\u00a0<a href=\"https:\/\/www.experfy.com\/training\/courses\/machine-learning-foundations-supervised-learning\">Machine Learning Foundations: Supervised Learning<\/a> developed by industry thought leaders and Experfy in Harvard Innovation Lab.<\/em><\/strong><\/p>\n<h2 id=\"e46e\" style=\"text-align: center;\"><img decoding=\"async\" style=\"width: 650px; height: 249px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/2000\/1*J_4joYwf_HHMbBt8s1Kuqw.jpeg\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/2000\/1*J_4joYwf_HHMbBt8s1Kuqw.jpeg\" \/><\/h2>\n<p style=\"text-align: center;\">Photo Credit: Pixabay<\/p>\n<p id=\"7101\"><a href=\"https:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/spark.apache.org\/\" data->Apache Spark<\/a>, once a component of the\u00a0<a href=\"http:\/\/hadoop.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"http:\/\/hadoop.apache.org\/\" data->Hadoop<\/a>\u00a0ecosystem, is now becoming the big-data platform of choice for enterprises. It is a powerful open source engine that provides real-time stream processing, interactive processing, graph processing, in-memory processing as well as batch processing with very fast speed, ease of use and standard interface.<\/p>\n<p id=\"4719\">In the industry, there is a big demand for a powerful engine that can do all of above. Sooner or later, your company or your clients will be using Spark to develop sophisticated models that would enable you to discover new opportunities or avoid risk. Spark is not hard to learn, if you already known Python and SQL, it is very easy to get started. Let\u2019s give it a try today!<\/p>\n<h3 id=\"314b\"><strong>Exploring The\u00a0Data<\/strong><\/h3>\n<p id=\"34b8\">We will use the same data set when we\u00a0<a href=\"https:\/\/towardsdatascience.com\/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/towardsdatascience.com\/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8\" data->built a Logistic Regression in Python<\/a>, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (Yes\/No) to a term deposit. The dataset can be downloaded from\u00a0<a href=\"https:\/\/www.kaggle.com\/rouseguy\/bankbalanced\/data\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/www.kaggle.com\/rouseguy\/bankbalanced\/data\" data->Kaggle<\/a>.<\/p>\n<p id=\"1f59\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">from pyspark.sql import SparkSession<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">spark = SparkSession.builder.appName(&#8216;ml-bank&#8217;).getOrCreate()<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">df = spark.read.csv(&#8216;bank.csv&#8217;, header = True, inferSchema = True)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">df.printSchema()<\/span><\/span><\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*TULzFyy4X2Y4ijn1_Evy1A.png\" \/><\/p>\n<p style=\"text-align: center;\">Figure 1<\/p>\n<p id=\"bdb0\">Input variables: age, job, marital, education, default, balance, housing, loan, contact, day, month, duration, campaign, pdays, previous, poutcome.<\/p>\n<p id=\"2559\">Output variable: deposit<\/p>\n<p id=\"35a4\">Have a peek of the first five observations. Pandas data frame is prettier than Spark DataFrame.show().<\/p>\n<p id=\"2796\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">import pandas as pd<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">pd.DataFrame(df.take(5), columns=df.columns).transpose()<\/span><\/span><\/p>\n<figure id=\"d03f\"><canvas width=\"75\" height=\"72\"><\/canvas><img decoding=\"async\" style=\"width: 508px; height: 494px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*xhryNF7_oYQr8p8kFKcztQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*xhryNF7_oYQr8p8kFKcztQ.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 2<\/p>\n<p id=\"0835\">Our classes are perfect balanced.<\/p>\n<p id=\"8469\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">import pandas as pd<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">pd.DataFrame(df.take(5), columns=df.columns).transpose()<\/span>\u00a0<\/span><\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*3tToVqZhiZEQYJnn-nVgeQ.png\" \/><\/p>\n<p style=\"text-align: center;\">Figure 3<\/p>\n<p id=\"d9ce\"><strong>Summary statistics for numeric variables<\/strong><\/p>\n<p id=\"689b\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">numeric_features = [t[0] for t in df.dtypes if t[1] == &#8216;int&#8217;]<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">df.select(numeric_features).describe().toPandas().transpose()<\/span><\/span><\/p>\n<figure id=\"a3cf\"><canvas width=\"75\" height=\"33\"><\/canvas><img decoding=\"async\" style=\"width: 553px; height: 255px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*isbhHT-K4BXHGkRIMvuVOQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*isbhHT-K4BXHGkRIMvuVOQ.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 4<\/p>\n<p id=\"fedf\"><strong>Correlations between independent variables<\/strong>.<\/p>\n<p id=\"f75b\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">numeric_data = df.select(numeric_features).toPandas()<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">axs = pd.scatter_matrix(numeric_data, figsize=(8, 8));<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">n = len(numeric_data.columns)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">for i in range(n):<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0 v = axs[i, 0]<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0 v.yaxis.label.set_rotation(0)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0 v.yaxis.label.set_ha(&#8216;right&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0 v.set_yticks(())<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0 h = axs[n-1, i]<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0 h.xaxis.label.set_rotation(90)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0 h.set_xticks(())<\/span><\/span><\/p>\n<figure id=\"3c9e\"><canvas width=\"75\" height=\"67\"><\/canvas><img decoding=\"async\" style=\"width: 560px; height: 508px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*zdOEcTGPGEYPWR0Nte4nvQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*zdOEcTGPGEYPWR0Nte4nvQ.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 5<\/p>\n<p id=\"d32c\">It\u2019s obvious that there aren\u2019t highly correlated numeric variables. Therefore, we will keep all of them for the model. However, day and month columns are not really useful, we will remove these two columns.<\/p>\n<div><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">df = df.select(&#8216;age&#8217;, &#8216;job&#8217;, &#8216;marital&#8217;, &#8216;education&#8217;, &#8216;default&#8217;, <\/span><\/span><\/div>\n<div><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">&#8216;balance&#8217;, &#8216;housing&#8217;, &#8216;loan&#8217;, &#8216;contact&#8217;, &#8216;duration&#8217;, &#8216;campaign&#8217;,<\/span><\/span><\/div>\n<div><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">&#8216;pdays&#8217;, &#8216;previous&#8217;, &#8216;poutcome&#8217;, &#8216;deposit&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">cols = df.columns<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">df.printSchema()<\/span><\/span><\/div>\n<figure id=\"6402\"><canvas width=\"75\" height=\"48\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*lNOW_Pgz36NDeHvteQeBEQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*lNOW_Pgz36NDeHvteQeBEQ.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 6<\/p>\n<h3 id=\"a1ef\"><strong>Preparing Data for Machine\u00a0Learning<\/strong><\/h3>\n<p id=\"c0b6\">The process includes Category Indexing, One-Hot Encoding and VectorAssembler \u2014 a feature transformer that merges multiple columns into a vector column.<\/p>\n<p id=\"e5b8\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">from pyspark.ml.feature import OneHotEncoderEstimator, <\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">StringIndexer, VectorAssembler<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">categoricalColumns = [&#8216;job&#8217;, &#8216;marital&#8217;, &#8216;education&#8217;, &#8216;default&#8217;, <\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">&#8216;housing&#8217;, &#8216;loan&#8217;, &#8216;contact&#8217;, &#8216;poutcome&#8217;]<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">stages = []<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">for categoricalCol in categoricalColumns:<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0 stringIndexer = StringIndexer(inputCol = categoricalCol, <\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">outputCol = categoricalCol + &#8216;Index&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0 encoder = OneHotEncoderEstimator(inputCols=<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">[stringIndexer.getOutputCol()], outputCols=[categoricalCol +<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0&#8220;classVec&#8221;])<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0 stages += [stringIndexer, encoder]<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">label_stringIdx = StringIndexer(inputCol = &#8216;deposit&#8217;, outputCol = &#8216;label&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">stages += [label_stringIdx]<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">numericCols = [&#8216;age&#8217;, &#8216;balance&#8217;, &#8216;duration&#8217;, &#8216;campaign&#8217;, &#8216;pdays&#8217;,<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0&#8216;previous&#8217;]<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">assemblerInputs = [c + &#8220;classVec&#8221; for c in categoricalColumns] +<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0numericCols<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">assembler = VectorAssembler(inputCols=assemblerInputs, <\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">outputCol=&#8221;features&#8221;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">stages += [assembler]<\/span><\/span><\/p>\n<p id=\"2b3e\">The above code are taken from\u00a0<a href=\"https:\/\/docs.databricks.com\/spark\/latest\/mllib\/binary-classification-mllib-pipelines.html\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/docs.databricks.com\/spark\/latest\/mllib\/binary-classification-mllib-pipelines.html\" data->databricks\u2019 official site<\/a>\u00a0and it indexes each categorical column using the StringIndexer, then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row. We use the StringIndexer again to encode our labels to label indices. Next, we use the VectorAssembler to combine all the feature columns into a single vector column.<\/p>\n<p id=\"8fae\"><strong>Pipeline<\/strong><\/p>\n<p id=\"a6ba\">We use Pipeline to chain multiple Transformers and Estimators together to specify our machine learning workflow. A Pipeline\u2019s stages are specified as an ordered array.<\/p>\n<p id=\"39fa\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">from pyspark.ml import Pipeline<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">pipeline = Pipeline(stages = stages)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">pipelineModel = pipeline.fit(df)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">df = pipelineModel.transform(df)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">selectedCols = [&#8216;label&#8217;, &#8216;features&#8217;] + cols<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">df = df.select(selectedCols)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">df.printSchema()<\/span><\/span><\/p>\n<figure id=\"cdca\"><canvas width=\"75\" height=\"58\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*DxErv3vt9xYXBKNybxhTLw.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*DxErv3vt9xYXBKNybxhTLw.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 7<\/p>\n<p id=\"09d8\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">pd.DataFrame(df.take(5), columns=df.columns).transpose()<\/span><\/span><\/p>\n<figure id=\"2b02\"><canvas width=\"75\" height=\"37\"><\/canvas><img decoding=\"async\" style=\"width: 650px; height: 332px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*FKhz3gm81yaZM6ivhrAfWg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*FKhz3gm81yaZM6ivhrAfWg.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 8<\/p>\n<p id=\"d4d4\">As you can see, we now have features column and label column.<\/p>\n<p id=\"83ed\">Randomly split data into train and test sets, and set seed for reproducibility.<\/p>\n<p id=\"c561\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">train, test = df.randomSplit([0.7, 0.3], seed = 2018)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">print(&#8220;Training Dataset Count: &#8221; + str(train.count()))<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">print(&#8220;Test Dataset Count: &#8221; + str(test.count()))<\/span><\/span><\/p>\n<p id=\"8bb3\"><strong><em>Training Dataset Count: 7764<br \/>\nTest Dataset Count: 3398<\/em><\/strong><\/p>\n<h3 id=\"0aca\">Logistic Regression Model<\/h3>\n<p id=\"3ccc\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">from pyspark.ml.classification import LogisticRegression<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">lr = LogisticRegression(featuresCol = &#8216;features&#8217;, labelCol = &#8216;label&#8217;, maxIter=10)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">lrModel = lr.fit(train)<\/span><\/span><\/p>\n<p id=\"9515\">We can obtain the\u00a0<strong>coefficients<\/strong>\u00a0by using LogisticRegressionModel\u2019s attributes.<\/p>\n<p id=\"2a9b\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">import matplotlib.pyplot as plt<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">import numpy as np<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">beta = np.sort(lrModel.coefficients)<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">plt.plot(beta)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.ylabel(&#8216;Beta Coefficients&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.show()<\/span><\/span><\/p>\n<figure id=\"5fe4\"><canvas width=\"75\" height=\"45\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*Z3ihFTH7jIE-9WyL6sVuIg.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*Z3ihFTH7jIE-9WyL6sVuIg.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 9<\/p>\n<p id=\"2e7a\">Summarize the model over the training set, we can also obtain the\u00a0<strong>receiver-operating characteristic and areaUnderROC<\/strong>.<\/p>\n<p id=\"d6e9\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">trainingSummary = lrModel.summary<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">roc = trainingSummary.roc.toPandas()<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.plot(roc[&#8216;FPR&#8217;],roc[&#8216;TPR&#8217;])<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.ylabel(&#8216;False Positive Rate&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.xlabel(&#8216;True Positive Rate&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.title(&#8216;ROC Curve&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.show()<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">print(&#8216;Training set areaUnderROC: &#8216; + str(trainingSummary.areaUnderROC))<\/span><\/span><\/p>\n<figure id=\"5b48\"><canvas width=\"75\" height=\"48\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*PcKw72q-RncZWOwhPItUYA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*PcKw72q-RncZWOwhPItUYA.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 10<\/p>\n<p id=\"e63f\"><strong>Precision and recall<\/strong>.<\/p>\n<p id=\"9718\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">pr = trainingSummary.pr.toPandas()<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.plot(pr[&#8216;recall&#8217;],pr[&#8216;precision&#8217;])<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.ylabel(&#8216;Precision&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.xlabel(&#8216;Recall&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">plt.show()<\/span><\/span><\/p>\n<figure id=\"551b\"><canvas width=\"75\" height=\"45\"><\/canvas><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*6phlTrWSfaT53zh8_3LWRQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*6phlTrWSfaT53zh8_3LWRQ.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 11<\/p>\n<p id=\"d223\"><strong>Make predictions on the test set<\/strong>.<\/p>\n<p id=\"7f66\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">predictions = lrModel.transform(test)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">predictions.select(&#8216;age&#8217;, &#8216;job&#8217;, &#8216;label&#8217;, &#8216;rawPrediction&#8217;, &#8216;prediction&#8217;, &#8216;probability&#8217;).show(10)<\/span><\/span><\/p>\n<figure id=\"2d81\"><canvas width=\"75\" height=\"30\"><\/canvas><img decoding=\"async\" style=\"width: 650px; height: 264px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*23XUWWuvbjp5ADgIoaZApw.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*23XUWWuvbjp5ADgIoaZApw.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 12<\/p>\n<p id=\"3962\"><strong>Evaluate our Logistic Regression model<\/strong>.<\/p>\n<p id=\"1eb4\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">from <\/span><span style=\"background-color: #f0f8ff;\">pyspark<\/span><span style=\"background-color: #f0f8ff;\">.ml.evaluation import BinaryClassificationEvaluator<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">evaluator = BinaryClassificationEvaluator()<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">print(&#8216;Test Area Under ROC&#8217;, evaluator.evaluate(predictions))<\/span><\/span><\/p>\n<p id=\"3868\"><strong><em>Test Area Under ROC 0.8858324614449619<\/em><\/strong><\/p>\n<h3 id=\"a190\"><strong>Decision Tree Classifier<\/strong><\/h3>\n<p id=\"9fc6\">Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multi-class classification, do not require feature scaling, and are able to capture non-linearities and feature interactions.<\/p>\n<p id=\"8786\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">from pyspark.ml.classification import DecisionTreeClassifier<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">dt = DecisionTreeClassifier(featuresCol = &#8216;features&#8217;, labelCol = &#8216;label&#8217;, maxDepth = 3)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">dtModel = dt.fit(train)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">predictions = dtModel.transform(test)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">predictions.select(&#8216;age&#8217;, &#8216;job&#8217;, &#8216;label&#8217;, &#8216;rawPrediction&#8217;, &#8216;prediction&#8217;, &#8216;probability&#8217;).show(10)<\/span><\/span><\/p>\n<figure id=\"a94e\"><canvas width=\"75\" height=\"31\"><\/canvas><img decoding=\"async\" style=\"width: 622px; height: 269px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*9v_WdJYGdwNwvikEBPAfaw.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*9v_WdJYGdwNwvikEBPAfaw.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 13<\/p>\n<p id=\"a0db\"><strong>Evaluate our Decision Tree model<\/strong>.<\/p>\n<p id=\"f71d\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">evaluator = BinaryClassificationEvaluator()<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">print(&#8220;Test Area Under ROC: &#8221; + str(evaluator.evaluate(predictions, {evaluator.metricName: &#8220;areaUnderROC&#8221;})))<\/span><\/span><\/p>\n<p id=\"fe86\"><strong><em>Test Area Under ROC: 0.7807240050065357<\/em><\/strong><\/p>\n<p id=\"4f6a\">One simple decision tree performed poorly because it is too weak given the range of different features. The prediction accuracy of decision trees can be improved by Ensemble methods, such as Random Forest and Gradient-Boosted Tree.<\/p>\n<h3 id=\"67ec\"><strong>Random Forest Classifier<\/strong><\/h3>\n<p id=\"5995\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">from pyspark.ml.classification import RandomForestClassifier<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">rf = RandomForestClassifier(featuresCol = &#8216;features&#8217;, labelCol = &#8216;label&#8217;)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">rfModel = rf.fit(train)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">predictions = rfModel.transform(test)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">predictions.select(&#8216;age&#8217;, &#8216;job&#8217;, &#8216;label&#8217;, &#8216;rawPrediction&#8217;, &#8216;prediction&#8217;, &#8216;probability&#8217;).show(10)<\/span><\/span><\/p>\n<figure id=\"b777\"><canvas width=\"75\" height=\"30\"><\/canvas><img decoding=\"async\" style=\"width: 650px; height: 265px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*rmkfZ6sW2DqNVEENdqQZrw.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*rmkfZ6sW2DqNVEENdqQZrw.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 14<\/p>\n<p id=\"8d53\"><strong>Evaluate our Random Forest Classifier<\/strong>.<\/p>\n<p id=\"a9a3\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">evaluator = BinaryClassificationEvaluator()<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">print(&#8220;Test Area Under ROC: &#8221; + str(evaluator.evaluate(predictions, {evaluator.metricName: &#8220;areaUnderROC&#8221;})))<\/span><\/span><\/p>\n<p id=\"8c99\"><strong><em>Test Area Under ROC: 0.8846453518867426<\/em><\/strong><\/p>\n<h3 id=\"ee12\"><strong>Gradient-Boosted Tree Classifier<\/strong><\/h3>\n<p id=\"fb42\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">from pyspark.ml.classification import GBTClassifier<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">gbt = GBTClassifier(maxIter=10)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">gbtModel = gbt.fit(train)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">predictions = gbtModel.transform(test)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">predictions.select(&#8216;age&#8217;, &#8216;job&#8217;, &#8216;label&#8217;, &#8216;rawPrediction&#8217;, &#8216;prediction&#8217;, &#8216;probability&#8217;).show(10)<\/span><\/span><\/p>\n<figure id=\"c6ce\"><canvas width=\"75\" height=\"30\"><\/canvas><img decoding=\"async\" style=\"width: 650px; height: 260px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*g2uohpSKLFm4cO8qnso9SA.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*g2uohpSKLFm4cO8qnso9SA.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 15<\/p>\n<p id=\"4935\"><strong>Evaluate our Gradient-Boosted Tree Classifier.<\/strong><\/p>\n<p id=\"c2a7\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">evaluator = BinaryClassificationEvaluator()<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">print(&#8220;Test Area Under ROC: &#8221; + str(evaluator.evaluate(predictions, {evaluator.metricName: &#8220;areaUnderROC&#8221;})))<\/span><\/span><\/p>\n<p id=\"baf1\"><strong><em>Test Area Under ROC: 0.8940728473145346<\/em><\/strong><\/p>\n<p id=\"0596\">Gradient-Boosted Tree achieved the best results, we will try tuning this model with the ParamGridBuilder and the CrossValidator. Before that we can use explainParams() to print a list of all params and their definitions to understand what params available for tuning.<\/p>\n<p id=\"7188\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">print(gbt.explainParams())<\/span><\/span><\/p>\n<figure id=\"8f79\"><canvas width=\"75\" height=\"31\"><\/canvas><img decoding=\"async\" style=\"width: 650px; height: 279px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*voLTp-IKD8u5cQkdwYc2Ag.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*voLTp-IKD8u5cQkdwYc2Ag.png\" \/><\/figure>\n<p style=\"text-align: center;\">Figure 16<\/p>\n<p id=\"1f22\"><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">from pyspark.ml.tuning import ParamGridBuilder, CrossValidator<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">paramGrid = (ParamGridBuilder()<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 .addGrid(gbt.maxDepth, [2, 4, 6])<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 .addGrid(gbt.maxBins, [20, 60])<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 .addGrid(gbt.maxIter, [10, 20])<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 .build())<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\">cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)<\/span><\/span><\/p>\n<p><span style=\"font-family: courier new,courier,monospace;\"><span style=\"background-color: #f0f8ff;\"># Run cross validations.\u00a0 This can take about 6 minutes since it is training over 20 trees!<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">cvModel = cv.fit(train)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">predictions = cvModel.transform(test)<\/span><br \/>\n<span style=\"background-color: #f0f8ff;\">evaluator.evaluate(predictions)<\/span><\/span><\/p>\n<p id=\"c81e\"><strong><em>0.8981050997838095<\/em><\/strong><\/p>\n<p id=\"ffc0\">To sum it up, we have learned how to build a binary classification application using PySpark and MLlib Pipelines API. We tried four algorithms and gradient boosting performed best on our data set.<\/p>\n<p id=\"7434\">Source code can be found on\u00a0<a href=\"https:\/\/github.com\/susanli2016\/PySpark-and-MLlib\/blob\/master\/Machine%20Learning%20PySpark%20and%20MLlib.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/github.com\/susanli2016\/PySpark-and-MLlib\/blob\/master\/Machine%20Learning%20PySpark%20and%20MLlib.ipynb\" data->Github<\/a>.<\/p>\n<p><a href=\"https:\/\/spark.apache.org\/docs\/2.1.0\/ml-classification-regression.html#linear-regression\" target=\"_blank\" rel=\"noopener noreferrer\" data-href=\"https:\/\/spark.apache.org\/docs\/2.1.0\/ml-classification-regression.html#linear-regression\" data->Reference: Apache Spark 2.1.0<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Spark is now becoming the big-data platform of choice for enterprises. It is a powerful open source engine that provides real-time stream processing, interactive processing, graph processing, in-memory processing as well as batch processing with very fast speed, ease of use and standard interface. Sooner or later, your company or your clients will be using Spark to develop sophisticated models that would enable you to discover new opportunities or avoid risk. Spark is not hard to learn, if you already know Python and SQL. learn how to build a binary classification application using PySpark and MLlib Pipelines API.&nbsp;<\/p>\n","protected":false},"author":255,"featured_media":14359,"comment_status":"open","ping_status":"open","sticky":false,"template":"single-post-2.php","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[97],"ppma_author":[2872],"class_list":["post-1519","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-artificial-intelligence"],"authors":[{"term_id":2872,"user_id":255,"is_guest":0,"slug":"susan-li","display_name":"Susan Li","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Li","first_name":"Susan","job_title":"","description":"Susan Li, Data Scientist at <a href=\"http:\/\/www.waveapps.com\/\">Wave Financial<\/a>, is helping organizations realize the potential of big data and advanced analytics. &nbsp;Her specialities include Machine learning, data mining, and predictive modeling, &nbsp;R, Python, SQL, and data visualization tools."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1519","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/255"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1519"}],"version-history":[{"count":3,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1519\/revisions"}],"predecessor-version":[{"id":30153,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1519\/revisions\/30153"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/14359"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1519"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1519"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1519"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1519"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}