Experfy
No Result
View All Result
  • Home
  • Future of Work
  • AI & Machine Learning
  • Big Data & Cloud
  • IoT & Automation
  • Software
  • ConsumerTech
  • HealthTech
  • FinTech
  • Home
  • Future of Work
  • AI & Machine Learning
  • Big Data & Cloud
  • IoT & Automation
  • Software
  • ConsumerTech
  • HealthTech
  • FinTech
No Result
View All Result
Experfy Insights
No Result
View All Result
Home AI & Machine Learning

Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem

Susan Li by Susan Li
February 18, 2019
in AI & Machine Learning
16 min read
0
Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem
Share on FacebookShare on Twitter

Ready to learn Machine Learning? Browse courses like Machine Learning Foundations: Supervised Learning developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem

Photo Credit: Pixabay

Apache Spark, once a component of the Hadoop ecosystem, is now becoming the big-data platform of choice for enterprises. It is a powerful open source engine that provides real-time stream processing, interactive processing, graph processing, in-memory processing as well as batch processing with very fast speed, ease of use and standard interface.

In the industry, there is a big demand for a powerful engine that can do all of above. Sooner or later, your company or your clients will be using Spark to develop sophisticated models that would enable you to discover new opportunities or avoid risk. Spark is not hard to learn, if you already known Python and SQL, it is very easy to get started. Let’s give it a try today!

Exploring The Data

We will use the same data set when we built a Logistic Regression in Python, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (Yes/No) to a term deposit. The dataset can be downloaded from Kaggle.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(‘ml-bank’).getOrCreate()
df = spark.read.csv(‘bank.csv’, header = True, inferSchema = True)
df.printSchema()

Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem

Figure 1

Input variables: age, job, marital, education, default, balance, housing, loan, contact, day, month, duration, campaign, pdays, previous, poutcome.

Output variable: deposit

Have a peek of the first five observations. Pandas data frame is prettier than Spark DataFrame.show().

import pandas as pd
pd.DataFrame(df.take(5), columns=df.columns).transpose()

Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem

Figure 2

Our classes are perfect balanced.

import pandas as pd
pd.DataFrame(df.take(5), columns=df.columns).transpose() 

Figure 3

Figure 3

Summary statistics for numeric variables

numeric_features = [t[0] for t in df.dtypes if t[1] == ‘int’]
df.select(numeric_features).describe().toPandas().transpose()

Figure 4

Figure 4

Correlations between independent variables.

numeric_data = df.select(numeric_features).toPandas()

axs = pd.scatter_matrix(numeric_data, figsize=(8, 8));

n = len(numeric_data.columns)
for i in range(n):
    v = axs[i, 0]
    v.yaxis.label.set_rotation(0)
    v.yaxis.label.set_ha(‘right’)
    v.set_yticks(())
    h = axs[n-1, i]
    h.xaxis.label.set_rotation(90)
    h.set_xticks(())

Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem

Figure 5

It’s obvious that there aren’t highly correlated numeric variables. Therefore, we will keep all of them for the model. However, day and month columns are not really useful, we will remove these two columns.

df = df.select(‘age’, ‘job’, ‘marital’, ‘education’, ‘default’,
‘balance’, ‘housing’, ‘loan’, ‘contact’, ‘duration’, ‘campaign’,
‘pdays’, ‘previous’, ‘poutcome’, ‘deposit’)
cols = df.columns
df.printSchema()
Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem

Figure 6

Preparing Data for Machine Learning

The process includes Category Indexing, One-Hot Encoding and VectorAssembler — a feature transformer that merges multiple columns into a vector column.

from pyspark.ml.feature import OneHotEncoderEstimator,
StringIndexer, VectorAssembler

categoricalColumns = [‘job’, ‘marital’, ‘education’, ‘default’,
‘housing’, ‘loan’, ‘contact’, ‘poutcome’]
stages = []

for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol = categoricalCol,
outputCol = categoricalCol + ‘Index’)
    encoder = OneHotEncoderEstimator(inputCols=
[stringIndexer.getOutputCol()], outputCols=[categoricalCol +
 “classVec”])
    stages += [stringIndexer, encoder]

label_stringIdx = StringIndexer(inputCol = ‘deposit’, outputCol = ‘label’)
stages += [label_stringIdx]

numericCols = [‘age’, ‘balance’, ‘duration’, ‘campaign’, ‘pdays’,
 ‘previous’]
assemblerInputs = [c + “classVec” for c in categoricalColumns] +
 numericCols
assembler = VectorAssembler(inputCols=assemblerInputs,
outputCol=”features”)
stages += [assembler]

The above code are taken from databricks’ official site and it indexes each categorical column using the StringIndexer, then converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row. We use the StringIndexer again to encode our labels to label indices. Next, we use the VectorAssembler to combine all the feature columns into a single vector column.

Pipeline

We use Pipeline to chain multiple Transformers and Estimators together to specify our machine learning workflow. A Pipeline’s stages are specified as an ordered array.

from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = [‘label’, ‘features’] + cols
df = df.select(selectedCols)
df.printSchema()

Figure 7

Figure 7

pd.DataFrame(df.take(5), columns=df.columns).transpose()

Figure 8

Figure 8

As you can see, we now have features column and label column.

Randomly split data into train and test sets, and set seed for reproducibility.

train, test = df.randomSplit([0.7, 0.3], seed = 2018)
print(“Training Dataset Count: ” + str(train.count()))
print(“Test Dataset Count: ” + str(test.count()))

Training Dataset Count: 7764
Test Dataset Count: 3398

Logistic Regression Model

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol = ‘features’, labelCol = ‘label’, maxIter=10)
lrModel = lr.fit(train)

We can obtain the coefficients by using LogisticRegressionModel’s attributes.

import matplotlib.pyplot as plt
import numpy as np

beta = np.sort(lrModel.coefficients)

plt.plot(beta)
plt.ylabel(‘Beta Coefficients’)
plt.show()

Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem

Figure 9

Summarize the model over the training set, we can also obtain the receiver-operating characteristic and areaUnderROC.

trainingSummary = lrModel.summary

roc = trainingSummary.roc.toPandas()
plt.plot(roc[‘FPR’],roc[‘TPR’])
plt.ylabel(‘False Positive Rate’)
plt.xlabel(‘True Positive Rate’)
plt.title(‘ROC Curve’)
plt.show()

print(‘Training set areaUnderROC: ‘ + str(trainingSummary.areaUnderROC))

Figure 10

Figure 10

Precision and recall.

pr = trainingSummary.pr.toPandas()
plt.plot(pr[‘recall’],pr[‘precision’])
plt.ylabel(‘Precision’)
plt.xlabel(‘Recall’)
plt.show()

Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem

Figure 11

Make predictions on the test set.

predictions = lrModel.transform(test)
predictions.select(‘age’, ‘job’, ‘label’, ‘rawPrediction’, ‘prediction’, ‘probability’).show(10)

Figure 12

Figure 12

Evaluate our Logistic Regression model.

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print(‘Test Area Under ROC’, evaluator.evaluate(predictions))

Test Area Under ROC 0.8858324614449619

Decision Tree Classifier

Decision trees are widely used since they are easy to interpret, handle categorical features, extend to the multi-class classification, do not require feature scaling, and are able to capture non-linearities and feature interactions.

from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(featuresCol = ‘features’, labelCol = ‘label’, maxDepth = 3)
dtModel = dt.fit(train)
predictions = dtModel.transform(test)
predictions.select(‘age’, ‘job’, ‘label’, ‘rawPrediction’, ‘prediction’, ‘probability’).show(10)

Figure 13

Figure 13

Evaluate our Decision Tree model.

evaluator = BinaryClassificationEvaluator()
print(“Test Area Under ROC: ” + str(evaluator.evaluate(predictions, {evaluator.metricName: “areaUnderROC”})))

Test Area Under ROC: 0.7807240050065357

One simple decision tree performed poorly because it is too weak given the range of different features. The prediction accuracy of decision trees can be improved by Ensemble methods, such as Random Forest and Gradient-Boosted Tree.

Random Forest Classifier

from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol = ‘features’, labelCol = ‘label’)
rfModel = rf.fit(train)
predictions = rfModel.transform(test)
predictions.select(‘age’, ‘job’, ‘label’, ‘rawPrediction’, ‘prediction’, ‘probability’).show(10)

Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem

Figure 14

Evaluate our Random Forest Classifier.

evaluator = BinaryClassificationEvaluator()
print(“Test Area Under ROC: ” + str(evaluator.evaluate(predictions, {evaluator.metricName: “areaUnderROC”})))

Test Area Under ROC: 0.8846453518867426

Gradient-Boosted Tree Classifier

from pyspark.ml.classification import GBTClassifier

gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
predictions = gbtModel.transform(test)
predictions.select(‘age’, ‘job’, ‘label’, ‘rawPrediction’, ‘prediction’, ‘probability’).show(10)

Machine Learning With PySpark And MLlib — Solving A Binary Classification Problem

Figure 15

Evaluate our Gradient-Boosted Tree Classifier.

evaluator = BinaryClassificationEvaluator()
print(“Test Area Under ROC: ” + str(evaluator.evaluate(predictions, {evaluator.metricName: “areaUnderROC”})))

Test Area Under ROC: 0.8940728473145346

Gradient-Boosted Tree achieved the best results, we will try tuning this model with the ParamGridBuilder and the CrossValidator. Before that we can use explainParams() to print a list of all params and their definitions to understand what params available for tuning.

print(gbt.explainParams())

Figure 16

Figure 16

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [2, 4, 6])
             .addGrid(gbt.maxBins, [20, 60])
             .addGrid(gbt.maxIter, [10, 20])
             .build())

cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations.  This can take about 6 minutes since it is training over 20 trees!
cvModel = cv.fit(train)
predictions = cvModel.transform(test)
evaluator.evaluate(predictions)

0.8981050997838095

To sum it up, we have learned how to build a binary classification application using PySpark and MLlib Pipelines API. We tried four algorithms and gradient boosting performed best on our data set.

Source code can be found on Github.

Reference: Apache Spark 2.1.0

Tags: Artificial Intelligence
Previous Post

From AI To Blockchain To Data: Meet Ocean

Next Post

Augmented Reality: Market Potential And Practical Applications

Susan Li

Susan Li

Susan Li, Data Scientist at Wave Financial, is helping organizations realize the potential of big data and advanced analytics.  Her specialities include Machine learning, data mining, and predictive modeling,  R, Python, SQL, and data visualization tools.

Next Post
Augmented Reality: Market Potential And Practical Applications

Augmented Reality: Market Potential And Practical Applications

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR POST

  • A Comparison of Tableau and Power BI, the Two Top Leaders in the BI Market

    A Comparison of Tableau and Power BI, the Two Top Leaders in the BI Market

    11910 shares
    Share 4769 Tweet 2975
  • Insights to Agile Methodologies for Software Development

    3022 shares
    Share 1209 Tweet 756
  • Why You Should Forget Loops and Embrace Vectorization for Data Science

    2686 shares
    Share 1074 Tweet 672
  • Greedy Algorithm And Dynamic Programming

    2064 shares
    Share 826 Tweet 516
  • Cloudera vs Hortonworks vs MapR: Comparing Hadoop Distributions

    2057 shares
    Share 822 Tweet 514
Experfy Insights

Experfy Insights provides cutting-edge perspectives on Big Data and analytics. Our unique ability to focus on business problems enables us to provide insights that are highly relevant to each industry.

Join Us At

About Us

Contact Us


1700 West Park Drive, Suite 190
Westborough, MA 01581

Email: [email protected]

Toll Free: (844) EXPERFY or
(844) 397-3739

© 2020, Experfy Inc. All rights reserved.

No Result
View All Result
  • Home
  • Future of Work
  • AI & Machine Learning
  • Big Data & Cloud
  • IoT & Automation
  • Software
  • ConsumerTech
  • HealthTech
  • FinTech

© 2020, Experfy Inc. All rights reserved.