Scalable Pipeline : PCA and Logistic Regression using Pyspark

1 min readDec 30, 2017

Recently I was working on a POC to do pipelining of PCA followed by Logistic Regression using Pyspark.

Even though now it seems trivial, I had to go through a lot of brain storming before arriving at this code.

In my POC, to read data, from csv file, I used the following code:

data =spark.read.format(’com.databricks.spark.csv’).option(‘header’,true).option(’inferschema’,true).load(my_file_path)
inp_cols = [’a’,’b’,’c’,’d’,’e’,’class’]
df =data.toDF(*inp_cols)

To create a column named “features”, I used VectorAssembler which assembles all the columns into a single vector.

assembler = VectorAssembler( inputCols=inp_cols[:-1],
outputCol=’features’)

Now I can create a pipeline containing VectorAssembler, PCA and Logistic Regression and pass our data-frame as my input.

pca = PCA(k=2, inputCol=’features’, outputCol=’pcaFeature’)
lr = LogisticRegression(maxIter=10, regParam=0.3).setLabelCol(‘class’)
pipeline = Pipeline (stages=[assembler,pca, lr])

Now you can create a pipeline model and then use it to perform prediction:

model = pipeline.fit(df)
prediction =model.transform(test_df)
prediction.show()

I hope my post, saved you a lot of time.

Scalable Pipeline : PCA and Logistic Regression using Pyspark

Written by ManiShankar Singh