Scalable Pipeline : PCA and Logistic Regression using Pyspark
Recently I was working on a POC to do pipelining of PCA followed by Logistic Regression using Pyspark.
Even though now it seems trivial, I had to go through a lot of brain storming before arriving at this code.
In my POC, to read data, from csv file, I used the following code:
data =spark.read.format(’com.databricks.spark.csv’).option(‘header’,true).option(’inferschema’,true).load(my_file_path)
inp_cols = [’a’,’b’,’c’,’d’,’e’,’class’]
df =data.toDF(*inp_cols)
To create a column named “features”, I used VectorAssembler which assembles all the columns into a single vector.
assembler = VectorAssembler( inputCols=inp_cols[:-1],
outputCol=’features’)
Now I can create a pipeline containing VectorAssembler, PCA and Logistic Regression and pass our data-frame as my input.
pca = PCA(k=2, inputCol=’features’, outputCol=’pcaFeature’)
lr = LogisticRegression(maxIter=10, regParam=0.3).setLabelCol(‘class’)
pipeline = Pipeline (stages=[assembler,pca, lr])
Now you can create a pipeline model and then use it to perform prediction:
model = pipeline.fit(df)
prediction =model.transform(test_df)
prediction.show()
I hope my post, saved you a lot of time.