Create feature vector programmatically in Spark ML / pyspark

I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns.

I.e. as in the Iris dataset:

(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)

I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code.

The solution I'd like to improve:

from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel
iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)
df = iris.map(lambda r: Row( id = r.id, a1 = r.a1, a2 = r.a2, a3 = r.a3, a4 = r.a4, label = r.label, binomial_label=r.binomial_label, features = Vectors.dense(r.a1, r.a2, r.a3, r.a4)) ).toDF()
kmeans_estimator = KMeans()\ .setFeaturesCol("features")\ .setPredictionCol("prediction")\
kmeans_transformer = kmeans_estimator.fit(df)
predicted_df = kmeans_transformer.transform(df).drop("features")
predicted_df.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1', label=u'Iris-setosa', prediction=1)

I'm looking for a solution, which is something like:

feature_cols = ["a1", "a2", "a3", "a4"]
prediction_col_name = "prediction"
<dataframe independent code for KMeans>
<New dataframe is created, extended with the `prediction` column.>

1 Answer

You can use VectorAssembler:

from pyspark.ml.feature import VectorAssembler
ignore = ['id', 'label', 'binomial_label']
assembler = VectorAssembler( inputCols=[x for x in df.columns if x not in ignore], outputCol='features')
assembler.transform(df)

It can be combined with k-means using ML Pipeline:

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, kmeans_estimator])
model = pipeline.fit(df)
0

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like