Pyspark模型特征重要程度评价

在我们使用spark训练好模型后,经常会对模型的重要程度进行评价,本文已随机森林为例,说明了pyspark如何对特征的重要性进行评价。

下面的代码完成了一个随机森林模型的训练过程,该代码来自Spark官网的示例代码

from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer=VectorIndexer(inputCol="features",outputCol="indexedFeatures",maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")

# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

rfModel = model.stages[1]
print(rfModel)  # summary only


模型训练好后,可以看到,被保存为了rfModel,使用下面的代码可以看到特征的重要性

rfModel.featureImportances

它输出的是一个list,阅读起来很不友好,可以使用下面这个模块转换一下

def ExtractFeatureImp(featureImp, dataset):
    import pandas as pd
    list_extract = []
    for i in dataset.schema["features"].metadata["ml_attr"]["attrs"]:
      list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i]
    varlist = pd.DataFrame(list_extract)
    varlist['score'] = varlist['idx'].apply(lambda x: featureImp[x])
    return(varlist.sort_values('score', ascending = False))

该模块包含2个入参,featureImp即是上面说到的rfModel.featureImportances,dataset至模型预测结果predictions。

模块输出的是一个dataframe,包含2列,name和score分别表示特征名字,特征对模型的重要程度。

下面的代码可以显示前30个对模型最重要的特征

ExtractFeatureImp(rfModel.featureImportances, predictions).head(30)

The End


已发布

分类

标签:

评论

《“Pyspark模型特征重要程度评价”》 有 1 条评论

  1.  的头像
    匿名

    牛批牛批

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注