在我们使用spark训练好模型后,经常会对模型的重要程度进行评价,本文已随机森林为例,说明了pyspark如何对特征的重要性进行评价。
下面的代码
完成了一个随机森林模型的训练过程,该代码来自Spark官网的示例代码
from pyspark.ml import Pipeline from pyspark.ml.regression import RandomForestRegressor from pyspark.ml.feature import VectorIndexer from pyspark.ml.evaluation import RegressionEvaluator # Load and parse the data file, converting it to a DataFrame. data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") # Automatically identify categorical features, and index them. # Set maxCategories so features with > 4 distinct values are treated as continuous. featureIndexer=VectorIndexer(inputCol="features",outputCol="indexedFeatures",maxCategories=4).fit(data) # Split the data into training and test sets (30% held out for testing) (trainingData, testData) = data.randomSplit([0.7, 0.3]) # Train a RandomForest model. rf = RandomForestRegressor(featuresCol="indexedFeatures") # Chain indexer and forest in a Pipeline pipeline = Pipeline(stages=[featureIndexer, rf]) # Train model. This also runs the indexer. model = pipeline.fit(trainingData) # Make predictions. predictions = model.transform(testData) # Select example rows to display. predictions.select("prediction", "label", "features").show(5) # Select (prediction, true label) and compute test error evaluator = RegressionEvaluator( labelCol="label", predictionCol="prediction", metricName="rmse") rmse = evaluator.evaluate(predictions) print("Root Mean Squared Error (RMSE) on test data = %g" % rmse) rfModel = model.stages[1] print(rfModel) # summary only
模型训练好后,可以看到,被保存为了rfModel,使用下面的代码可以看到特征的重要性
rfModel.featureImportances
它输出的是一个list,阅读起来很不友好,可以使用下面这个模块转换一下
def ExtractFeatureImp(featureImp, dataset): import pandas as pd list_extract = [] for i in dataset.schema["features"].metadata["ml_attr"]["attrs"]: list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i] varlist = pd.DataFrame(list_extract) varlist['score'] = varlist['idx'].apply(lambda x: featureImp[x]) return(varlist.sort_values('score', ascending = False))
该模块包含2个入参,featureImp即是上面说到的rfModel.featureImportances,dataset至模型预测结果predictions。
模块输出的是一个dataframe,包含2列,name和score分别表示特征名字,特征对模型的重要程度。
下面的代码可以显示前30个对模型最重要的特征
ExtractFeatureImp(rfModel.featureImportances, predictions).head(30)
发表回复