In Kaggle platform, there is an example dataset about Quality of Red Wine. I wrote some code for it by using scikit-learn and pandas:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
# Read dataset
wine = pd.read_csv('~/Downloads/winequality-red.csv', sep = ';')
attrs = wine.drop(['quality'], axis = 1)
header = list(attrs)
attrs = attrs.values
# Use scaler to normalize data
scaler = StandardScaler()
scaled_attrs = scaler.fit_transform(attrs)
quality = wine['quality'].values
# SVM classifier
svr = SVC(kernel = 'rbf', max_iter = -1)
svr.fit(attrs, quality)
# Randomized decison trees classifier
dt = ExtraTreesClassifier()
dt.fit(attrs, quality)
ls = list(zip(dt.feature_importances_, header))
ls.sort(key = lambda x: x[1])
for importance, name in ls:
print(name, importance)
print('\n\n')
# Cross validation on this two classifiers
for reg in [svr, dt]:
scores = cross_val_score(reg, attrs, quality, scoring = 'neg_mean_squared_error', cv = 10)
rmse = -scores
print(reg)
print(rmse.mean(), rmse.std())
print('\n')
The results reported by snippet above:
alcohol 0.1438906634767823 chlorides 0.07953780339531004 citric acid 0.07979101058207233 density 0.0846765183778148 fixed acidity 0.07686725880938272 free sulfur dioxide 0.07178658192019563 pH 0.07797509374376276 residual sugar 0.0796105749270121 sulphates 0.11872569296381115 total sulfur dioxide 0.0993798893196299 volatile acidity 0.08775891248422625 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) 0.6983420378445301 0.04803296683789781 ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
Looks the most important feature to predict quality of red wine is ‘alcohol’. Intuitively, right?