scikit-learn

Some thoughts about cuDF and cuML

I just received an email from NVIDIA about their RAPIDS. Although the cuDF and cuML look fantastic for a data scientist. I am still doubtful about them.

In our daily work, we usually process small DataFrame by Pandas, so cuDF will be too expensive since it needs GPU. And even we need to join two large DataFrame, we tend to use BigQuery, for it’s distributed and relatively cheap. The only proper case for cuDF I think is some heavy operations on less than 8GB data. Who need so many heavy operations on a DataFrame? I don’t know.

For cuML, it’s more like a GPU version scikit-learn. Actually, for tabular data we use XGBoost/LightGBM, for non-structure-data we use PyTorch/Tensorflow. Who will even use scikit-learn? Not even mention the cuML.

LinearSVC versus SVC in scikit-learn

In competition ‘Quora Insincere Questions Classification’, I want to use simple TF-IDF statistics as a baseline.

def grid_search(classifier, parameters, X, y, X_train, y_train, X_test, y_test, name = 'SVC'):
    begin = time.time()
    clf = GridSearchCV(classifier, parameters, cv = StratifiedKFold(n_splits = 10), n_jobs = 36)
    clf.fit(X, y)
    print('f1 for ' + name +': ', benchmark(clf.best_estimator_, X_train, y_train, X_test, y_test), clf.best_estimator_)
    print('cost time: ', time.time() - begin)
data = pd.read_csv('train.csv')
data = data.sample(frac = 1.0)
corpus = data['question_text']
y = data['target']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify = y)
C_set = [0.4, 0.6, 0.8]
tol_set = [2, 1.5, 1.4, 1.3, 1.2, 1, 0.8, 0.6, 0.4]
parameters = {
    'penalty': ['l1', 'l2'],
    'C': C_set,
    'tol': tol_set }
classifier = LinearSVC(dual = False)
grid_search(classifier, parameters, X, y, X_train, y_train, X_test, y_test, 'LinearSVC')

The result is not bad:

f1 for LinearSVC:  0.5255644347893553 LinearSVC(C=0.8, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l1', random_state=None, tol=0.4,
     verbose=0)

But after I change LinearSVC to SVC(kernel=’linear’), the program couldn’t work out any result even after 12 hours!
Am I doing anything wrong? In the page of sklearn.svm.LinearSVC, there is a note:

Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

Also in the page of sklearn.svm.SVC, it’s another note:

The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.

That’s the answer: LinearSVC is the right choice to process a large number of samples.

Prediction of Red Wine Quality

In Kaggle platform, there is an example dataset about Quality of Red Wine. I wrote some code for it by using scikit-learn and pandas:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
# Read dataset
wine = pd.read_csv('~/Downloads/winequality-red.csv', sep = ';')
attrs = wine.drop(['quality'], axis = 1)
header = list(attrs)
attrs = attrs.values
# Use scaler to normalize data
scaler = StandardScaler()
scaled_attrs = scaler.fit_transform(attrs)
quality = wine['quality'].values
# SVM classifier
svr = SVC(kernel = 'rbf', max_iter = -1)
svr.fit(attrs, quality)
# Randomized decison trees classifier
dt = ExtraTreesClassifier()
dt.fit(attrs, quality)
ls = list(zip(dt.feature_importances_, header))
ls.sort(key = lambda x: x[1])
for importance, name in ls:
    print(name, importance)
print('\n\n')
# Cross validation on this two classifiers
for reg in [svr, dt]:
    scores = cross_val_score(reg, attrs, quality, scoring = 'neg_mean_squared_error', cv = 10)
    rmse = -scores
    print(reg)
    print(rmse.mean(), rmse.std())
    print('\n')

The results reported by snippet above:

alcohol 0.1438906634767823
chlorides 0.07953780339531004
citric acid 0.07979101058207233
density 0.0846765183778148
fixed acidity 0.07686725880938272
free sulfur dioxide 0.07178658192019563
pH 0.07797509374376276
residual sugar 0.0796105749270121
sulphates 0.11872569296381115
total sulfur dioxide 0.0993798893196299
volatile acidity 0.08775891248422625
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
0.6983420378445301 0.04803296683789781
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Looks the most important feature to predict quality of red wine is ‘alcohol’. Intuitively, right?

Use PCA (Principal Component Analysis) to blur color image

I wrote an example of blurring color picture by using PCA from scikit-learn:

import cv2
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components = 0.96)
img = cv2.imread("input.jpg")
reduced = pca.fit_transform(img)
res = pca.inverse_transform(reduced)
cv2.imwrite('output.jpg', res.reshape(shape))

But it reports

ValueError: Found array with dim 3. Estimator expected <= 2.

The correct solution is transforming image to 2 dimensions shape, and inverse transform it after PCA:

img = cv2.imread('input.jpg')
shape = img.shape
img_r = img.reshape((shape[0], shape[1] * shape[2]))
reduced = pca.fit_transform(img_r)

It works very well now. Let's see the original image and blurring image:

Original Image

Blurring Image

Robin on Linux

scikit-learn