In competition ‘Quora Insincere Questions Classification’, I want to use simple TF-IDF statistics as a baseline.
def grid_search(classifier, parameters, X, y, X_train, y_train, X_test, y_test, name = 'SVC'):
begin = time.time()
clf = GridSearchCV(classifier, parameters, cv = StratifiedKFold(n_splits = 10), n_jobs = 36)
clf.fit(X, y)
print('f1 for ' + name +': ', benchmark(clf.best_estimator_, X_train, y_train, X_test, y_test), clf.best_estimator_)
print('cost time: ', time.time() - begin)
data = pd.read_csv('train.csv')
data = data.sample(frac = 1.0)
corpus = data['question_text']
y = data['target']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify = y)
C_set = [0.4, 0.6, 0.8]
tol_set = [2, 1.5, 1.4, 1.3, 1.2, 1, 0.8, 0.6, 0.4]
parameters = {
'penalty': ['l1', 'l2'],
'C': C_set,
'tol': tol_set }
classifier = LinearSVC(dual = False)
grid_search(classifier, parameters, X, y, X_train, y_train, X_test, y_test, 'LinearSVC')
The result is not bad:
f1 for LinearSVC: 0.5255644347893553 LinearSVC(C=0.8, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l1', random_state=None, tol=0.4, verbose=0)
But after I change LinearSVC to SVC(kernel=’linear’), the program couldn’t work out any result even after 12 hours!
Am I doing anything wrong? In the page of sklearn.svm.LinearSVC, there is a note:
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
Also in the page of sklearn.svm.SVC, it’s another note:
The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.
That’s the answer: LinearSVC is the right choice to process a large number of samples.