Machine learning: soft slassifiers and ROC

This post explains the concept of soft classifiers (in its simple form) and offers examples in sklearn.

Soft classifiers

In classification problems, hard classifiers gives the exact predicted class.

But soft classifiers gives a probability estimation over all classes. Prediction can then be made using a threshold. This also gives the possibility of multi-label classifications.

Code in sklearn:

This is a sample program in python using a KNN classier.

import pandas

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)

#  change to binary classification
new_class = np.random.randint(0, 2, len(dataset), dtype='l')
dataset['class'] = new_class

 

Now build the classifiers.


from sklearn.neighbors import KNeighborsClassifier

# Make predictions on validation dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation) # hard 
predictions_prob = knn.predict_proba(X_validation) # soft

 

Results of hard predictor.


predictions

array([ 1.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  0.,  0.,
        0.,  1.,  1.,  1.])

 

Results of soft classifier:

array([[ 0.2,  0.8],
       [ 0.2,  0.8],
       [ 0.6,  0.4],
       [ 0.6,  0.4],
       [ 0.4,  0.6],
       [ 0.6,  0.4],
       [ 0.6,  0.4],
       [ 0.4,  0.6],
       [ 0.8,  0.2],
       [ 0.6,  0.4],
       [ 0.8,  0.2],
       [ 0.6,  0.4],
       [ 0.8,  0.2],
       [ 0.4,  0.6],
       [ 0.2,  0.8],
       [ 0.4,  0.6],
       [ 0.4,  0.6],
       [ 0.4,  0.6],
       [ 0.6,  0.4],
       [ 0.4,  0.6],
       [ 0.6,  0.4],
       [ 0.4,  0.6],
       [ 0.4,  0.6],
       [ 0.4,  0.6],
       [ 0.6,  0.4],
       [ 0.6,  0.4],
       [ 0.8,  0.2],
       [ 0.4,  0.6],
       [ 0.4,  0.6],
       [ 0.2,  0.8]])

 

This will have results on the ROC curve produced. For the hard classifier, the ROC is linear. For the soft classifier, the ROC is continuous… Note the input of soft classifier: pred = predictions_prob[:, 1] === this generates the probability of the positive class and input to ROC function.

 

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import roc_curve, auc

plt.figure(figsize = (12, 8))

truth = Y_validation
pred = predictions
fpr, tpr, thresholds = roc_curve(truth, pred)
roc_auc = auc(fpr, tpr)
c = (np.random.rand(), np.random.rand(), np.random.rand())
plt.plot(fpr, tpr, color=c, label= 'HARD'+' (AUC = %0.2f)' % roc_auc)

truth = Y_validation
pred = predictions_prob[:, 1]
fpr, tpr, thresholds = roc_curve(truth, pred)
roc_auc = auc(fpr, tpr)
c = (np.random.rand(), np.random.rand(), np.random.rand())
plt.plot(fpr, tpr, color=c, label= 'SOFT'+' (AUC = %0.2f)' % roc_auc)

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC')
plt.legend(loc="lower right")

plt.show()

 

Why is this the case?

This has to do with the input parameters of sklearn. See doc:

sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)

Parameters:

y_score : array, shape = [n_samples]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).

 

But still, why is the ROC curve continuous? This has to do with what is ROC curve?

ROC = receiver operating characteristics

For an ROC graph, the x-axis is the false positive rate, the y-axis is the true positive rate.

Each point corresponding to a particular classification strategy.

(0, 0) = classify all instances to negative.

(1, 1) = classifying all instances to positive. ??

A ranking model produces a set of points in ROC space. Each point correspond to the result of a threshold – each threshold produces a different point in ROC. Thus this soft classifier in effect equals to (many) strategies.

Final plot:

Update 20180428:

sklearn.metrics.roc_curve has an argument “drop_intermediate=True”, which will automatically adjust the number of thresholds in the curve. This will result in different number of points in a plot. When comparing results from different classifiers, have this point in mind!

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *