Improve handling of unbalanced confusion matrices #41

ExcaliburZero · 2017-08-01T20:04:59Z

Here I have made a few changes that make it easier to plot confusion matrices where the true and predicted sets of labels are not the same. This is a case that can occur when doing something like applying "new" categories to a dataset with an older set of categories.

The changes included are the following:

Fix an issue with nan values showing up when unbalanced confusion matrices are normalized. Where rows with zero entries would sum to zero and then divide by zero when normalizing each cell.

Add options to limit the labels displayed on the true and predicted axes, as with unbalanced confusion matrices some of the labels can be only in the set of true labels or only in the set of predicted labels.

You can see the effect of the new options here:

import numpy as np
import matplotlib.pyplot as plt
import scikitplot as sciplt

y_true = np.array(["A", "A", "B", "B", "B", "C", "D"])
y_pred = np.array(["A", "A", "Ba", "Bb", "Ba", "C", "D"])

print(y_true.shape)
print(y_pred.shape)

true_labels = np.unique(y_true)
pred_labels = np.unique(y_pred)

labels = np.sort(np.unique(np.concatenate([true_labels, pred_labels])))

true_label_indexes = np.where(np.isin(labels, true_labels))
pred_label_indexes = np.where(np.isin(labels, pred_labels))

sciplt.plotters.plot_confusion_matrix(y_true, y_pred, hide_zeros=True, normalize=True, true_label_indexes=true_label_indexes, pred_label_indexes=pred_label_indexes, labels=labels)
plt.show()

reiinakano · 2017-08-04T06:29:22Z

Hi @ExcaliburZero thanks for taking the time to write this. Apologies for the late response, I've been quite busy.

First off, do you think you could open a separate PR for the NaN values bugfix so I can merge that immediately and keep this new feature suggestion separate? Thanks!

Anyway, I think allowing people to select which classes to show on the CM is a useful feature, however, I'm not a fan of the new arguments introduced here. The need to use indices would be confusing for anyone not familiar with the internals. Nobody knows what the index of a class is, unless they explicitly see that it's the index in the array you get from np.sort(np.unique(np.concatenate([true_labels, pred_labels]))). And nobody's gonna go and dig that info out from the source code.

I think a better way is to let them pass in a list of the actual classes they want included in the x-axis and y-axis respectively. But you have to be careful to handle edge cases: classes that are not in the data at all, duplicate classes, etc.

ExcaliburZero · 2017-08-04T15:57:25Z

I have made a separate PR for the NaN values bugfix (#42).

I agree that the arguments are a bit confusing to use. Definitely they should instead take in the names of categories instead, though as you noted this is complicated by the possible values that can be passed in.

I'll make those change to the arguments and add in some good validation. (Though I'll be a bit busy so I will probably get around to working on this on Monday.)

ExcaliburZero · 2017-08-07T04:23:43Z

I have changed the options to take in the names of the labels instead.

true_labels = ["A", "B", "C", "D"]
pred_labels = ["A", "Ba", "Bb", "C", "D"]

sciplt.plotters.plot_confusion_matrix(y_true, y_pred, true_labels=true_labels,
                                      pred_labels=pred_labels)

I have also added validation for the both test_labels and pred_labels to check that there are no duplicate labels and that there are no labels that are not in classes. In the case that there is either such issue then a ValueError is raised with a descriptive error message.

ExcaliburZero · 2017-08-07T04:27:48Z

Here are some examples of the error messages.

Duplicate labels:

true_labels = ["A", "B", "C", "D", "D", "A"]
pred_labels = ["A", "Ba", "Bb", "C", "D", "F", "G"]

sciplt.plotters.plot_confusion_matrix(y_true, y_pred, true_labels=true_labels,
                                      pred_labels=pred_labels)

ValueError: The following duplicate labels were passed into true_labels: D, A

Missing labels:

true_labels = ["A", "B", "C", "D"]
pred_labels = ["A", "Ba", "Bb", "C", "D", "F", "G"]

sciplt.plotters.plot_confusion_matrix(y_true, y_pred, true_labels=true_labels,
                                      pred_labels=pred_labels)

ValueError: The following labels were passed into pred_labels, but were not found in labels: F, G

reiinakano · 2017-08-19T15:25:51Z

Sorry for the late review. I've been very busy the past few days. Will add my comments now.

reiinakano · 2017-08-19T15:27:06Z

scikitplot/plotters.py

@@ -87,6 +94,50 @@ def plot_confusion_matrix(y_true, y_pred, labels=None, title=None, normalize=Fal
 else:
 classes = np.asarray(labels)

+ def validate_labels(known_classes, passed_labels, argument_name):


Please put this outside of the function, add a detailed docstring, and add a unit test.

reiinakano · 2017-08-19T15:27:32Z

scikitplot/plotters.py

+ duplicate_indexes = indexes[~np.isin(indexes, unique_indexes)]
+ duplicate_labels = passed_labels[duplicate_indexes]
+
+ msg = "The following duplicate labels were passed into %s: %s" % (argument_name, ", ".join(duplicate_labels))


Line length is too long. Also, since the rest of the codebase uses .format instead of % string formatting, I'd prefer it if you use that as well.

reiinakano · 2017-08-19T15:27:44Z

scikitplot/plotters.py

+ if np.any(passed_labels_absent):
+ absent_labels = passed_labels[passed_labels_absent]
+
+ msg = "The following labels were passed into %s, but were not found in labels: %s" % (argument_name, ", ".join(absent_labels))


Line length is too long. Also, since the rest of the codebase uses .format instead of % string formatting, I'd prefer it if you use that as well.

reiinakano · 2017-08-19T15:31:32Z

scikitplot/plotters.py

+
+ pred_classes = classes[pred_label_indexes]
+ cm = cm[:,pred_label_indexes][:,0,:]
+
 if normalize:


I think you should calculate normalized values before slicing the array according to pred_classes and true_classes, otherwise the calculated normalized values might be wrong

reiinakano · 2017-08-19T15:32:11Z

scikitplot/plotters.py

+ pred_label_indexes = np.where(np.isin(classes, pred_labels))
+
+ pred_classes = classes[pred_label_indexes]
+ cm = cm[:,pred_label_indexes][:,0,:]


stylefix: cm = cm[:, pred_label_indexes][:, 0, :]

reiinakano · 2017-08-19T15:36:57Z

Aside from my comments, this looks pretty good.

Only thing missing are the appropriate unit tests to properly define the behavior of this new functionality. Thanks!

ExcaliburZero · 2017-08-24T00:12:01Z

I have made the changes you mentioned.

Right now I am just having an issue where it looks like numpy.isin does not work in the Python 2.7 build.

ExcaliburZero · 2017-08-24T02:53:02Z

Okay, the tests seem to all work correctly in the Travis CI builds.

Do you want me to also write some tests for the new arguments for plot_confusion_matrix?

reiinakano · 2017-08-24T03:11:48Z

Yes please. Just one more test running through the new arguments.

ExcaliburZero · 2017-08-24T14:32:37Z

I have added a test for the new arguments, and also fixed and added a test for an issue it had with non-string labels.

reiinakano · 2017-08-24T16:00:57Z

scikitplot/plotters.py

+ else:
+ validate_labels(classes, true_labels, "true_labels")
+
+ true_label_indexes = np.where(np.in1d(classes, true_labels))


Can't say for sure, but I did some tests and wouldn't
true_label_indexes = np.in1d(classes, true_labels)
work just as well?

reiinakano · 2017-08-24T16:02:07Z

scikitplot/plotters.py

+ else:
+ validate_labels(classes, pred_labels, "pred_labels")
+
+ pred_label_indexes = np.where(np.in1d(classes, pred_labels))


Same here,

pred_label_indexes = np.where(np.in1d(classes, pred_labels))

Although for this one, you'll want to change L167 to
cm = cm[:, pred_label_indexes]

Add options to plot only certain specified labels in confusion matrices to allow for cases where some "true" labels are not in the "predicted" label set or vice versa. This can be useful in cases where a classifier with certain labels is applied to a dataset with a disjoint or partially disjoint set of related labels. Also add tests for some of the new functionality.

ExcaliburZero · 2017-08-24T16:19:00Z

I have now made those changes.

reiinakano · 2017-08-24T16:37:09Z

LGTM!

Thanks a lot for this feature!

ExcaliburZero force-pushed the cm-nan branch 2 times, most recently from 7e70d2d to 3699401 Compare August 7, 2017 04:12

reiinakano reviewed Aug 19, 2017

View reviewed changes

ExcaliburZero force-pushed the cm-nan branch 2 times, most recently from 08977e5 to 4b59b8c Compare August 23, 2017 22:30

ExcaliburZero force-pushed the cm-nan branch from 4b59b8c to dcaf556 Compare August 24, 2017 02:48

ExcaliburZero force-pushed the cm-nan branch from dcaf556 to d0f21a0 Compare August 24, 2017 14:29

ExcaliburZero force-pushed the cm-nan branch from d0f21a0 to 09c76f2 Compare August 24, 2017 14:36

reiinakano reviewed Aug 24, 2017

View reviewed changes

ExcaliburZero force-pushed the cm-nan branch from 09c76f2 to 85f38d6 Compare August 24, 2017 16:11

reiinakano merged commit d5402fe into reiinakano:master Aug 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of unbalanced confusion matrices #41

Improve handling of unbalanced confusion matrices #41

ExcaliburZero commented Aug 1, 2017

reiinakano commented Aug 4, 2017 •

edited

Loading

ExcaliburZero commented Aug 4, 2017

ExcaliburZero commented Aug 7, 2017

ExcaliburZero commented Aug 7, 2017

reiinakano commented Aug 19, 2017

reiinakano Aug 19, 2017 •

edited

Loading

reiinakano Aug 19, 2017 •

edited

Loading

reiinakano Aug 19, 2017 •

edited

Loading

reiinakano Aug 19, 2017 •

edited

Loading

reiinakano Aug 19, 2017 •

edited

Loading

reiinakano commented Aug 19, 2017 •

edited

Loading

ExcaliburZero commented Aug 24, 2017

ExcaliburZero commented Aug 24, 2017

reiinakano commented Aug 24, 2017

ExcaliburZero commented Aug 24, 2017

reiinakano Aug 24, 2017

reiinakano Aug 24, 2017

ExcaliburZero commented Aug 24, 2017

reiinakano commented Aug 24, 2017

Improve handling of unbalanced confusion matrices #41

Improve handling of unbalanced confusion matrices #41

Conversation

ExcaliburZero commented Aug 1, 2017

reiinakano commented Aug 4, 2017 • edited Loading

ExcaliburZero commented Aug 4, 2017

ExcaliburZero commented Aug 7, 2017

ExcaliburZero commented Aug 7, 2017

reiinakano commented Aug 19, 2017

reiinakano Aug 19, 2017 • edited Loading

Choose a reason for hiding this comment

reiinakano Aug 19, 2017 • edited Loading

Choose a reason for hiding this comment

reiinakano Aug 19, 2017 • edited Loading

Choose a reason for hiding this comment

reiinakano Aug 19, 2017 • edited Loading

Choose a reason for hiding this comment

reiinakano Aug 19, 2017 • edited Loading

Choose a reason for hiding this comment

reiinakano commented Aug 19, 2017 • edited Loading

ExcaliburZero commented Aug 24, 2017

ExcaliburZero commented Aug 24, 2017

reiinakano commented Aug 24, 2017

ExcaliburZero commented Aug 24, 2017

reiinakano Aug 24, 2017

Choose a reason for hiding this comment

reiinakano Aug 24, 2017

Choose a reason for hiding this comment

ExcaliburZero commented Aug 24, 2017

reiinakano commented Aug 24, 2017

reiinakano commented Aug 4, 2017 •

edited

Loading

reiinakano Aug 19, 2017 •

edited

Loading

reiinakano Aug 19, 2017 •

edited

Loading

reiinakano Aug 19, 2017 •

edited

Loading

reiinakano Aug 19, 2017 •

edited

Loading

reiinakano Aug 19, 2017 •

edited

Loading

reiinakano commented Aug 19, 2017 •

edited

Loading