Receiver Operating Characteristics (ROC) are becoming increasingly commonly used in machine learning as they offer a valuable insight into how your model is performing that isn’t captured with just log-loss, facilitating diagnosis of any issues. I won’t go into much detail of what ROC actually is here, as this post is more intended to help navigate people looking for a MAUC Python implementation. If however you are looking for an overview of ROC then I’d recommend Fawcett’s tutorial here. However, in short, ROC is a way of viewing errors from a model of a dichotomous data set where the final prediction will be a hard class label (such as in medical diagnosis). All predictions are identified as being one of the following:
- True Positive
- False Positive
- True Negative
- False Positive
They can therefore identify on what type of samples your model is making its errors on, allowing you to take steps to improve it. To use ROC in machine learning, a quantifiable measure is needed to act as an indicator of a model’s ‘goodness’. The Area Under the ROC Curve (AUC) is used for this purpose.
However, you can’t plot a ROC curve for problems with more than two dimensions, which could be part of the reason why error rate is still used for multi-class classification datasets. One approach is to visualise the problem as a multi-dimensional plane, from which the area can be calculated. Ferri et al (2003) implemented such an approach called the Volume under the Surface in their paper Volume under the ROC surface for multi-class problems. An alternative approach uses the fact that the AUC is the probability that a randomly drawn instance from class 0 will have a greater probability of being predicted to class 0 than a randomly drawn point from class 1 to develop a probabilistic framework to use for higher dimensions.
Hand and Till (2001) adapted this technique in their widely cited paper A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. It involves an approximation of the AUC based on this probability; the AUC is normally estimated from the area of trapezoids below the curve. This measure (denoted as A) is then calculated for each pairwise class grouping and averaged to produce an overall measure (the MAUC) of a classifier’s ability to separate classes.
As this approach is ideal for situations where independent misclassification costs are required, such as medical diagnosis, I wanted to use it for a few experiments. However I was unable to find an existing implementation in Python, the language that my research code is in (if I could make the choice again I would choose R, it seems by far the most suitable language for machine learning).
I’ve shared my implementation as a Gist in the hope that it will come in handy for someone one day.