yet another note for myself

Visualizing Confusion Matrix using heatmap in R

Your Ad Here

Confusion matrix is one of the many ways to analyze accuracy of a classification model. As show in the table below, a confusion matrix is basically a two dimensional table with two axes. On one axis it has actual or target categories and on the other it contains predicted categories. Diagonal cells
indicate true positives i.e. number of test cases that were correctly predicted by the model. For instance, in the table below the model corrected predicted 2 out of 11 (or 18%) actual A’s as A. Non diagonal elements indicate false positives or true negatives i.e. number of test cases that were incorrectly predicated by the model to belong to a different category.

A B C
A 2 4 5
B 4 5 6
C 6 2 1

While the above confusion matrix is insightful, it only works when you few limited categories. However, while working on a problem I had more than 20 categories and visualizing a series of numbers across the table and making sense of them was an arduous task. So I started loooking for a way to visualize the confusion matrix. After exploring possible visualization techniques, I came with the idea of using heatmap. Luckily it was easy to produce heatmap in R using excellent ggplot library. (If you haven’t played with ggplot, try it right now. Its great !).

Below is the final output. In the figures below, color indicates percentage of test cases in that cell. The diagonal cells are highlighted with darker black border. Additionally, the two variants of the image display actual percentage value as text. if you like, it is possible to replace percentage text with actual frequency.

#generate random data
data <- data.frame(sample(LETTERS[0:20], 100, replace=T),sample(LETTERS[0:20], 100, replace=T))
names(data) <- c("Actual","Predicted")

#compute frequency of actual categories
actual <- as.data.frame(table(data$Actual))
names(actual) <- c("Actual","ActualFreq")

#build confusion matrix
confusion <- as.data.frame(table(data$Actual, data$Predicted))
names(confusion) <- c("Actual","Predicted","Freq")

#calculate percentage of test cases based on actual frequency
confusion <- merge(confusion, actual, by=c("Actual"))
confusion$Percent <- confusion$Freq/confusion$ActualFreq*100

#render plot
# we use three different layers
# first we draw tiles and fill color based on percentage of test cases
tile <- ggplot() +
geom_tile(aes(x=Actual, y=Predicted,fill=Percent),data=confusion, color="black",size=0.1) +
labs(x="Actual",y="Predicted") 

# next we render text values. If you only want to indicate values greater than zero then use data=subset(confusion, Percent > 0)
tile <- tile +
geom_text(aes(x=Actual,y=Predicted, label=sprintf("%.1f", Percent)),data=confusion, size=3, colour="black") +
scale_fill_gradient(low="grey",high="red")

# lastly we draw diagonal tiles. We use alpha = 0 so as not to hide previous layers but use size=0.3 to highlight border
tile <- tile +
geom_tile(aes(x=Actual,y=Predicted),data=subset(confusion, as.character(Actual)==as.character(Predicted)), color="black",size=0.3, fill="black", alpha=0) 

#render
tile

No related posts.

This entry was posted in Data Mining, Machine Learning and tagged , , . Bookmark the permalink.

One Response to Visualizing Confusion Matrix using heatmap in R

  1. Edna Perez says:

    You made some nice points there. I did a search on the issue and found most individuals will go along with with your site.