“confusionMatrix” function in R – The data contain levels not found in the data

“confusionMatrix” function of “caret” package threw error as below while validating prediction results in R.

Error message:
Error in confusionMatrix.default(loan$Defaulter, loan$Prediction) :
The data contain levels not found in the data.

Reason:
This error comes up because the two columns we feed into confusion matrix function have different levels.

R line of code that gives error is

> confusionMatrix(loan$Defaulter, loan$Prediction)

“str” on the data frame reveals the below.

> str(loan)
'data.frame': 15 obs. of 3 variables:
 $ Customer : Factor w/ 15 levels "A","B","C",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Defaulter : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 1 2 2 1 ...
 $ Prediction: Factor w/ 2 levels "1","2": 2 1 1 2 2 1 1 2 1 1 ...

Defaulter variable existing in the loan dataset has levels 0 and 1, where 0 denotes a non-defaulter and 1 denotes a defaulter.

Prediction variable that is created and populated by our R code has levels 1 and 2, where 1 denotes a non-defaulter and 2 denotes a defaulter.

Due to this mismatch, a confusion matrix cannot be created.

Fix:
We have the required information, but just denoted by mismatching labels.

The levels of Prediction variable are changed to 0 & 1 as below using “levels”.

> levels(loan$Prediction) <- list("0" = "1", "1" = "2")

Post this level correction, “str” on the data frame shows matching levels in Defaulter and Prediction variables.

> str(loan)
'data.frame':  15 obs. of  3 variables: 
$ Customer  : Factor w/ 15 levels "A","B","C",..: 1 2 3 4 5 6 7 8 9 10 ... 
$ Defaulter : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 1 2 2 1 ... 
$ Prediction: Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 2 1 1 ...

confusionMatrix function now produces the desired result.

> confusionMatrix(loan$Defaulter, loan$Prediction)
Confusion Matrix and Statistics

          Reference
Prediction 0 1
         0 5 2
         1 3 5
                                          
               Accuracy : 0.6667          
                 95% CI : (0.3838, 0.8818)
    No Information Rate : 0.5333          
    P-Value [Acc > NIR] : 0.2201          
                                          
                  Kappa : 0.3363          
 Mcnemar's Test P-Value : 1.0000          
                                          
            Sensitivity : 0.6250          
            Specificity : 0.7143          
         Pos Pred Value : 0.7143          
         Neg Pred Value : 0.6250          
             Prevalence : 0.5333          
         Detection Rate : 0.3333          
   Detection Prevalence : 0.4667          
      Balanced Accuracy : 0.6696          
                                          
       'Positive' Class : 0

2 thoughts on ““confusionMatrix” function in R – The data contain levels not found in the data

  1. Hello Arun,

    I am getting an error in R as follows:
    > confusionMatrix(as.factor(predicted), as.factor(Netflix.train$type))
    Error in confusionMatrix.default(as.factor(predicted), as.factor(Netflix.train$type)) :
    The data must contain some levels that overlap the reference.

    I checked the levels and they are different. How to change it?
    > levels(as.factor(Netflix.train$type))
    “1” “2”
    > levels(as.factor(predicted))
    “Movie” “TVShow”

    Thanks,
    Mohamed Asfar

    Like

    1. Hi Mohamed,

      Do you have a line of code that converts predicted probabilities into labels? You must be passing the labels “Movie” and “TVShow” somewhere in your R code, assigning them to certain probability ranges I believe.

      Let me assume 1 maps to Movie and 2 maps to TVShow. Please try out the following command. You can just swap the values if it is the other way around.

      > levels(predicted) = list(“1” = “Movie”, “2” = “TVShow”)

      Then try your Confusion Matrix..

      > confusionMatrix(as.factor(predicted), as.factor(Netflix.train$type))

      Like

Leave a comment