Quantifying Uncertainty in Discrete Classes to Understand Model Performance Limits

Author

Data Science

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Recap of Uncertainty in Discrete Classes

In my previous blog post, I described basic uncertainty handling and how some relatively simple analysis can help understand limitations in model performance for regression models. These limitations allow for precise project scoping (if the fundamental performance limit is 80% you aren’t going to do better). I am going to continue this thread today but focusing instead on discrete classes.

As a data science consultant, I have been asked to make models to classify the attractiveness of photos, determine pathologies of medical images, and descriptions of relationships between people. The problem with these tasks is that instead of being concretely discrete tasks, there is a huge amount of subjectivity in them. Not necessarily due to taste (as in the case of the photo’s attractiveness), but potentially due to the school you were taught in (pathologies in medical images) or cultural background. This made-up graph below on how temperature feel might vary for different people tries to illustrate this problem.

What I might want before I start a project is to know how well I could do when my classes overlap, i.e., they are uncertain. This maximum performance of an ideal classifier is the Bayes error rate. If we knew analytic expressions for the uncertainty in our classes, we could calculate the overlap between our different classes. This overlap allows us to calculate the Bayes error rate or the maximum performance of our problem. Unfortunately, we very rarely get the analytic distributions in practice. Instead, I will investigate an example of how you may approach such a solution through a simulated data approach.

How to Quantify Uncertainty in Discrete Classes

In the ProCogia office, we have a few people that come in by bike, some that take the bus, and some that walk. In Winter, those that walk are happy with the office temperature in the office, those that take the bus think it’s cold, and those that cycle think it’s too hot. However, the office is the same temperature everywhere. Depending on some prior conditions people report different feelings. How can we quantify those? We can estimate probability distributions for different labels given an underlying temperature.

Imagine we don’t know how people got to the office, instead that we don’t know anything about the people who are labelling the data (this is normally the case). We would have some classes with given distributions given some underlying vector. This is very often the case in machine learning, data science, and AI.

How can we quantify this? If I ask 3, 10, or 50 people whether the office is cold, chilly, normal, warm, or hot I am going to get a distribution of responses. I could then take cases where there is a clear winner and discard cases where there is not. I could also get everyone’s opinion and take a majority vote. The problem with those methods is that we are discarding a lot of useful information. We already know that room feeling is a vague, uncertain idea. If we embrace that uncertainty we can learn a lot. What we are really trying to learn is the distribution of our classes given an underlying state vector (temperature, mode of transport, etc.). For our case, we can simplify this and just say that it depends on temperature alone. Based on the responses we got at each temperature we could try to come up with a vague distribution for each class at each temperature. It might look something like this:

Note: I made up this data

Simulating Uncertain Classes – Results

For our temperature example, we can now construct potential probability distributions with our underlying vector. If we had a dataset that looked like this how well could we do? We can see that there is no way we would ever get 100% correct. There are cases in this distribution where we would have to make an educated guess (or roll a die).

That’s okay – sometimes the real world is probabilistic! But we can construct synthetic datasets that mirror our constructed probability distributions to try to hypothesize how well we could do if we increased our data size by surveying many more people. We can choose an array of temperatures and at each temperature roll a die to decide which feeling (e.g., ‘cold,’ ‘normal,’ or ‘hot’) based on the likelihood of each class at that temperature. This tries to replicate the real-world uncertainty in classes, where the same temperature could be perceived differently by different individuals. We can do that for a hundred, a thousand, or a million different temperatures if we like. Hopefully, as we increase the size of the dataset we can approach a limit where we will not improve the performance of the model anymore.

Rating image attractiveness or assessing medical images are subjective tasks that involve overlapping, uncertain classes. We want to understand the performance limitations that these overlapping classes create. The synthetic dataset we’ve generated allows us to explore how a classifier might perform given this uncertainty.

I generated thirteen different-sized datasets and trained multi-class logistic regression models. To train and assess the performance of the models I used five-fold cross-validation. I split each dataset into five and trained the model five times with different combinations of the five parts. I took the average and standard deviations of the scores to understand the variation of the performances of my models.

Looking at the results we see that the performance is a bit erratic before it settles down and the error in our resultant performance is negligible. This final performance is a good indication of how well we could do if we had an ideal dataset and a perfect state vector of our system.

Uncertain Classes – How to Handle in the Real World

Why would we want to do this? Imagine it is a picture we are trying to assess as hideous, ugly, neutral, pretty, or outstanding. We can imagine each image will get a range of responses. Some will get all of one, some will get some of two, and some will get a big mixture. The better we can understand the uncertainty of those classes the better we can appreciate limitations in the performance of our model before we start.

If we can imagine some underlying vector space that lets us put each picture accommodated so that we can say what the probability of each class being chosen is then that is the problem we are trying to solve with our model. Constructing that space is not an easy matter, but what we are trying to do here is by looking at the distribution of classes say how well we could do in an idealized case where we are able to make the perfect vector representation of each of our images. If we know that idealized performance that will inform how we carry out the project, the funding we allocate, and when we think we have done a good enough job.

This exercise takes some time and thought. It requires gathering a reasonable amount of data to scope out the possible maximum performances of your project before you really begin the work in earnest. If you’re spending large sums of money on training an LLM, a deep learning model, or gathering very expensive medical imaging data, however, you may find that the reassurance and understanding of the possibilities open to you comforting, however.

Conclusion

By simulating uncertain classes and approximating the Bayes error rate, we start to understand the theoretical limits of model performance. As stated above, this is trying to approximate the Bayes error rate. To find out the Bayes error rate analytically requires knowing the actual probability distributions of the various classes with our underlying state vector, which we won’t know in practice (most of the time, anyway). Something like our approach is necessary if we want to have an idea of the limits of model performance. Understanding these limitations for real-world data will help to set realistic expectations for our projects and make more informed decisions about resource allocation. Such informed decisions are crucial for successful model development, be it for image recognition, text classifier, or medical image analysis applications.

Interested in learning more about data science and its real-world applications? Check out our Data Science Blogs for more insights, expert tips, and success stories to help you stay ahead in the world of data and AI.

Subscribe to our newsletter

Stay informed with the latest insights, industry trends, and expert tips delivered straight to your inbox. Sign up for our newsletter today and never miss an update!

We care about the protection of your data. Read our Privacy Policy.

Keep reading

Dig deeper into data development by browsing our blogs…

A wide landscape digital illustration for a blog titled "Turning AI Potential into Impactful Business Use Cases". The image features a futuristic, glowing blue cityscape representing a data-driven "frontier firm". In the foreground, a translucent human hand interacts with a holographic interface displaying data charts and AI icons, symbolizing the transition from human assistants to autonomous, agent-led operations.

Get in Touch

Let us leverage your data so that you can make smarter decisions. Talk to our team of data experts today or fill in this form and we’ll be in touch.

Take a deeper dive

Locate Us

Follow Us

Contact Us