Understanding Model Performance in the Face of Uncertainty

Table of Contents

Categories

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Understanding Model Performance in the Face of Uncertainty A male data scientist in his 30s with short dark hair and glasses sits at a computer. The computer screen displays fluctuating graphs and abstract data. Surrounding him are hazy, abstract shapes, gradient lines, and floating question marks, symbolizing uncertainty in data measurement. In the background, swirling mathematical symbols and merging shapes represent difficult-to-define categories in machine learning models.

Introduction

As a data scientist, I’ve tackled projects in diverse fields from biomedical science to search engineanalysis. These projects often involve complex tasks like ranking algorithms, image analysis, and natural language processing. A common challenge across these fields is defining what “done” means for each project.

What is “done” varies widely. Sometimes it’s straightforward, requiring a basic model; other times, it demands state-of-the-art performance, or it might be impossible. Understanding what’s achievable upfront would be invaluable, especially in the more challenging cases. Fortunately, there are ways to estimate this.

 

Defining “Done” in Model Performance

One approach is to consult resources like paperswithcode.com, Google Scholar, or arXiv.org to gauge the state-of-the-art performance for a given task. However, these benchmarks are often achieved under ideal conditions, such as with extensive datasets and well-defined categories or numbers with low uncertainties. In real-world scenarios, especially when dealing with less distinct categories (like differentiating between dogs, wolves, coyotes, and jackals) or abstract constructs (like various dog breeds), the achievable performance might be significantly lower compared to more straightforward tasks (e.g., distinguishing between dogs and cats).

https://pixabay.com/photos/whippet-dog-canine-pet-portrait-100331/

https://pixabay.com/photos/greyhound-greyhounds-4848945/

What are the breeds of these dogs? What do those categories mean when there are mixed breeds or other complications?

 

The Role of Uncertainty in Data

I have experienced the challenge of trying to classify abstract constructs on multiple client projects. With one client, they had a world-class dataset of 3D image data labeled by experts. This data was expensive to collect and time-consuming for the experts to label. To compound this, they didn’t agree with one another. Each image had to be labeled by three experts. If two or more of the experts agreed on the classification then it was accepted. If all three disagreed the most senior expert’s choice was taken. Does this mean that the minority or the more junior labelers were wrong? Or is the truth more complicated, that often classes are a grouping of some continuum into buckets for our convenience when talking to each other?

The challenge of defining “done” is compounded by the inherent uncertainty in data. Measuring quantities like weight, volume, or voltage, you never have a perfect value due to the limitations of measurement tools. A ruler might measure (16 ± 0.5) mm. This ±0.5 mm represents the measurement’s uncertainty and tells us about the probability distribution of values which the ruler gives. Uncertainty applies to all measurements, including quantities derived from data such as the number of apples in a truck or ages in a spreadsheet. Measurement uncertainties have implications for model performance. If you have two models with performance scores of 85.324% and 85.329%, the difference might be negligible if the uncertainty in those percentages exceeds 0.005% (it probably does!).

 

The Challenge of Classifying Abstract Constructs

This may seem like a somewhat academic distinction but imagine that you have developed an exciting new deep-learning-based model with features extracted from an LLM’s output and you’re going to need constantly running GPUs for inference. Imagine your model outperforms the system it is intended to replace in your test set. Great, spin up the GPUs and retire the old system. Except, what if the uncertainty in the new model’s performance is such that it cannot be distinguished from the previous model’s? Is it worth spending that money on GPUs if you cannot be sure that your model is statistically significantly better than what came before?

The length of the word ProCogia in this sign is (73±1) mm

 

Measuring Model Uncertainty

Estimating uncertainties in models trained derived from uncertain data is a crucial yet often overlooked aspect. For instance, if you’re training a model to predict current from voltage, you might want to perform linear regression. Understanding the gradient’s uncertainty can be done using error propagation techniques. Further, however, we can estimate the upper bound of the model performance using a quantity called the Cramer-Rao criterion. This gives the upper bound for a perfect model when trained with a specific uncertainty in its training data.

Where should the line of best fit go here? There is uncertainty in the best gradient and so in the model performance.

 

Practical Examples: Client Project Experiences

In another client project, we had just this problem. We were trying to train a machine learning model to predict what the gradient would be based on initial conditions at the first point in the series. The gradients we were using to train our model were calculated from measurements which were taken with machines which had a 10% uncertainty in their measurement. This lead to uncertainty in the target variable (the gradient calculated from these uncertain measurements) which, according to the Cramer-Rao criterion limited performance of the overall model to 35%. This allowed us to frame our results fairly.

 

Implications of Model Uncertainty for Decision-Making

Further, consider a scenario where you want to predict an adult’s height based on their weight as a baby and daily milk consumption. The challenge is not just that you do not know the exact relationship between these variables, but that you do not know the roles played by genetics, microbiome composition, exercise, and exposure to chemicals. To estimate the model’s performance and uncertainty, you could simulate different scenarios, varying the relationships and the known data’s precision. This is something I have done with clients before, generating synthetic data to try to simulate the uncertainty in the data we are training on to train models with limited information.  Presenting these studies helped frame the expectations of the project and defined what ‘done’ was in that case.

 

Conclusion

This blog post touched on a few different things. Uncertainty, what it means, how uncertainty is combined, and what the implications are for model performance. I hope that in reading this you have been able to see that if you do not account for uncertainty you may choose to overspend on resources you don’t need, may commit to projects that can’t give you the desired performance which you could have known from the start, or maybe you believe someone’s claim that might be a bit too good to be true. This highlights the need for thorough project scoping in a project’s initial phases.  If this scoping is done it clarifies what is possible and how improvement is to be measured. This will prevent frustration for everyone involved in the project and more projects being successful.

Interested in learning more about data science and its real-world applications? Check out our Data Science Blogs  for more insights, expert tips, and success stories to help you stay ahead in the world of data and AI.

Keep reading

Dig deeper into data development by browsing our blogs…
ProCogia would love to help you tackle the problems highlighted above. Let’s have a conversation! Fill in the form below or click here to schedule a meeting.