Evaluating Generative AI models
Evaluating the performance of Generative AI models, particularly in applications such as chatbots, presents a unique set of challenges for Data Scientists. Unlike traditional machine learning tasks where performance can be quantified through metrics such as mean squared error or F1 score, assessing the quality of generated text requires an understanding of language, context and often knowledge of the specific field of application.
In this article we discuss the advantages and limitations of human evaluation methods, techniques such as ROUGE for algorithmic evaluation and finally the emerging art of using language models to evaluate their own output.
The Role of Generative AI
Consider the example of a chatbot that has been developed using a Generative AI model, perhaps with prompt engineering, fine tuning or RAG techniques. The purpose of the chatbot is to act as an assistant to employees at an enterprise company, providing useful information such as onboarding instructions, company policies & procedures, and searching the company’s knowledge base.
Evaluating Chatbot Performance
During development, it’s important to have a method of evaluating performance of the chatbot, so that its design can be iteratively improved through experimentation and fine tuning. Equally, in production it is necessary to continuously monitor the performance of the chatbot to ensure that it is operating effectively in the real world, where usage patterns may differ from initial assumptions.
Defining Evaluation Objectives
Before considering Generative AI evaluation methods, as with all data science projects it is important to consider and define our objectives. Depending on the application, desirable qualities of a chatbot may include:
- Engagingness: Is the chatbot enjoyable and interesting to interact with, leading to a higher number of active users
- Factual accuracy: Is everything the chatbot says correct, or does it hallucinate?
- Personalization: Can the chatbot tailor responses to individual users based on their preferences or past interactions?
- Privacy and Security: How well does the chatbot handle and protect user data and maintain confidentiality?
- Multilingual support: Can the chatbot communicate effectively in multiple languages, if required?
Algorithmic and Human Evaluation Methods
To perform consistent and repeatable evaluation, a list of questions or inputs may be produced. In some cases, reference responses to these inputs may be available. For the employee assistant chatbot, an evaluation example might be:
Input | |
What is the company’s policy for remote working? | |
Chatbot response for evaluation | Reference response |
Employees are permitted to work remotely up to three days per week, subject to manager approval. Please refer to the company’s remote work policy document for further details | Staff members are allowed to telecommute for a maximum of three days weekly, pending approval from their supervisors. For more information, please consult the official remote work policy. |
How might we assess the quality of the chatbot’s response against the reference? This is a task that predates the recent advances in Generative AI, and several scoring methods have emerged from various fields, particularly in machine translation. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (bilingual evaluation understudy) are two such metrics designed to evaluate the accuracy of an automatically produced summary text or translation against a human-produced reference. Both methods make use of n-gram overlap, the extent to which sequences of n words in two texts match or coincide with each other.
These algorithmic methods have the obvious limitation that they will penalize answers that use synonyms or equivalent sentences that convey the same semantic meaning with different structures. Nonetheless, BLEU scores have been found to correlate highly with human judgements of translation quality, and have consequently been used widely in research. ROUGE is a similar metric designed to measure the quality of summarizations, another common task for large language models (LLMs). Regardless of their initially intended application, both of these metrics give useful information when comparing chatbot responses to a reference example, albeit with the limitations discussed due to their simplicity.
The chatbot’s response in the evaluation example achieves a BLEU score of 0.0 and ROUGE-1 score of 0.3846 (scores above 0.5 are generally considered high quality). If we examine our chatbot’s response against the reference, we start to see why this might occur because despite being a desirable response that offers the correct information, few of the words chosen are exact matches.
In this situation, we might resort to human-provided evaluations against a scoring rubric. This is exactly what OpenAI did when they employed Reinforcement Learning from Human Feedback (RLHF) to great success with ChatGPT. Whilst these human evaluations are evidently very powerful, they are very expensive and slow to obtain, compared to algorithmic methods. It may also be difficult to obtain access to human evaluators from the same population as the target audience of the product, with a risk of bias introduced if the two populations differ.
The Emergence of Self-Evaluation by Language Models
An emerging third option is to use LLMs themselves to perform the evaluation. Prompted with a scoring rubric and instructed to provide structured scoring output, LLMs can evaluate performance at orders of magnitude lower costs, and higher speeds than humans. Once an evaluation pipeline has been built, this allows for rapid iterative changes to the model with immediate quantitative evaluation. Whilst it is important to be aware of the limitations of these models and concerns raised about biased evaluations, used with caution they can certainly be a useful and effective tool.
Evaluating a Corporate Chatbot with MLFlow
MLFlow recently extended their Evaluate API to support evaluation with LLMs. The API provides several built-in metrics, as well as a nicely designed framework for designing custom metrics with definitions and grading rubrics to suit any task. A custom metric can be defined as follows:
import pandas as pd
import mlflow
from mlflow.metrics.genai import make_genai_metric
my_metric = make_genai_metric(
name="answer_correctness",
definition=(
"Answer correctness is evaluated on the accuracy of the provided output based on "
"the provided targets, which is the ground truth. Scores can be assigned based on "
"the degree of semantic similarity and factual correctness of the provided output "
"to the provided targets, where a higher score indicates higher degree of accuracy."
),
grading_prompt=(
"Answer correctness: Below are the details for different scores:"
"- Score 1: The output is completely incorrect. It is completely different from "
"or contradicts the provided targets."
"- Score 2: The output demonstrates some degree of semantic similarity and "
"includes partially correct information. However, the output still has significant "
"discrepancies with the provided targets or inaccuracies."
"- Score 3: The output addresses a couple of aspects of the input accurately, "
"aligning with the provided targets. However, there are still omissions or minor "
"inaccuracies."
"- Score 4: The output is mostly correct. It provides mostly accurate information, "
"but there may be one or more minor omissions or inaccuracies."
"- Score 5: The output is correct. It demonstrates a high degree of accuracy and "
"semantic similarity to the targets."
),
version="v1",
model="openai:/gpt-4-turbo-preview",
grading_context_columns=["targets"],
parameters={"temperature": 0.0},
aggregations=["mean", "variance", "p90"],
greater_is_better=True,
)
Once a metric has been defined, the chatbot’s responses should be placed in a Pandas DataFrame alongside their associated inputs and reference answers. MLFlow handles prompting of the LLM (OpenAI’s GPT-4 Turbo in this case), as well as parsing and collection of responses into a convenient structured format. The code below evaluates our single example, but it would be trivial to extend the evaluation dataset to a list of any number of test cases and reference responses.
eval_data = pd.DataFrame({"inputs": ["What is the company’s policy for remote working?"],
"context": ["Staff members are allowed to telecommute for a maximum of three days weekly, pending approval from their supervisors. For more information, please consult the official remote work policy."],
"predictions": ["Employees are permitted to work remotely up to three days per week, subject to manager approval. Please refer to the company's remote work policy document for further details."]})
with mlflow.start_run() as run:
results = mlflow.evaluate(
data=eval_data,
targets="context",
predictions="predictions",
extra_metrics=[my_metric]
)
eval_table = results.tables["eval_results_table"]
This yields the following output:
Inputs | context | predictions | answer_ correctness/ v1/score | answer_ correctness/ v1/justification |
What is the company’s policy for remote working? | Staff members are allowed to telecommute for a maximum of three days weekly, pending approval from their supervisors. For more information, please consult the official remote work policy. | Employees are permitted to work remotely up to three days per week, subject to manager approval. Please refer to the company’s remote work policy document for further details. | 5 | The output accurately reflects the provided targets, maintaining semantic similarity and factual correctness without any noticeable discrepancies or inaccuracies. |
Along with an integer score, the LLM returns a freeform justification of its scoring choice, which can assist with refinement of the metric description and scoring system, as well as providing feedback that can be considered when iterating on the design of the chatbot itself. Where several evaluation examples are available, summary statistics can be used to compare the distribution of scores between iterations, providing a quantitative measure of improvement.
Conclusion
In this article three of the most common approaches to the performance evaluation of Generative AI models have been discussed. In the example considered, reference answers were available for comparison which may not be the case in practice. In this situation human or LLM-based approaches may be adapted, but the ROUGE and BLEU methods would be unfeasible. The “best” approach will always depend on the use case, nuances of the dataset and the resources available, and will continue to evolve as new methods are proposed and best practices emerge in this new field.
At ProCogia, we recognize that the landscape of Generative AI evaluation is perpetually evolving. Staying ahead of emerging methods and refining our approaches accordingly is essential to our mission. Through innovative and adaptable evaluation strategies, we aim to enhance the performance and reliability of Generative AI applications, ensuring they meet the ever-changing demands of the digital era.