Unlocking LLM Confidence Through Logprobs

12 min readFeb 18, 2024

In the realm of AI-driven responses, understanding the confidence level of generated text is crucial for evaluating model performance and enhancing user trust. The use of log probabilities (or logprobs) serves as a beacon of insight into the decision-making process of language models.

Introduction

Log probabilities represent the logarithm of the probability of each token in the output sequence generated by the language model given a specific context. They serve as quantitative measures of the model’s confidence in its token selection.

Log probabilities express token likelihoods on a logarithmic scale (-∞, 0] instead of the standard [0, 1] unit interval, ensuring numerical stability, particularly with probabilities near zero. This stability enhances precision and efficiency in computations, crucial for handling small probabilities and preventing underflow issues.

Higher log probabilities, closer to 0, indicate stronger confidence in token selection, typically leading to more coherent and contextually relevant responses. Conversely, lower log probabilities (approaching -∞) suggest less confidence in the token choice.

Leveraging Log Probabilities for Confidence Assessment

In this article, we will utilize the IBM WatsonX.ai API to harness log probabilities for confidence assessment.

Enabling TOKEN_LOGPROBS parameter in API requests grants users access to the log probabilities associated with each token generated by the model.
To compute a confidence score for the entire response, one can utilize individual logprobs and perform aggregation of these scores (either on logarithmic or linear scale).

Ways to determine Confidence Level:

Log Probability Averaging:
- Compute the average of log probabilities for each token in the response.
- Higher average (closer to zero) indicates greater confidence.
- Pros: Simple calculation and direct reflection of the model’s output.
- Cons: Logarithmic values may be less intuitive and sensitive to outliers.
Linear Probability Transformation:
- Convert log probabilities to linear probabilities using the exponential function (e^x).
- Average linear probabilities to determine confidence (on 0 to 100 scale).
- Pros: Linear probabilities are easier to interpret.
- Cons: Requires additional computation.

In the upcoming code examples, we’ll explore how these methods work in practice.

Best Practices and Considerations:

When utilizing log probabilities for confidence assessment, it is essential to consider the following:

Granularity: Log probabilities provide insights at the token level, necessitating aggregation for assessing the confidence of entire responses.
Interpretation: Converting log probabilities to linear probabilities facilitates easier interpretation and comparison across different models or scenarios.
Thresholds: Establishing confidence thresholds can aid in decision-making and filtering out responses with low confidence levels.

Ultimately, selecting the appropriate method depends on the specific requirements and nuances of the application, ensuring accurate and meaningful interpretation of confidence scores.

Code Implementation:

In this article, we’ll delve into the utilization of logprobs, particularly for classification, summarization and Q&A tasks. Feel free to customize prompts and adjust parameters to explore and analyze the models’ predictions.

Import required packages and load the env variables

# Import necessary modules from the IBM Watson Machine Learning SDK
from ibm_watson_machine_learning import APIClient
from ibm_watson_machine_learning.foundation_models import Model 
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.metanames import GenTextReturnOptMetaNames

# Import other required libraries
import os 
import numpy as np
from IPython.display import display, HTML
from dotenv import load_dotenv

# Load the environment variables
load_dotenv()
watsonx_api_key = os.getenv("API_KEY", None)
ibm_cloud_url = os.getenv("IBM_CLOUD_URL", None)
project_id = os.getenv("PROJECT_ID", None)

# Set up the credentials for accessing IBM WatsonX
creds = {
    'url': ibm_cloud_url,
    'apikey' : watsonx_api_key
}

Initialize the WatsonX LLM Model

# Set up the language model with the specified parameters and credentials
def set_up_model_connection(credentials: dict, project_id: str, max_tokens: int, min_tokens: int, model_id = "meta-llama/llama-2-70b-chat") -> Model:

    generate_params = {
        GenParams.MAX_NEW_TOKENS: max_tokens,
        GenParams.MIN_NEW_TOKENS: min_tokens,
        GenParams.DECODING_METHOD: "greedy",
        GenParams.REPETITION_PENALTY: 1,
        GenParams.RETURN_OPTIONS: {
            GenTextReturnOptMetaNames.INPUT_TEXT: False,
            GenTextReturnOptMetaNames.INPUT_TOKENS: True,
            GenTextReturnOptMetaNames.GENERATED_TOKENS: True,
            GenTextReturnOptMetaNames.TOKEN_LOGPROBS: True,
            GenTextReturnOptMetaNames.TOKEN_RANKS: True,
            GenTextReturnOptMetaNames.TOP_N_TOKENS: 2
        }
    }

    model = Model(
        model_id = model_id,
        credentials = creds,
        params = generate_params,
        project_id = project_id
    )

    print(f"Model initialized: {model_id}")
    return model

The relevant request parameters are:

token_logprobs: Specifies if the log probability (natural log of probability) for each returned token should be included.
top_n_tokens: Specifies the number of top candidate tokens to include at the position of each returned token, with a maximum value of 5.

# Method 1: Average of Log Probabilities
def calculate_confidence_log_probs(log_probs):
    avg_log_prob = np.mean(log_probs)
    return -avg_log_prob  # closer to 0 indicates higher confidence


# Method 2: Converting to Linear Probabilities
def calculate_confidence_linear_probs(log_probs):
    linear_probs = np.round(np.exp(log_probs)*100,2)
    confidence = np.mean(linear_probs)
    return confidence  # closer to 100 indicates higher confidence

Here, the first method calculates the average of log probabilities of tokens in the response, where a value closer to 0 signifies higher confidence. The second method converts log probabilities to linear probabilities and computes the average, with values nearing 100 indicating higher confidence score.

Classification Task

Here, the model’s objective is to classify movies based on their descriptions into one of the specified genres, returning only the primary genre.

# sample movies data - movie name, imdb link and description
movies_data = [
    ("Sam Bahadur", "https://www.imdb.com/title/tt10786774/", "Sam Bahadur is a Hindi film directed by Meghna Gulzar, that is based on the life of India's first field marshal, Sam Manekshaw — whose military genius and leadership shaped the 1971 Indo-Pak war."),
    ("Welcome", "https://www.imdb.com/title/tt0488798/", "Dubai-based criminal don Uday takes it upon himself to try and get his sister Sanjana married - in vain, as no one wants to be associated with a crime family. Uday's associate Sagar Pandey finds a young man, Rajiv, who lives with his maternal uncle and aunt - Dr. and Mrs. Ghunghroo. Through extortion he compels Ghunghroo to accept this matrimonial alliance. But Rajiv has already fallen in love with young woman in South Africa. When the time comes to get Rajiv formally engaged to this woman, he finds out that Sanjana and she are the very same. With no escape from this predicament, the wedding is planned, with hilarious consequences."),
    ("Spider-Man: Far From Home", "https://www.imdb.com/title/tt6320628/", "Peter Parker is beset with troubles in his failing personal life as he battles a brilliant businessman named Adrian Toomes and a magician named Quentin Beck."),
    ("The Godfather", "https://www.imdb.com/title/tt0068646/", "The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.")
]

# setup the prompt for movie genre classification
CLASSIFICATION_PROMPT = """[INST] <<SYS>>
Classify the genre of a movie based on its description. Choose from these categories: Action, Drama, Comedy, Biography, War, and Science Fiction. Only provide the name of the primary genre as output, without any additional information.
<</SYS>>
Movie Description:
{movie_desc}
[/INST]  
"""

# code to make API call to model and return output with logprobs
html_content = ""
for movie in movies_data:
    prompt = CLASSIFICATION_PROMPT.format(movie_desc=movie[2])

    response = model.generate(prompt=[prompt])
    output = response[0]['results'][0]['generated_text']

    generated_tokens = response[0]['results'][0]['generated_tokens']
    log_probs = [token['top_tokens'][0]['logprob'] for token in generated_tokens]
    confidence_log_probs = calculate_confidence_log_probs(log_probs)
    confidence_linear_probs = calculate_confidence_linear_probs(log_probs)

    movie_name_link = f"<a href='{movie[1]}'>{movie[0]}</a>"
    html_content += (
        f"<h3>Movie Name: {movie_name_link}</h3>"
        f"<p>Predicted Genre: {output}</p>"
        f"<p>Confidence using average of log probabilities: {confidence_log_probs}</p>"
        f"<p>Confidence using linear probabilities: {round(confidence_linear_probs, 2)}%</p><br>"
    )

# display the predicted results 
display(HTML(html_content))

Here we can see how confident the model is in predicting the movie genre. Logprobs aid in classification by quantifying the certainty of predictions made by models, allowing users to set custom thresholds for classification or confidence levels, thereby enhancing model interpretability.

SUMMARIZATION TASK

In this use-case, we setup a new prompt that serves as a guideline for the model to generate a brief summary of a news article.

# sample news article data
news_articles = [
    {
        "title": "Challenges in Distributing COVID-19 Vaccines to Rural Areas",
        "content": "In the race to vaccinate against COVID-19, rural areas face significant challenges. According to recent statistics, only 30% of rural populations have received their first vaccine dose. Dr. Sarah Johnson, a leading epidemiologist, highlights the disparities in vaccine distribution. Remote regions like Pine County, with a population of 50,000, struggle with vaccine accessibility. Transportation infrastructure limitations exacerbate the situation, leading to logistical hurdles. Health officials are exploring innovative solutions, such as mobile vaccination clinics. The rollout plan aims to prioritize high-risk demographics and essential workers. Funding shortages pose a major obstacle to the deployment of vaccination programs. Despite challenges, community outreach efforts have shown promising results in some areas. Collaboration between healthcare providers and local governments is key to overcoming barriers."
    },
    {
        "title": "Quantum Computing Breakthrough: Advancements in Superconducting Qubits",
        "content": "Scientists at QuantumTech Labs announce a groundbreaking achievement in quantum computing. The research team, led by Dr. Emily Chang, has made significant strides in superconducting qubit technology. QuantumTech's latest quantum processor prototype boasts a record-breaking qubit count of 1024. The increased qubit density marks a crucial milestone in the quest for scalable quantum computing. Dr. Chang explains that the breakthrough opens doors to solving complex computational problems. Superconducting qubits, with their low error rates, offer promising prospects for practical quantum applications. The development paves the way for quantum supremacy, surpassing classical computing capabilities. Industry experts anticipate transformative impacts across various fields, from cryptography to drug discovery. QuantumTech plans to collaborate with leading research institutions to further refine the technology. Despite challenges, the quantum computing community remains optimistic about the future."
    }
]

# setup the prompt for news article summarization
SUMMARY_PROMPT = """
[INST] <<SYS>>
Summarize the key points of the given news article in 1-2 lines. 
<</SYS>>
News:
{news_para}
[/INST]
"""

# code to make API call to model and return output with logprobs
html_content = ""
for article in news_articles:
    prompt = SUMMARY_PROMPT.format(news_para = article['content'])
    
    response = model.generate(prompt=[prompt])
    gen_summary = response[0]['results'][0]['generated_text']

    generated_tokens = response[0]['results'][0]['generated_tokens']
    log_probs = [token['top_tokens'][0]['logprob'] for token in generated_tokens if len(token['top_tokens']) > 1 and "logprob" in token['top_tokens'][0]]
    confidence_log_probs = calculate_confidence_log_probs(log_probs)
    confidence_linear_probs = calculate_confidence_linear_probs(log_probs)

    html_content += (
        f"<h3>News Title: {article['title']}</h3>"
        f"<p>Generated Summary: {gen_summary}</p>"
        f"<p>Confidence using average of log probabilities: {confidence_log_probs}</p>"
        f"<p>Confidence using linear probabilities: {round(confidence_linear_probs, 2)}%</p><br>"
    )

display(HTML(html_content))

The calculate_confidence_linear_probs function provides a score that serves as an indicator of the model’s confidence in its response. Thus, logprobs can assist in selecting the best summary among candidate responses or models, ensuring the most coherent and accurate representation of the input data. Remember, log probabilities are provided for each token, so aggregating them meaningfully is crucial for understanding the overall confidence for a generated sentence or paragraph.

RAG BASED Q&A TASK

Logprobs not only aid in assessing confidence levels but also serve as a valuable asset in identifying and mitigating hallucinations.

Here, we will task the model to evaluate whether the question can be answered solely based on the provided article context. The model will output a boolean value, indicating if there’s enough context for model to provide a complete answer. Subsequently, we’ll analyze the model’s confidence using log probabilities to ascertain its certainty regarding the answer’s presence in the article.

# Article retrieved (hard-coded)
cricket_article = """India’s 10-match winning run at the ICC Cricket World Cup 2023 ended in the final with a six-wicket loss against Australia at the Narendra Modi Stadium on Sunday. India started well with both bat and ball. However, a lack of boundaries and partnerships saw India get bowled out on 240 in 50 overs before composed knocks from Travis Head (137 off 120) and Marnus Labuschagne (58 not out in 110) guided Australia to their sixth ODI World Cup title in 43 overs. This was India’s second ODI World Cup final against Australia and third in ICC events. India lost the 2003 World Cup final at Johannesburg by 125 runs and suffered a 209-run defeat in the ICC World Test Championships final earlier this year. Put into bat first, India, champions in 1983 and 2011, started the match with attacking intent. Rohit Sharma’s 47 off 31 gave India the perfect launchpad in the powerplay. His innings, laced with four fours and three maximums, helped India score 80 runs in the first 10 overs. However, India’s scoring rate started to drop after Sharma fell to Glenn Maxwell’s spin in the 10th over. Travis Head took a brilliant diving catch to dismiss the Indian captain. Shreyas Iyer was caught behind in the next over bowled by Australian captain Pat Cummins and India were reduced to 81/3 in the 11th over. Virat Kohli (54 off 63) and KL Rahul (66 off 107) stitched together a 67-run partnership for the fourth wicket. However, the stand took its time and batted for 18 overs before Kohli lost his wicket with an inside edge on the bowling of Pat Cummins. Australia produced a top-notch fielding effort and India were unable to find boundaries consistently. India went 16 overs without a boundary after Shreyas Iyer’s wicket. In fact, Pat Cummins (34/2) completed his quota of 10 overs without conceding a boundary. Leg spinner Adam Zampa (44/1) was economical. Though Josh Hazlewood (60/2) and Mitchell Starc (55/3) went for runs, they picked up wickets consistently to restrict India to 240. David Warner and Travis Head hit Jasprit Bumrah for 15 runs in the first over of the Australian batting innings. Mohammed Shami dismissed David Warner in the next over but the required run-rate for Australia never went out of reach. Jasprit Bumrah sent Mitchell Marsh (15 off 15) and Steven Smith (4 off 9) back cheaply, reducing Australia to 47/3 inside seven overs. Travis Head and Marnus Labuschagne navigated the next few overs cautiously. The pair grew in confidence but once the dew came into play at around the 20th over, the ball started coming onto the ball nicely and Travis Head unleashed a flurry of shots to help Australia inch closer to the target. The pair took Australia within touching distance of the victory. India did get the breakthrough with the wicket of Travis Head but with Australia needing just one to win, Glenn Maxwell came in next and hit the winning runs as Australia cruised home to victory with seven overs to spare. India came into the final with 10 wins on the trot. Kohli ended as the top-scorer at the 2023 ODI World Cup with 765 runs while Mohammed Shami, with 24 wickets, was the top wicket-taker in the 2023 ODI World Cup."""

# Questions that can be easily answered given the article
easy_questions = [
    "Who were the top-scorer and top wicket-taker for India in the 2023 ODI World Cup?",
    "How many ODI World Cup titles has Australia won after the 2023 victory?",
]

# Questions that are not fully covered in the article
medium_questions = [
    "Who was India's top run-scorer in the final match?",
    "How did the weather conditions impact the gameplay during the final match?",
]

PROMPT = """[INST] <<SYS>>
Before even answering the question, evaluate whether the answer to the question can be determined fully based on the information provided in the article.
You must only output the word 'True' or the word 'False' and nothing else.
<</SYS>>
You retrieved this article: {article}. The question is: {question}
[/INST]
"""

html_output = ""
html_output += "Questions clearly answered in article"

for ques in easy_questions:
    prompt = PROMPT.format(article = cricket_article, question = ques)
    
    response = model.generate(prompt = [prompt])
    output = response[0]['results'][0]['generated_text']

    generated_tokens = response[0]['results'][0]['generated_tokens']
    log_probs = [token['top_tokens'][0]['logprob'] for token in generated_tokens]
    confidence_log_probs = calculate_confidence_log_probs(log_probs)
    confidence_linear_probs = calculate_confidence_linear_probs(log_probs)

    html_output += (
        f"<p>Question: {ques}</p>"
        f"<p>Has Sufficient Context For Answer: {output}</p>"
        f"<p>Confidence using average of log probabilities: {confidence_log_probs}</p>"
        f"<p>Confidence using linear probabilities: {round(confidence_linear_probs, 2)}%</p><br>"
    )

html_output += "<br>"
html_output += "Questions only partially covered in the article"

for ques in medium_questions:
    prompt = PROMPT.format(article = cricket_article, question = ques)
    
    response = model.generate(prompt = [prompt])
    output = response[0]['results'][0]['generated_text']

    generated_tokens = response[0]['results'][0]['generated_tokens']
    log_probs = [token['top_tokens'][0]['logprob'] for token in generated_tokens]
    confidence_log_probs = calculate_confidence_log_probs(log_probs)
    confidence_linear_probs = calculate_confidence_linear_probs(log_probs)

    html_output += (
        f"<p>Question: {ques}</p>"
        f"<p>Has Sufficient Context For Answer: {output}</p>"
        f"<p>Confidence using average of log probabilities: {confidence_log_probs}</p>"
        f"<p>Confidence using linear probabilities: {round(confidence_linear_probs, 2)}%</p><br>"
    )


display(HTML(html_output))

For the first two questions, the model is quite confident that the article contains adequate context to address the given queries.

For the last two questions, which are tricky and not clearly answered in the provided context, the model is either less confident or outputs a boolean false, signifying the insufficiency of the provided context.

In RAG-based Q&A systems, lower log probabilities for certain terms may signal inaccuracies or hallucinations.

Visualizing these deviations and analyzing individual token scores on a graph with the mean log probability as the baseline aids in pinpointing uncertainties, thereby enhancing response accuracy.
This form of self-evaluation can help mitigate hallucinations by allowing the retrieval systems to adjust their responses or prompt users again when confidence in the context’s sufficiency is low.

Conclusion:

In conclusion, log probabilities serve as a fundamental tool in AI, by providing valuable insights into the model’s decision-making process and thus, enhancing reliability and interpretability across various applications. They find utility beyond classification, summarization, and RAG, extending to perplexity evaluation, named entity recognition, auto-completion, speech recognition, language generation, machine translation systems and more, making them indispensable across a wide array of AI tasks.