Evaluating Multimodal Models with LLaVA-Critic

Refining Multimodal Models with Enhanced Evaluation Techniques

6 min readOct 18, 2024

LLaVA-Critic is the first open-source large multimodal model (LMM) designed to act as a general-purpose evaluator, capable of assessing performance across a wide range of multimodal tasks. It introduces a new way to enhance model alignment and self-critique using AI-generated feedback, making it an essential tool in the development of modern multimodal systems. In this article, we explore its key features, data collection process, and use cases.

1. Overview of LLaVA-Critic

LLaVA-Critic aims to provide a scalable solution for evaluating multimodal models, such as vision-language tasks. The model was trained on a high-quality dataset specifically designed for evaluation tasks, enabling it to assess model responses and generate reward signals for preference learning. It offers two main capabilities:

LMM-as-a-Judge: LLaVA-Critic can deliver reliable evaluation scores for multimodal tasks, comparable to proprietary models like GPT-4V, making it a cost-effective alternative.
Preference Learning: By generating AI-driven reward signals, it reduces reliance on costly human feedback for model alignment, enhancing preference-based training.

These features establish LLaVA-Critic as a valuable tool for both evaluation and learning in multimodal model development.

2. Significance of Learning to Evaluate

As large multimodal models reach maturity with pre-trained data from the web, there is growing interest in improving post-training using AI-enhanced synthetic data. Reliable AI evaluation is crucial for automating complex task assessments, which can otherwise be labor-intensive and expensive. In particular, accurate reward signals are essential for reinforcement learning and guiding models during inference.

While many models focus on improving real-world vision tasks, the role of LMMs in judging and evaluating other models has remained unexplored. LLaVA-Critic addresses this gap by providing evaluation scores along with reasoning for various tasks, such as visual chat.

3. Key Contributions of LLaVA-Critic

LLaVA-Critic introduces several key innovations:

Critic Instruction-Following Data: It is trained on a carefully curated dataset, including over 46,000 images and 113,000 evaluation samples. This data incorporates both pointwise and pairwise evaluation criteria, allowing the model to perform evaluations with quantitative judgments and detailed reasoning.
Multimodal Model as a Critic: LLaVA-Critic expands the capabilities of existing LMMs to serve as a critic, evaluating model outputs and offering feedback to enhance model performance.
Open-Source: In an effort to support the broader AI community, the LLaVA-Critic team has released its instruction data, model checkpoints, codebase, and visual chat demo for public use.

4. Data Collection Process

The training data for LLaVA-Critic was generated using a GPT-assisted pipeline, covering two key evaluation settings:

Pointwise Scoring: Here, the model assigns a score to an individual response, either by directly evaluating it or by comparing it to a reference answer. The dataset includes question-image pairs with associated responses, scores, and justifications.
Pairwise Ranking: In this setting, the model compares two responses and determines which one is of higher quality. This approach was used to train LLaVA-Critic on multiple response pairs, allowing it to handle complex preference learning tasks.

The dataset was constructed from widely used multimodal benchmarks and off-the-shelf LMM responses, ensuring comprehensive evaluation coverage across diverse tasks.

5. Model Training and Fine-Tuning

LLaVA-Critic was fine-tuned from a pre-trained LMM, specifically the LLaVA-OneVision (OV) 7B/72B checkpoint in order to develop its “critic” capacities. The model was trained on the LLaVA-Critic-113k dataset for one epoch, using standard cross-entropy loss to predict scores, rankings, and justifications. The final model, known as LLaVA-Critic (v1.0), was also tested on a smaller subset, referred to as LLaVA-Critic (v0.5).

Dataset: https://huggingface.co/datasets/lmms-lab/llava-critic-113k
Models: https://huggingface.co/collections/lmms-lab/llava-critic-66fe3ef8c6e586d8435b4af8

6. Scenarios and Use Cases

LLaVA-Critic is useful in the following scenarios:

Scenario 1: LMM-as-a-Judge — The model provides consistent evaluation scores and justifications, automating the labor-intensive task of human feedback for multimodal benchmarks.
Scenario 2: Preference Learning — LLaVA-Critic generates reward signals to optimize models based on preference learning, reducing reliance on human-annotated data. It also outperforms human-in-the-loop models in preference alignment, making it an ideal solution for scaling model development.

7. Code Usage

This section shows how we can use a single code snippet to perform both pairwise and pointwise scoring of Large Multimodal Model (LMM) responses, using lmms-lab/llava-critic-7b model.

Pairwise scoring involves comparing two model responses and determining which is better, while pointwise scoring assigns a score to the model response.


from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates

from PIL import Image
import requests
import copy
import torch
import warnings
warnings.filterwarnings("ignore")

# Load the model and tokenizer
pretrained = "lmms-lab/llava-critic-7b"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)

model.eval()

# Download and process the image
url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
image_sizes = [image.size]

# Define a function to handle critic prompts
def evaluate_image(critic_prompt, conv_template="qwen_1_5"):
    # Generate the full prompt with the image token and critic prompt
    question = DEFAULT_IMAGE_TOKEN + "\n" + critic_prompt
    conv = copy.deepcopy(conv_templates[conv_template])
    conv.append_message(conv.roles[0], question)
    conv.append_message(conv.roles[1], None)
    prompt_question = conv.get_prompt()

    # Tokenize the input
    input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)

    # Generate response
    cont = model.generate(
        input_ids,
        images=image_tensor,
        image_sizes=image_sizes,
        do_sample=False,
        temperature=0,
        max_new_tokens=4096,
    )

    # Decode the generated text
    text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
    return text_outputs[0]

# Define the two critic prompts
pairwise_prompt = (
    "Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answers provided by a Large Multimodal Model (LMM). "
    "Determine which answer is better and explain your reasoning with specific details. Your task is provided as follows:\n"
    "Question: [What this image presents?]\n"
    "The first response: [The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.]\n"
    "The second response: [This is a handwritten number seven.]\n"
    "ASSISTANT:\n"
)

pointwise_prompt = (
    "Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answer provided by a Large Multimodal Model (LMM). "
    "Score the response out of 100 and explain your reasoning with specific details. Your task is provided as follows:\n"
    "Question: [What this image presents?]\n"
    "The LMM response: [This is a handwritten number seven.]\n"
    "ASSISTANT:\n"
)

# Run the evaluation for both pairwise and pointwise scoring
pairwise_result = evaluate_image(pairwise_prompt)
pointwise_result = evaluate_image(pointwise_prompt)

# Print both results
print("Pairwise Evaluation Result:")
print(pairwise_result)
print("\nPointwise Evaluation Result:")
print(pointwise_result)

8. Conclusion

LLaVA-Critic demonstrates the potential of open-source LMMs to act as general evaluators and provide scalable, AI-driven feedback. By releasing the model and its data to the public, LLaVA-Critic sets the stage for future research into superhuman alignment mechanisms for large multimodal models. Its ability to reduce the need for human feedback while offering high-quality evaluations and preference learning signals makes it a powerful tool for AI developers.

9. References

Original Paper: https://arxiv.org/abs/2410.02712
Blog Series: https://llava-vl.github.io/blog/2024-10-03-llava-critic/