Fine-Tuning Vision-Language Models using LoRA

Using Unsloth for fine-tuning with Weights & Biases integration for experiment tracking

12 min readNov 29, 2024

Co-Author: Sohan M

Introduction

Vision-Language Models (VLMs) are becoming essential in the AI landscape, enabling systems to process and understand both images and text for a variety of tasks. Key applications include image captioning, visual question answering, document understanding, and OCR (reading text from images). These tasks are fundamental for industries that rely on high-quality visual and textual analysis, such as e-commerce, healthcare, and finance. There is now a wide range of powerful open-source multimodal models, especially VLMs that are capable of handling a wide range of tasks. Some of the leading models include:

While these pre-trained multimodal models usually provide a strong base and are effective for general tasks like image captioning and document understanding, as they have already learned patterns from large, diverse datasets, fine-tuning these models on your own dataset can further enhance their performance for more specific applications. Some scenarios where fine-tuning becomes essential include:

Domain Adaptation
Task Specialization
Resource Optimization
Cultural and Regional Context

By adapting the model to better meet the unique needs of your task, fine-tuning can improve both accuracy and efficiency for targeted applications.

In this blog, we’ll explore how to fine-tune Meta AI’s Llama-3.2–11B-Vision model using a combination of powerful tools. We’ll utilize Unsloth for efficient model loading and training, leverage LoRA for optimized parameter updates, and integrate Weights & Biases (WandB) for seamless experiment tracking. After fine-tuning, we can make use of vLLM for model serving and inference, ensuring high-performance deployment.

Overview of Tools and Techniques

Unsloth:

Optimized framework for fine-tuning vision-language models (VLMs) and large language models (LLMs), offering up to 30x faster training speeds with 60% reduced memory usage.
Supports multiple hardware setups, including NVIDIA, AMD, and Intel GPUs, with intelligent weight optimization techniques for enhanced memory efficiency.

LoRA (Low-Rank Adaptation):

A technique for efficient fine-tuning that avoids modifying all model parameters.
Adds small trainable layers to the model for task-specific adaptations.
Reduces GPU memory requirements, enabling use on standard hardware.
Ideal for balancing resource efficiency and fine-tuning performance.

Weights & Biases (W&B):

A tracking tool for monitoring training metrics, managing experiments, and visualizing performance.
Ensures reproducibility and collaboration across teams.

Step-by-Step Guide to Fine-Tuning and Deployment

Install the Required Libraries

!pip install torch==2.5.1 transformers==4.46.2 datasets wandb huggingface_hub python-dotenv --no-cache-dir | tail -n 1 
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes xformers==0.0.28.post3 --no-cache | tail -n 1

SetUp Weights & Biases

To monitor the fine-tuning process and keep track of different experiments, we will use Weights & Biases (W&B). This allows you to automatically log training progress including loss curves, visualize metrics, compare model versions, and track model performance over time.

import os
import wandb
from dotenv import load_dotenv
load_dotenv()

def setup_wandb(project_name: str, run_name: str):
    # Set up your API KEY
    try:
        api_key = os.getenv("WANDB_API_KEY")
        wandb.login(key=api_key)
        print("Successfully logged into WandB.")
    except KeyError:
        raise EnvironmentError("WANDB_API_KEY is not set in the environment variables.")
    except Exception as e:
        print(f"Error logging into WandB: {e}")
    
    # Optional: Log models
    os.environ["WANDB_LOG_MODEL"] = "checkpoint"
    os.environ["WANDB_WATCH"] = "all"
    os.environ["WANDB_SILENT"] = "true"
    
    # Initialize the WandB run
    try:
        wandb.init(project=project_name, name=run_name)
        print(f"WandB run initialized: Project - {project_name}, Run - {run_name}")
    except Exception as e:
        print(f"Error initializing WandB run: {e}")

# Setup Weights & Biases
setup_wandb(project_name="<project_name>", run_name="<run_name>")

HuggingFace Authentication

After fine-tuning our model, we will upload it to the Hugging Face Hub. To do this, we first need to authenticate by retrieving and verifying our Hugging Face token. This token grants access to upload models and interact with Hugging Face resources. We will go over the upload process later in this article.

from huggingface_hub import login

hf_token = os.getenv("HUGGINGFACE_TOKEN")
if hf_token is None:
    raise EnvironmentError("HUGGINGFACE_TOKEN is not set in the environment variables.")
login(hf_token)

Prepare the Dataset for Training

For our training, we will be using the HuggingFaceM4/the_cauldron dataset, specifically the geomverse subset, which is designed for multimodal tasks involving geometric problem-solving and mathematical reasoning with images and text. Each sample contains an image that illustrates the geometry problem, paired with a text description of the problem and the step-by-step solution. This makes it well-suited for multimodal models that need to integrate both image and text to solve problems.

To make the fine-tuning process more efficient, we’ll select a subset of 3,000 samples from the dataset instead of utilizing the full training split. This approach allows us to reduce computational overhead while quickly evaluating the model’s performance.

from datasets import load_dataset
from PIL import Image

# Loading the dataset
dataset_id = "HuggingFaceM4/the_cauldron"
subset = "geomverse"
dataset = load_dataset(dataset_id, subset, split="train")

# Selecting a subset of 3K samples for fine-tuning
dataset = dataset.select(range(3000))
print(f"Using a sample size of {len(dataset)} for fine-tuning.")
print(dataset)

Now, let’s take a look at the 5th sample from the dataset to examine its text and image content.

dataset[5]

To better understand the image data in our dataset, we can check its properties. Here’s how we can retrieve the image’s mode, size, and type:

# Print the mode of the image in dataset[5]
print(f"Image Mode: {dataset[5]['images'][0].mode}")

# Print the size of the image in dataset[5]
print(f"Image Size: {dataset[5]['images'][0].size}")

# Print the type of the image in dataset[5]
print(f"Image Type: {type(dataset[5]['images'][0])}")

# Display the image - dataset[5]["images"][0].show()
print("Displaying the Image:")
small_image = dataset[5]["images"][0].copy()  # Create a copy to avoid modifying the original
small_image.thumbnail((400, 400))             # Resize to fit within 400x400 pixels
small_image.show()

In this section, we define utility functions for image pre-processing to optimize data for training and structuring the dataset into correct format:

convert_to_rgb:

Ensures all images are in RGB format.
Handles alpha channels by compositing over a white background.

reduce_image_size:

Resizes images to a smaller scale, enhancing memory and computational efficiency.

format_data:

Structures the dataset by combining text and image data.
Organizes each sample into “user” and “assistant” roles.
Prepares the dataset for fine-tuning a conversational model for multimodal tasks.

This structured approach ensures seamless training for models handling both text and image inputs.

def convert_to_rgb(image):
    """Convert image to RGB format if not already in RGB."""
    if image.mode == "RGB":
        return image
    image_rgba = image.convert("RGBA")
    background = Image.new("RGBA", image_rgba.size, (255, 255, 255))
    alpha_composite = Image.alpha_composite(background, image_rgba)
    return alpha_composite.convert("RGB")


def reduce_image_size(image, scale=0.5):
    """Reduce image size by a given scale."""
    original_width, original_height = image.size
    new_width = int(original_width * scale)
    new_height = int(original_height * scale)
    return image.resize((new_width, new_height))


def format_data(sample):
    """Format the dataset sample into structured messages."""
    image = sample["images"][0]
    image = convert_to_rgb(image)  
    image = reduce_image_size(image)

    return {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": sample["texts"][0]["user"],
                    },
                    {
                        "type": "image",
                        "image": image,  
                    }
                ],
            },
            {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": sample["texts"][0]["assistant"],
                    }
                ],
            },
        ],
    }


# Transform the dataset
converted_dataset = [format_data(sample) for sample in dataset]

Now, let’s see how our converted data looks after applying the above transformations.

converted_dataset[5]

Loading our Vision Model

This setup initializes the Llama-3.2–11B-Vision-Instruct model using Unsloth’s FastVisionModel, with the following parameters:

Gradient Checkpointing (use_gradient_checkpointing="unsloth"):
Reduces memory usage significantly, particularly useful for processing long-context sequences.
Quantization (load_in_4bit=False):
Keeps default 16-bit precision (LoRA) for better accuracy, though 4-bit quantization (QLoRA) can be set to save memory.

import torch
from unsloth import FastVisionModel 

model_name = "unsloth/Llama-3.2-11B-Vision-Instruct"

model, tokenizer = FastVisionModel.from_pretrained(
    model_name = model_name,
    load_in_4bit = False,                     # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth",   # True or "unsloth" for long context
)

You can use other models such as:

unsloth/Qwen2-VL-7B-Instruct
unsloth/Pixtral-12B-2409
unsloth/llava-v1.6-mistral-7b-hf

Simply replace the model name in the code with one of these options based on your requirements.

To see full list of supported models, refer here.

Configuring LoRA for Parameter-Efficient Fine-Tuning

In this section, we configure LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning to optimize training and reduce memory usage by focusing only on key parts of the model, rather than fine-tuning all parameters. Here’s a breakdown of the key parameters:

finetune_vision_layers=True: Enables fine-tuning of the vision layers, allowing for specialized adaptation to visual tasks.
finetune_language_layers=True: Enables fine-tuning of the language layers, adjusting the model for language-related tasks.
finetune_attention_modules=True: Enables fine-tuning of attention layers, which help the model focus on important parts of the input sequence.
finetune_mlp_modules=True: Allows fine-tuning of the MLP layers, essential for transforming representations within the model.
r=8: Sets the rank of LoRA matrices, which balances model performance with memory efficiency by controlling the low-rank approximation of the layers.
lora_alpha=16: A scaling factor for LoRA to control how much the low-rank matrices influence the model’s final weights.
lora_dropout=0: Sets the dropout rate to zero for consistent training without introducing randomness during training.
bias=”none”: Specifies no additional bias terms are used during fine-tuning.
random_state=3407: Ensures that the training is reproducible by fixing the random seed.
use_rslora=False: Disables rank-sensitive LoRA, opting for the standard LoRA configuration, which is more efficient but may not capture complex patterns as well.
loftq_config=None: Disables LoftQ, which is an advanced initialization method that improves accuracy but increases memory usage at the start.

This configuration allows efficient fine-tuning of select layers, optimizing the model for tasks while minimizing computational overhead.

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers
    r = 8,           
    lora_alpha = 16,  
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None
)

Evaluate the Base Vision Model

Before we proceed with any fine-tuning, let’s first check the performance of the original model. We’ll utilize the TextStreamer class to stream the generated text output, enabling real-time response streaming.

FastVisionModel.for_inference(model)         # Enable for inference!

image = dataset[5]["images"][0]
instruction = dataset[5]["texts"][0]["user"]

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {
                "type": "text",
                "text": instruction
            },
        ]
    }
]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1024,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

You’ll observe that even though the model correctly interpreted the shapes, but the mathematical and geometric reasoning was incorrect, resulting in an overly lengthy answer with an incorrect output and some signs of hallucination.

We are using min_p = 0.1 and temperature = 1.5 as mentioned in Unsloth’s kaggle notebook. For more information, refer to this tweet.

Training with SFTTrainer and Unsloth

This code configures and initiates the training process for a vision model using the SFTTrainer from the trl library. It first sets hyperparameters through SFTConfig, including batch size, learning rate, and optimizer settings, while enabling mixed-precision training based on hardware support. The model is then prepared for training using FastVisionModel.for_training(). The trainer is initialized with the model, tokenizer, and a custom data collator (UnslothVisionDataCollator) for vision fine-tuning. This setup ensures efficient training, resource management, and logging for multimodal tasks, particularly for vision-based models.

from trl import SFTTrainer, SFTConfig
from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator

args = SFTConfig(
        per_device_train_batch_size = 2, # Controls the batch size per device
        gradient_accumulation_steps = 4, # Accumulates gradients to simulate a larger batch
        warmup_steps = 5,
        num_train_epochs = 3,            # Number of training epochs
        learning_rate = 2e-4,            # Sets the learning rate for optimization
        fp16 = not is_bf16_supported(),
        bf16 = is_bf16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.01,            # Regularization term for preventing overfitting
        lr_scheduler_type = "linear",   # Chooses a linear learning rate decay
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb",            # Enables WandB logging
        logging_steps = 1,              # Sets frequency of logging 
        logging_strategy = "steps",
        save_strategy = "no",
        load_best_model_at_end = True,
        save_only_model = False,
        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        dataset_num_proc = 4,
        max_seq_length = 2048,
    )

FastVisionModel.for_training(model)    # Enable for training!

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
    train_dataset = converted_dataset,
    args = args,
)

This code captures the intial GPU memory stats at the start of training.

# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Now that we’ve finalized the setup, let’s begin the training of our model.

trainer_stats = trainer.train()
print(trainer_stats)

wandb.finish()

After the training process, the below code checks and compares final memory usage, capturing the memory used specifically for LoRA training and calculating memory percentages.

# Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

We can visualize the training metrics and system metrics, such as memory usage, training duration, training loss, and accuracy, etc. on WandB to gain better insights into our model’s performance over time.

Saving and Deploying the Model

After fine-tuning the vision-language model, the trained model is saved locally and uploaded to the Hugging Face Hub for easy access and future deployment. You can either using Hugging Face’s push_to_hub for online storage or save_pretrained for local saving.

However, this process specifically saves the LoRA adapters only, and not the full merged model. This is because, when using LoRA, only the adapter weights are trained and not the entire model. As a result, when saving the model, only the adapter weights are stored, and the full model is not saved.

# Local saving
model.save_pretrained("<lora_model_name>") 
tokenizer.save_pretrained("<lora_model_name>")

# Online saving
model.push_to_hub("<hf_username/lora_model_name>", token = hf_token)
tokenizer.push_to_hub("<hf_username/lora_model_name>", token = hf_token)

To merge the LoRA adapters with the original base model and save the model in 16-bit precision for optimized performance with vLLM, you can use the merged_16bit option. This allows you to save the fine-tuned model to float16.

# Merge to 16bit
model.save_pretrained_merged("<model_name>", tokenizer, save_method = "merged_16bit",)
model.push_to_hub_merged("<hf_username/model_name>", tokenizer, save_method = "merged_16bit", token = hf_token)

Model Evaluation

Once the LoRA fine-tuning process is complete, we will now test our model’s performance by loading a sample image and its corresponding mathematical problem statement from the dataset (not used during training) to evaluate how the model interprets and responds to it.

from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    model_name = "<lora_model_name>",   # Trained model either locally or from huggingface
    load_in_4bit = False,
)
FastVisionModel.for_inference(model)         # Enable for inference!

image = dataset[-1]["images"][0]
instruction = dataset[-1]["texts"][0]["user"]

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {
                "type": "text",
                "text": instruction
            },
        ]
    }
]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1024,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

With just ~3k samples for fine-tuning, you’ll notice significant improvement. The model not only interprets the shapes correctly but also demonstrates accurate mathematical and geometric reasoning, producing concise and precise answers without any hallucinations.

Conclusion:

Multimodal/Vision AI is transforming industries by enabling models to work seamlessly with both visual and textual data. Fine-tuning these models for specific applications is made simpler and more efficient with tools like Unsloth, which reduces training time and memory usage, and LoRA, which allows parameter efficient fine-tuning. While integration with Weights & Biases can help you track and analyze your experiments effectively. Together, these tools empower researchers and businesses to unlock the full potential of multimodal AI for practical and impactful use cases.