Ten ways to Serve Large Language Models: A Comprehensive Guide

Gautam Chutani
14 min readOct 24, 2024

--

Deploying large language models (LLMs) can be a challenging task, especially with the growing complexity of models and hardware requirements. However, several solutions have emerged that make serving LLMs more accessible and scalable. From lightweight local deployment tools to robust inference engines designed for high-performance production environments, there’s a wide variety of options available.

This comprehensive guide explores ten popular LLM serving engines and tools, each offering distinct advantages for different use cases, whether you’re a hobbyist running models on consumer-grade hardware or a developer deploying large-scale models in production.

Image Source

1. WebLLM

WebLLM is a high-performance, in-browser LLM inference engine powered by WebGPU hardware acceleration, enabling AI models like Llama 3 to run directly in the browser without server dependencies. It supports real-time AI interactions with features like streaming responses, structured JSON generation, and logit-level control, offering full compatibility with the OpenAI API. WebLLM allows developers to easily integrate AI into web applications while ensuring privacy and efficiency through its modular design, making it ideal for building chatbots, assistants, and more.

Key Features:

  • In-browser model execution with WebGPU acceleration
  • OpenAI API compatibility for seamless integration
  • Real-time streaming and JSON-mode support
  • Wide model support (Llama, Phi, Gemma, etc.)
  • Custom model integration with MLC format
  • Web Worker & Service Worker integration for performance optimization
  • Chrome Extension Support to enhance browser functionality with WebLLM

Pros:

  • No server-side deployment required
  • Enhanced privacy as processing happens client-side
  • Cross-platform via web browsers

Cons:

  • Restricted to models that can run in browsers
  • Limited by client hardware capabilities

Helpful Resources:

WebLLM Chat (AI models running in browser)

2. LM Studio

LM Studio is a powerful desktop application that enables users to run large language models (LLMs) completely offline, directly on their local machine. It supports various hardware configurations and can be used to experiment with different models and configurations. LM Studio offers a user-friendly chat interface as well as an OpenAI-compatible local server, making it versatile for developers who want to integrate LLMs into their applications or experiment with various models.

LM Studio allows the execution of LLMs on Mac, Windows, and Linux through llama.cpp. Furthermore, on Mac devices with Apple Silicon, it also supports Apple's ML Compute MLX for running LLMs.

Key Features of LM Studio

  • Offline LLM Execution: Run LLMs on your local machine without an internet connection.
  • Structured JSON Responses: Generate structured JSON outputs to enforce specific data formats.
  • Multi-Model Support: Run multiple models at once for parallel AI tasks.
  • Chat with Local Documents: Access and interact with local documents using the in-app chat UI (new in version 0.3).
  • OpenAI-Compatible Local Server: Simulate OpenAI-like endpoints for local development and testing.
  • Hugging Face Integration: Download and manage models directly from Hugging Face, simplifying the workflow.

Pros:

  • Fast on-device LLM inference, entirely offline with a user-friendly GUI interface
  • Easy model management and download via Hugging Face
  • Chat UI and OpenAI-compatible local server
  • Run multiple models simultaneously

Cons:

  • Limited to desktop environments; not suitable for production deployments
  • Not all model architectures work out-of-the-box
  • Requires high system resources for larger models
  • Performance depends on local hardware

Helpful Resources:

LM Studio local server running on Mac M1

3. Ollama

Ollama is a powerful, open-source LLM serving engine that enables local inference, allowing users to run language models directly on their machines without relying on cloud services. This capability enhances privacy, reduces latency, and provides greater control over the models being used, making it an ideal solution for developers and organizations looking to leverage AI while maintaining data security.

Key Features:

  • Local Inference: Run language models on local machines for improved privacy and reduced latency.
  • Model Management: Easily load, unload, and switch between multiple language models.
  • API Integration: Simple API access for seamless integration into applications.
  • Cross-Platform Compatibility: Available for Windows, macOS, and Linux, offering cross-platform support for running LLMs locally on various operating systems.
  • Custom Model Configuration: Users can modify settings and parameters to tailor model behavior according to specific needs.

Pros:

  • Easy to set up and user-friendly
  • Great for simpler and personal projects
  • Supports various models out of the box
  • Provides a simple API for integration as well as command-line interface for quick interactions
  • Facilitates experimentation and customization

Cons:

  • Limited to models supported by Ollama
  • Performance dependent on local hardware capabilities
  • May not be as performant as specialized solutions for large-scale deployment, potentially limiting usage in larger applications

Usage:

You can download and run the ollama installer from here.

  • To start the Ollama inference server locally, run the following:
ollama serve
  • In a new terminal window, install the granite-code:8b model available in the Ollama library.
ollama pull granite-code:8b
  • To list models
ollama list
  • To run the model
ollama run granite-code:8b
Ollama CLI Usage

Helpful Resources:

4. vLLM

vLLM (Virtual Large Language Model) is an advanced open-source library designed for high-performance inference and serving of Large Language Models (LLMs). It leverages innovative features such as PagedAttention for efficient memory management, continuous batching for optimal GPU utilization, and support for various quantization methods to enhance inference speed. vLLM is compatible with an OpenAI-like API and integrates seamlessly with the Hugging Face ecosystem, making it a versatile tool for AI practitioners.

Image Source

Key Features:

  • PagedAttention for optimized memory usage
  • Continuous batching for dynamic request handling
  • Various quantization methods to speed up inference
  • OpenAI-compatible API for easy integration
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
  • Rich support for most of the popular open-source models on HuggingFace, including — Transformer-like LLMs (e.g., Llama), Mixture-of-Expert LLMs (e.g., Mixtral), Embedding Models (e.g. E5-Mistral) and Multi-modal LLMs (e.g., Pixtral)

Pros:

  • Designed for production use with high performance and state-of-the-art throughput
  • Flexibility to support multiple model architectures
  • Open-source nature enables community contributions
  • Excellent concurrency handling for multiple simultaneous requests
  • Memory efficiency allows for larger models on limited hardware

Cons:

  • More complex setup compared to simpler solutions
  • Requires more technical expertise to optimize

Usage:

Here’s an example of how to use Pixtral-12B multimodal in offline mode with vLLM:

from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "mistralai/Pixtral-12B-2409"
max_img_per_msg = 2

sampling_params = SamplingParams(max_tokens = 2048)

llm = LLM(
model = model_name,
tokenizer_mode = "mistral",
load_format = "mistral",
config_format = "mistral",
dtype = "bfloat16",
max_model_len = 8192,
gpu_memory_utilization = 0.95,
limit_mm_per_prompt = {"image": max_img_per_msg},
)

image_url = "https://storage.googleapis.com/lablab-static-eu/images/events/clyyg9s36000f357erj4ytsnm/clyyg9s36000f357erj4ytsnm_imageLink_143zw03r1.jpg"

messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Extract the company name conducting hackathon and then generate a catchy social media caption for the image. Output in JSON format."},
{"type": "image_url", "image_url": {"url": image_url}}
]
}
]

res = llm.chat(messages=messages, sampling_params=sampling_params)
print(res[0].outputs[0].text)

vLLM also provides an HTTP server that implements OpenAI’s Completions and Chat API. This allows for easy integration with existing systems and workflows. For vision-language models like Pixtral, vLLM’s HTTP server is compatible with the OpenAI Vision API.

  • Spin up a server
vllm serve mistralai/Pixtral-12B-2409 --tokenizer_mode mistral --limit_mm_per_prompt 'image=2'
  • Make a request to the model
curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "mistralai/Pixtral-12B-2409",
"messages": [
{
"role": "user",
"content": [
{"type" : "text", "text": "Describe the content of this image in detail please."},
{"type": "image_url", "image_url": {"url": "https://s3.amazonaws.com/cms.ipressroom.com/338/files/201808/5b894ee1a138352221103195_A680%7Ejogging-edit/A680%7Ejogging-edit_hero.jpg"}}
]
}
]
}'

Helpful Resources:

5. LightLLM

LightLLM is a Python-based framework designed for fast and efficient inference of Large Language Models (LLMs). Renowned for its lightweight design, easy scalability, and high-speed performance, LightLLM leverages the strengths of several well-regarded open-source implementations, such as FasterTransformer, TGI, vLLM, and FlashAttention. The framework supports advanced features that optimize GPU utilization and memory management, making it ideal for both development and production environments.

Image Source

Key Features:

  • Tri-process Asynchronous Collaboration: Tokenization, model inference, and detokenization occur asynchronously, significantly improving GPU utilization.
  • Nopad (Unpad) Support: Efficiently handles requests with large length disparities through nopad attention operations across multiple models.
  • Dynamic Batch Scheduling: Allows for dynamic scheduling of requests to optimize resource use.
  • FlashAttention Integration: Enhances speed and reduces GPU memory footprint during inference.
  • Token Attention: Implements a token-wise KV cache management mechanism, resulting in zero memory waste during inference.
  • High-performance Router: Works with Token Attention to manage GPU memory meticulously for each token, optimizing throughput.
  • Int8KV Cache: Doubles token capacity, specifically supporting the LLaMA model.

Pros:

  • Lightweight and fast, ideal for both small and large-scale deployments
  • Asynchronous processing improves efficiency and throughput
  • Open-source with a supportive community, allowing for continuous improvement and updates
  • Flexible architecture enables easy integration with existing systems

Cons:

  • May require more technical expertise for setup and optimization
  • Limited support for certain models, as some features are specifically tailored (e.g., Int8KV Cache for LLaMA only)

Helpful Resources:

6. OpenLLM

OpenLLM is a versatile platform that simplifies self-hosting of large language models (LLMs). It allows developers to run state-of-the-art open-source models like Llama, Qwen, Mistral, and more as OpenAI-compatible APIs. With built-in chat interfaces, optimized inference backends, and seamless integration with Docker, Kubernetes, and BentoCloud, OpenLLM streamlines the process of deploying, managing, and interacting with custom and popular LLMs.

Key Features

  • Single-command setup: Run popular open-source LLMs or custom models with a single command.
  • OpenAI-compatible APIs: Provides OpenAI-like APIs for easy integration across various tools and frameworks.
  • Enterprise-grade deployment: Simplified cloud deployment via Docker, Kubernetes, and BentoCloud for production use.
  • Custom repository support: Add and host custom models with a straightforward process using BentoML.
  • Built-in chat UI: Easy access to a chat interface for model interactions at /chat endpoint.

Pros:

  • Minimal setup with a single command to serve models.
  • Direct compatibility with OpenAI API tools for seamless integration.
  • Ready-to-use cloud deployment with BentoCloud and Kubernetes.
  • Allows hosting of custom models via private repositories.

Cons:

  • Only public custom repositories are supported for now.
  • While it supports custom models, setting up and managing your own repository can be technically challenging for less experienced users.

Usage:

  • To install via PyPI
pip install openllm
  • Start OpenLLM server locally
openllm serve llama3.1:8b-4bit
  • LangChain Integration
from langchain_community.llms import OpenLLM
llm = OpenLLM(server_url='http://localhost:3000')
llm.invoke("Which is the largest mammal in the world?")
  • Chat UI

OpenLLM provides a chat UI at the /chat endpoint for the launched LLM server at http://localhost:3000/chat

OpenLLM Chat UI

Helpful Resources:

7. HuggingFace TGI

Hugging Face Text Generation Inference (TGI) is a powerful and scalable solution designed for serving large language models efficiently. Optimized for inference workloads, TGI supports various open-source models and custom ones, providing fast and scalable text generation services. It is particularly suited for high-performance environments where speed and resource efficiency are critical, and it integrates seamlessly with Hugging Face’s model hub.

Image Source

Key Features

  • Optimized inference engine: Designed to handle large-scale text generation tasks with low latency.
  • Supports open-source models: Works with models from Hugging Face’s model hub, including GPT, BERT, and custom models.
  • Scalability: Capable of handling multiple requests simultaneously, ideal for production environments.
  • GPU acceleration: Leverages GPU resources for faster model inference and performance optimization.
  • Multi-model serving: Serve multiple models in parallel with the same infrastructure, simplifying deployment.
  • Auto-scaling: Automatically scales based on workload demand to ensure optimal resource use.
  • Production ready (distributed tracing with Open Telemetry, Prometheus metrics)

Pros:

  • Engineered for low-latency, high-throughput text generation, making it ideal for production use.
  • Compatible with a wide range of models from Hugging Face’s model hub, including custom models.
  • Easily integrates with Hugging Face’s ecosystem, providing direct access to model hosting and deployment.

Cons:

  • While TGI is highly optimized, large-scale deployments can require substantial GPU and memory resources, which may be expensive for smaller teams.
  • TGI has slightly higher latency compared to vLLM under heavy loads.

Usage (with transformers):

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"

pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)

messages = [
{"role": "system", "content": "You are a wise sage who answers all questions with ancient wisdom."},
{"role": "user", "content": "What is the meaning of life?"},
]

outputs = pipeline(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Helpful Resources:

8. GPT4ALL

GPT4All by Nomic is both a series of models and an ecosystem for training and deploying models locally on your computer. The platform enables users to run large language models (LLMs) efficiently on desktops and laptops, emphasizing privacy by keeping data on the device. Inspired by OpenAI’s ChatGPT, the GPT4All desktop application offers a familiar interface for interaction. With the integration of Nomic’s embedding models, users can easily pull information from local documents into their chats, providing a streamlined experience. GPT4All supports popular model architectures like LLaMa, Mistral and utilizes the efficient llama.cpp and Nomic's C backend, making it accessible for users across various skill levels.

Key Features:

  • Local Execution: Once installed, run LLMs without internet directly on your hardware, including support for CPUs and GPUs.
  • Privacy-Focused: All data remains on your device.
  • Cross-Platform Support: Available for macOS, Windows, and Linux.
  • Document Integration: Seamlessly pull information from local files to allow your local LLM access to sensitive data in formats like .pdf and .txt, ensuring that your information remains on your device.
  • Python SDK: Program with LLMs using llama.cpp and Nomic's backend.

Pros:

  • High privacy and security for sensitive data.
  • User-friendly interface similar to ChatGPT.
  • Supports a wide range of models.
  • Efficient performance on consumer hardware.

Cons:

  • Limited support for advanced fine-tuning compared to cloud alternatives.
  • May require substantial local resources for optimal performance.

Usage:

  • Python SDK
from gpt4all import GPT4All
model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf") # downloads / loads a 4.66GB LLM
with model.chat_session():
print(model.generate("How does transfer learning work in image classification?", max_tokens=512))
  • Desktop Application

Helpful Resources:

9. llama.cpp

llama.cpp is a highly optimized, dependency-free C/C++ implementation designed for running large language models (LLMs) such as Lama and others locally. Closely linked to the GGML library, it serves as the default implementation for these models and provides a foundation for many tools and applications in the LLM ecosystem. This library includes various bindings (e.g., for Python) that extend its functionality, allowing users to interact with a wide array of models through different interfaces. llama.cpp is specifically engineered to deliver high performance on various hardware configurations, including Apple silicon and x86 architectures. It supports advanced features such as multiple levels of integer quantization and custom CUDA kernels for NVIDIA GPUs, making it versatile for both local and cloud deployment.

Key Features

  • Dependency-Free Implementation: No external dependencies, ensuring simplicity and ease of setup.
  • Optimized for Performance: High efficiency on a wide range of hardware, including support for ARM and x86 architectures.
  • Extensive Model Support: Compatible with numerous models, including LLaMA, Mistral, Mixtral, DBRX, and many others.
  • Quantization Support: Offers multiple integer quantization options (from 1.5-bit to 8-bit) for faster inference and reduced memory usage.
  • Various Bindings Available: Extensive language bindings for Python, Go, Node.js, and more, facilitating integration into diverse development environments.

Pros

  • Optimized for running LLMs on consumer hardware, making it accessible for various users.
  • The library is open source, allowing for community contributions and transparency.
  • Simple installation and usage, making it easy for beginners and experienced users alike to leverage LLM capabilities.

Cons

  • The interface may not be as polished or user-friendly compared to some commercial offerings.
  • Lacks some of the advanced customization options available in cloud-based services.

Usage:

To use the llama.cpp Python package

# pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama(model_path="./path/model.gguf")

output = llm(
"What is artificial intelligence?",
max_tokens=100,
stop=["\n"],
echo=True # Echo the prompt back in the output
)

print(output["choices"][0]["text"].strip())

Helpful Resources:

10. Triton Inference Server with TensorRT-LLM:

NVIDIA’s Triton Inference Server is an enterprise-grade platform designed to accelerate the deployment of large language models (LLMs) in production environments. It enables organizations to serve models efficiently while maintaining high performance across various frameworks, such as TensorFlow, PyTorch, and ONNX. Paired with TensorRT-LLM, an open-source framework specifically engineered to optimize the performance of LLMs, this combination allows developers to compile and fine-tune models for maximum efficiency. TensorRT-LLM enhances inference speed by optimizing kernels and leveraging powerful techniques like paged attention and efficient key-value (KV) caching, making it well-suited for high-throughput applications.

Image Source

Key Features

  • Model Optimization: TensorRT-LLM compiles models and optimizes them for inference, significantly boosting performance.
  • Paged Attention & Efficient KV Caching: These techniques enhance memory efficiency and allow for faster processing of long sequences in LLMs.
  • Dynamic Batching: Triton’s dynamic batching feature increases throughput by combining multiple inference requests, optimizing resource utilization.
  • Concurrent Model Execution: This allows multiple models to be executed simultaneously, further improving throughput.
  • Multi-Framework Support: Out-of-the-box support for various deep learning frameworks enables flexibility in deployment.
  • Performance Metrics: Provides detailed metrics for GPU utilization, server throughput, latency, and more, facilitating performance tuning.

Pros

  • Significantly enhances inference speeds, ensuring fast response times for high-demand applications.
  • Optimizes resource utilization, improving GPU efficiency and performance without wastage.
  • Easily scales across multiple GPUs and nodes, making it suitable for enterprise-level deployments.

Cons

  • Both Triton and TensorRT-LLM are built specifically for NVIDIA GPUs, limiting flexibility of deployment options to this hardware.
  • Can be complex for beginners, requiring setup and configuration knowledge; maybe overkill for simple projects

Helpful Resources:

Conclusion

In summary, the landscape of Large Language Model (LLM) serving engines is vast, offering a range of tools catering to different needs. Whether you’re looking for lightweight, local deployment solutions like llama.cpp and GPT4ALL or high-performance engines like Triton Inference Engine and vLLM, there’s an option for every use case. Each tool has its strengths and trade-offs, from ease of use and cross-platform compatibility to technical complexity and performance optimization. By understanding the capabilities and limitations of these engines, you can make an informed decision on the best solution to serve your LLMs effectively for your specific applications.

--

--

Gautam Chutani
Gautam Chutani

Written by Gautam Chutani

Exploring the depths of AI, data science, and backend development while embracing a passion for cricket!

No responses yet