Streaming LLM Responses: Importance and Implementation

Gautam Chutani
3 min readJun 11, 2024

--

In today’s fast-paced digital world, waiting for a response can be frustrating, especially when using generative AI applications. Long wait times can hinder user engagement, making it crucial to find ways to deliver responses more quickly. One effective solution is to stream responses from large language models (LLMs) in real time.

This article discusses why streaming responses is important and provides examples of how to implement streaming with various APIs such as OpenAI, IBM watsonx.ai and AWS Bedrock using Python.

Image Source

Why Streaming Responses Matters

When using APIs like OpenAI, IBM watsonx.ai or other generative AI interfaces, it can take more than 20–30 seconds to generate a complete response. The length of the response depends on the number of tokens the AI model generates, which is influenced by the complexity of the prompt.

Long wait times can significantly impact user satisfaction. While loading animations can help, they often aren’t enough for applications that require real-time interactions. Instead, streaming responses in a typewriter-style format, similar to ChatGPT, can enhance the user experience by showing the generated text as it is created.

Implementing Streaming Responses with Different APIs

Let’s look at how to stream responses from different LLM providers using their Python SDKs.

OpenAI

By default, when you make a request to OpenAI for a completion, it waits until the entire response is generated before sending it back all at once. To stream completions, set stream=TrueHere’s a simple example:

import os
from openai import OpenAI
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(".env"))

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = OpenAI()

streaming_response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Tell me about history of Cricket in 200 words."}],
max_tokens=512,
stream=True,
)

for chunk in streaming_response:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")

AWS Bedrock

AWS also provides the ability to stream AI responses. The below code block invokes Anthropic Claude models on Amazon Bedrock using the Invoke Model API with a response stream.

import os
import json
import boto3
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(".env"))

AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")

client = boto3.client(service_name='bedrock-runtime', region_name = 'us-east-1', aws_access_key_id = AWS_ACCESS_KEY_ID, aws_secret_access_key = AWS_SECRET_ACCESS_KEY)

modelId = 'anthropic.claude-v2'

prompt = "Tell me about history of Cricket in 200 words."

payload = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 512,
"temperature": 0.5,
"messages": [
{
"role": "user",
"content": [{"type": "text", "text": prompt}],
}
],
}

request = json.dumps(payload)

streaming_response = client.invoke_model_with_response_stream(modelId=modelId, body=request)

for event in streaming_response["body"]:
chunk = json.loads(event["chunk"]["bytes"])
if chunk["type"] == "content_block_delta":
print(chunk["delta"].get("text", ""), end="")

IBM WatsonX.ai

IBM watsonx.ai also offers APIs for streaming AI responses. Unlike others, the advantage here is that it directly returns a generator object, which makes it easy to integrate with backend applications. Here’s how to implement it:

import os
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv(".env"))

API_KEY = os.getenv("API_KEY")
PROJECT_ID = os.getenv("PROJECT_ID")
IBM_CLOUD_URL = os.getenv("IBM_CLOUD_URL")
credentials = {
"url": IBM_CLOUD_URL,
"apikey": API_KEY
}

model_id = "mistralai/mixtral-8x7b-instruct-v01"

parameters = {
GenParams.DECODING_METHOD: "greedy",
GenParams.MAX_NEW_TOKENS: 500
}

system_prompt = """[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>
{question} [/INST]
"""

model = Model(
model_id=model_id,
params=parameters,
credentials=credentials,
project_id=PROJECT_ID)

input_prompt = system_prompt.format(question = "Tell me about history of Cricket in 200 words.")

streaming_response = model.generate_text_stream(prompt=input_prompt)

for chunk in streaming_response:
print(chunk, end='')

Conclusion

Streaming responses from LLMs can significantly improve user experience by reducing wait times and providing real-time feedback. This article demonstrated how to implement streaming responses using APIs from OpenAI, IBM watsonx and AWS Bedrock.

In the next part, we will cover how to render these streamed responses in a front-end user interface using JavaScript and FastAPI.

By implementing these techniques, you can create more responsive and engaging AI applications that keep users satisfied and engaged.

--

--

Gautam Chutani
Gautam Chutani

Written by Gautam Chutani

Exploring the depths of AI, data science, and backend development while embracing a passion for cricket!

No responses yet