Overview
Streaming enables real-time delivery of AI responses as they’re generated, token by token, rather than waiting for the complete response. This creates a more responsive and engaging user experience, especially for longer responses.How Streaming Works
When you enable streaming on a chat completion request, the model begins sending response chunks immediately as it generates them. Each chunk contains a small piece of the response (typically a few tokens), which your application can display progressively. The stream continues until the model completes the response or reaches a stopping condition. Instead of a single large payload at the end, you receive a continuous flow of data delivered via Server-Sent Events (SSE), allowing you to update your UI incrementally as tokens arrive.Benefits of Streaming
- Improved Perceived Performance: Users see responses begin immediately, reducing perceived latency even for long completions.
- Better User Experience: Progressive display mimics natural conversation flow and keeps users engaged.
- Reduced Time to First Token: Start showing results within milliseconds rather than waiting for complete generation.
- Responsive Interfaces: Build chat interfaces that feel fluid and interactive, similar to popular AI assistants.
- Early Cancellation: Allow users to stop generation mid-stream if they’ve seen enough or want to rephrase.
Use Cases
- Conversational Interfaces: Chat applications where users expect immediate, flowing responses
- Long-Form Content: Essays, articles, or explanations where waiting for completion would feel sluggish
- Code Generation: Display code as it’s written, allowing developers to review while generation continues
- Real-Time Assistants: Customer support bots that need to feel responsive and human-like
- Interactive Storytelling: Narrative experiences where progressive reveal enhances engagement
Chat Streaming
Enable streaming for chat completions:Technical Details
Response Format
Streaming responses use the Server-Sent Events (SSE) protocol. Each event contains a JSON chunk with partial response data. Your client receives these chunks sequentially and can process them as they arrive.Stream Lifecycle
- Initiation: Request begins with
stream: trueparameter - Token Generation: Model generates and sends tokens incrementally
- Chunk Delivery: Each chunk arrives via SSE as soon as it’s available
- Completion Signal: Final chunk includes a
finish_reasonindicating why generation stopped - Connection Close: Stream ends and connection is closed