Streaming

Overview

Streaming enables real-time delivery of AI responses as they’re generated, token by token, rather than waiting for the complete response. This creates a more responsive and engaging user experience, especially for longer responses.

How Streaming Works

When you enable streaming on a chat completion request, the model begins sending response chunks immediately as it generates them. Each chunk contains a small piece of the response (typically a few tokens), which your application can display progressively. The stream continues until the model completes the response or reaches a stopping condition. Instead of a single large payload at the end, you receive a continuous flow of data delivered via Server-Sent Events (SSE), allowing you to update your UI incrementally as tokens arrive.

Benefits of Streaming

Improved Perceived Performance: Users see responses begin immediately, reducing perceived latency even for long completions.
Better User Experience: Progressive display mimics natural conversation flow and keeps users engaged.
Reduced Time to First Token: Start showing results within milliseconds rather than waiting for complete generation.
Responsive Interfaces: Build chat interfaces that feel fluid and interactive, similar to popular AI assistants.
Early Cancellation: Allow users to stop generation mid-stream if they’ve seen enough or want to rephrase.

Use Cases

Conversational Interfaces: Chat applications where users expect immediate, flowing responses
Long-Form Content: Essays, articles, or explanations where waiting for completion would feel sluggish
Code Generation: Display code as it’s written, allowing developers to review while generation continues
Real-Time Assistants: Customer support bots that need to feel responsive and human-like
Interactive Storytelling: Narrative experiences where progressive reveal enhances engagement

Chat Streaming

Enable streaming for chat completions:

curl https://api.compilelabs.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "moonshotai/kimi-k2-0905",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

Technical Details

Response Format

Streaming responses use the Server-Sent Events (SSE) protocol. Each event contains a JSON chunk with partial response data. Your client receives these chunks sequentially and can process them as they arrive.

Stream Lifecycle

Initiation: Request begins with stream: true parameter
Token Generation: Model generates and sends tokens incrementally
Chunk Delivery: Each chunk arrives via SSE as soon as it’s available
Completion Signal: Final chunk includes a finish_reason indicating why generation stopped
Connection Close: Stream ends and connection is closed

Compatibility

Streaming works with all major AI models on Compile Labs, including OpenAI, Anthropic, Google, Meta, and open source models like Qwen, DeepSeek, and Llama. The same streaming interface works across all providers through our unified API.

Getting started

Modalities

Organizations

API Keys & Access

Billing & Usage

Features

Errors

Overview

How Streaming Works

Benefits of Streaming

Use Cases

Chat Streaming

Technical Details

Response Format

Stream Lifecycle

Compatibility

Next Steps

Webhooks

Usage Tracking

Getting started

Modalities

Organizations

API Keys & Access

Billing & Usage

Features

Errors

​Overview

​How Streaming Works

​Benefits of Streaming

​Use Cases

​Chat Streaming

​Technical Details

​Response Format

​Stream Lifecycle

​Compatibility

​Next Steps

Webhooks

Usage Tracking

Overview

How Streaming Works

Benefits of Streaming

Use Cases

Chat Streaming

Technical Details

Response Format

Stream Lifecycle

Compatibility

Next Steps