VLLM
Pass-through endpoints for VLLM - call provider-specific endpoint, in native format (no translation).
| Feature | Supported | Notes | 
|---|---|---|
| Cost Tracking | ❌ | Not supported | 
| Logging | ✅ | works across all integrations | 
| End-user Tracking | ❌ | Tell us if you need this | 
| Streaming | ✅ | 
Just replace https://my-vllm-server.com with LITELLM_PROXY_BASE_URL/vllm 🚀
Example Usage
curl -L -X GET 'http://0.0.0.0:4000/vllm/metrics' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
Supports ALL VLLM Endpoints (including streaming).
Quick Start
Let's call the VLLM /metrics endpoint
- Add HOSTED VLLM API BASE to your environment
export HOSTED_VLLM_API_BASE="https://my-vllm-server.com"
- Start LiteLLM Proxy
litellm
# RUNNING on http://0.0.0.0:4000
- Test it!
Let's call the VLLM /metrics endpoint
curl -L -X GET 'http://0.0.0.0:4000/vllm/metrics' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
Examples
Anything after http://0.0.0.0:4000/vllm is treated as a provider-specific route, and handled accordingly.
Key Changes:
| Original Endpoint | Replace With | 
|---|---|
| https://my-vllm-server.com | http://0.0.0.0:4000/vllm(LITELLM_PROXY_BASE_URL="http://0.0.0.0:4000") | 
| bearer $VLLM_API_KEY | bearer anything(usebearer LITELLM_VIRTUAL_KEYif Virtual Keys are setup on proxy) | 
Example 1: Metrics endpoint
LiteLLM Proxy Call
curl -L -X GET 'http://0.0.0.0:4000/vllm/metrics' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer $LITELLM_VIRTUAL_KEY' \
Direct VLLM API Call
curl -L -X GET 'https://my-vllm-server.com/metrics' \
-H 'Content-Type: application/json' \
Example 2: Chat API
LiteLLM Proxy Call
curl -L -X POST 'http://0.0.0.0:4000/vllm/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer $LITELLM_VIRTUAL_KEY' \
-d '{
    "messages": [
        {
            "role": "user",
            "content": "I am going to Paris, what should I see?"
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.8,
    "top_p": 0.1,
    "model": "qwen2.5-7b-instruct",
}'
Direct VLLM API Call
curl -L -X POST 'https://my-vllm-server.com/chat/completions' \
-H 'Content-Type: application/json' \
-d '{
    "messages": [
        {
            "role": "user",
            "content": "I am going to Paris, what should I see?"
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.8,
    "top_p": 0.1,
    "model": "qwen2.5-7b-instruct",
}'
Advanced - Use with Virtual Keys
Pre-requisites
Use this, to avoid giving developers the raw Cohere API key, but still letting them use Cohere endpoints.
Usage
- Setup environment
export DATABASE_URL=""
export LITELLM_MASTER_KEY=""
export HOSTED_VLLM_API_BASE=""
litellm
# RUNNING on http://0.0.0.0:4000
- Generate virtual key
curl -X POST 'http://0.0.0.0:4000/key/generate' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{}'
Expected Response
{
    ...
    "key": "sk-1234ewknldferwedojwojw"
}
- Test it!
curl -L -X POST 'http://0.0.0.0:4000/vllm/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234ewknldferwedojwojw' \
  --data '{
    "messages": [
        {
            "role": "user",
            "content": "I am going to Paris, what should I see?"
        }
    ],
    "max_tokens": 2048,
    "temperature": 0.8,
    "top_p": 0.1,
    "model": "qwen2.5-7b-instruct",
}'