vLLM on SaladCloud Community

Last Updated: September 15, 2025

Overview

This guide covers deploying vLLM on SaladCloud Community GPU’s consumer GPUs (like RTX 4090) to serve Large Language Models efficiently. vLLM is a high-throughput, open-source inference engine for LLMs. It’s widely adopted in production because it provides:

Continuous batching for maximizing GPU utilization.
PagedAttention for optimized memory management.
Streaming outputs and OpenAI-compatible.
Quantization support (FP8, AWQ, GPTQ) to reduce memory usage.

On SaladCloud Community, deployments run on single consumer GPUs (default: RTX 4090, 24 GB VRAM). This setup is ideal for experimenting with models up to ~14B parameters or running lighter workloads at scale.

Example Models for SaladCloud Community

You can deploy any Hugging Face model that vLLM supports. Popular examples:

Llama 3.1 8B Instruct — 8B parameter model tuned for chat.
Qwen2.5 7B Instruct — General-purpose 7B parameter model.
Mistral 7B Instruct v0.3 — Compact model optimized for efficiency.
Gemma 2 9B — Open model for reasoning and instruction following.
DeepSeek R1 Distill Llama 8B — Distilled 8B model

Models are not bundled with the container. They are downloaded at runtime, which can take several minutes depending on size. For production, consider preloading models into the image.

Configuration Options

When deploying, you can set:

Model — Hugging Face model ID to load (required).
Hugging Face Token — Optional, required only for private or gated models.
Max Model Length — Model context length (default: 4096).

Defaults are tuned for consumer GPUs like the RTX 4090. You can override them in Advanced Configuration.

Example Request

Submit chat completion requests to the /v1/chat/completions endpoint:

curl https://<YOUR-GATEWAY-URL>/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -H 'Salad-Api-Key: <YOUR_API_KEY>' \
  -d '{
    "model": "vllm",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "List three colors."}
    ],
    "stream": false,
    "max_tokens": 24
  }'

How To Use This Recipe

Authentication

If authentication is enabled, requests must include your SaladCloud API key in the Salad-Api-Key header. See Sending Requests.

Replica Count

We recommend at least 2 replicas for development and 3–5 replicas for production.

Logging

Logs are available in the SaladCloud Portal. You can also connect an external logging provider such as Axiom.

Deploy & Wait

When you deploy, SaladCloud will provision nodes, pull the container image, and download the model weights. Large models may take 5-10 minutes or more to fully load. Once replicas show a green checkmark in the Ready column, the service is live.

Advanced Settings

All settings are pre-configured, but you can override them via environment variables for performance tuning:

DTYPE — Compute precision (auto, float16, bfloat16, float32).
MAX_NUM_BATCH_TOKENS — Max tokens processed per batch.
MAX_NUM_SEQS — Max concurrent sequences per batch.
GPU_MEM_UTIL — Fraction of GPU VRAM vLLM can use (default 0.92).
QUANTIZATION — Quantization mode (awq, gptq, fp8, etc.).
KV_CACHE_DTYPE — Precision for the key/value cache (auto, fp8, fp16, bf16).
DOWNLOAD_DIR — Directory for caching downloaded models.
TOKENIZER — Custom tokenizer repo or path.
TRUST_REMOTE_CODE — Enable if the model requires custom code from Hugging Face.

Source Code

The complete source code for this recipe is available in the SaladCloud Recipes GitHub repository.

Documentation Index

​Overview

​Example Models for SaladCloud Community

​Configuration Options

​Example Request

​How To Use This Recipe

​Authentication

​Replica Count

​Logging

​Deploy & Wait

​Advanced Settings

​Source Code

Overview

Example Models for SaladCloud Community

Configuration Options

Example Request

How To Use This Recipe

Authentication

Replica Count

Logging

Deploy & Wait

Advanced Settings

Source Code