Overview
This recipe runs the official SGLang runtime on a Salad GPU and exposes an OpenAI-compatible API for chat completions and related clients. Model selection and the main runtime controls are all exposed in the deployment form. This recipe is configured for technical users who want to control the serving stack directly:- choose any supported Hugging Face model ID
- set parser and backend options yourself
- connect any OpenAI-compatible tool or SDK after deployment
Quick Start
- Open the SaladCloud Portal.
- Deploy the SGLang recipe.
- Enter a Container Group Name.
- Enter a Model such as
Qwen/Qwen3.5-9B. - Adjust optional runtime fields only if your model needs them.
- Decide whether to keep Require Container Gateway Authentication enabled.
- Deploy and wait for the first startup to finish.
Models are downloaded from Hugging Face at startup, so initial startup can take several minutes depending on model
size and node network speed.
Current Defaults
The recipe currently defaults to:- Model:
Qwen/Qwen3.5-9B - Served model name: blank, so SGLang will use the model ID
- Host bind:
:: - Context length:
32768 - Memory fraction:
0.8 - Load format:
auto - Sampling defaults source:
model - Attention backend:
triton - Reasoning parser:
qwen3 - Tool call parser:
qwen3_coder - Trust remote code:
false - Disable CUDA graph:
true - Authentication: enabled by default
Example Models
Examples that fit this recipe well:Qwen/Qwen3.5-9BwithReasoning Parser = qwen3andTool Call Parser = qwen3_coderQwen/Qwen2.5-7B-InstructwithTool Call Parser = qwen25meta-llama/Llama-3.1-8B-InstructwithTool Call Parser = llama3deepseek-ai/DeepSeek-R1-Distill-Qwen-7BwithReasoning Parser = deepseek-r1mistralai/Mistral-7B-Instruct-v0.3withTool Call Parser = mistral
Load Format:auto,safetensors,gguf,bitsandbytesSampling Defaults Source:model,openaiAttention Backend:triton,trtllm_mhaReasoning Parser:qwen3,deepseek-r1,gpt-ossTool Call Parser:qwen3_coder,qwen25,llama3,mistral
Authentication
Require Container Gateway Authentication is available in the deployment form and is enabled by default.- Enabled: requests must include the
Salad-Api-Keyheader. - Disabled: anyone with the URL can call the API.
Example Request
Notes
- If your model is private or gated on Hugging Face, provide
HF_TOKEN. - Parser settings are model-specific. Leave them blank if your model does not need them.
- On Blackwell GPUs, hybrid Qwen models may require
Attention Backend = triton. Disable CUDA Graph = trueis the safer startup default.- If you want a direct response instead of a reasoning response for supported models, include
chat_template_kwargs: {"enable_thinking": false}in the request body.