Overview
This recipe deploys a lightweight Uvicorn service that loads an Unsloth-optimized large language model and exposes two HTTP endpoints:GET /healthfor readiness probesPOST /v1/generatefor text generation (supports streaming)
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit, but you may point to any compatible Hugging Face repo or local path.
Configure the Container
Set these environment variables when you deploy (via the Portal form or Salad API):MODEL_ID— Hugging Face repo or local path for the base model (defaultunsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit).DTYPE— Tensor dtype (bfloat16,float16, orfp16); defaults tobfloat16.LOAD_IN_4BIT—"true"to enable 4-bit loading (VRAM savings); set"false"to disable.MAX_SEQ_LEN— Maximum context length accepted by the model (default8192tokens).MAX_NEW_CAP— Upper bound onmax_new_tokensper request (default4096).STREAM_STDOUT— Optional ("true"by default). Mirrors generated text to container logs.
networking.auth = true). Disable it if you need anonymous
access.
Retrieve Your Container Group ID
.id from the response to target this container group in your client integrations.
Call the API (Non-Streaming)
max_new_tokens is clamped between 1 and MAX_NEW_CAP. Empty prompts return an "error": "empty prompt" response.
Health Checks
The built-in readiness probe hitsGET /health on port 8000. You can test it manually:
{"ok": true} indicates the model is loaded and ready.
Troubleshooting
- 403 Forbidden — Gateway auth is enabled. Include the
Salad-Api-Keyheader or disable auth during deployment. - Returned
error: empty prompt— Provide a non-emptypromptstring. max_new_tokensignored — Value exceededMAX_NEW_CAP; increase the cap or request fewer tokens.- Slow warm-up — Large models can take time to load at startup. Expect longer cold-start latencies after scaling to zero.
- GPU out of memory — Disable
LOAD_IN_4BITonly when you have sufficient VRAM. Otherwise, pick a smaller model or keep 4-bit loading enabled.