Qwen3.5-9B with llama.cpp

Last Updated: March 23, 2026

Overview

This recipe runs Qwen3.5-9B with the official llama.cpp CUDA server on a Salad GPU. The model downloads automatically on first startup, the built-in llama.cpp web UI is available at your deployment URL, and the container exposes an OpenAI-compatible API for tools such as OpenClaw, OpenCode, and other compatible clients. This recipe is designed to be easy for nontechnical users:

the model is already chosen for you
it is public by default, so you can test it immediately after deployment
thinking is enabled by default
you can start with the built-in web UI, then connect other tools later

Quick Start

Open the SaladCloud Portal.
Deploy the Qwen3.5-9B (llama.cpp) recipe.
Enter a Container Group Name.
Decide whether to enable Require Container Gateway Authentication:
- Disabled: public access.
- Enabled: requests must include your SaladCloud API key.
Choose whether to keep Enable Thinking / Reasoning turned on.
Deploy and wait for the first startup to finish.

Model is downloaded from Hugging Face at startup, so it can take several minutes before the deployment becomes ready.

Once the container is ready, you can either open the built-in UI in a browser or connect an OpenAI-compatible client to /v1/chat/completions.

Use With OpenClaw

If you want to connect this recipe to OpenClaw, follow this guide:

Use OpenClaw with a Salad-hosted LLM

Defaults

The recipe comes preconfigured with these defaults:

Model source: unsloth/Qwen3.5-9B-GGUF
Model alias: qwen3.5-9b
Context size: 262144
Parallel slots: 1
Thinking: enabled by default
Sampling defaults: temperature 0.6, top_p 0.95, min_p 0.0, top_k 20
Authentication: disabled by default

temperature, top_p, and min_p are startup defaults. You can still override them per request in your inference payload.

Thinking Mode

When thinking is enabled, you can control it per request:

Add /think to explicitly enable reasoning for that turn.
Add /no_think to disable reasoning for that turn.

If you disable thinking in the deployment form, the recipe forces non-thinking mode for faster, shorter responses.

Authentication

Require Container Gateway Authentication is available in the deployment form and is unchecked by default.

Disabled: anyone with the URL can call the API.
Enabled: every request must include the Salad-Api-Key header.

If you enable authentication, see Sending Requests for the header format.

Example Request

curl https://<your-dns>.salad.cloud/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.5-9b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a short explanation of mixture-of-experts models."}
    ],
    "max_tokens": 256
  }'

If you enabled authentication during deployment, add:

-H 'Salad-Api-Key: <api-key>'

For Technical Users

If you want to switch this recipe to a different model later, open the container group in the SaladCloud Portal and edit Advanced Configuration. Useful environment variables include:

LLAMA_ARG_HF_REPO to use a different Hugging Face GGUF repo
LLAMA_ARG_HF_FILE to select a specific file from that repo
LLAMA_ARG_MODEL_URL to point directly to a .gguf file
LLAMA_ARG_CTX_SIZE to change the context window
LLAMA_ARG_N_GPU_LAYERS to control GPU offload

For full llama.cpp server options, see:

Documentation Index

​Overview

​Quick Start

​Use With OpenClaw

​Defaults

​Thinking Mode

​Authentication

​Example Request

​For Technical Users

​Source Code

Overview

Quick Start

Use With OpenClaw

Defaults

Thinking Mode

Authentication

Example Request

For Technical Users

Source Code