Documentation Index
Fetch the complete documentation index at: https://docs.salad.com/llms.txt
Use this file to discover all available pages before exploring further.
Last Updated: February 28, 2025
Real-Time AI Inference with a Redis-Based Queue
Several customers have successfully implemented a Redis-based, flexible and platform-independent queue for real-time
applications on SaladCloud, showcasing the following advantages:
- The Redis cluster, client applications, and Salad nodes are all strategically deployed within the same region to
ensure local access and minimize latency.
- Supports multiple clients and servers, providing real-time request/response functionality in streaming and
non-streaming modes with support for synchronous and asynchronous processing.
- More resilient to burst traffic, node failures, and the variability in AI inference times, while allowing easy
customization for specific applications, such as using different timeout settings per request and adjusting streaming
granularity (tokens or chunks).
- The input and output data of a task can be embedded within the request and response. For large datasets, data exchange
can occur directly between client applications or SCE instances and cloud storage, with the request and response
containing only a reference to the data.
However, implementing this solution requires effort and comes with certain limitations:
- A self-hosted Redis cluster or a managed service from public cloud providers (considering cost factors).
- Integrate the Redis worker in both client applications and inference servers.
- IP whitelisting for access control is not applicable from Salad nodes to the Redis cluster; instead, application-level
authentication can be used.
The Redis-based solution is generally used for low-latency, real-time applications where responses or partial
responses must be returned immediately as soon as they are ready, or for node-to-node communication within the same
container group or across different groups. It may be applied to batch jobs or long-running tasks, but this requires
additional logic and effort. For these scenarios, we typically rely on Salad Kelpie and AWS SQS.
Key Concepts in Redis
Redis is single-threaded but handles high levels of concurrency efficiently using asynchronous I/O and event-driven
architecture, and it supports asynchronous concurrency at the client level.
A list in Redis is an ordered collection of elements where items are added in the order they are inserted. It
supports efficient insertion and removal of elements from both ends. A list will be automatically removed from Redis
once it contains no elements. It also supports blocking operations, such as blocking reads on a non-existent list, with
a specified timeout.
A zset (sorted set) in Redis is a data structure that contains unique elements, each assigned a score. The elements
are stored in order of their scores, allowing efficient retrieval, with the highest-scoring element being accessed and
removed first.
A hash in Redis is a collection of key-value pairs, where each key is unique and maps to a specific value. It is
ideal for representing objects with multiple fields.
The Python Redis client uses a connection pool to manage connections efficiently,
reducing overhead. Instead of establishing a new connection for each request, it initializes the connection lazily on
the first command and reuses it for subsequent requests. The socket_timeout setting defines the maximum time (in
seconds) the client will wait for a response from the Redis cluster before timing out. If the cluster does not respond
within this duration, the client raises a timeout error.
pydantic_redis simplifies working with Redis and Pydantic models
together by providing tools for serializing and deserializing Pydantic models, making it easier to manage data between a
Redis store and your application.
Reference Design: Non-streaming
Please refer to the example code
(client,
server and
common code) in this senario.
The solution functions as both a real-time queue and a storage system, keeping historical I/O data and information about
client applications and servers.
The interaction between client applications and servers can be further simplified. For example, servers can directly
save the result to ‘Temporary:Request_1’, eliminating the need for one write operation (S7) from servers and one read
operation (C9) by client applications from the Redis cluster.
The SCE instance can automatically retrieve and process new requests based on its current load and available resources.
During node failure, the SCE instance immediately stops fetching new requests, and any existing request IDs are lost, as
they have already been removed from the ‘REQUEST:PENDING’ in Redis.
Client applications may encounter timeout errors when performing a blocking read from ‘Temporary:Request_1’ due to the
following reasons:
When a timeout error occurs, applications can perform additional reads for an extended waiting time for specific
requests (flexible and customizable). Applications can also resend the request using a new ID and higher priority, while
disregarding the previous request (which may still be in the queue or being processed), or simply return an error to
users.
Rather than implementing a complex logic, applications can query the number of pending requests in the Redis cluster
before submitting a large quantity of requests and apply flow control as needed, such as rejecting new user requests
during periods of congestion. The system monitor should regularly track the pending requests in the cluster and
upscale/downscale the GPU resource pool accordingly.
To enhance I/O throughput and AI inference efficiency, your may run multiple Redis worker instances (processes or
coroutines) alongside the inference server (running as a separate process), which supports multiple threads or
asynchronous concurrency with batched inference.
Reference Design: Streaming
Please refer to the example code
(client,
server and
common code) in this senario.
The streaming solution is similar to the non-streaming solution, differing only in how results are generated by servers
and delivered to client applications.
In Redis, a list is used to achieve the streaming effect, with servers writing to the left and clients reading from
the right. Both writes and reads are performed in chunks (tokens or sentences) by clients and servers.
On the server side, specifically for LLMs, the Redis worker retrieves partial responses from the inference server, such
as Hugging Face’s TGI, and writes them into Redis. Instead of calling the TGI server directly, the Redis worker can also
run custom inference code built on
the Hugging Face’s library,
enabling streaming results that are then returned to Redis.
The socket_timeout can be set to a smaller value in this case, as partial inference results can be returned when
they are ready, enabling quicker error detection and response handling.
Client applications may encounter timeout errors when performing a blocking read from ‘Streaming:Request_1’. A similar
logic can be applied—if some chunks have already been received for a request, it is highly likely that a server
failure has occurred.
The streaming granularity can be customized and changing based on application needs, such as token-by-token,
chunk-by-chunk, chunk-by-sentence, or image-by-video, as long as the data can be converted to strings,
providing flexibility in how data is processed and delivered.
From the test, we can see that the real-time queue does not impose an intensive read/write workload on Redis, and its
throughput scales linearly with the number of clients and servers. However, a smaller chunk size may reduce transmission
efficiency and overall system throughput.
