Documentation Index
Fetch the complete documentation index at: https://docs.salad.com/llms.txt
Use this file to discover all available pages before exploring further.
Last Updated: February 25, 2025
Main Requirements of Real-Time AI Inference
-
Typical use cases include image generation, large language models (LLMs), and transcription, where inference times
range from a few seconds to minutes, and responses must be delivered in real time to enhance a seamless user
experience.
-
Unlike traditional web applications, which typically have more consistent response times, AI inference times can be
significantly longer and vary widely, even for the same application. For instance, inference times for LLMs are
closely tied to the length of generated tokens.
-
The ability to return partial responses (such as progress, chunks and tokens) as soon as they are ready, rather than
making users wait for the entire response to be generated, can significantly enhance the quality of the user
experience.
-
As request volumes may fluctuate over time, inference systems should be able to scale efficiently to handle varying
demand.
-
During system congestion or failures, rejecting new requests early is a better strategy than allowing them to wait for
extended periods, only to result in inevitable failures. Fail fast not failure-proof!
Real-Time AI Inference with SaladCloud’s Container Gateway
Deploying SaladCloud’s Container Gateway is the fastest way to enable input/output to your applications, but additional
considerations are required to ensure their success.
In the following diagram, all function points highlighted in green must be thoroughly reviewed, correctly configured,
and properly implemented.
Key Considerations
Currently, all container instances, regardless of location, are centrally accessed through SaladCloud’s Container
Gateway in the U.S. Using the gateway for instances in other regions may introduce additional latency, typically in the
range of several hundred milliseconds. This is generally acceptable for most applications, as AI inference times are
typically much longer. However, for latency-sensitive applications that require local access, a Redis-based queue
should be considered.
The gateway’s Round Robin algorithm may lead to some instances being overwhelmed while others remain idle, due to the
variability in AI inference times. In most cases,
the Least Number of Connections algorithm is
more effective, as it can better manage these disparities.
Real-time autoscaling to match GPU resources with AI system load at the minute level is not feasible. Instead, you will
need to adjust the GPU resource pool based on historical data, and slightly overprovision resources in anticipation of
spikes. Please check
this link
for more information.
Some inference servers use a single thread, and when performing long-running inference tasks, they may fail to respond
to liveness or readiness probes, leading to node reallocation or servers not receiving further requests from the
gateway. To address this, inference servers typically require
a robust architecture
supporting multiple threads or asynchronous concurrency with batched inference].
Some implementations of liveness or readiness probes in servers simply return “OK” without accurately reflecting the
true status of the inference servers, which requires improvement to distinguish between different states, such as READY,
BUSY and FAILED.
The gateway has a
server response timeout of up to
100 seconds for all requests. Inferences, such as video generation, may exceed this time limit and result in failures.
Furthermore, sending more requests than the system can handle will cause them to queue up either in the gateway (while
configured in the single-connection mode) or on the instances, eventually resulting in timeout errors.
To improve system robustness,
inference servers can proactively reject excessive requests as a back-pressure mechanism, while client applications can
implement traffic control to stop accepting new requests from users during periods of congestion.
There is always a delay of tens of seconds between node failures and reallocation, when these nodes stop receiving
requests from the gateway (via the Readiness Probe), and the point at which client applications start receiving errors.
This delay may pose challenges for real-time inferences and requires optimization in the client applications.
Please review these links
(LLM,
image generation)
thoroughly before building real-time applications using the Container Gateway.