Serverless AI API
The nature of AI and LLM workloads on already trained models lends itself very naturally to a serverless-style architecture. As a framework for building and deploying serverless applications, Spin provides an interface for you to perform AI inference within Spin applications.
Using Serverless AI From Applications
Configuration
By default, a given component of a Spin application will not have access to any Serverless AI models. Access must be provided explicitly via the Spin application’s manifest (the spin.toml
file). For example, an individual component in a Spin application could be given access to the llama2-chat model by adding the following ai_models
configuration inside the specific [component.(name)]
section:
// -- snip --
[component.please-send-the-codes]
ai_models = ["codellama-instruct"]
// -- snip --
Spin supports “llama2-chat” and “codellama-instruct” for inferencing and “all-minilm-l6-v2” for generating embeddings.
File Structure
By default, the Spin framework will expect any already trained model files (which are configured as per the previous section) to be downloaded by the user and made available inside a .spin/ai-models/
file path of a given application. For example:
code-generator-rs/.spin/ai-models/llama/codellama-instruct
See the serverless AI Tutorial documentation for more concrete examples of implementing the Fermyon Serverless AI API, in your favorite language.
Embeddings models are slightly more complicated; it is expected that both a
tokenizer.json
and amodel.safetensors
are located in the directory named after the model. For example, for thefoo-bar-baz
model, Spin will look in the.spin/ai-models/foo-bar-baz
directory fortokenizer.json
and amodel.safetensors
.
Serverless AI Interface
The Spin SDK surfaces the Serverless AI interface to a variety of different languages. See the Language Support Overview to see if your specific language is supported.
The set of operations is common across all supporting language SDKs:
The exact detail of calling these operations from your application depends on your language:
Want to go straight to the reference documentation? Find it here.
To use Serverless AI functions, the llm
module from the Spin SDK provides the methods. The following snippet is from the Rust code generation example:
use spin_sdk::{
http::{IntoResponse, Request, Response},
llm,
};
// -- snip --
fn handle_code(req: Request) -> anyhow::Result<impl IntoResponse> {
// -- snip --
let result = llm::infer_with_options(
llm::InferencingModel::CodellamaInstruct,
&prompt,
llm::InferencingParams {
max_tokens: 400,
repeat_penalty: 1.1,
repeat_penalty_last_n_token_count: 64,
temperature: 0.8,
top_k: 40,
top_p: 0.9,
},
)?;
// -- snip --
}
General Notes
The infer_with_options
examples, operation:
- The above example takes the model name
llm::InferencingModel::CodellamaInstruct
as input. From an interface point of view, the model name is technically an alias for a string (to maximize future compatibility as users want to support more and different types of models). - The second parameter is a prompt (string) from whoever/whatever is making the request to the
handle_code()
function. - A third, optional, parameter which is an interface allows you to specify parameters such as
max_tokens
,repeat_penalty
,repeat_penalty_last_n_token_count
,temperature
,top_k
andtop_p
. - The return value (the
inferencing-result
record) contains a text field of typestring
. Ideally, this would be astream
that would allow streaming inferencing results back to the user, but alas streaming support is not yet ready for use so we leave that as a possible future backward incompatible change.