kollama deploy
kollama deploy
command is used to deploy a Model
to the Kubernetes cluster. It's basic a wrapper and utility CLI to interact with the Ollama Operator by manipulating CRD resources.
Use cases
Deploy model that lives on registry.ollama.ai
kollama deploy phi
Deploy to a specific namespace
kollama deploy phi --namespace=production
Deploy Model
that lives on a custom registry
kollama deploy phi --image=registry.example.com/library/phi:latest
Deploy Model
with exposed NodePort service for external access
kollama deploy phi --expose
Deploy Model
with exposed LoadBalancer service for external access
kollama deploy phi --expose --service-type=LoadBalancer
Deploy Model
with resources limits
The following example deploys the phi
model with CPU limit to 1
and memory limit to 1Gi
.
kollama deploy phi --limit=cpu=1 --limit=memory=1Gi
Flags
--namespace
If present, the namespace scope for this CLI request.
--image
Default: registry.ollama.ai/library/<model name>:latest
kollama deploy phi --image=registry.ollama.ai/library/phi:latest
Model image to deploy.
- If not specified, the
Model
name will be used as the image name (will be pulled fromregistry.ollama.ai/library/<model name>
by default if no registry is specified). For example, if theModel
name isphi
, the image name will beregistry.ollama.ai/library/phi:latest
. - If not specified, the tag will be latest.
--limit
(supports multiple flags)
Multiple limits can be specified by using the flag multiple times.
Resource limits for the deployed Model
. This is useful for clusters that don't have a large enough number of resources, or if you want to deploy multiple Models
in a cluster with limited resources.
For resource limits on NVIDIA, AMD GPUs...
In Kubernetes, any GPU resource follows this pattern for resources labels:
resources:
limits:
gpu-vendor.example/example-gpu: 1 # requesting 1 GPU
Using nvidia.com/gpu
allows you to limit the number of NVIDIA GPUs, therefore when using kollama deploy
you can use --limit nvidia.com/gpu=1
to specify the number of NVIDIA GPUs as 1
:
kollama deploy phi --limit=nvidia.com/gpu=1
this is what it may looks like in the YAML configuration file:
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
Documentation on using resource labels with
nvidia/k8s-device-plugin
Using amd.com/gpu
allows you to limit the number of AMD GPUs, therefore when using kollama deploy
you can use --limit amd.com/gpu=1
to specify the number of AMD GPUs as 1
.
kollama deploy phi --limit=amd.com/gpu=1
this is what it may looks like in the YAML configuration file:
resources:
limits:
amd.com/gpu: 1 # requesting a GPU
Your can read more here: Schedule GPUs | Kubernetes
I have deployed Model
, but I want to change the resource limit...
Of course you can, with the kubectl set resources
command, you can change the resource limit:
kubectl set resources deployment -l model.ollama.ayaka.io/name=<model name> --limits cpu=4
For memory limits:
kubectl set resources deployment -l model.ollama.ayaka.io/name=<model name> --limits memory=8Gi
The format is <resource>=<quantity>
.
For example: --limit=cpu=1
--limit=memory=1Gi
.
--storage-class
kollama deploy phi --storage-class=standard
StorageClass
to use for the Model
's associated PersistentVolumeClaim
.
If not specified, the default StorageClass
will be used.
--pv-access-mode
kollama deploy phi --pv-access-mode=ReadWriteMany
Access mode for Ollama Operator created image store (to cache pulled images)'s StatefulSet
resource associated PersistentVolume
.
If not specified, the access mode will be ReadWriteOnce
.
If you are deploying Model
s into default deployed kind and k3s clusters, you should keep it as ReadWriteOnce
. If you are deploying Model
s into a custom cluster, you can set it to ReadWriteMany
if StorageClass
supports it.
--expose
Default: false
kollama deploy phi --expose
Whether to expose the Model
through a Service for external access and makes it easy to interact with the Model
.
Actually, when creating a Model resource, a ClusterIP
type service will be created
At the case where users didn't supply either --expose
flag, Ollama Operator will create a associated service for the Model
with the type of ClusterIP
with the same name as the corresponding Deployment by default, and the service will be used for internal communication between the Model
and other services in the cluster.
By default, --expose
will create a NodePort
service.
Use --expose=LoadBalancer
to create a LoadBalancer
service.
--service-type
kollama deploy phi --expose --service-type=NodePort
Default: NodePort
Type of the Service
to expose the Model
. Only valid when --expose
is specified.
If not specified, the service will be exposed as NodePort
.
To understand how many Services are associated to Model
...
kubectl get svc --selector ollama.ayaka.io/type=model
Use LoadBalancer
to expose the service as LoadBalancer
.
--service-name
kollama deploy phi --expose --service-name=phi-svc-nodeport
Default: ollama-model-<model name>-<service type>
Name of the Service
to expose the Model
.
If not specified, the Model
name will be used as the service name with -nodeport
as the suffix for NodePort
.
--node-port
kollama deploy phi --expose --service-type=NodePort --node-port=30000
Default: Random port
To understand what NodePort is used for the Model
...
kubectl get svc --selector model.ollama.ayaka.io/name=<model name> -o json | jq ".spec.ports[0].nodePort"
You can't simply specify a port number!
There are several restrictions:
- By default,
30000-32767
is theNodePort
port range in the Kubernetes cluster. If you want to use ports outside this range, you need to configure the--service-node-port-range
parameter for the cluster. - You can't use the port number already occupied by other services.
For more information about choosing your own port number, please refer to Chapter of Kubernetes Official Document about nodePort
.
If not specified, a random port will be assigned. Only valid when --expose
is specified, and --service-type
is set to NodePort.