kollama deploy
kollama deploy command is used to deploy a Model to the Kubernetes cluster. It's basic a wrapper and utility CLI to interact with the Ollama Operator by manipulating CRD resources.
Use cases
Deploy model that lives on registry.ollama.ai
kollama deploy phiDeploy to a specific namespace
kollama deploy phi --namespace=productionDeploy Model that lives on a custom registry
kollama deploy phi --image=registry.example.com/library/phi:latestDeploy Model with exposed NodePort service for external access
kollama deploy phi --exposeDeploy Model with exposed LoadBalancer service for external access
kollama deploy phi --expose --service-type=LoadBalancerDeploy Model with resources limits
The following example deploys the phi model with CPU limit to 1 and memory limit to 1Gi.
kollama deploy phi --limit=cpu=1 --limit=memory=1GiFlags
--namespace
If present, the namespace scope for this CLI request.
--image
Default: registry.ollama.ai/library/<model name>:latest
kollama deploy phi --image=registry.ollama.ai/library/phi:latestModel image to deploy.
- If not specified, the
Modelname will be used as the image name (will be pulled fromregistry.ollama.ai/library/<model name>by default if no registry is specified). For example, if theModelname isphi, the image name will beregistry.ollama.ai/library/phi:latest. - If not specified, the tag will be latest.
--limit (supports multiple flags)
Multiple limits can be specified by using the flag multiple times.
Resource limits for the deployed Model. This is useful for clusters that don't have a large enough number of resources, or if you want to deploy multiple Models in a cluster with limited resources.
For resource limits on NVIDIA, AMD GPUs...
In Kubernetes, any GPU resource follows this pattern for resources labels:
resources:
limits:
gpu-vendor.example/example-gpu: 1 # requesting 1 GPUUsing nvidia.com/gpu allows you to limit the number of NVIDIA GPUs, therefore when using kollama deploy you can use --limit nvidia.com/gpu=1 to specify the number of NVIDIA GPUs as 1:
kollama deploy phi --limit=nvidia.com/gpu=1this is what it may looks like in the YAML configuration file:
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU #Documentation on using resource labels with
nvidia/k8s-device-plugin
Using amd.com/gpu allows you to limit the number of AMD GPUs, therefore when using kollama deploy you can use --limit amd.com/gpu=1 to specify the number of AMD GPUs as 1.
kollama deploy phi --limit=amd.com/gpu=1this is what it may looks like in the YAML configuration file:
resources:
limits:
amd.com/gpu: 1 # requesting a GPU #Your can read more here: Schedule GPUs | Kubernetes
I have deployed Model, but I want to change the resource limit...
Of course you can, with the kubectl set resources command, you can change the resource limit:
kubectl set resources deployment -l model.ollama.ayaka.io/name=<model name> --limits cpu=4For memory limits:
kubectl set resources deployment -l model.ollama.ayaka.io/name=<model name> --limits memory=8GiThe format is <resource>=<quantity>.
For example: --limit=cpu=1 --limit=memory=1Gi.
--storage-class
kollama deploy phi --storage-class=standardStorageClass to use for the Model's associated PersistentVolumeClaim.
If not specified, the default StorageClass will be used.
--pv-access-mode
kollama deploy phi --pv-access-mode=ReadWriteManyAccess mode for Ollama Operator created image store (to cache pulled images)'s StatefulSet resource associated PersistentVolume.
If not specified, the access mode will be ReadWriteOnce.
If you are deploying Models into default deployed kind and k3s clusters, you should keep it as ReadWriteOnce. If you are deploying Models into a custom cluster, you can set it to ReadWriteMany if StorageClass supports it.
--expose
Default: false
kollama deploy phi --exposeWhether to expose the Model through a Service for external access and makes it easy to interact with the Model.
Actually, when creating a Model resource, a ClusterIP type service will be created
At the case where users didn't supply either --expose flag, Ollama Operator will create a associated service for the Model with the type of ClusterIP with the same name as the corresponding Deployment by default, and the service will be used for internal communication between the Model and other services in the cluster.
By default, --expose will create a NodePort service.
Use --expose=LoadBalancer to create a LoadBalancer service.
--service-type
kollama deploy phi --expose --service-type=NodePortDefault: NodePort
Type of the Service to expose the Model. Only valid when --expose is specified.
If not specified, the service will be exposed as NodePort.
To understand how many Services are associated to Model...
kubectl get svc --selector ollama.ayaka.io/type=modelUse LoadBalancer to expose the service as LoadBalancer.
--service-name
kollama deploy phi --expose --service-name=phi-svc-nodeportDefault: ollama-model-<model name>-<service type>
Name of the Service to expose the Model.
If not specified, the Model name will be used as the service name with -nodeport as the suffix for NodePort.
--node-port
kollama deploy phi --expose --service-type=NodePort --node-port=30000Default: Random port
To understand what NodePort is used for the Model...
kubectl get svc --selector model.ollama.ayaka.io/name=<model name> -o json | jq ".spec.ports[0].nodePort"You can't simply specify a port number!
There are several restrictions:
- By default,
30000-32767is theNodePortport range in the Kubernetes cluster. If you want to use ports outside this range, you need to configure the--service-node-port-rangeparameter for the cluster. - You can't use the port number already occupied by other services.
For more information about choosing your own port number, please refer to Chapter of Kubernetes Official Document about nodePort.
If not specified, a random port will be assigned. Only valid when --expose is specified, and --service-type is set to NodePort.