Skip to content

kollama deploy

kollama deploy command is used to deploy a Model to the Kubernetes cluster. It's basic a wrapper and utility CLI to interact with the Ollama Operator by manipulating CRD resources.

Use cases

Deploy model that lives on registry.ollama.ai

shell
kollama deploy phi

Deploy to a specific namespace

shell
kollama deploy phi --namespace=production

Deploy Model that lives on a custom registry

shell
kollama deploy phi --image=registry.example.com/library/phi:latest

Deploy Model with exposed NodePort service for external access

shell
kollama deploy phi --expose

Deploy Model with exposed LoadBalancer service for external access

shell
kollama deploy phi --expose --service-type=LoadBalancer

Deploy Model with resources limits

The following example deploys the phi model with CPU limit to 1 and memory limit to 1Gi.

shell
kollama deploy phi --limit=cpu=1 --limit=memory=1Gi

Flags

--namespace

If present, the namespace scope for this CLI request.

--image

Default: registry.ollama.ai/library/<model name>:latest

shell
kollama deploy phi --image=registry.ollama.ai/library/phi:latest

Model image to deploy.

  • If not specified, the Model name will be used as the image name (will be pulled from registry.ollama.ai/library/<model name> by default if no registry is specified). For example, if the Model name is phi, the image name will be registry.ollama.ai/library/phi:latest.
  • If not specified, the tag will be latest.

--limit (supports multiple flags)

Multiple limits can be specified by using the flag multiple times.

Resource limits for the deployed Model. This is useful for clusters that don't have a large enough number of resources, or if you want to deploy multiple Models in a cluster with limited resources.

For resource limits on NVIDIA, AMD GPUs...

In Kubernetes, any GPU resource follows this pattern for resources labels:

yaml
resources:
  limits:
    gpu-vendor.example/example-gpu: 1 # requesting 1 GPU

Using nvidia.com/gpu allows you to limit the number of NVIDIA GPUs, therefore when using kollama deploy you can use --limit nvidia.com/gpu=1 to specify the number of NVIDIA GPUs as 1:

shell
kollama deploy phi --limit=nvidia.com/gpu=1

this is what it may looks like in the YAML configuration file:

yaml
resources:
  limits:
    nvidia.com/gpu: 1 # requesting 1 GPU

Documentation on using resource labels with nvidia/k8s-device-plugin

Using amd.com/gpu allows you to limit the number of AMD GPUs, therefore when using kollama deploy you can use --limit amd.com/gpu=1 to specify the number of AMD GPUs as 1.

shell
kollama deploy phi --limit=amd.com/gpu=1

this is what it may looks like in the YAML configuration file:

yaml
resources:
  limits:
    amd.com/gpu: 1 # requesting a GPU

Example YAML manifest of labels with ROCm/k8s-device-plugin

Your can read more here: Schedule GPUs | Kubernetes

I have deployed Model, but I want to change the resource limit...

Of course you can, with the kubectl set resources command, you can change the resource limit:

shell
kubectl set resources deployment -l model.ollama.ayaka.io/name=<model name> --limits cpu=4

For memory limits:

shell
kubectl set resources deployment -l model.ollama.ayaka.io/name=<model name> --limits memory=8Gi

The format is <resource>=<quantity>.

For example: --limit=cpu=1 --limit=memory=1Gi.

--storage-class

shell
kollama deploy phi --storage-class=standard

StorageClass to use for the Model's associated PersistentVolumeClaim.

If not specified, the default StorageClass will be used.

--pv-access-mode

shell
kollama deploy phi --pv-access-mode=ReadWriteMany

Access mode for Ollama Operator created image store (to cache pulled images)'s StatefulSet resource associated PersistentVolume.

If not specified, the access mode will be ReadWriteOnce.

If you are deploying Models into default deployed kind and k3s clusters, you should keep it as ReadWriteOnce. If you are deploying Models into a custom cluster, you can set it to ReadWriteMany if StorageClass supports it.

--expose

Default: false

shell
kollama deploy phi --expose

Whether to expose the Model through a Service for external access and makes it easy to interact with the Model.

Actually, when creating a Model resource, a ClusterIP type service will be created

At the case where users didn't supply either --expose flag, Ollama Operator will create a associated service for the Model with the type of ClusterIP with the same name as the corresponding Deployment by default, and the service will be used for internal communication between the Model and other services in the cluster.

By default, --expose will create a NodePort service.

Use --expose=LoadBalancer to create a LoadBalancer service.

--service-type

shell
kollama deploy phi --expose --service-type=NodePort

Default: NodePort

Type of the Service to expose the Model. Only valid when --expose is specified.

If not specified, the service will be exposed as NodePort.

To understand how many Services are associated to Model...

shell
kubectl get svc --selector ollama.ayaka.io/type=model

Use LoadBalancer to expose the service as LoadBalancer.

--service-name

shell
kollama deploy phi --expose --service-name=phi-svc-nodeport

Default: ollama-model-<model name>-<service type>

Name of the Service to expose the Model.

If not specified, the Model name will be used as the service name with -nodeport as the suffix for NodePort.

--node-port

shell
kollama deploy phi --expose --service-type=NodePort --node-port=30000

Default: Random port

To understand what NodePort is used for the Model...

shell
kubectl get svc --selector model.ollama.ayaka.io/name=<model name> -o json | jq ".spec.ports[0].nodePort"

You can't simply specify a port number!

There are several restrictions:

  1. By default, 30000-32767 is the NodePort port range in the Kubernetes cluster. If you want to use ports outside this range, you need to configure the --service-node-port-range parameter for the cluster.
  2. You can't use the port number already occupied by other services.

For more information about choosing your own port number, please refer to Chapter of Kubernetes Official Document about nodePort.

nodePort to expose the Model.

If not specified, a random port will be assigned. Only valid when --expose is specified, and --service-type is set to NodePort.

Contributors

The avatar of contributor named as Neko Ayaka Neko Ayaka

Changelog