`kollama deploy`

kollama deploy command is used to deploy a Model to the Kubernetes cluster. It's basic a wrapper and utility CLI to interact with the Ollama Operator by manipulating CRD resources.

Use cases

Deploy model that lives on registry.ollama.ai

shell

kollama deploy phi

Deploy to a specific namespace

shell

kollama deploy phi --namespace=production

Deploy `Model` that lives on a custom registry

shell

kollama deploy phi --image=registry.example.com/library/phi:latest

Deploy `Model` with exposed NodePort service for external access

shell

kollama deploy phi --expose

Deploy `Model` with exposed LoadBalancer service for external access

shell

kollama deploy phi --expose --service-type=LoadBalancer

Deploy `Model` with resources limits

The following example deploys the phi model with CPU limit to 1 and memory limit to 1Gi.

shell

kollama deploy phi --limit=cpu=1 --limit=memory=1Gi

Flags

`--namespace`

If present, the namespace scope for this CLI request.

`--image`

Default: registry.ollama.ai/library/<model name>:latest

shell

kollama deploy phi --image=registry.ollama.ai/library/phi:latest

Model image to deploy.

If not specified, the Model name will be used as the image name (will be pulled from registry.ollama.ai/library/<model name> by default if no registry is specified). For example, if the Model name is phi, the image name will be registry.ollama.ai/library/phi:latest.
If not specified, the tag will be latest.

`--limit` (supports multiple flags)

Multiple limits can be specified by using the flag multiple times.

Resource limits for the deployed Model. This is useful for clusters that don't have a large enough number of resources, or if you want to deploy multiple Models in a cluster with limited resources.

For resource limits on NVIDIA, AMD GPUs...

In Kubernetes, any GPU resource follows this pattern for resources labels:

yaml

resources:
  limits:
    gpu-vendor.example/example-gpu: 1 # requesting 1 GPU

Using nvidia.com/gpu allows you to limit the number of NVIDIA GPUs, therefore when using kollama deploy you can use --limit nvidia.com/gpu=1 to specify the number of NVIDIA GPUs as 1:

shell

kollama deploy phi --limit=nvidia.com/gpu=1

this is what it may looks like in the YAML configuration file:

yaml

resources:
  limits:
    nvidia.com/gpu: 1 # requesting 1 GPU #

Documentation on using resource labels with nvidia/k8s-device-plugin

Using amd.com/gpu allows you to limit the number of AMD GPUs, therefore when using kollama deploy you can use --limit amd.com/gpu=1 to specify the number of AMD GPUs as 1.

shell

kollama deploy phi --limit=amd.com/gpu=1

this is what it may looks like in the YAML configuration file:

yaml

resources:
  limits:
    amd.com/gpu: 1 # requesting a GPU  #

Example YAML manifest of labels with ROCm/k8s-device-plugin

Your can read more here: Schedule GPUs | Kubernetes

I have deployed Model, but I want to change the resource limit...

Of course you can, with the kubectl set resources command, you can change the resource limit:

shell

kubectl set resources deployment -l model.ollama.ayaka.io/name=<model name> --limits cpu=4

For memory limits:

shell

kubectl set resources deployment -l model.ollama.ayaka.io/name=<model name> --limits memory=8Gi

The format is <resource>=<quantity>.

For example: --limit=cpu=1 --limit=memory=1Gi.

`--storage-class`

shell

kollama deploy phi --storage-class=standard

StorageClass to use for the Model's associated PersistentVolumeClaim.

If not specified, the default StorageClass will be used.

`--pv-access-mode`

shell

kollama deploy phi --pv-access-mode=ReadWriteMany

Access mode for Ollama Operator created image store (to cache pulled images)'s StatefulSet resource associated PersistentVolume.

If not specified, the access mode will be ReadWriteOnce.

If you are deploying Models into default deployed kind and k3s clusters, you should keep it as ReadWriteOnce. If you are deploying Models into a custom cluster, you can set it to ReadWriteMany if StorageClass supports it.

`--expose`

Default: false

shell

kollama deploy phi --expose

Whether to expose the Model through a Service for external access and makes it easy to interact with the Model.

Actually, when creating a Model resource, a ClusterIP type service will be created

At the case where users didn't supply either --expose flag, Ollama Operator will create a associated service for the Model with the type of ClusterIP with the same name as the corresponding Deployment by default, and the service will be used for internal communication between the Model and other services in the cluster.

By default, --expose will create a NodePort service.

Use --expose=LoadBalancer to create a LoadBalancer service.

`--service-type`

shell

kollama deploy phi --expose --service-type=NodePort

Default: NodePort

Type of the Service to expose the Model. Only valid when --expose is specified.

If not specified, the service will be exposed as NodePort.

To understand how many Services are associated to Model...

shell

kubectl get svc --selector ollama.ayaka.io/type=model

Use LoadBalancer to expose the service as LoadBalancer.

`--service-name`

shell

kollama deploy phi --expose --service-name=phi-svc-nodeport

Default: ollama-model-<model name>-<service type>

Name of the Service to expose the Model.

If not specified, the Model name will be used as the service name with -nodeport as the suffix for NodePort.

`--node-port`

shell

kollama deploy phi --expose --service-type=NodePort --node-port=30000

Default: Random port

To understand what NodePort is used for the Model...

shell

kubectl get svc --selector model.ollama.ayaka.io/name=<model name> -o json | jq ".spec.ports[0].nodePort"

You can't simply specify a port number!

There are several restrictions:

By default, 30000-32767 is the NodePort port range in the Kubernetes cluster. If you want to use ports outside this range, you need to configure the --service-node-port-range parameter for the cluster.
You can't use the port number already occupied by other services.

For more information about choosing your own port number, please refer to Chapter of Kubernetes Official Document about nodePort.

nodePort to expose the Model.

If not specified, a random port will be assigned. Only valid when --expose is specified, and --service-type is set to NodePort.

Contributors

Neko Ayaka

Changelog

Last edited 10 months ago

View full history

kollama deploy ​

Use cases ​

Deploy model that lives on registry.ollama.ai ​

Deploy to a specific namespace ​

Deploy Model that lives on a custom registry ​

Deploy Model with exposed NodePort service for external access ​

Deploy Model with exposed LoadBalancer service for external access ​

Deploy Model with resources limits ​

Flags ​

--namespace ​

--image ​

--limit (supports multiple flags) ​

--storage-class ​

--pv-access-mode ​

--expose ​

--service-type ​

--service-name ​

--node-port ​

Contributors

Changelog

`kollama deploy`

Use cases

Deploy model that lives on registry.ollama.ai

Deploy to a specific namespace

Deploy `Model` that lives on a custom registry

Deploy `Model` with exposed NodePort service for external access

Deploy `Model` with exposed LoadBalancer service for external access

Deploy `Model` with resources limits

Flags

`--namespace`

`--image`

`--limit` (supports multiple flags)

`--storage-class`

`--pv-access-mode`

`--expose`

`--service-type`

`--service-name`

`--node-port`