NVIDIA GPU Operator – Simplifying AI/ML Deployments on the Canonical Platform
anaqvi
on 22 October 2019
Tags: GPU , Kubeflow , kubernetes
Leveraging Kubernetes for AI deployments is becoming increasingly popular. Chances are if your business is involved in AI/ML with Kubernetes you are using tools like Kubeflow to reduce complexity, costs and deployment time. Or, you may be missing out!
With AI/ML being the tech topics of the world, GPUs play a critical role in the space. NVIDIA, a prominent player in the GPU space is one of the top choices for most stakeholders in the field. Nvidia takes their commitment to the space a step ahead with the launch of the GPU Operator open-source project at Mobile World Congress LA.
What is the GPU Operator
The GPU, being a high performance compute resource in the cluster requires a few components to be installed before application workloads can be deployed onto the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin, container runtime, etc. With the GPU Operator, you can manage resources in a Kubernetes cluster and automate bootstrapping GPU nodes tasks.
Supported Platforms
The NVIDIA GPU Operator currently supports and has been validated with the following:
● Pascal+ GPUs are supported (incl. Tesla V100 and T4)
● Kubernetes v1.13+
- Canonical’s Charmed Kubernetes v1.15 has been tested with and supports NVIDIA Nvidia GPU Operator. The GPU Operator works out the box with Canonical’s Charmed Kubernetes and is supported from day one.
– Note: Helm may fail to initialize in Kubernetes v1.16. The Helm installation step above includes a workaround for this. More details can be found in the Github issue.
● Helm 2
● Ubuntu 18.04.3 LTS
● The GPU Operator includes the following NVIDIA components:
● Docker CE 19.03.2
● NVIDIA Container Toolkit 1.0.5
● NVIDIA Kubernetes Device Plugin 1.0.0-beta4
● NVIDIA Tesla Driver 418.87.01
Set-Up
Prerequisites
The GPU Operator has a few prerequisites:
- It requires a fresh configuration of nodes – nodes must not be pre-configured with NVIDIA components (driver, container runtime, device plugin).
- i2c_core and ipmi_msghandler kernel modules need to be loaded
The following command ensures these modules are loaded:
$ sudo modprobe -a i2c_core ipmi_msghandler
The module loading step is not persistent and refreshes after a reboot. To make module loading persistent add the modules to the config file as shown:
$ echo -e “i2c_core\nipmi_msghandler” | sudo tee /etc/modules-load.d/driver.conf
- Node Feature Discovery (NFD) is required on each node. By default, NFD master and worker are automatically deployed .
If NFD is already running in the cluster prior to the deployment of the operator, set the variable nfd.enabled=false at the helm install step:
$ helm install –devel –set nfd.enabled=false nvidia/gpu-operator -n test-operator
See notes on NFD setup
Install Helm
$ curl -L https://git.io/get_helm.sh | bash
Create service-account for helm
$ kubectl create serviceaccount -n kube-system tiller
$ kubectl create clusterrolebinding tiller-cluster-rule –clusterrole=cluster-admin –serviceaccount=kube-system:tiller
Initialize Helm
$ helm init –service-account tiller –wait
Note that if you have Helm already deployed in your cluster and you are adding a new node, run this instead
$ helm init –client-only
Install the GPU Operator
Note that after running this command, NFD will be automatically deployed.
$ helm install –devel nvidia/gpu-operator -n test-operator –wait
$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/manifests/cr/sro_cr_sched_none.yaml
To check the gpu-operator version
$ helm ls
Running a Sample GPU Application
Create a tensorflow notebook example
$ kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml
Grab the token from the pod once it is created
$ kubectl get pod tf-notebook
$ kubectl logs tf-notebook
Use the following URL in your browser when you connect for the first time, to login with a token:
http://localhost:8888/?token=MY_TOKEN
You can now access the notebook on http://localhost:30001/?token=MY_TOKEN
What’s next
NVIDIA and Canonical will continue partnering to improve the AI/ML space and enable innovators. One area of interest is extending the GPU Operator to MicroK8s. MicroK8s takes the Kubernetes simplification one step ahead; a lightweight Kubernetes distribution with Kubeflow, GPUs, Helm and GPU Operator all in one package -Get started in seconds!.
Contributing
If you find a bug, have technical issues or would like to contribute to the NVIDIA GPU Operator, please visit the official Github page.
For issues or contributing to Canonical’s Kubernetes, please visit the Github page. You can also reach out to us on Twitter @canonical @ubuntu.
Canonical and NVIDIA look forward to your valuable feedback!
Run Kubeflow anywhere, easily
With Charmed Kubeflow, deployment and operations of Kubeflow are easy for any scenario.
Charmed Kubeflow is a collection of Python operators that define integration of the apps inside Kubeflow, like
katib or pipelines-ui.
Use Kubeflow on-prem, desktop, edge, public cloud and multi-cloud.
What is Kubeflow?
Kubeflow makes deployments of Machine Learning workflows on Kubernetes simple, portable and scalable.
Kubeflow is the machine learning toolkit for Kubernetes. It extends Kubernetes ability to run independent and
configurable steps, with machine learning specific frameworks and libraries.
Install Kubeflow
The Kubeflow project is dedicated to making deployments of machine learning workflows on Kubernetes simple,
portable and scalable.
You can install Kubeflow on your workstation, local server or public cloud VM. It is easy to install
with MicroK8s on any of these environments and can be scaled to high-availability.
Newsletter signup
Related posts
Canonical accelerates AI Application Development with NVIDIA AI Enterprise
Charmed Kubernetes support comes to NVIDIA AI Enterprise Canonical’s Charmed Kubernetes is now supported on NVIDIA AI Enterprise 5.0. Organisations using...
Canonical Kubernetes enhances AI/ML development capabilities with NVIDIA integrations
In recent years, Artificial Intelligence (AI) and Machine Learning (ML) have surged in importance. This rise can be attributed to a massive influx of data,...
How to deploy AI workloads at the edge using open source solutions
Running AI workloads at the edge with Canonical and Lenovo AI is driving a new wave of opportunities in all kinds of edge settings—from predictive maintenance...