Kubeflow Part 2: Using Notebooks to Create Pipelines—Elyra and RStudio

app business connection device

We are running a #Kubeflow series where we are sharing our experiences and thoughts on building a Kubeflow-based ML pipeline architecture for production use. This is the second post in the series.

The first step in the machine learning process is to perform the experimentation, both on the data and on the model. The next step is usually writing code to actually use the model, whether it be for training or inference. The former is popularly performed on Jupyter Notebooks and similar IDEs. Kubeflow provides a way to use these Notebooks themselves to create Pipelines in such a way that each step in the Pipeline runs in its own Docker container. In the last article, we went over creating Pipelines in the default manner, using the JupyterLab server. Here, we will cover two alternative approaches.

In Elyra, we will see a simplified method, using a UI to stitch together work that a data scientist might already have done. RStudio will provide an IDE alternative to JupyterLab.

Elyra

Elyra is a project that aims to help data scientists, machine learning engineers, and AI developers through the model development life cycle complexities. Elyra integrates with JupyterLab providing a Pipeline visual editor that enables low-code or no-code creation of Pipelines that can be executed in a Kubeflow environment.

Many of the steps of creating a Notebook for creating an Elyra Pipeline are the same as those for JupyterLab, with minor modifications. We start, once again, by going to the “Notebooks” screen from the Kubeflow console. In the Kubeflow console, “Notebook server” is referred to as “Notebook.” Click on the “Create Notebook” button to open the wizard screen for the creation process.

Figure 1. The Regular Options for Notebook Server–Creation

When creating a new Notebook Server, we select “Custom Image” and provide the following image name: elyra/kf-notebook.

Next, we can set the following parameters in the advanced options:

  • the amount of compute power, in terms of CPUs, and amount of RAM;
  • the amount of storage in the base volume and mounting additional volumes;
  • optional configurations for PodDefault resources in the profile namespace;
  • the number of GPU devices, if required;
  • affinity and tolerations; and
  • enabling shared memory—some libraries, like PyTorch, use shared memory for multiprocessing.

The image selected above, elyra/kf-notebook, is actually built on top of a JupyterLab image, so the set-up visible will be nearly identical. One difference in the launcher is the availability of the “Elyra Pipeline Editor.”

There are two steps to creating a Pipeline in Elyra once the code is available: creating a runtime environment configuration; and creating the Pipeline in the Elyra UI.

How Elyra works is it stitches different IPYNB Notebooks together into a Pipeline, which each Notebook a separate Component. So in this scenario, we write each Component into Notebooks as if it were normal code.

A few Notebooks might include loading some data, cleansing it, and performing some analysis. These can be seen as steps in the Pipeline—known as Components in Kubeflow parlance. Our goal is to convert IPYNB Notebooks into Components. Thus, we can write code in each of our Notebooks that are standalone to that Component’s purpose.

Next, we define the runtime environment for the containers that will run each Component.

Figure 2. A Sample Pipeline as Displayed in the Elyra Editor

Creating a Runtime Environment Configuration

Once our code is ready, there are several prerequisites required to collect before creating an Elyra Pipeline. This connectivity information is stored into a runtime configuration. To create one, select the Runtimes tab from the JupyterLab sidebar. From there, create a “New Kubeflow Pipelines runtime.” Give it a name and, optionally, a description.

Figure 3. The Runtimes Tab in the JupyterLab Sidebar

Then provide the following information in the configuration form. The first section contains the connectivity information for Kubeflow.

  • API endpoint, e.g., http://kubernetes-service.domain-name.com/pipeline.
  • Namespace, for a multi-user, auth-enabled Kubeflow installation, e.g., mynamespace.
  • Username, for a multi-user, auth-enabled Kubeflow installation, e.g., myusername.
  • Password, for a multi-user, auth-enabled Kubeflow installation, e.g., passw0rd.

Workflow engine type, which should be Argo or Tekton.

Figure 4. The Pipelines Connectivity Information Required

Elyra utilizes S3-compatible cloud storage to make data available to notebooks and scripts while they are executed. Any kind of cloud storage should work (such as Minio) as long as it can be accessed from the machine where JupyterLab is running and from the Kubeflow Pipelines cluster. The following information is therefore relevant to it and should be provided in this section.

  • S3 compatible object storage endpoint, e.g., http://minio-service.kubernetes:9000.
  • S3 object storage username, e.g., minio.
  • S3 object storage password, e.g., minio123.
  • S3 object storage bucket, e.g., pipelines-artifacts.
Figure 5. The Cloud Storage Information Required

Then save the runtime configuration.

Creating a Kubeflow Pipeline through Elyra

Open a new “Elyra Pipeline Editor” from the launcher. Creating a Pipeline is as simple as dragging and dropping Notebook files into the editor. Once all components are present, they can be connected together: the output of one to the input of another (or several others), as seen in Figure 2.

In the Pipeline Properties, we intimate what base image we want each Component to run in by default. This can be modified on a Component-by-Component basis in each node’s properties. We also set the runtime configuration for the Pipeline here.

Handling data transfer between Components is also through the UI in Elyra. File dependencies and files being output can be declared in the Node Properties of each node, or Component. Environment variables, if required, can also be defined in the properties.

Figure 6. Output Files Made Available to Subsequent Components

Now that all the set-up is complete, we just need to save our Pipeline and select, “Export Pipeline,” to get a similar YAML as we got from our JupyterLab flow!

RStudio

RStudio is a popular IDE for R, commonly used for data science. Kubeflow Pipelines can also be created through RStudio; however, there is the caveat that it can only be done through a Python file in RStudio. This Python file can be written the same way as the way we wrote our single-cell Pipeline code in JupyterLab.

To create a Notebook Server for RStudio, select the option labeled “2” in the image-selection portion of the Notebook-creation wizard. This will create a Notebook Server with RStudio.The Kubeflow SDK only provides Python packages and has no support for R. Pipelines can be created through Python files and Kubeflow CLI commands, as discussed in the JupyterLab section of this article; or through packaging R code in Docker containers.

Conclusion

Kubeflow provides easy ways to create Pipelines, each Component of which runs independently and can be reused. Elyra builds on this by even allowing Pipelines to be created with minimal extra or modified code. Pipelines can be created through Python code in other IDEs as well, including RStudio.

At this point, we should be able to create a training pipeline and an inference pipeline. In the next article, we will discuss running the Pipelines that have been created in a myriad of ways, as well as how to monitor and debug them, including by viewing their logs. 

After that will follow articles on Auto ML, using a model registry, and monitoring our data and models over time.

Leave a Reply

%d bloggers like this: