Kubeflow Part 4: AutoML Experimentation in Kubeflow Using Katib

a person tuning an acoustic guitar

We are running a #Kubeflow series where we are sharing our experiences and thoughts on building a Kubeflow-based ML pipeline architecture for production use. This is the fourth post in the series.

One of the troubles of creating models for data scientists is the hyperparameter tuning. The hyperparameters are core to how well the model performs on the data: they are parameters not learnt by the model through the data; rather, they are set beforehand by the data scientist.

In the last set of articles in this series, we learned how to create pipelines through JupyterLab—or visually through Elyra—and how to run them. In this article, we will look at one step that a data scientist might perform prior to running the training pipeline: hyperparameter tuning.

The same kind of machine learning model can require different constraints, weights, or learning rates to generalize different data patterns. These hyperparameters—so called because they are set by the data scientist to control the learning process, as opposed to a parameter being learned through the process—are set through a trial-and-error process.

One way would be for the data scientist to run the same training process over the data, manually using different values for each of the hyperparameters for each run. They would then compare the results using metrics such as loss or accuracy.

This involves coming up with strategies to modify the hyperparameters after each step, as well as manually changing the values and submitting the jobs. This has led to a host of tools created to solve this particular problem: AutoML tools.

Automated machine learning, or AutoML, is the process of optimizing the model’s architecture or hyperparameters before any learning is done by it. It optimizes the learning process itself, through a set of preprogrammed strategies. Given its importance in the journey to creating a properly working model, this article focuses on AutoML in Kubeflow, the tool used for it, and how to get it working.


Katib is a Kubernetes-native project for AutoML. The specific AutoML areas supported by Katib are hyperparameter tuning and neural architecture search (NAS), which is the process of learning the best possible architecture of a neural network—this includes the size of filters in a convolutional neural network (CNN), for instance. The scope of this article is limited to hyperparameter tuning, but the same steps apply for NAS in Katib.

Katib is agnostic to machine learning frameworks. It can tune hyperparameters of applications written in any language, and natively supports many ML frameworks, such as TensorFlow, MXNet, PyTorch, XGBoost, and others.

Key to any AutoML tool is the selection of optimization algorithms available to the user, and Katib has an extensive list. In addition to the trivial grid search and random search algorithms, more complex and intricate ones currently supported include Bayesian optimization, Hyperband, Tree of Parzen Estimators (TPE), Multivariate TPE, Covariance Matrix Adaptation Evolution Strategy (CMA-ES), and Sobol’s Quasirandom Sequence for hyperparameter tuning. NAS algorithms include Neural Architecture Search based on ENAS, Differentiable Architecture Search (DARTS), and Population Based Training (PBT).

Finally, Katib comes out-of-the-box with Kubeflow. This means the integration work is taken care of for us, and data scientists can focus on configuring their experimentation to easily complete the hyperparameter tuning phase.

The Data and Model

The use-case under our consideration for this hyperparameter tuning experiment is the same as that during our last set of articles: a text use-case. Each element of text in the dataset is a log statement. The problem statement is to find out which lines are anomalous based on the order in which the log lines appear.

In the preceding articles, we created a simple artificial neural network (ANN) as a sample solution, to avoid large models with long periods of training, attributes a real-world solution would possess. Instead, we proceed with this data and model.

Before Creating a Katib Experiment

There are a few prerequisites for a Katib experiment to be run. The first is packaging the training code as a container and making it available in a registry. For the latter, the Docker documentation and the Kubernetes documentation are sufficient. Thus, we focus on the former—specifically, the process of converting the code we already have from our JupyterLab model of creating a pipeline, into code that is amenable to being run as a Katib experiment.

There are four steps to containerizing the training code for Katib: (1) consolidating all Component functions; (2) replacing Kubeflow input/output artifacts with simple arguments and return statements; (3) logging the objective metrics in their respective functions; and (4) combining them in a main function. Let’s look at each more closely.

Consolidating Component Functions

The first step is trivial. All of the code required for this step already exists in a Python file created using the Jupyter Magic for this purpose.

%%writefile ./single_layer_ann_training_pipeline.py

Code Segment 1. A Single-Line Jupyter Magic that Converts a Cell into a Python File

We can copy and paste the Component functions into a separate Python file for this purpose.

Replacing Kubeflow Artifacts with Python Alternatives

The second step of containerizing our training code is replacing Kubeflow artifacts with regular Python arguments and return statements. Here is an example of the same.

def preprocess_data(csv_path: comp.InputPath('CSV'), sequence_json: comp.OutputPath()):
    # Imports required for the Pipeline Component.
    import pandas as pd
    import numpy as np
    import json
    # Read from the artifact CSV.
    df = pd.read_csv(csv_path)

    # Preprocess the dataset.
    df['sequence'] = df['sequence'].replace('[]', np.nan).copy()
    mask = ~(df['sequence'].isna())
    sequences = df.loc[mask, 'sequence']
    df = None
    sequences = [eval(sequence) for sequence in sequences]

    # Write the preprocessed data into an artifact.
    with open(sequence_json, 'w') as f:
        json.dump(sequences, f)

Code Segment 2. The Original Code of a Component to Preprocess Data

def preprocess_data(df):
    # Preprocess the dataset.
    df['sequence'] = df['sequence'].replace('[]', np.nan).copy()
    mask = ~(df['sequence'].isna())
    sequences = df.loc[mask, 'sequence']
    df = None
    sequences = [eval(sequence) for sequence in sequences]

    return sequences

Code Segment 3. The New Function that can be Containerized

Note the differences between the Code Segments 2 and 3. First, the function header is modified to remove all artifacts annotated with any of the output-type artifacts of kfp.components (note that in the full code containing the code segments above, kfp.components was imported as comp as follows: import kfp.components as comp): OutputBinaryFile, OutputTextFile, or OutputPath. The former two can be converted into file-write statements.

def preprocess_data(data: str, write_path: comp.OutputTextFile()):
    with open(write_path, 'w') as f:

Code Segment 4. Sample Function with an Output Artifact of Text File Type

def preprocess_data(data):
    with open('/path/to/write/location', 'w') as f:

Code Segment 5. Sample Function after being Converted for Containerization

Output artifacts of the Path type can be turned into return statements. Instead of writing the contents of an object into a file, and having to subsequently read from that file in functions that require that object, we can simply return the object itself. In code segment 2, we write the contents of the dictionary object sequences into a JSON artifact of the Path type. In code segment 3, we return the sequences dictionary at the end of the function.

Input artifacts, too, are converted into corresponding Python objects. However, these remain in the function headers. Similarly to output artifacts, input artifacts of the Binary File and Text File types can be converted into paths from which file-read statements can be made, and Path types can be converted into the relevant Python objects. For instance, the function reading the sequences dictionary that is returned by preprocess_data will no longer expect the location of the JSON file containing its contents; instead, it will expect the dictionary object itself.

The last part of readying our functions for containerization is simply to move all import statements to the top of the Python script. Recall that each Component’s import statements needed to be inside the function, as each Component ran as its own container. Since all the code will be containerized together for Katib experimentation, they can all reside at one location at the top of the script. This is not necessary, but can make larger files easier to read and maintain.

Logging Objective Metrics

Katib requires one objective metric, and allows for optional additional metrics that are simply recorded. Katib attempts to optimize the hyperparmeters by looking at the objective metric. Thus, this objective metric and the additional metrics, if any, must be logged.

There are five metric collectors that may be defined in Katib, and the Katib Metrics Collector documentation details them along with examples. In this article, we will discuss the logging requirements for the StdOut metrics collector. This is the default metrics collector of Katib, and yet its documentation lacks somewhat. At the time of writing, the following logging behavior worked for us.

import logging

    format="%(asctime)s %(levelname)-8s %(message)s",

logging.info('accuracy=' + str(correct))

Code Segment 6. The Logging Config and Statement in Segments

The base logging library of Python works perfectly for this task. We configure it to use the format above. This behavior was taken from the MXNet MNIST example in the Katib repository.

Each metric to be collected is then logged at INFO-level in the format ‘<metric_name>=<metric_value>’, with no spaces.

Finalizing in Main

Once all changes have been made to the functions, we just need to bring it all together in a main function or segment. Here is the one we used for this use-case.

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.set_defaults(lr=1e-3, batch_size=64, epochs=5)
    args = parser.parse_args()

    output_str = unzip_data('zipfile.zip')
    df = read_data('bucket-name', output_str, ',', '.', 'utf-8')
    sequences = preprocess_data(df)
    model = model_training(sequences, float(args.lr), int(args.batch_size), int(args.epochs))
    model_evaluating(sequences, model)

Code Segment 7. The Main Segment

The bottom half of the function is the culmination of our efforts to change the Component functions into Python functions. Note here how, due to the return statements, appropriate assignment operations are performed.

At the top of the main segment, an argument-parser is created. These arguments must represent the hyperparameters that will be tuned during the experiment. If the hyperparameters change, the code must be changed and the container must be built once more. The effect of this can be seen on the call of the model_training function: model = model_training(sequences, float(args.lr), int(args.batch_size), int(args.epochs)).

With that complete, the container image can be built and made available at a repository, and we can move on to the Katib experiment.

# syntax=docker/dockerfile:1

FROM python:3.8-slim-buster

WORKDIR /training

COPY training .

RUN pip3 install -r requirements.txt

ENTRYPOINT ["python3", "./training_pipeline.py"]

Code Segment 8. The Dockerfile Used to Containerize the Code Seen Above

Creating a Katib Experiment

The Katib UI lands on a home page listing all experiments that have been run so far.

Figure 1. The Katib Home Page

We can create a new experiment by following the button that leads us in that direction. That takes us to a form with a lot of options for hyperparameter tuning.

Figure 2. The Tuning Options Provided by Katib

Let’s look at each option in detail.

Trial Thresholds

There are three options that may be configured here.

  1. parallelTrialCount. Each hyperparameter tuning experiment runs as a set of trials, and each trial is run in its own container. This option allows us to set the maximum number of trial —and thus the maximum number of containers—that may run in parallel.
  2. maxTrialCount. The maximum number of trials that should run. This sets a limit on how many trials will be run, and how many containers will be spun up, but it is optional. If omitted, the experiment will run until the objective is reached (discussed in the subsection Objective) or the experiment reaches the maximum number of failed trials (discussed below).
  3. maxFailedTrialCount. The maximum number of trials allowed to fail. Katib recognizes trials with a status of Failed or MetricsUnavailable as Failed trials, and if the number of failed trials reaches maxFailedTrialCount, Katib stops the experiment with a status of Failed.


In this section, we define options that detail the metric to be optimized during the trials, other metrics that must be recorded, whether the objective metric must be maximized or minimized, and other such information.

  1. objectiveMetricName. The name of the metric to be optimized. It should be the same as one of the metrics being logged in the training code. In our case, this was accuracy, as seen in Code Segment 6.
  2. additionalMetricNames. The names of other metrics to be reported. These must also match names being logged in the training code, and are optional.
  3. type. The optimization strategy for the objective metric: maximize or minimize.
  4. goal. The value at which the trials may stop, even if maxTrialCount trials have not run. For instance, the data scientist may be happy with a 92% accuracy and not wish to spin up more containers than necessary.
  5. metricStrategies. The default way to calculate the experiment’s objective is to compare all maximum metric values when type is maximize, and all minimum metric values when type is minimize. To change the default behavior, this option may be set, such as the following.
objectiveMetricName: accuracy
type: maximize
  - name: accuracy
    value: latest

This would lead to the latest accuracy value reported by each training trial to be compared with the latest accuracy value reported by the others, and the trial with the maximum accuracy value to be declared the most successful. The default strategy, if metricStrategies is omitted, is the same as type.

Search Algorithm

The search algorithm determines how the search space of hyperparameters is performed. The options are different for each algorithm; the Random Search and Grid Search algorithms only have the option random_state, whereas the Bayesian Optimization algorithm has base_estimator, n_initial_points, acq_func, acq_optimizer, and random_state. Full details for all algorithms available can be found in the Katib documentation.

Early Stopping

Early stopping allows data scientists to avoid overfitting while training models during Katib experiments. It also saves resources and execution times by stopping trials when the target metrics are not improving before the completion of the training process.

With Katib, there is no need to make any changes to the training container code. Just configuring it in this form will suffice.

Early stopping uses the same output metrics used by the metrics collector. Logging the metrics with the timestamp is vital to early stopping, as it requires the sequence of metrics reported.

Two options need to be set for early stopping.

  1. algorithmName. The name of the algorithm desired for early stopping.
  2. algorithmSettings. The settings for said algorithm.

Currently, Katib only supports one early stopping algorithm: Median Stopping Rule. It stops a pending trial at step S if the trial’s best objective value by step S is worse than the median value of the running averages of all completed trials’ objectives reported up to step S. The settings for Median Stopping are as follows.

  1. min_trials_required. The minimum number of successful trials to be completed before stopping is considered.
  2. start_step. The number of reported intermediate trials before stopping.

Any trial that is stopped early will have the status EarlyStopped instead of Successful.


The range of the hyperparameters or other parameters that you want to tune for your machine learning (ML) model. The parameters define the search space for tuning. In this section of the spec, the names and distributions (discrete or continuous) of every hyperparameter that was added as an argument in our program are defined. A minimum and maximum value or a list of allowed values for each hyperparameter may be provided. Katib generates hyperparameter combinations in the range based on the hyperparameter tuning algorithm provided.

Metrics Collector

A specification of how to collect the metrics from each trial, such as the accuracy and loss metrics. As mentioned in the subsection Logging Objective Metrics, any of five different Metrics Collectors can be specified here, and the training code needs to be adjusted accordingly.

Trial Template

This is the YAML template that defines the trial. The spec is defined in the following terms.

  1. trialTemplate.trialSpec. The unstructured template with model parameters, which are substituted from trialTemplate.trialParameters. For example, the container may receive hyperparameters as command-line arguments, as we did, or as environment variables.
  2. trialTemplate.primaryContainerName. The name of the training container.

The following represents the trial template we used for hyperparameter tuning.

    apiVersion: batch/v1
    kind: Job
            sidecar.istio.io/inject: "false"
          serviceAccountName: default-editor
            - command:
                - 'python3'
                - 'training_pipeline.py'
                - '--lr'
                - '${trialParameters.learningRate}'
                - '--batch_size'
                - '${trialParameters.batch_size}'
                - '--epochs'
                - '${trialParameters.epochs}'
              image: <repo_name>/<container_name>
              name: <container_name>
          restartPolicy: Never
    - name: learningRate
      reference: lr
    - name: batch_size
      reference: batch_size
    - name: epochs
      reference: epochs
  primaryContainerName: <container_name>
  successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
  failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#

Code Segment 9. The Trial Template

The Results

Once the Katib experiment is all set up, all we have to do is click “Create,” and Katib starts an experiment for us! This takes us back to the Home Page, as seen in Figure 1, with a new row added for our new experiment. The number of successful, running, and failed trials will continue to get updated over time, until the experiment status reaches the completed state (denoted by a white-colored, green-background check mark). At this point, the experiment can be explored in more detail.

Figure 3. The Graph of the Results

The results open on a graph that contains the objective metric on the far left, followed by each of the hyperparameters that was chosen. Each line in the graph, colored differently from the other lines, starts at the objective metric value that experiment reached, and traverses to the corresponding hyperparameter values that were used in the experiment. An overview of the results can be found at the bottom of the page.

Figure 4. The Overview of the Results

The results overview shows information about some of the settings that were set by the user, such as the goal, and the number of running, failed, and successful trials. In addition, it also details the most successful trial (if any), by stating the values of the hyperparameters that led to the best value of the objective metric, as well as the said best value. In this case, the trial that performed best got an accuracy value of 90.75%, using values of a learning rate of 2.52×10–3, a batch size of 160, and 15 epochs to train.

Finally, we can see the details of all the trials that were run, to compare them in tabular form, under the “Trials” tab.

Figure 5. The Tabular Comparison of Trials

This table displays the details of every trial run, as well as the value it resulted in for the objective metric, and the values of all hyperparameters that led there. Data scientists can use this information to input hyperparameters into their training pipelines, whether they be environment variables, arguments, or embedded in the code (in which case their pipeline will have to be built again).


Kubeflow provides a Kubernetes-native way of performing AutoML, the process of tuning hyperparameters or neural architectures in an automated way, using certain algorithms to optimally select the correct set of values of hyperparameters or the settings of neural architectures. We can now run Katib on the training code that we modified for this purpose, and use the resulting values in the pipeline that we have already created in our previous set of articles.

Next up in this series will be an article on model registries, a centralized method of storing models, using APIs and a UI, to collaboratively manage the full lifecycle of an ML model. We will see how to set one up, how to connect it with Kubeflow, how to connect our pipelines to the registry, and the results of using one.

Leave a Reply

%d bloggers like this: