We are running a #Kubeflow series where we are sharing our experiences and thoughts on building a Kubeflow-based ML pipeline architecture for production use. This is the sixth post in the series.
In the last article, we discussed the what a model registry is, what functionality it provides, as well as how to bring MLflow’s model registry to the Kubeflow UI. This time, we will talk about maintaining a model’s performance over time, and observing the signs that may indicate a retraining is needed.
When a data science team has a machine learning workload in place, deployed, and running, one point they should keep in mind is that models deteriorate in performance over time. This could be due to a variety of factors: the training data for the model is not reflective of the data arriving during later inference periods; the model is retrained, but retains “learning” of information that is no longer relevant; or the relationship between the input features and the target variables is no longer the same.
These are just some ways in which a model’s performance can deteriorate, but they display the need for the data scientists to keep track of the performance of the model and the distributions of the various datasets used (the training sets, and each inference set). This kind of observability is generally referred to as “model drift” and “data drift,” for the performance of the model and for the distributions of features respectively.
Kubeflow does not come with in-built capabilities for model and data observability. In this article, we explore Evidently, an open-source Python library that helps evaluate, test, and monitor the performance of ML models from validation to production, and how to integrate it with Kubeflow. We also look at producing dashboards from Evidently’s analyses to allow data scientists to visualize this information.
Evidently
Evidently helps evaluate and test data and ML model quality throughout the model lifecycle. It has three components: Reports, Tests, and Monitors. These interfaces cover alternative usage scenarios: from visual analysis to automated pipeline testing and real-time monitoring.
To use Evidently, the data scientist provides the data, chooses what is to be evaluated, and decides on the format of the output. In this article, we focus on Evidently’s Monitors, as they are the interface that allows us to create dashboards that are customizable and meaningful.
Evidently has Monitors for five dashboards, and we discuss each of them here.
Data Drift
Data drift is the concept of input features changing their distribution over time. The Evidently data drift Monitor does not consider the relationship between features; just the relationship between a feature at different points in time.
To calculate data drift, the data scientist needs to provide two datasets. The reference dataset is the one that Evidently uses to form the benchmark. The current dataset is analyzed for changes in reference to the reference set. A common refrain will be to use the training set as the reference set, and an inference set as the current set.
The schema for the two datasets must be identical, and must contain all of the features over which drift is to be calculated. If the data provided to Evidently contains a target variable, drift will be calculated over it as well.
Finally, the data scientist can provide column mappings. Evidently tries to identify numerical and categorical data automatically, but this can also be provided by the data scientist. If the column contains text data, that must be specified explicitly.
Classification Performance
As the name suggests, this Monitor is only applicable to classification tasks. It monitors the performance of a classification model in production, which can be used to trigger or decide on retraining frequencies and necessity.
The classification performance Monitor can generate a report for a single dataset, or it can compare performance on one dataset to that on another. It works for binary and multi-class, probabilistic and non-probabilistic classification problems.
The Monitor results in the following information.
- Model quality summary. Evidently can calculate accuracy, precision, recall, F1-score, ROC, AUC, and LogLoss.
- Class representation. It displays the number of objects of each class, to provide an idea of the distribution of data in the dataset.
- Class-wise quality metrics. Evidently calculates the same metrics for on a class-wise basis as calculated on the model’s results as a whole.
Regression Performance
This Monitor is an equivalent one to the above, but for regression tasks. It can generate a report on a single dataset, or compare two; displays plots related to performance; and helps explore areas of under- and overestimation.
It displays the following information.
- Model quality summary. Regression performance is measured in mean error, mean absolute error, and mean absolute percentage error.
- Feature-wise error bias. This displays the mean value of that feature in three groups: OVER (the top 5% of predictions with overestimation), UNDER (the top 5% of predictions with underestimation), and MAJORITY.
Categorical and Numerical Target Drift
The target drift Monitors measure the drift in the target and prediction variables. Each Monitor is for a different kind of target, categorical or numerical. Both Monitors required two datasets, one for reference and the current dataset.
Despite being named in this manner, the drift is calculated for both target and prediction variables, if available. The numerical target drift Monitor additionally calculates the correlation between the target and the prediction variables.
Creating Dashboards and Connecting Evidently to Them
The Evidently GitHub provides a repository under their examples that creates the five dashboards discussed in the section above. The dashboards are defined to work in Grafana. The example connects Prometheus with Evidently, with the former pulling data from Evidently for Grafana to use in its dashboards.

With some changes to the code there, we can adapt it for our own datasets and workloads.
To begin, we can fork the Evidently GitHub repo. We can then navigate to evidently/examples/integrations/grafana_monitoring_service
, which is the working directory for this integration. Within this directory, we can see the dashboards defined under the dashboards/
directory, in JSON form for Grafana. Under the config/
directory is configuration for the way Grafana, Prometheus, and Evidently are to work in concert to bring the dashboards to life.
The repo also defines a Dockerfile, a docker-compose.yml
, a requirements.txt
, and a run_example.py
. The latter runs the Docker image.
The only directory in which we need to work is the metrics_app/
directory. It contains its own requirements.txt
, a config.yaml
for Evidently, and an app.py
that runs the Evidently service.
Let us walk through the changes we must make to add our log dataset use-case to the service.
Changing requirements.txt
Recall that some Evidently dashboards require a reference dataset to compare newer data to. For our use-case, the reference dataset is the training set. Since we have stored the training set in an Amazon S3 bucket, the Evidently service needs a way to access it. For this, we add boto3
to the requirements.
dataclasses==0.6
Flask~=2.0.1
pandas~=1.1.5
Werkzeug~=2.0.1
requests~=2.26.0
prometheus_client~=0.11.0
pyyaml~=5.4.1
boto3
Code Segment 1. The New requirements.txt
for Evidently
Changing config.yaml
This file is used to tell Evidently what datasets it might expect, what their schema are, and what Monitors to use on the data. It also speaks to the parameters under which the service is to work, which we will discuss in detail.
The first section of the config.yaml
is the datasets
section. Under this section, we define the name of the dataset, and then the column mapping, the data format, and the monitors to use.
datasets:
log_1_layer_ann:
column_mapping:
categorical_features: [ 'sequence', 'variables', 'danger' ]
numerical_features: [ ]
datetime: 'timestamp'
target: 'target'
prediction: 'prediction'
data_format:
header: true
separator: ','
date_column: 'timestamp'
monitors:
- data_drift
- classification_performance
- cat_target_drift
Code Segment 2. The datasets
Section
In our dataset, we have three features: sequence
, variables
, and danger
. This is a simplified form of an actual dataset that might be used with more complex models to solve this problem. sequence
refers to the order of log templates that have arrived prior to and including the line in question. A log template is a representation of multiple logs by keeping the fixed text in the line and abstracting the variable text with a wildcard.
For instance, consider the following two log lines.
21984 2023-01-09T00:34:13. 168205-07 INFO: config_db: scan_md_log: took 1.892 seconds
21984 2023-01-09T00:35:15. 168205-07 INFO: config_db: scan_md_log: took 0.998 seconds
This results in the following log template: 21984 2023-01-09T00:*:*. 168205-07 INFO: config_db: scan_md_log: took * seconds
. The variables column for each of the log lines is then [34, 13, 1.892]
and [35, 15, 0.998]
respectively.
The danger
column is a categorical input that represents whether a danger word exists within the log line, and what level of danger the word represents. Common danger words are “error,” “exception,” and “terminated,” but the list can be extensive. Values range from 0
for no danger to 4
for critical-level danger.
With some pre-processing, we provide all this data in such a way that it all comes out to be categorical information, which is why we provide all three features as categorical features. In this data, we decided not to provide any numerical information. We have a timestamp
column that represents the time at which the log line was encountered (which is thus of the datetime
type), a target
column that is a binary value representing whether the line is anomalous or not, and a prediction
column representing the model’s corresponding prediction.
The CSV file containing our training set has a header row and uses the comma (,
) as a separator. We also reiterate that the timestamp
column represents a date and not an input feature.
Finally, for this dataset, we only require the data drift, classification performance, and categorical target drift dashboards, because this is a classification task with a categorical target. All of this can be seen in Code Segment 2.
We next define the parameters for the Evidently service.
service:
calculation_period_sec: 2
min_reference_size: 30
moving_reference: false
datasets_path: datasets
use_reference: true
window_size: 5
source: s3
bucket: <bucket_name>
Code Segment 3. The Service-Definition
Evidently provides the following service-definition parameters.
reference_path
. The path to the reference dataset. Since we are taking our reference dataset from S3, we leave this out of our definition.use_reference
. Defines whether to use the provided reference dataset (true
) or to collect the reference from the production data stream (false
). As of writing this article, only thetrue
option is available.min_reference_size
. The minimal number of objects in the reference dataset to start calculating the monitoring metrics. If the reference dataset is provided externally, but has less objects than defined, the required number of objects will be collected from the production data and added to the reference.moving_reference
. Defines whether the reference must be moved in time during metrics calculation. As of writing this article, only thefalse
option is available.window_size
. The number of new objects in the current production data stream required to calculate the new monitoring metrics.calculation_period_sec
. How often the monitoring service should check for the new values of monitoring metrics.monitors
. The list of monitors to use. This defines the list by default; if they are different for each dataset, it must be provided under each dataset’s definition.
We added two new parameters, due to the fact that we are using S3 for our reference dataset. We will see where they come into the picture in the app.py
.
source
. Optional; notifying the service that reference data is coming from S3.bucket
. Optional; the name of the bucket from which reference data is to be downloaded.
The following is the full config.yaml
file.
datasets:
log_1_layer_ann:
column_mapping:
categorical_features: [ 'sequence', 'variables', 'danger' ]
numerical_features: [ ]
datetime: 'timestamp'
target: 'target'
prediction: 'prediction'
data_format:
header: true
separator: ','
date_column: 'timestamp'
monitors:
- data_drift
- classification_performance
- cat_target_drift
service:
calculation_period_sec: 2
min_reference_size: 30
moving_reference: false
datasets_path: datasets
use_reference: true
window_size: 5
source: s3
bucket: <bucket_name>
Code Segment 4. The Full config.yaml
Changing app.py
Most of the app.py
file provided by Evidently will work as intended. In fact, there are only three places in which changes are required.
The Imports
We add two new import statements for the rest of the added code to work.
import pathlib
import boto3
Code Segment 5. The New Imports
The New Service Parameters
The existing service-definition parameters, the ones used under the services
section in config.yaml
, are defined under the class MonitoringServiceOptions
. We therefore add our two new service parameters to the class.
@dataclasses.dataclass
class MonitoringServiceOptions:
datasets_path: str
min_reference_size: int
use_reference: bool
moving_reference: bool
window_size: int
calculation_period_sec: int
source: str
bucket: str
Code Segment 6. The New MonitoringServiceOptions
Class
Adding the S3-Download Option for the Reference Dataset
Under the decorator of @app.before_first_request
, the function configure_service()
performs all of the steps required for the Evidently service to run. This includes loading the configuration, loading the reference dataset, and instantiating the service.
After instantiating the MonitoringServiceOptions
and DataLoader
data-classes, it attempts to load the datasets defined in the configuration. Here, we utilize the new service-definition parameters created for this purpose.
if options.source.lower() == 's3':
s3 = boto3.resource('s3')
datasets = {}
for dataset_name in config['datasets']:
logging.info(f"Load reference data for dataset %s", dataset_name)
reference_path = os.path.join(datasets_path, dataset_name, "training.csv")
if dataset_name in config["datasets"]:
dataset_config = config["datasets"][dataset_name]
if options.source.lower() == 's3':
logging.info(f'S3 reference set detected')
bucket = options.bucket
key = '/'.join(reference_path.split('/')[2:])
logging.info(f'Downloading file s3://{bucket}/{key} from S3')
pathlib.Path(reference_path).parent.mkdir(parents=True, exist_ok=True)
response = s3.Bucket(bucket).download_file(key, reference_path)
logging.info(f'File s3://{bucket}/{key} downloaded from S3 to {reference_path}')
Code Segment 7. Downloading the Reference Set from S3
Note that the next lines of code assume that the data is present at reference_path
, so we simply download the reference dataset to that location before proceeding, in case the option source
was defined as 's3'
.
That concludes the changes required to be made to the repository! We can now follow the steps to run the services.
Running the Services
- Install Docker in the cluster.
- Create a Python virtual environment and activate it.
pip install virtualenvvirtualenv venvsource venv/bin/activate
- Clone our forked repo of the Evidently example.
- Install the dependencies.
cd evidently/examples/integrations/grafana_monitoring_service/
pip install -r requirements.txt
- Run the Docker image.
./run_example.py
This will leave the following services running:
- Evidently monitoring service at port 8085;
- Prometheus at port 9090;
- Grafana at port 3000; and
- sending the production data from the test datasets to the Evidently monitoring service (row by row).
- Explore the dashboards.
The Grafana dashboards are accessible at port 3000, which can be exposed. Once logged in, the dashboards can be found under the General tab.
Connecting a Pipeline to the Service
Now that the service is configured, up, and running, we can add a Component to our inference Pipelines to send the inference data and model prediction information to the Evidently service. To achieve this, we send the data as a CSV artifact and the predictions as a JSON artifact to the Component. Then, we preprocess them in the required way so as to have them in the right format for Evidently.
df = pd.read_csv(csv_path)
with open(preds_json) as f:
preds = json.load(f)
# Preprocess the dataset.
df['sequence'] = df['sequence'].replace('[]', np.nan).copy()
mask = ~(df['sequence'].isna())
reqd_columns = ['timestamp', 'danger', 'variables', 'sequence', 'target']
df = df.loc[mask, reqd_columns].reset_index(drop=True)
df['prediction'] = preds
Code Segment 8. Using the Input Artifacts
Next, we define a numpy
-encoder over the default JSON-encoder, so that all data can be JSON-ified properly.
class NumpyEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, np.void):
return None
if isinstance(obj, (np.generic, np.bool_)):
return obj.item()
if isinstance(obj, np.ndarray):
return obj.tolist()
return json.JSONEncoder.default(self, obj)
Code Segment 9. The numpy
-Encoder
Then, we define a function, send_data
, that performs the required POST request for us.
def send_data(data) -> None:
try:
response = requests.post(
f"<host>:8085/iterate/log_1_layer_ann",
data=json.dumps(data, cls=NumpyEncoder),
headers={"content-type": "application/json"},
)
if response.status_code == 200:
print(f"Success.")
else:
print(
f"Got an error code {response.status_code} for the data chunk. "
f"Reason: {response.reason}, error text: {response.text}"
)
except requests.exceptions.ConnectionError as error:
print(f"Cannot reach a metrics application, error: {error}, data: {data}")
Code Segment 10. The send_data
Function
Note that the URL to which the POST request must be sent will change based on the name of the dataset defined in Evidently’s config.yaml
.
Finally, we call the function with the data as an argument.
data = df.to_dict(orient='records')
send_data(data)
Code Segment 11. Calling the send_data
Function
This Component will send the data in the correct format to the Evidently service, where the relevant monitors will act upon the data and have results ready for Prometheus to pull from it. Grafana pulls the results from Prometheus and displays them on the relevant dashboards.
def writing_monitoring_info(csv_path: comp.InputPath('CSV'), preds_json: comp.InputPath()):
# Imports required for the Pipeline Component.
import pandas as pd
import numpy as np
import requests
import json
# Read from the artifact CSV.
df = pd.read_csv(csv_path)
with open(preds_json) as f:
preds = json.load(f)
# Preprocess the dataset.
df['sequence'] = df['sequence'].replace('[]', np.nan).copy()
mask = ~(df['sequence'].isna())
reqd_columns = ['timestamp', 'danger', 'variables', 'sequence', 'target']
df = df.loc[mask, reqd_columns].reset_index(drop=True)
df['prediction'] = preds
class NumpyEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, np.void):
return None
if isinstance(obj, (np.generic, np.bool_)):
return obj.item()
if isinstance(obj, np.ndarray):
return obj.tolist()
return json.JSONEncoder.default(self, obj)
def send_data(data) -> None:
try:
response = requests.post(
f"<host>:8085/iterate/log_1_layer_ann",
data=json.dumps(data, cls=NumpyEncoder),
headers={"content-type": "application/json"},
)
if response.status_code == 200:
print(f"Success.")
else:
print(
f"Got an error code {response.status_code} for the data chunk. "
f"Reason: {response.reason}, error text: {response.text}"
)
except requests.exceptions.ConnectionError as error:
print(f"Cannot reach a metrics application, error: {error}, data: {data}")
data = df.to_dict(orient='records')
send_data(data)
Code Segment 12. The Entire Monitoring Component
Conclusion
Production observability is a key factor in any workload, and ML workloads are no different. Given that the models are so heavily dependent on the data they are fed, we find ways to measure drifts in both the data and the model. We also look at the performance of the model over time. All of this can be achieved with Evidently, which is easily linked with Kubeflow.
Through a Prometheus–Grafana pipeline, the observability is brought to dashboards for easy viewing, and the streaming nature of Evidently’s Monitors means that these dashboards can be used over time with ease.
This brings us to the end of the Kubeflow series of articles. Hopefully, they were interesting and informative, and helped bring your workload to Kubeflow in a seamless manner!