Hi, I am Guneet Singh a recent Computer Science graduate who participated in GSoC 2022 and was a part of Geant4 fast simulation group to build an end to end Kubeflow Pipeline for training machine learning based model for fast shower simulation. The project was completed over the summer and the outcomes of the project have been highlighted in the next section.
The project’s objective is to use Kubeflow to handle the development of a scalable ML Pipeline for the ML FastSimulation in Geant4. The Training would be used to generate an optimised tuned generative model which will later be used to perform Inference in Geant4. Motivation behind using Kubeflow ML pipelines is as follows:
Kubeflow is a free, open-source machine learning platform that makes it possible for machine learning pipelines to orchestrate complicated workflows running on Kubernetes.
In Large Hadron Collider (LHC) experiments at CERN in Geneva, the calorimeter is a crucial detector technology to measure the energy of particles. These particles interact electromagnetically and/or hadronically with the material of the calorimeter, creating cascades of secondary particles or showers. Describing the showering process relies on simulation methods describing all particle interactions with matter. A detailed and accurate simulation is based on the Geant4 toolkit. Constrained by the need for precision, the simulation is inherently slow and constitutes a bottleneck for physics analysis. Furthermore, with the upcoming high luminosity upgrade of the LHC with more complex events and a much-increased trigger rate, the amount of required simulated events will increase. Machine Learning (ML) techniques such as generative modeling are used as fast simulation alternatives to learn to generate showers in a calorimeter, i.e., simulating the calorimeter response to certain particles. The pipeline of a fast simulation solution can be categorized into five components: data preprocessing, ML model design, validation, inference, and optimization. The preprocessing module allows us to derive a suitable representation of showers and to perform data cleaning, scaling, and encoding. The preprocessed data is then used by the generative model for training. To search for the best set of hyperparameters of the model, techniques such as Automatic Machine Learning (AutoML) are used. The validation component is based on comparing different ML metrics and physics quantities between the input and generated data. The aim of this project is to optimize the ML pipeline of the fast simulation approach using the open-source platform Kubeflow. You can check further details here.
The ML FastSim in Geant4 components
The ML FastSimulation (Training) in Geant4 can be broken down into the following functional components:
ML FastSim project
Before beginning the discussion about the Kubeflow Component Creation, it is important to look at the python code base which will later be reformatted according to kubeflow pipeline generation requirements
|---convert.py
|---generate.py
|---README.md
|---requirements.txt
|---setup.py
|---train.py
|---validate.py
|
+---core
|-------constants.py
|-------model.py
|
+---utils
|-------observables.py
|-------preprocess.py
The Full codebase can be found here.
Refactored ML FastSim project into Kubeflow Pipeline
|---main.py
|---configuration.py
|---README.md
|---generate_yaml.py
|---Katib_yaml
|
+---pipeline_components
|-------generate.py
|-------input_parameters.py
|-------preprocess.py
|-------validate.py
|-------model_parameters.py
|-------Katib_setup.py
|
+---training_docker
|-------Dockerfile
|-------krb5.conf
|-------main.py
To design any pipeline, the following steps are essential:
***
/pipeline_components/input_parameters
defines the variables that are going to be used throughout the pipeline.input_parameters
component are initialised using a configuration.py
file which you can edit to control your workflowKatib_setup
aims to focus on the integration of Katib hyperparameter tuning into our pipeline.generate
was configured to load the Saved Model from EOS
and produce the shower generation for number of events specified by the user.This sections aims to showcase how Kubeflow Pipeline is created by refactoring the simple python code into Kubeflow component format . All the examples demonstrates different use cases which are intensively required in any ML workflow. In reference to the discussion in
Kubeflow Pipeline Preparation
the upcoming points would help in grasping those suggestions and understand the blockers usually faced and how to solve them.
### Identifying the First Component of the Pipeline
Input Parameters
about which had discussed in previous section.Preprocess Python Function
has been implemented herePreprocess Kubeflow Function
focuses on how transferring of variables between components take place. Check here
In Kubeflow we can not transfer arrays, list ,dictionaries, dataframes, etc. like we pass str,int,bool or float.
Each Kubeflow component lives and executes in different containers.
To establish the connection between components we use persistent memory (EOS) to store large data and pass the location path from one component to another.
Model_Setup
component has to be loaded in the Generate Component
EOS
The python function formatted according to Kubeflow requirements become components by using the kfp.components
package which contains inbuilt function to convert python functions to components and store them in YAML format.
To generate YAML files of all the components check my repo here
The following file below shows how the kubeflow components are brought together and connected into a single pipeline. The components.dsl
package provides functions for components connections and pipeline formulation.
In the ml_pipeline_first
function the components are stitched together logically.
Observe the passing of arguments from one component to another, which establishes the link among the components, and defines the workflow.
A specific methodology needs to be followed while creating your docker image. The following steps discuss its creation:
Step1
: $ docker login gitlab-registry.cern.ch
Step2
: Goto this link and download the folder. The Dockerfile and requirements.txt found here are the base images over which we will be adding our own additional requirements.
Step3
: If unable to login in step 1, try this first and then again put in login credentials
$ sudo chmod 666 /var/run/docker.sock
Step4
: Update the requirements.txt file according to the needs of the projects and mention the libraries to be installed using pip.
Step5
: Custom DockerFile content:
# Select a base image from which to extend
FROM <SPECIFY YOUR BASE IMAGE>
# or: FROM custom_public_registry/username/image
USER root
# Install required packages
COPY custom_requirements.txt /requirements.txt
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys FEEA9169307EA071 8B57C5C2836F4BEB && apt-get -qq update && pip3 install -r /requirements.txt
USER jovyan
# The following line is mandatory:
CMD ["sh", "-c", \
"jupyter lab --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-browser \
--allow-root --port=8888 --LabApp.token='' --LabApp.password='' \
--LabApp.allow_origin='*' --LabApp.base_url=${NB_PREFIX}"]
Step6
: $ docker build. -f <Base_Dockerfile_Name> -t <your_alias>
Step7
: $ docker build . -f <Custom_Dockerfile_name> -t gitlab-registry.cern.ch/<repo_name>/<container_name>:<tag_name>
Step8
:$ docker push gitlab-registry.cern.ch/<repo_name>/<container_name>:<tag_name>
Step9
: Once you have pushed the image to the GitLab registry, it is now easily accessible for the containers. My images can be found here.
Step1
: Open Terminal and enter kinit <CERN-USER-ID>
Step2
: Delete existing Kerberos Secret:
kubectl delete secret krb-secret
Step3
: Create a new general Kerberos Secret:
kubectl create secret generic krb-secret --from-file=/tmp/krb5cc_1000
Step4
: Configure EOS in the Pipeline Code. Mounting Kerberos and EOS to the kubeflow environment
eos_host_path = k8s_client.V1HostPathVolumeSource(path='/var/eos')
eos_volume = k8s_client.V1Volume(name='eos', host_path=eos_host_path)
eos_volume_mount = k8s_client.V1VolumeMount(name=eos_volume.name, mount_path='/eos')
krb_secret = k8s_client.V1SecretVolumeSource(secret_name='krb-secret')
krb_secret_volume = k8s_client.V1Volume(name='krb-secret-vol', secret=krb_secret)
krb_secret_volume_mount = k8s_client.V1VolumeMount(name=krb_secret_volume.name, mount_path='/secret/krb-secret-vol')
Step5
: To add the volumes so that EOS is accessible through each component, we add the following to each of the function components created using the kfp sdk:
.add_volume(krb_secret_volume) \
.add_volume_mount(krb_secret_volume_mount) \
.add_volume(eos_volume) \
.add_volume_mount(eos_volume_mount)
Step6
: Once the above setup completes, we can access publicly visible files from the EOS.
The Original Data was trained on a 240 GB RAM. Thus, it was important to refactor the code to handle ~175 GB worth of data in 8 GB RAM through batch loading, progressive training and appropriate cache management. Detailed Discussion is available here
Katib is a hyperparameter tuning framework that comes with Kubeflow. It provides scalability through k8s environment as it can run multiple trials in parallel. Katib is a host of various powerful algorithms that can be added in our workflow such as NAS, Bayesian Optimisation,Grid Search,etc. The Katib experiments are parallelly run on GPUs and would strongly depend on the resource allocated to your namespace
For indepth understanding of Katib YAML visit the official documentation
Dockerizing the components is an essential step in Kubeflow Pipeline construction This helps in setting up different environments and resources for each component of a pipeline It is also necessary if we want KATIB to attach with our model training setup since KATIB runs this image multiple times in parallel The steps to create a docker image can be found in the Katib Readme here
--lr
and --batch_size
are being tuned through KatibThe Katib Results looks as follows:
The Kubeflow Dashboard also provides a Tabular presentation of experiment details:
The following steps would provide you a guided workflow through which you can import this project onto your Kubeflow Namespace and run the experiments.
STEP1
: Go to ml.cern.ch and login into the Kubeflow Dashboard
STEP2
: Go to Notebook tab on the side panel and create a working space
STEP3
: Confirm the allocated resources and create the workspace with kf-14-tensorflow-jupyter:v1
STEP4
: Create a folder from the sidebars
STEP5
: Once inside the folder open a Terminal and
Before Step 6 Create your kerberos secret to access the EOS memory space from inside the Pipeline
The commands are to be entered in the Terminal as follows: 1)
kinit <your namespace>
2)kubectl delete secret krb-secret
3)kubectl create secret generic krb-secret --from-file=/tmp/krb5cc_1000
STEP6
: Run!git clone <repo name>
in the notebook cell
STEP7
: Change the parameter values in the configuration.py
so to adjust according to your experiment setup
STEP8
: Run !python3 generate_yaml.py
in the next notebook cell
The above step would create YAML files for each python component which will be a part of the Kubeflow Pipeline
STEP9
: Run !python3 main.py --namespace <Specify your namespace name> --pipeline_name <Specify your pipeline name>
in the notebook cell
STEP10
: To check the results open the runs
tab to see final pipeline graphs and AutoML
tab to access the Katib Hyper Parameter Tuning