Background: Having been involved in machine learning for a few months, the biggest challenge I faced was while training a model to achieve a certain, required accuracy. To do so, I needed to train the model again and again. I would do this by manipulating the number of epochs, adding layers, etc. For a number of reasons, I felt that this would be a challenge for almost every machine learning / deep learning engineer, especially because training a model takes all your system resources. I believed that if your system is not capable, diverting all system resources may cause serious overheating issues and performance lag.
Goals: Simplifying the tasks that Machine Learning engineers typically faced while training bigger models.
Solution & Results:
The project that I built is an Accuracy Achiever. With the help of Jenkins, I built a setup such that I just only need to commit my code to GitHub from my development environment and everything will be done by Jenkins with just one click. The code is written in such a way that it will automatically manipulate the code of my training model as well as within Docker containers.
Jenkins's ability to automate everything makes it possible to get my work done in just a few steps. Also, it opened a new door in the world of Data Science: machine learning operations, aka MLOps.
I have tried to achieve "containerization within a container" by using some of the Docker commands, but before launching another container within this container I would need to install and configure Docker inside the container. I can do it manually by running Docker exec commands for running Docker installation commands, but I have automated it by using Jenkins.
I have used Docker containers for training my model separately so that it could be made global and wouldn't exhaust my system resources while training. I have used Jenkins to integrate different tools like Git, Docker, and my local development environment to create something which can be used by any developer, whether be it be a company employee or a beginner in the ML domain.
Also with the help of Jenkins, I made a monitoring system in the same project which continuously checks whether the models which are training in containers are running or have failed. If failed, then automatically it will launch another one and send an email notifying the developer.
Jenkins is a vast solution which has many features and plugins, but if you know the core concepts of any technology, just by the basics itself you can do a lot in this domain. While I didn't use any extraneous plugins, I did use the Git plugin, build pipeline, and email plugin. That's all it took, along with shell scripting and automatic job triggers, to make it completely automated.
The results? I first achieved an accuracy level of 84%, but my desired accuracy was above 85%. So the system automatically added more layers and trained my model until it achieved an accuracy of 89%. On top of that, I walked away with: