The conventional process for a Machine Learning/Data Science related project usually starts with the data. It consists of different segments, such as data preprocessing, model training, and prediction. Various tools are used for folder structure and pipeline.
Training a machine learning model is one thing, but managing and putting all those pieces together is a whole different story. The ML field is growing at a pace. In addition to this, a complete project requires an ETL pipeline (data engineering pipeline) and a machine learning pipeline with production-ready implementations. Combining all these with standard software engineering practice is much needed.
While one can combine ‘n’ different tools to get work done, this is much more prone to bugs and increases the workload and time.
What is Kedro?
Kedro packages everything we need, starting from ETL processes, visualization plugins, DAGs extension, project packaging, deployment option, CLI tools, and machine learning workflow with out of the box support for handling different data formats. More to this, we can either use Kedro as a framework or a library. Kedro is highly modular and flexible enough to hook any other requirement, so the potentials are unlimited.
What makes Kedro Useful?
We have experienced a few merits while building our projects; it has a defined folder structure specialized for machine learning and data science purposes. The predefined structure helped us and our team members to collaborate and structure code according to io operations, data engineering, and data science sequences.
Kedro helps us visualize, which reduces time as the team becomes aware of how the pipeline is built without looking into the code. Kedro’s structure and code organization helps to relocate required logic. It eases the team to work collaboratively with less effort. It is also easy to go through and recall past projects because of the defined project structure. With all the issues solved by the Kedro, it becomes production-ready for development. It has been easily scalable and deployable with data versioning tools, which drastically reduces production effort.
Advantages of Using Kedro
- Versioning of dataset
- Framework for Pipelining both data engineering-related task and data science-related task
- A Standardization project, cookiecutter like a template
- Modularity and flexibility
- In-built support for different data file types
- Additional plugins from the airflow to visualization
This is a sample project to show how easy it is to build a complete pipeline in Kedro. Let’s dive in!
Handling and loading/saving different datafiles is the first thing every machine learning engineer will encounter. There are various data formats and sources of data. Kedro out of the box supports commonly used data files (CSV, S3 to distribute DB). Check out the list of all different data file formats. Kedro knows that there can be different data formats, so it also allows us to load and process those custom data formats. Just to show how easy it is to load data.
Writing this on a catalog file, we ask Kedro to go to the URL and download those datasets for us and save those in the variable “house.” In addition to this, we can version our data.
This is how the kedro framework looks like:
To address the data pipeline workflow, Kedro divides it into two branches:
- The data engineering pipeline where we extract data, transform them, and make them ready to input into machine learning pipeline,
- And the next is the data science pipeline, which helps process, augment, build, and evaluate the model.
Both the branches have two sections: nodes and pipeline.
In nodes, we need to define different functions for our workflow, and in the pipeline, we need to make data flow on those nodes. Let’s take a look at the code:
Data engineering pipeline
The data engineering pipeline is responsible for extracting the data, making an integral transform, and passing the model training ready data to the data science pipeline. In the above example, the process_data function loads the data and splits it to train the test subset.
The pipeline is one of the critical building blocks for maintaining the flow of the Kedro project. As shown in the example above, we create a node where process_data is the function defined in the node file, and the house is the variable declared in the catalog file as discussed earlier. We pass the parameter and data to process_data and save the output as python dict. We can combine as many nodes as required, but it’s good to avoid pipelines’ cyclic design.
Data Science pipeline
Data science has the same pattern as described in the data engineering pipeline. The only difference is that we define all the work of training a model, model evaluation, and every machine learning task can be done within this section.
As shown in the figure above, the definition for training a model, prediction, and accuracy calculation is done in the node file. This block of function is itself a node.
The data engineering pipeline’s outputs are train_x, train_y, test_x, and test_y. In the data science pipeline, we should only be aware of using the same variable name because kedro itself handles the combining of both pipelines. As shown in the figure, we have three different nodes. Each node represents the function block defined in nodes.py. One more thing to notice is that the output of the first node is input for the second node, so nodes are in a chain.
Visualization and packaging
Kedro has the flexibility to hook any functionality you want. One of the problems we face in machine learning projects is the challenge to explain ML processes to our clients. Keeping the explanation part in mind, It has a kedro visualization plugin which helps us visualize the machine learning data and model training pipeline. Using the visualization tool, we can see the overall pipeline of the system.
The figure above is the visualization of everything we defined above. This is extremely important when the project grows. We can pick a point if the data flow is going as we expected.
Kedro does not have the option of serving a model using API. After the model is trained using Kedro Pipeline, the model is versioned and served using tools like mlflow, tf serving etc.
Alternatives to Kedro
In our research, we haven’t found any replicated alternative for Kedro. GoKart has many common features with Kedro. A pipeline is built on top of Kedro. Here are some workflow pipelines which may have overlapping features, but they are not exactly replaceable with Kedro.
There are alternatives package for workflow pipeline:
Find a more detailed comparison HERE.
Kedro manages to pack most of the often-repeated work and makes it easy to focus on the core business logic. Additionally, with various plugins and a standard project structure, working with a team is much more efficient. Loosely coupled units and hooks implementation open the limitless potential of how we might want to use it.
Kedro presents us with high flexibility to build different data-driven projects. It has replaced “n” different tools to build an overall pipeline and workflow for data projects into one. As a result, it is easy to debug and has become manageable and modular.