Standard Template for Machine Learning projects – deepsense.ai’s approach
Developing efficient ML projects may be a great challenge, as there are a lot of resources and decision points that need to be taken into consideration. Therefore, it is crucial to establish a framework that ensures both consistency of project delivery and high-quality code. In this post, we will share some insights about our standard template and present best practices to set up ML projects. Let’s get started!
Especially at companies like deepsense.ai the well-defined approach to project delivery is crucial, as we develop multiple isolated projects on a regular basis and work for many different clients simultaneously.
From the very beginning, we have put great attention to following best practices in each project. As the number of projects being developed increased, we decided to create the standard template to ensure uniformity. The ML template includes automating resource creation and bootstrapping each new project. Such an approach provides multiple benefits:
- higher quality of developed projects,
- quick setup for new projects,
- easier onboarding of new project members,
- flexibility for special projects needs,
- clear rules for team members without a programming background.
The cookiecutter template from deepsense.ai is open-sourced and available on our GitHub. You can find the documentation here.
Why the ML project template and code generators?
As the complexity of our projects has grown due to many different existing or future customers, we have decided to establish a cookiecutter template to initialize new repositories.
The major advantage of using the standard template is that it provides a tremendous general starting point for accumulating our experiences, which can be easily customized according to the specifics of a given project. However, it should be kept in mind that the template does not:
- provide perfect configuration for every project,
- response to all subjective personal opinions,
- recommend the newest and shiny tools or approaches as they appear.
Moreover, our intention was to propagate useful technologies that are more tied to certain types of projects. The proposed solution aims to create custom code generators that can be used to add e.g. a basic boilerplate for streamlit demo application or training loop based on pytorch lightning and hydra to a new or ongoing project.
A structured approach to code generation solves many maintenance problems and allows customization of solutions for every project, without polluting them a lot with unused files. The lessons learned can be integrated back without breaking changes for older projects by accident with new updates as can happen with framework or library updates. In short – it’s safer and faster to iterate. Of course, it also means that fixes or new features would need to be backported to each project, but it is a trade-off which suits best our daily work.
Despite the fact that this approach is nothing new and web communities use quite a lot of code generators (for React, Vue, Elixir Phoenix to name a few), it isn’t as widely spread as good practice in other contexts.
General rules for creating Machine Learning project template
It is worth underlining that there is no single, best template that will be suitable for implementation in all projects. Nevertheless, understanding the basic principles allows you to create a uniform, high-quality standard. Let’s start with understanding the design principles that guided us.
1. Harnessing experiences from different projects (both DS and SE)
In any company where you are not working on just one project (e.g., a specific product) there will be data silos. Especially in our case, we take security very seriously and limit project access very strictly. That is why it is always good to conduct internal interviews with teams implementing various projects and collect lessons learned by presenting different perspectives. While working on the template, we talked to almost every member of the deepsense.ai team to accumulate as many valuable insights as possible.
2. Adapting to software house project style
The wide spectrum of projects that we implement at our AI development company – from R&D to production – meant that we had to think carefully about the main assumptions of the template to make it as universal as possible.
3. Adding tools and configurations that are useful
Based on the collected insights, we have prepared a set of basic, recommended tools. Usually, it takes too much time to set up or read about each tool and how to configure it during an ongoing project – especially if you have not done it before. It is easier to edit existing files and adjust them for specific project needs.
4. Making onboarding as easy as possible
A similar structure, enforced code style, and tools allow to onboard new team members faster or switch people smoothly between projects.
5. Handing over ownership to ensure independence
The template is just a guideline and the project team takes ownership and can adapt it quickly to the specific needs of a project. Standardization and ensuring high consistency cannot kill our creativity! Especially in R&D projects – that we love so much at deepsense.ai – we can afford to take risks, relax the rules and quickly develop with the ability to adapt to higher requirements related to the production phase.
6. Being PEP and other best practices complaint
When implementing the project, we try to maintain certain established rules, but only if it makes sense and justifies the effectiveness of the work in the project. Examples of such sources of information include official setuptools docs or PEP documents like this one.
7. Being cautious
Our main intention was to use battle-tested, stable methods with known drawbacks, issues, and workarounds approaches, so they could be implemented in broader situations and in different projects.
deepsense.ai standard template
The created template can be used manually by all team members – they can point cookiecutter to our internal repository or use it in zipped file form. To use it in ongoing projects cookiecutter is called with the “-f” option on a branch first followed by a merge as any other feature.
As a good start, we highly recommend learning about cookiecutter from its original documentation.
It is super simple to use cookiecutter. Firstly, of course, we need to install it:
$ pip install cookiecutter
And then run it with our template:
$ cookiecutter <template>
cookiecutter.json file
This file is crucial, as it defines questions and key-value substitutions which are applied to template files.
Here is an example cookiecutter file:
{ "project_name": "default", "__project_name_slug": "{{cookiecutter.project_name | slugify}}", "ci": ["GitLab", "None" ], "python_package_name": "{{ cookiecutter.__project_name_slug.replace('-', '_') }}", }
Internally the file uses the Jinja2 template system and allows the use of Python or some extensions (like slugify). For example, our default Python package name is dynamically constructed with slugify and Python string function.
Hooks
There are two cookiecutter hooks that can be implemented in Python: pre and post project generation:
- pre hook – to validate variables’ content before cookiecutter generates files.
- e.g. check if the python package’s name provided by a user is correct. We use regex to validate user input.
- post hook – to clean up unnecessary files that are not needed or to run additional code.
- e.g. remove unnecessary files, fix file permissions – depending on user needs we might not need a Gitlab CI config file and we discovered that not all scripts have correct execution permissions and needs to be fixed.
Project template directory
All files that can be created by the template are put in the directory:
{{ cookiecutter.project_name }}
Cookiecutter asks the user for information, copies this directory, applies Jinja substitution and calls hooks.
Tests
We also have a few basic tests to ensure that cookiecutter projects can be generated just fine. For example we test if all jinja templates are resolved, if expected files are/are not present etc.
Unfortunately, it is not possible to test everything, e.g. the gitlab CI config – for that we created a dedicated test repository which is manually synced and used to test changes before they can be merged back. The repository is also sometimes used to prototype or test some changes first and then the feature is backported to the template.
Documentation & CI
It is important to mention that we have CI which runs tests for the template – it already helped to catch some errors early.
Our CI also integrates nicely with Sphinx documentation – it automatically builds and uploads it to GitLab Pages. Initial documentation has recommended sphinx extensions installed and configured and provides a backbone a team can build upon. For example, we find out that Markdown is nicer to use than the default RST format and we make it possible to use it out of the box.
Overview of our cookiecutter project
The standard cookiecutter project at deepsense.ai consists of:
- Basic python package structure:
- setup.py – legacy compatibility for pip install -e .
- setup.cfg – package metadata and dependencies.
- pyproject.toml – all tools configuration (if support is present)
- a very minimal python code + example test
- pre-commit hooks
- black, flake8 – enforce code style
- pycln – cleanups unused imports
- mypy – checks type errors
- isort – sorts imports
- pylint – provides static code analysis and enforces coding standard
- pyupgrade – modernizes code for given python version
- bandit – checks for security issues
- Sphinx documentation
- basic preconfigured documentation template
- page with list of autogenerated third-party python packages list with licenses
- Basic script to setup developer environment
- Minimal README.md file with standard project setup description
- Preconfigured semantic versioning with bump2version
- Gitlab integration (default, optional):
- linter stage (pre-commit run –all)
- tests (pytest) + code coverage
- license checks of installed packages (e.g. fail build on GPL dependency)
- building and hosting documentation on GitLab Pages
- building package and uploading to private GitLab Package registry
- security: trivy scanning
- Other less important files (more configurations, .gitignore etc)
We use precommits for every commit – this is important for us as it enforces good practices as habits and it is easier to fix linter issues on a small scale.
It takes time to set up many of the tools, especially tiresome might be to set up CI pipelines with YAML files – here everything just works and no tears are to be shed.
It is much easier to modify given example files than to search the web – adaptation for project needs like custom client coding guidelines takes less time too.
Of course, the template will evolve – for example, in the future we might migrate some linter checks into ruff after we battle-test it.
Cookiecutter template usage summary
The Machine Learning project template has gained great recognition among the deepsense.ai team, but we still want to develop it by analyzing new lessons learned and insights from new projects. However, we hope that the presented approach will be an inspiration to expand the idea of standardizing ML projects.