Table of contents
In the first part of our guide we focused on properly executing the entire process of building and implementing machine learning models with a focus on the main goal – solving the overarching business challenge. In the second part of our material we dig deeper into the topic of modeling.
Table of contents
MODEL. It’s never too late to learn.
Machine learning should be approached differently depending on the goal. If you’re taking part in a competition, you’ll want to focus mainly on feature engineering and experimentation – much of the other work will be done for you. In academia, researching previous approaches is vital. For commercial projects, you can likely skip a big portion of experimentation in favor of making doubly sure that you have a good idea of what is ultimately needed of you and how it fits into the big picture.
Understanding the goal and prioritizing the process accordingly is the key to success. The checklist presented below can help organize your efforts. While reading this checklist, you should decide which points are helpful means to your end and which would only slow you down.
Don’t shuffle your data (before reading this)
A lot of people set up their validation code to split the dataset into train and test and call it a day. This is very dangerous, as it is likely to induce data leaks. Are we sure that time is not a factor? We probably shouldn’t use the future to predict the past. Was it grouped by source? We should plan for any and all of those. A simple, yet very imperfect sanity check is to validate twice – once based on shuffled data, and once on a natural split (like the last lines of a file). If there’s a big discrepancy, you need to think twice.
Make the validation as real as possible
When testing your solution, try to imitate the exact scenario and environment the model will be used in as closely as possible. If it’s store sales prediction, you want a different model for new (or hypothetical) stores and existing locations. You likely want different validations too. In addition to a time split, you need to add some lag to account for model deprecation, and you likely want to prevent the same store from appearing in both the training and validation set. A rule of thumb is that “the worse results the validation gives, the better it likely is”.
Try to make the main metric “natural”
When making a model, look for metrics that make sense in the context of the problem. AUC is good internally, but I challenge anyone to explain it to a non-technical person. Precision and Recall convey information in a simpler way, but there’s multiple ways to set the prediction threshold, and in the end they are 2 numbers. F1 needs to die already. It takes 2 flawed numbers and combines them in a flawed way to arrive at an even more flawed number. Instead, you could ask “How much money is a true positive/negative worth? How much does a false positive/negative cost?” If there is one metric non-technical people understand, it’s $$$.
Set up your pipeline in a way that prevents mistakes
Modeling is often a long process. Structure your project and code in a way that ensures that avoidable mistakes are avoided. Obviously, the validation code should not care about the specifics of the model. The training code should not even have access to anything to do with validation.
Double check your framing
Establishing your validation pipeline is a nice opportunity to revisit how you framed the problem at hand and how this is being reflected in the pipeline. Plan a session or two with other stakeholders and domain experts as well to align on methodology, metrics, KPIs, initial results, and how it will all eventually fit.
MULTIPLE APPROACHES. No pain no gain.
Consistently developing good models is only possible when methodically testing hypotheses. You may believe you already know the best architecture/method for the job. What if your knowledge is outdated though? Or, worse still, you were wrong to begin with? The best way to verify is to carefully curate a gauntlet of different models, evaluate them and decide how they figure in the final solution.
Check random prediction & constant prediction
It might seem pointless to do so, but they provide some context to the problem at hand. For example, it’ll help you assess whether a model is failing to learn something at all, or just cannot break through some kind of glass ceiling.
Develop a simple rule-based model
It may be predicting the previous value for time series, a “group by XYZ, then use median” approach, a regex detecting common patterns… We expect that any model we build would at least be an improvement on that.
Try a classical ML model
Before using the newest and most capable tools at your disposal, try linear regression, gradient boosting or something similarly easy to define, train & use. See how much each model can squeeze out of the data and try to understand why. Do not perform non-obvious feature extraction/selection, that’s not the point here.
Test the standard approach for your problem class
Are you doing NLP? There’s probably a version of BERT for you. Financial modeling? Factorization Machines are likely still king. This not only helps you understand the problem better, but it may also become the foundation of the overall solution.
Use any kind of benchmark available
If there is a solution or an approach that has previously been used, test it. It may be your client’s current ruleset, or something you found in a research paper. It is paramount that you don’t spend too much time attempting to be worse than what’s already out there. Also, if possible, verify your validation pipeline – it should rank solutions in similar order as the public benchmarks. If it doesn’t, it’s important to investigate why.
Try ensembling, if you need it
Ensembling of different models (or different instances of the same model!) can help you squeeze out some more quality information from your data. It’s a powerful tool that regularizes your output. Please remember that the more distinct your models are, the better the ensemble performs. Just make sure not to overdo it. The time needed to build a big ensemble or use it for prediction quickly racks up.
After the above, you can get a little crazy
Now you know what the pre-existing approaches are and can make educated guesses about the quality of your own. It’s high time to test them. Leverage your validation/test/evaluation pipeline to make decisions for you. If you think an idea is silly, perhaps you can run it through the code anyway – there’s not much to lose.
FEATURE ENGINEERING. Learn to walk before you run.
The topic of feature engineering has fallen out of fashion in recent years due to the prevalence of deep learning and the focus on models doing “their own feature engineering”. This does not mean that it is obsolete. Especially, the smaller the dataset, the more important it is to help the model understand the data.
Be creative
Think about properties that can be (even remotely) useful for your algorithms. This is an area where some domain knowledge may help, but you should go beyond this. If experts use 5 specific variables to assess something, ask them about 20 more that may be related to the problem.
Copy from others (or from your past self)
It may be worthwhile to spend some time inspecting similar problems and features that worked well there. Treat this as an inspiration – even if you can’t translate them directly, you may be able to come up with a proxy or an analogy.
Be thorough
Some features can be extracted from the way data is collected. When predicting the click through rate for a commercial, looking at the user’s history is much more important than just the specific datapoint. A user who sees a lot of commercials in a short time frame is most likely a bot. We cannot infer this if we decide to treat each record as a separate entity.
Help your model
A lot of models are limited in terms of scope. Linear regression treats each feature separately to an extent – but we can make “new features” out of feature interactions (and feature interactions are not limited to cross-products). On the other hand, when using tree-based models, some (order-definable) categorical variables will work better as a discrete value than one-hot encoded.
Remove or change features that are too specific
A full address is probably not needed – as opposed to info about being located in a city or not, or the distance to the closest highway…
CONTINUOUS IMPROVEMENT. To err is a model, to understand is divine.
Your model will be wrong in the future and this may make someone unhappy. What if you can predict what mistakes can happen? What if you can somehow spot them? Fix them?
Understand errors, predictions, variables and factors
By understanding your model’s shortcomings very well, you might not only get some ideas on how to mend them but also design the whole thing to make them less severe.
Know your predictions
The same goes for predictions. What do they usually look like? Is there a bias? What are the typical results in typical scenarios?
Know how to change your (model’s) mind
What variables make it work? Do you know how your model reacts to changes in hyperparameters? You may check whether increasing/decreasing them gives the desired result.
Reassess your results and validation pipeline
Simple models might not be powerful and expressive enough to expose the shortcomings of the validation pipeline. Once your errors are fewer or smaller, make sure you revisit the validation method and make sure it stays solid throughout the whole process.
META
You are most likely overwhelmed with the number of dos and don’ts in this guide. It does tell you to make the distinction between necessary and unnecessary, but there’s two more things that should be mentioned, especially if you’re working on the project with other people.
Periodically make an examination of conscience
Some of the best practices described here require planning, regular maintenance, good coding, and a careful approach. There’s situations where those prerequisites deteriorate. Use your time to go back and bring your approach back up to par.
Document when cutting corners
Sometimes you cannot follow all of the best practices, e.g. there is no way to make a believable validation pipeline. Those compromises should be noted and explained, so that in future iterations of the project you (or especially others) know what should be improved.