Much has been said about the effective running of machine learning projects. However, the topic keeps coming up. Data Scientists spend a lot of time discussing modeling methods, while – in my opinion – the overarching goal of running machine learning projects in companies fades into the background. It is vital to remember that the purpose of ML projects is not modeling itself, but achieving defined business goals.
While for data scientists modeling is often the most exciting part of the job, the other steps of an ML project should not be neglected, as doing so may imperil the valuable business results we set out to achieve in the first place. Properly executing the entire process of building and implementing machine learning models is essential.
PROCESS. As you make your bed, so you must lie upon it.
The entire ML project process can be described as a five-point checklist.
- FRAMING – the main goal here is to determine the essence of a business problem and phrase it in Data Science lingo.
- DATA – fuels the whole solution, so we need to painstakingly examine and understand it.
- MODELING – building the model is the core activity and often is viewed as the most exciting part.
- PRESENTATION & CONTINUATION – for our efforts to be truly fully appreciated, both the results and the solution must be described in a way that it is understandable and useful for business stakeholders.
- PRODUCTION & MAINTENANCE – because data science projects require so much experimentation, how they are to be brought to the production stage may end up being neglected. Is the code up to snuff? Do my timelines need to be adjusted? Will the solution we produce ultimately solve the problem stakeholders need solved? Make sure that you think about this at some point.
FRAMING. All that glitters is not gold.
Often, before we start a project, it seems that everyone has the same understanding of the basic concepts and purpose. It is worth double-checking to ensure they do. A project deepsense.ai did for a client in the banking sector may serve as a good example of why. Our job was to predict churn, which seemed straightforward enough at first glance. The problem was that everyone’s underlying interpretation of “churn” was slightly different. Only a series of detailed questions allowed us to agree on what exactly we define as “churn,” taking into account the time horizon, specific customer activities and eventual net profit from this customer. The more thoroughly we analyze and define a business problem, the more precisely we will be able to transfer it into metrics.
These are the questions that are worth answering during this phase:
How will the solution be used?
Determine at the outset in what context the results of our modeling will be used. Will it be input for business decisions, support for process automation or just some improvements within the system.
How should performance be measured?
Discuss the KPIs and establish how project success will be measured. The importance of this point shouldn’t be underestimated.
How will the solution be tested and validated?
Consider how your validation pipeline should look like during the development and what additional testing should be done during and after it. Remember that business stakeholders must be able to confirm that the solution works well.
Will additional requirements or limitations arise?
Last but not least, analyze technical capabilities and possible limitations related to technical constraints, data extraction or tolerable latency.
DATA. A bad workman blames his tools.
The garbage in / garbage out principle is well known. I would encourage you to look at it not only from the perspective of data, but also of the model development.
Key aspects at this stage include:
Understand what data can be available, and request it early
Due to the complexity of business processes, we almost always encounter difficulties in obtaining properly prepared data. Take this into account and plan more time to request and validate data – possibly even before we officially kick off the project.
Understand how the data was extracted and preprocessed
Surprises are rare here, but when they occur, they tend to be big ones. The person responsible for data extraction may make certain decisions or errors that can significantly distort the results – either by accident, miscommunication or simply out of a lack of knowledge of modelling practices. It is always a good idea to understand the data extraction process and later thoroughly review the data with the business and data owners to confirm that you are all on the same page.
Explore the data
First of all, double-check that you have all the data you requested. Then perform the critical step of pre-modelling Exploratory Data Analysis. It should help you understand the data, the problem itself, generate insights, discover patterns, spot anomalies and test hypotheses. You can approach the data exploration as if it were up to you, rather than the model, to generate predictions. Don’t forget the simple stuff at this stage: ask basic questions, compute statistics, check features and labels distribution.
Confront your findings with the data owners
With all the findings from the EDA, conclusions, insights, and new hypotheses, talk to the data owners and business stakeholders. Confronting all of these early and with people that know the data and problem (hopefully better and with more intuition than we have) will provide a better foundation for modelling.
MODEL. It is never late to learn.
I hope I won’t disappoint you here, but given the importance of modelling and people’s eagerness to do it, I’m going to cover it in detail in the second part. As for now, just for the sake of order, let’s just cover a few key aspects.
Try a number of different approaches
Modeling is a process of constant experimentation and it is always worth trying a number of different approaches to models, features and hyperparameters. The key thing here is to do this in the proper order – starting from basic benchmarks and standard or off-the-shelf approaches before rolling up your sleeves and unleashing your creativity.
Understand errors, important variables, predictions, …
Understanding those will give us ideas for improvement as well as allow us to catch potential issues or anomalies in time.
Confront your findings and predictions with business stakeholders
Whatever your findings, make sure they are either expected or you work on explaining them. Another good educational exercise is asking business stakeholders what they expect the data to reveal and cross-checking this with what you discover.
Do at least a couple of iterations
Don’t forget the experimental nature of modeling. Developing a good solution usually takes time and requires multiple iterations. Don’t be afraid to revisit earlier steps if necessary – especially after discovering something new about the problem/data or stalling.
PRESENTATION & CONTINUATION. All’s well that ends well.
How and why does your solution achieve your business objective?
A well-thought-out presentation of the project’s results may determine its success or failure. That is why we need to pay particular attention to emphasizing the business value of the project and the impact of the modeling on specific business processes. Otherwise business stakeholders may not understand the solution we deliver or its actual value. Doing the work is one thing but solving the problem and convincing others that we have done so is quite another.
What steps are necessary towards full-deployment/productionization?
More often than not the PoC doesn’t fully reflect the project’s business value. Therefore, at this stage, ask yourself “what’s next?” in order to determine what further developments are needed to achieve the desired results.
Optional: Prepare for the handover
If you are aiming for a full handover, be sure that both sides are on the same page – no one likes surprises here.
PRODUCTION & MAINTENANCE. Don’t count your chickens before they hatch.
Make sure your code is production-ready
Because of the experimental nature of data science, code quality and general software engineering principles may take a back seat to modelling. Going “live” is the last call to account for this.
Monitor, measure, and retrain only if necessary
Make sure that you monitor both inputs (feature space) and outputs (your predictions and actual labels if possible). Detecting any data shifts late will surely make many people unhappy. As for retraining, understand how often it has to be done and figure out the right degree of automation.
THE DEEPSENSE.AI TAKEAWAY
Many businesses have already learned the value of putting machine learning to use. The role of Data Science or ML teams is growing and advanced data analysis is becoming a key factor to support strategic imperatives. It is therefore crucial that data scientists keep in mind the main goal of the project – solving or improving a specific business problem (as opposed to just playing with ML). The modelling itself may well be the key or core ML activity, yet if unaccompanied by all the other steps it won’t achieve the ultimate goal. Hopefully, the above checklist will improve your collaboration with business stakeholders and bring greater success to your projects.