For companies seeking ways to test AI-driven solutions in a safe environment, running a competition for data scientists is a great and affordable way to go – when it’s done properly.
According to a McKinsey report, only 20% of companies consider themselves adopters of AI technology while 41% remain uncertain about the benefits that AI provides. Considering the cost of implementing AI and the organizational challenges that come with it, it’s no surprise that smart companies seek ways to test the solutions before implementing them and get a sneak peek into the AI world without making a leap of faith.
That’s why more and more organizations are turning to data science competition platforms like Kaggle, CrowdAI and DrivenData. Making a data science-related challenge public and inviting the community to tackle it comes with many benefits:
- Low initial cost – the company needs only to provide data scientists with data, pay the entrance fee and fund the award. There are no further costs.
- Validating results – participants provide the company with verifiable, working solutions.
- Establishing contacts – A lot of companies and professionals take part in Kaggle competitions. The ones who tackled the challenge may be potential vendors for your company.
- Brainstorming the solution – data science is a creative field, and there’s often more than one way to solve a problem. Sponsoring a competition means you’re sponsoring a brainstorming session with thousands of professional and passionate data scientists, including the best of the best.
- No further investment or involvement – the company gets immediate feedback. If an AI solution is deemed efficacious, the company can move forward with it and otherwise end involvement in funding the award and avoid further costs.
While numerous organizations – big e-commerce websites and state administrations among them – sponsor competitions and leverage the power of data science community, running a comptetion is not at all simple. An excellent example is the competition the US National Oceanic and Atmospheric Administration sponsored when it needed a solution that would recognize and differentiate individual right whales from the herd. Ultimately, what proved the most efficacious was the principle of facial recognition, but applied to the topsides of the whales, which were obscured by weather, water and the distance between the photographer above and the whales far below. To check if this was even possible, and how accurate a solution may be, the organization ran a Kaggle competition, which deepsense.ai won.
Having won several such competitions, we have encountered both brilliant and not-so-brilliant ones. That’s why we decided to prepare a guide for every organization interested in testing potential AI solutions in Kaggle, CrowdAI or DrivenData competitions.
Recommendation 1. Deliver participants high-quality data
The quality of your data is crucial to attaining a meaningful outcome. Minus the data, even the best machine learning model is useless. This also applies to data science competitions: without quality training data, the participants will not be able to build a working model. This is a great challenge when it comes to medical data, where obtaining enough information is problematic for both legal and practical reasons.
- Scenario: A farming company wants to build a model to identify soil type from photos and probing results. Although there are six classes of farming soil, the company is able to deliver sample data for only four. Considering that, running the competition would make no sense – the machine learning model wouldn’t be able to recognize all the soil types.
Advice: Ensure your data is complete, clear and representative before launching the competition.
Recommendation 2. Build clear and descriptive rules
Competitions are put together to achieve goals, so the model has to produce a useful outcome. And “useful” is the point here. Because those participating in the competition are not professionals in the field they’re producing a solution for, the rules need to be based strictly on the case and the model’s further use. Including even basic guidelines will help them to address the challenge properly. Lacking these foundations, the outcome may be right but totally useless.
- Scenario: Mapping the distribution of children below the age of 7 in the city will be used to optimize social, educational and healthcare policies. To make the mapping work, it is crucial to include additional guidelines in the rules. The areas mapped need to be bordered by streets, rivers, rail lines, districts and other topographical obstacles in the city. Lacking these, many of the models may map the distribution by cutting the city into 10-meter widths and kilometer-long stripes, where segmentation is done but the outcome is totally useless due to the lack of proper guidelines in the competition rules.
Advice: Think about usage and include the respective guidelines within the rules of the competition to make it highly goal-oriented and common sense driven.
Recommendation 3. Make sure your competition is crack-proof
Kaggle competition winners take home fame and the award, so participants are motivated to win. The competition organizer needs to remember that there are dozens (sometimes thousands) of brainiacs looking for “unorthodox” ways to win the competition. Here are three examples
- Scenario 1: A city launches a competition in February 2018 to predict traffic patterns based on historical data (2010-2016). The prediction had to be done for the first half of 2017 and the real data from that time was the benchmark. Googling away, the participants found the data, so it was easy to fabricate a model that could predict with 100% accuracy. That’s why the city decided to provide an additional, non-public dataset to enrich the data and validate if the models are really doing the predictive work.
However, competitions are often cracked in more sophisticated ways. Sometimes data may ‘leak’: data scientists get access to data they shouldn’t see and use it to prepare their model to tailor a solution to spot the outcome, rather than actually predicting it.
- Scenario 2: Participants are challenged to predict users’ age from internet usage data. Before the competition, the large company running it noticed that there was a long aplha-numeric ID, with the age of users embedded, for every record. Running the competition without deleting the ID would allow participants to crack it instead of building a predictive model.
Benchmark data is often shared with participants to let them polish their models. By comparing the input data and the benchmark it is sometimes possible to reverse-engineer the outcome. The practice is called leaderboard probing and can be a serious problem.
- Scenario 3: The competition calls for a model to predict a person’s clothing size based on height and body mass. To get the benchmark, the participant has to submit 10 sample sizes. The benchmark then compares the outcome with the real size and returns an average error. By submitting properly selected numbers enough times, the participant cracks the benchmark. Anticipating the potential subterfuge, the company opts to provide a public test set and a separate dataset to run the final benchmark and test the model.
Advice: Look for every possible way your competition could be cracked and never underestimate your participants’ determination to win.
Recommendation 4. Spread the word about your competition
One of the benefits of running a competition is that you get access to thousands of data scientists, from beginners to superstars, who brainstorm various solutions to the challenge. Playing with data is fun and participating in competitions is a great way to validate and improve skills, show proficiency and look for customers. Spreading the word about your challenge is almost as important as designing the rules and preparing the data.
- Scenario: A state administration is in need of a predictive model. It has come up with some attractive prizes and published the upcoming challenge for data scientists on its website. As these steps may not yield the results it’s looking for, it decides to sponsor a Kaggle competition to draw thousands of data scientists to the problem.
Advice: Choose a popular platform and spread the word about the competition by sending invitations and promoting the competition on social media. Data scientists swarm to Kaggle competitions by the thousands. It stands to reason that choosing a platform to maximize promotion is in your best interest.
Running a competition on Kaggle or a similar platform can not only help you determine if an AI-based solution could benefit your company, but also potentially provide the solution, proof of concept and the crew to implement it at the same time. Could efficiency be better exemplified?
Just remember, run a competition that makes sense. Although most data scientists engage in competitions just to win or validate their skills, it is always better to invest time and energy in something meaningful. It is easier to spot if the processing data makes sense than a lot of companies running competitions realize.
Preparing a model that is able to recognize plastic waste in a pile of trash is relatively easy. Building an automated machine to sort the waste is a whole different story. Although there is nothing wrong with probing the technology, it is much better to run a competition that will give feedback that can be used to optimize the company’s short- and long-term future performance. Far too many competitions either don’t make sense or produce results that are never used. Even if the competition itself proves successful, who really has the time or resources to do fruitless work?