Table of contents
Table of contents
For companies seeking ways to test AI-driven solutions in a safe environment, running a competition for data scientists is a great and affordable way to go – when it’s done properly.
According to a McKinsey report, only 20% of companies consider themselves adopters of AI technology while 41% remain uncertain about the benefits that AI provides. Considering the cost of implementing AI and the organizational challenges that come with it, it’s no surprise that smart companies seek ways to test the solutions before implementing them and get a sneak peek into the AI world without making a leap of faith.
That’s why more and more organizations are turning to data science competition platforms like Kaggle, CrowdAI and DrivenData. Making a data science-related challenge public and inviting the community to tackle it comes with many benefits:
- Low initial cost – the company needs only to provide data scientists with data, pay the entrance fee and fund the award. There are no further costs.
- Validating results – participants provide the company with verifiable, working solutions.
- Establishing contacts – A lot of companies and professionals take part in Kaggle competitions. The ones who tackled the challenge may be potential vendors for your company.
- Brainstorming the solution – data science is a creative field, and there’s often more than one way to solve a problem. Sponsoring a competition means you’re sponsoring a brainstorming session with thousands of professional and passionate data scientists, including the best of the best.
- No further investment or involvement – the company gets immediate feedback. If an AI solution is deemed efficacious, the company can move forward with it and otherwise end involvement in funding the award and avoid further costs.
Recommendation 1. Deliver participants high-quality data
The quality of your data is crucial to attaining a meaningful outcome. Minus the data, even the best machine learning model is useless. This also applies to data science competitions: without quality training data, the participants will not be able to build a working model. This is a great challenge when it comes to medical data, where obtaining enough information is problematic for both legal and practical reasons.- Scenario: A farming company wants to build a model to identify soil type from photos and probing results. Although there are six classes of farming soil, the company is able to deliver sample data for only four. Considering that, running the competition would make no sense – the machine learning model wouldn’t be able to recognize all the soil types.
Recommendation 2. Build clear and descriptive rules
Competitions are put together to achieve goals, so the model has to produce a useful outcome. And “useful” is the point here. Because those participating in the competition are not professionals in the field they’re producing a solution for, the rules need to be based strictly on the case and the model’s further use. Including even basic guidelines will help them to address the challenge properly. Lacking these foundations, the outcome may be right but totally useless.- Scenario: Mapping the distribution of children below the age of 7 in the city will be used to optimize social, educational and healthcare policies. To make the mapping work, it is crucial to include additional guidelines in the rules. The areas mapped need to be bordered by streets, rivers, rail lines, districts and other topographical obstacles in the city. Lacking these, many of the models may map the distribution by cutting the city into 10-meter widths and kilometer-long stripes, where segmentation is done but the outcome is totally useless due to the lack of proper guidelines in the competition rules.
Recommendation 3. Make sure your competition is crack-proof
Kaggle competition winners take home fame and the award, so participants are motivated to win. The competition organizer needs to remember that there are dozens (sometimes thousands) of brainiacs looking for “unorthodox” ways to win the competition. Here are three examples- Scenario 1: A city launches a competition in February 2018 to predict traffic patterns based on historical data (2010-2016). The prediction had to be done for the first half of 2017 and the real data from that time was the benchmark. Googling away, the participants found the data, so it was easy to fabricate a model that could predict with 100% accuracy. That’s why the city decided to provide an additional, non-public dataset to enrich the data and validate if the models are really doing the predictive work.
- Scenario 2: Participants are challenged to predict users’ age from internet usage data. Before the competition, the large company running it noticed that there was a long aplha-numeric ID, with the age of users embedded, for every record. Running the competition without deleting the ID would allow participants to crack it instead of building a predictive model.
- Scenario 3: The competition calls for a model to predict a person’s clothing size based on height and body mass. To get the benchmark, the participant has to submit 10 sample sizes. The benchmark then compares the outcome with the real size and returns an average error. By submitting properly selected numbers enough times, the participant cracks the benchmark. Anticipating the potential subterfuge, the company opts to provide a public test set and a separate dataset to run the final benchmark and test the model.
Recommendation 4. Spread the word about your competition
One of the benefits of running a competition is that you get access to thousands of data scientists, from beginners to superstars, who brainstorm various solutions to the challenge. Playing with data is fun and participating in competitions is a great way to validate and improve skills, show proficiency and look for customers. Spreading the word about your challenge is almost as important as designing the rules and preparing the data.- Scenario: A state administration is in need of a predictive model. It has come up with some attractive prizes and published the upcoming challenge for data scientists on its website. As these steps may not yield the results it’s looking for, it decides to sponsor a Kaggle competition to draw thousands of data scientists to the problem.