Table of contents
Table of contents
It was far and away the most popular Kaggle competition, gaining the attention of more than 8,000 data scientists globally. The team of Paweł Godula, team leader and deepsense.ai’s Director of Customer Analytics, Michał Bugaj and Aliaksandr Varashylau took fifth place and 1st on the public leaderboard.
The goal of the competition launched by Home Credit Group was to build a model that could predict the probability of a bank’s customer repaying a cash loan (90% of the training data) or installment loan (10% of the training data). Combining an exciting, real-life challenge and a high-quality dataset, this competition became the most popular ever featured competition on Kaggle.
The sandbox raiders
There can be no doubt that being a data scientist is fun. Playing with various datasets, finding patterns and exploring the needles hidden in the depths of the digital haystack. This time, the dataset was a marvel to behold. Why?- The bank behind the competition provided data on roughly 300,000 customers, including details on credit history, properties, family status, earnings and geographic location.
- To enrich the dataset the bank provided information about the customers’ credit history taken from external sources, mostly from customer-ratings institutions.
- The level of detail provided was astonishing. Participants could analyze the credit history of customers at the level of a single installment of a single loan.
- While the personal data was of course perfectly anonymized, the features were not. This enabled endless feature engineering, which is every data scientist’s dream.
1. Hand crafting more than 10,000 features
Out of 10,000 features, we carefully chose the 2,000 strongest for the final model. Endless brainstorming, countless creative sessions and discussions gave us more than 10,000 features that could possibly explain the default on a loan. As most of these features carried largely duplicate information, we used an algorithm for automatic feature selection based on feature importance. This procedure enabled us to eliminate ~8,000 features and reduce the training time significantly, while improving the cross validation score at the same time. The heavily tuned, five-fold bagged, LightGBM model based on these 2,000 features was our submission’s workhorse.2. Using deep learning to extract interactions among different data sources
We wondered how we could capture the interactions between signals coming from different data sources. For example, what if 20 months ago someone was rejected in an external credit Bureau, had a late payment in their installment payments, and applied for a loan at our branch? These types of interactions are very hard to capture by humans because of the number of possible options. So, we turned to deep learning and turned this problem into an image classification problem. How? We created a single vector of user characteristics coming from different data sources for every month of user history, going as far back as 96 months(8 years was the cutoff in most data sources). We then stacked those vectors and created a very sparse “user image” and, finally, fed this image into a neural network. The network architecture was as follows:- Normalization – division by global max in every row
- Input (the “user image” in format of n_characteristics x 96 months – we looked 8 years into the past)
- 1-D convolution spanning 2 consecutive months (to see the change between periods)
- Bidirectional LSTM
- Output
3. Using nested models
One of the things that bothered us throughout the competition was the somewhat arbitrary nature of various group-bys that we performed on data while doing the hand-crafted features. For example, we supposed that an overdue installment from five years ago would be less important than one from just a month ago. However, what is the exact relationship? The traditional way is to test different thresholds using a cv score, but there is also a more elegant way for the model to figure it out. What we did was build a bunch of “limited-power” models using only only a single source of data (for example, only a credit card’s history). The purpose of that was to force the model to find all possible relationships in the given data source, even at the cost of accuracy. Below are the AUC (area under curve metric) scores that we got from models using only one data source:- Previous application: 0.63
- Credit card balance: 0.58
- Pos cash balance: 0.54
- Installment payments: 0.58
- Bureau: 0.61
- Bureau balance: 0.55
- More than 2,000 hand-crafted features, selected out of 10,000 features created during brainstorming and creative sessions
- one feature from the neural network from Step 2
- Around 40-50 features coming from “nested models” described in Step 3