Blog posts Archives - Page 12 of 18

Crime forecasting – ‘Minority Report’ realized

September 28, 2017/in Data science, Machine learning /by Patryk Miziuła

Everybody who watched ‘Minority Report’, Steven Spielberg’s movie based on the Philip Dick’s short story, daydreams about crime forecasting in the real world. We have good news: machine learning algorithms can do just that!

In September 2016, the National Institute of Justice in the US announced the Real-Time Crime Forecasting Challenge. The goal was to predict future crimes in the city of Portland, OR. CodiLime, deepsense.ai’s parent company, took part in it, giving the job to our machine learning team. The results were revealed in August 2017: we did a great job and won eight out of 40 sub-competitions! In this post we describe the crime forecasting algorithms we used.

Competition rules

Fortunately, the NIJ didn’t ask contestants to carve names of forthcoming criminals and victims into wooden balls, as was the case in the movie. Instead, they wanted to know the hotspots – small areas with the greatest ‘intensity’ of future crimes.

Crime forecasting: frame from ‘Minority Report’ — Frame from ‘Minority Report’: red ball means a murder of passion. What color should the ball that predicts a tax fraud of passion be?

Three different types of crimes were considered separately: burglary, car theft and street crimes (including assaults, robberies, shots fired). Additionally, all the crimes together were of interest as well.
The end of February 2017 was the deadline and five future timespans were involved:

The first week of March 2017,
The first two weeks of March 2017,
All of March 2017,
March and April 2017,
March, April and May 2017.

Thus, we had to make 4 x 5 = 20 individual crime forecasts for 20 type/time categories (e.g. ‘burglary, two weeks’).
Once we finished May 2017, in each of 20 type/time categories our hotspot predictions were compared against the actual state of affairs in Portland using two independent metrics:

‘crime density’: number of crimes that occurred in hotspots divided by the total volume of hotspots,
‘prediction efficiency’: the number of crimes that occurred in hotspots divided by the number of crimes in the actual worst regions with the same total volume as our hotspots.

Hence, the competition consisted of 4 x 5 x 2 = 40 separate sub-competitions in total (e.g. ‘burglary, two weeks, crime density’). The winner took it all in each of them and the all was $15,000. So, there was $600,000 in the pot – a good motivation to work!
To be clear, three independent clones of the Real-Time Crime Forecasting Challenge were run simultaneously. The one we took part in was intended for large businesses. Of the remaining two, one was run for small businesses and the other for students. Every clone had the same rules and goals, but its own contestants, winners and prizes.

Our solution

Data

In ‘Minority Report’, the Precrime Police unit got their crime forecasts from Precogs, three mutated humans who could see into the future. At deepsense.ai, our Precrime unit created the predictions based on the past.

The organizer delivered historical data with all the crimes registered in Portland between March 2012 and February 2017. Almost 1,000,000 records were provided in total. Each of them contained daytime, place (with accuracy to one foot!) and the type of crime committed.
Our first question was: since we have no Precogs onboard, can we use anything else than historical data? What could affect future crimes, but hadn’t left a trace on those that had already been committed? Well, in our opinion these could only be future events. But are they easier to predict than crimes themselves? For instance, one can page through local newspapers seeking sentences like ‘A new gin mill is going to be opened in March 2017. The crime rate will certainly rise there.’ However, such research requires a lot of work and there is no guarantee it’ll actually help. So we decided to squeeze as much out of the historical data only as we could.

Blind contest

No leaderboard was run during the contest. We didn’t know how many competitors we had and how honed their crime forecasting skills were. The only thing we could do to win was improve our own results over and over.

The first attempts showed us that in each of 20 type/time categories the ‘crime density’ metric was maximized by a lot of small hotspots whereas the ‘prediction efficiency’ performed best for a small number of large hotspots. Hence it was clear that we couldn’t satisfy both metrics simultaneously. Since each metric formed an independent sub-competition with a separate prize, it was better to have a good score for one metric than mediocre results for both. So, for each of the 20 type/time categories we had to decide which metric to focus on in our further work.
Which metric to choose when the metrics are incomparable, scores between categories are incomparable and you don’t know other competitors’ results? We checked that under some reasonable assumptions the best strategy is to just toss a coin; and this is what we did, 20 times – once per type/time category.

Bad neighborhoods remain bad

The major rule we followed while building our models was rather pessimistic: ‘if many crimes have occurred somewhere, more are likely to happen.’ This principle may strike some as naive, but the longer we explored the data, the more confident we were that it worked.

Not every past crime is equally important. We took advantage of the aging and seasonality of data. We focused more on data from 2017 and 2016 than on older ones. Also, we boosted the significance of crimes committed in the same season as the forecasting time. For instance, to make predictions for March 2017 we took special care of data from preceding Marches.
Moreover, as we know, evil is prone to ‘radiate’. When a crime is committed, we can expect others to happen nearby. This is why we decided to ‘diffuse’ the data points. For those who like statistical jargon, we note that this technique is called kernel density estimation.
However, we didn’t set the ‘intensities’ of data aging, seasonality and diffusion by hand. They were adjusted by our algorithm automatically. How did it know how to do that, you ask? As always in machine learning, it just chose them to obtain the best results! For each of 20 type/time categories we separated the last period of historical data as a validation dataset (e.g. February 2017 for a forecasting of March 2017). The algorithm used all but validation data to check which parameters best predict crimes from the validation set. Then, ultimately, it took all the available data to prepare the final crime forecasting.

Neptune

We must say that the Real-Time Crime Forecasting Challenge was also a logistic challenge. We had to manage and improve 40 models simultaneously. To do that we used our own machine learning lab called Neptune. We designed it for precisely this type of task: to easily store, compare and recreate a lot of experiments. To be honest, we can’t imagine how one would handle 40 models without using this tool.

Results

The results were announced in August 2017: in our large-business group we won 8 out of 40 sub-competitions, were the runner-up in 6 more and took third place in yet another 6. This is a big success, but there is something we are especially proud of. We compared crime forecasts from all the three clones of the competition: large businesses, small businesses and students, and it turned out that our results would give us the top place in the total ranking! Our team finished with the best predictions in seven sub-competitions, three more that the runner-up managed.
Do you want to see one of our winning crime forecasts? Here it is:

The gray area is Portland, around 15 by 20 miles. 56,000 black dots are all the crimes committed between March and May 2017. The hotspots we chose are blue, but you probably can’t see them, so let’s zoom in on the Downtown:

Crime forecasting: winning forecast zoomed

We indicated 112 hotspots, 294 by 213 ft each. They appear to be placed randomly, but they are not, they lie optimally. This is why machine learning algorithms are so fun: it’s hard to deal with their outputs using common sense, but they work!

Needle in a haystack

The total number of crimes in Portland between March and May 2017 – 56,000 – is impressively big. Another category was on the opposite pole: during the first week of March 2017 only 20 (twenty) burglaries were committed in the investigated area!

If you think that it is hard to shoot 20 random events in a 150 mi² area with use of bars with the total volume less than ¾ mi² (the organizer’s requirement), you are absolutely right. In our opinion it was a matter of luck. We indeed hit one burglary, but it wasn’t enough to win this category.
But there was another way. The number of 20 crimes is so small that hypothetically any cheater could simply change the history and assure his victory by arranging a burglary or two in fixed places. Of course we didn’t do that and we think that nobody did since 20-25 is a typical amount of weekly burglaries in Portland. Experienced data scientists wouldn’t try this hoax because they’d know that if they weren’t the only ones who were going to do so, they wouldn’t benefit from this highly risky move. And, above all, they tend to spend their time on doing data science stuff rather than plotting fake crimes – being honest is usually a simpler way for us. However, in the ‘Minority Report’ universe a wooden ball would inform us about any bad intentions. In our world we just believe in people… or we can predict their behavior using machine learning algorithms!

Summary

If you’ve enjoyed our post or want to ask about anything related to crime forecasting (or maybe demand forecasting?), please leave us a reply!

Human log loss for image classification

September 12, 2017/in Data science, Deep learning /by Piotr Migdal

Deep learning vs human perception: creating a log loss benchmark for industrial & medical image classification problems (e.g. cancer screening).

In the last few years convolutional neural networks have outperformed humans in most visual classification tasks. But there is one caveat – usually they win by a small margin:

Image Classification - Measuring the Progress of AI Research (Human log loss for image classification)

Chart from Measuring the Progress of AI Research by EFF (CC BY-SA).

There are few exceptions to this rule, including the whale individuals detection contest our deepsense.ai team won (85% accuracy for almost 500 different whales). However, one could argue that a computer’s pattern recognition skills were similar to a human’s, but it was easier for the computer to memorize 500 different whale specimens.
When we train our deep learning model for an image classification task, how do we know if it is performing well? One way to approach this problem is to compare it against a human performance benchmark. We expect human errors even for simple tasks – labels can be misclassified and the person performing the classification is likely to make mistakes from time to time: after all, “to err is human”.
Creating benchmarks for medical and industrial problems is even more challenging, because:

we don’t have a simple sanity check (unlike distinguishing dogs from cats),
it is not obvious if a single photo devoid of any context is enough for classification (very often even for the best specialists it is not).

Measuring human accuracy for medical images

In the Cervical Cancer Screening Kaggle competition (by Intel & MobileODT), the goal was to predict one of three classes of cervical openings for each patient. To see this image classification task, visit (warning: explicit medical images) Cervix type classification or this Short tutorial on how to (humanly) recognize cervix types.
Our networks in this competition were just a bit better than a random model. We wanted to quantify human performance to see if our networks were bad, or if it is simply impossible to do much better. To measure human accuracy, we sampled 50 cervix images. This number sounded like a reasonable trade-off: high enough for some estimations, but small enough not to exhaust us. Unlike Andrej Karpathy, who set the human benchmark for ImageNet, we avoided going through the whole dataset. We looked at them ourselves and also gave them to two medical doctors, one a gynecologist. The task was to predict the cervix opening class for each image. The accuracy was as follows:

At least most of us did better than using the majority class prediction – i.e. assigning each image to the most numerous class. The medical doctors didn’t outperform the rest of us. That may seem surprising, but there is a common phenomenon at work here: many visual task much knowledge, just good pattern recognition (even pigeons can detect cancer from photos). Even less unexpected is the wisdom of crowds – an ensemble model (which involves members of “the crowd” voting) significantly outperformed each individual prediction.

Translating categorical predictions into log loss

However, many machine learning tasks use another measure of error – log loss (also known as cross-entropy), which takes into account our uncertainty (e.g. it is better to predict the correct class with 90% certainty than with 51%). It is especially important for problems with imbalanced classes. If we want to use the same prediction for a group of items, to minimize log loss we need to use empirical probabilities for the sample. For the whole sample of cervixes, those probabilities would be (18%, 52%, 30%) for classes 1, 2 and 3, respectively, resulting in a log loss of 1.01.
To measure human log loss we need to ask people to predict a probability distribution for each image, e.g. (20%, 70%, 10%). However, this task is time-consuming and can be difficult to explain to non-data scientists, in this case medical doctors. Humans are notoriously bad at assigning probabilities, so this approach would most likely need calibration anyway.
Fortunately, there is a very simple probability calibration technique, which takes discrete predictions as its input. Here’s the recipe:

predict a class for each image (here: 1, 2 or 3)
for all instances with the same predicted class, calculate the empirical distribution of the ground truth values
turn discrete predictions into the respective distributions
given the predictions, calculate the log loss

For example, for Michał (the project leader and top guesser), it is:

So, whenever he predicted class 1, we assign it to the (3/7, 4/7, 0/7) probability distribution, for class 2 – (5/26, 18/26, 3/26) and for class 3 – (1/17, 4/17, 12/17). His log loss on the same dataset is 0.78. This procedure is equivalent to calculating the conditional entropy of cervix classes, given our prediction, that is: H(groud_truth | our_prediction). See also the Wikipedia page on mutual information and its relation to conditional entropy.

Human predictions vs Kaggle results

Here we calculate the conditional entropy for each participant:

How does this compare to the final Kaggle results? Ultimately, the top winner had a log loss of 0.77, whereas our (artificial, not biological) neural network ensemble returned 0.84. Neither of those beat us humans, with our log loss of 0.73.
During the competition phase had seen entries with log loss as low as 0.4. We hadn’t known if the authors had used a clever approach, or if it had been simply an overfit. After our human test, however, we rightfully assumed that it was indeed a large overfit. It seems that this technique may be useful for setting a reasonable benchmark for your next image classification problem – whether it’s a Kaggle challenge or a project for your customer.

Additional materials

Remarks

This technique may underestimate log loss (in general, there is no unbiased estimator for Shannon entropy). A more educated way (but one requiring more samples) would be to use cross-validation. To avoid ending up with zero probabilities, smoothing probabilities may be crucial.
We can use different classes for guessing than when we want to predict, for example, using the option “I don’t know”. The magic of methods related to entropy is that they are label-insensitive.
To learn more about entropy, read the first two chapters of Thomas M. Cover, Joy A. Thomas, Elements of Information Theory.

Code snippet

import numpy as np
from sklearn.metrics import confusion_matrix
# label - ground truth labels
# predictions - prediction labels
def entropy(x, epsilon=1e-6):
    # assumes x is normalized
    return (- x * np.log(x + epsilon)).sum()
def conditional_entropy(mat):
    mat = mat / mat.sum()
    return entropy(mat) - entropy(mat.sum(axis=0))
print(conditional_entropy(confusion_matrix(label, prediction)))

Project members: Michał Tadeusiak (leader), Grzegorz Łoś, Patryk Miziuła, Dorota Kowalska, Piotr Migdał.
Thanks also to Robert Bogucki, Paweł Subko and Agata Chęcińska for valuable remarks on the draft.

How to create a product recognition solution

August 22, 2017/in Data science, Deep learning, Machine learning, Neptune /by Krzysztof Dziedzic and Patryk Miziuła

Product recognition is a challenging area that offers great financial promise. Automatically detected product attributes in photos should be easy to monetize, e.g., as a basis for cross-selling and upselling.

However, product recognition is a tough task because the same product can be photographed from different angles, in different lighting, with varying levels of occlusion, etc. Also, different fine-grained product labels, such as ones in royal blue or turquoise, may prove difficult to distinguish visually. Fortunately, properly tuned convolutional neural networks can effectively resolve these problems.
In this post, we discuss our solution for the iMaterialist challenge announced by CVPR and Google and hosted on Kaggle in order to show our approach to product recognition.

The problem

Data and goal

The iMaterialist organizer provided us with hyperlinks to more than 50,000 pictures of shoes, dresses, pants and outerwear. Some tasks were attached to every picture and some labels were matched to every task. Here are some examples:

product recogntion: exemplary picture of dress

task	labels
dress: occasion	wedding party, cocktail party, cocktail, party, formal, prom
dress: length	knee
dress: color	dark red, red

product recogntion: exemplary picture of outerwear

task	labels
outerwear: age	adult
outerwear: type	blazers
outerwear: gender	men
pants: color	brown

product recogntion: exemplary picture of pants

task	labels
pants: material	jeans, denim, denim jeans
pants: color	blue, blue jeans, denim blue, light blue, light, denim
pants: type	jeans
pants: age	adult
pants: decoration	men jeans
pants: gender	men

product recogntion: exemplary picture of shoes

task	labels
shoe: color	dark brown
shoe: up height	kneehigh
pants: color	black

Our goal was to match a proper label to every task for every picture from the test set. From the machine learning perspective this was a multi-label classification problem.

There were 45 tasks in total (a dozen per cloth type) and we had to predict a label for all of them for every picture. However, tasks not attached to the particular test image were skipped during the evaluation. Actually, usually only a few tasks were relevant to a picture.

Problems with data

There were two main problems with data:

We weren’t given the pictures themselves, but only the hyperlinks. Around 10% of them were expired, so our dataset was significantly smaller than the organizer had intended. Moreover, the hyperlinks were a potential source of a data leak. One could use text-classification techniques to take advantage of leaked features hidden in hyperlinks, though we opted not to do that.
Some labels with the same meaning were treated by the organizer as different, for example “gray” and “grey”, “camo” and “camouflage”. This introduced noise in the training data and distorted the training itself. Also, we had no choice but to guess if a particular picture from the test set was labeled by the organizer as either “camo” or “camouflage”.

Evaluation

The evaluation score function was the average error over all test pictures and relevant tasks. A score value of 0 meant that all the relevant tasks for all the test pictures were properly labeled, while a score of 1 implied that no relevant task for any picture was labeled correctly. A random sample submission provided by the organizer yielded a score greater than 0.99. Hence we knew that a good result couldn’t be achieved by accident and we would need a model that could actually learn how to solve the problem.

Our solution

A bunch of convolutional neural networks

Our solution consisted of about 20 convolutional neural networks. We used the following architectures in several variants:

All of them were initialized with weights pretrained on the ImageNet dataset. Our models also differed in terms of the data preprocessing (cropping, normalizing, resizing, switching of color channels) and augmentation applied (random flips, rotations, color perturbations from Krizhevsky’s AlexNet paper). All the neural networks were implemented using the PyTorch framework.

Choosing the training loss function

Which loss function to choose for the training stage was one of the major problems we faced. 576 unique pairs of task/label occurred in the training data so the outputs of our networks were 576-dimensional. On the other hand, typically only a few labels were matched to a picture’s tasks. Therefore the ground truth vector was very sparse – only a few of its 576 coordinates were nonzero – so we struggled to choose the right training loss function.
Assume that $(z_1,…,z_{576})in mathbb{R}^{576}$ is a model output and
[y_i=left{begin{array}{ll}1, & text{if task/label pair }itext{ matches the picture,}, & text{elsewhere,}end{array}right.quadtext{for } i=1,2,ldots,576.]

As this was a multi-label classification problem, choosing the popular crossentropy loss function:
$[sum_{i=1}^{576}-y_ilog p_i,quad text{where } p_i=frac{exp(z_i)}{sum_{j=1}^{576}exp(z_j)},]$
wouldn’t be a good idea. This loss function tries to distinguish only one class from others.
Also, for the ‘element-wise binary crossentropy’ loss function:
$[sum_{i=1}^{576}-y_ilog q_i-(1-y_i)log(1-q_i),quad text{where } q_i=frac{1}{1+exp(-z_i)},]$
the sparsity caused the models to end up constantly predicting no labels for any picture.
In our solution, we used the ‘weighted element-wise crossentropy’ given by:
$[sum_{i=1}^{576}-bigg(frac{576}{sum_{j=1}^{576}y_j}bigg)cdot y_ilog q_i-(1-y_i)log(1-q_i),quad text{where } q_i=frac{1}{1+exp(-z_i)}.]$
This loss function focused the optimization on positive cases.

Ensembling

Predictions from particular networks were averaged, all with equal weights. Unfortunately, we didn’t have enough time to perform any more sophisticated ensembling techniques, like xgboost ensembling.

Other techniques tested

We also tested other approaches, though they proved less successful:

Training the triplet network and then training xgboost models on features extracted via embedding (different models for different tasks).
Mapping semantically equivalent labels like “gray” and “grey” to a common new label and remapping those to the original ones during postprocessing.

Neptune

We managed all of our experiments using Neptune, deepsense.ai’s Machine Learning Lab. Thanks to that, we were easily able to track the tuning of our models, compare them and recreate them.

Results

We achieved a score of 0.395, which means that we correctly predicted more than 60% of all the labels matched to relevant tasks.

We are pleased with this result, though we could have improved on it significantly if the competition had lasted longer than only one month.

Summary

Challenges like iMaterialist are a good opportunity to create product recognition models. The most important tools and tricks we used in this project were:

Playing with training loss functions. Choosing the proper training loss function was a real breakthrough as it boosted accuracy by over 20%.
A custom training-validation split. The organizer provided us with a ready-made training-validation split. However, we believed we could use more data for training so we prepared our own split with more training data while maintaining sufficient validation data.
Using the PyTorch framework instead of the more popular TensorFlow. TensorFlow doesn’t provide the official pretrained models repository, whereas PyTorch does. Hence working in PyTorch was more time-efficient. Moreover, we determined empirically that, much to our surprise, the same architectures yielded better results when implemented in PyTorch than in TensorFlow.

We hope you have enjoyed this post and if you have any questions, please don’t hesitate to ask!

Running distributed TensorFlow on Slurm clusters

June 26, 2017/in Data science, Deep learning, Machine learning /by Tomasz Grel

In this post, we provide an example of how to run a TensorFlow experiment on a Slurm cluster. Since TensorFlow doesn’t yet officially support this task, we developed a simple Python module for automating the configuration. It parses the environment variables set by Slurm and creates a TensorFlow cluster configuration based on them. We’re sharing this code along with a simple image recognition example on CIFAR-10. You can find it in our github repo.

But first, why do we even need distributed machine learning?

Distributed TensorFlow

When machine learning models are developed, training time is an important factor. Some experiments can take weeks or even months on a single machine. Shortening this time enables us to try out more approaches, test many similar models and use the best one. That’s why it’s useful to use multiple machines for faster training.
One of of TensorFlow’s strongest points is that it’s designed to support distributed computation. To use multiple nodes, you just have to create and start a tf.train.Server and use a tf.train.MonitoredTrainingSession.

Between Graph Replication

In our example we’re going be using a concept called ‘Between Graph Replication’. If you’ve ever run MPI jobs or used the ‘fork’ system call, you’ll be familiar with it.
In Distributed TensorFlow, Between Graph Replication means that when several processes are being run on different machines, each process (worker) runs the same code and constructs the same TensorFlow computational graph. However, each worker uses a discriminator (the worker’s I.D., for example) to execute instructions differently from the rest (e.g. process different batches of the training data).
This information is also used to make processes on some machines work as ‘Parameter Servers’. These jobs don’t actually run any computations – they’re only responsible for storing the weights of the model and sending them over the network to other processes.

Connections between tasks in a distributed TensorFlow job with 3 workers and 2 parameter servers.

Apart from the worker I.D. and the job type (normal worker or parameter server), TensorFlow also needs to know the network addresses of other workers performing the computations. All this information should be passed as configuration for the tf.train.Server. However, keeping track of it all in addition to starting multiple processes on multiple machines with different parameters can be really tedious. That’s why we have cluster managers, such as Slurm.

Slurm

Slurm is a workload manager for Linux used by many of the world’s fastest supercomputers. It provides the means for running computational jobs on multiple nodes, queuing the jobs until sufficient resources are available and monitoring jobs that have been submitted. For more information about Slurm, you can read the official documentation here.
When running a Slurm job you can discover other nodes taking part by examining environment variables:

SLURMD_NODENAME – name of the current node
SLURM_JOB_NODELIST – number of nodes the job is using
SLURM_JOB_NUM_NODES – list of all nodes allocated to the job

Our python module parses these variables to make using distributed TensorFlow easier. With the tf_config_from_slurm function you can automate this process. Let’s see how it can be used to train a simple CIFAR-10 model on a CPU Slurm cluster.

Distributed TensorFlow on Slurm

In this section we’re going to show you how to run TensorFlow experiments on Slurm. A complete example of training a convolutional neural network on the CIFAR-10 dataset can be found in our github repo, so you might want to take a look at it. Here we’ll just examine the most interesting parts.
Most of the code responsible for training the model comes from this TensorFlow tutorial. The modifications allow the code to be run in a distributed setting on the CIFAR-10 dataset. Let’s examine the changes one by one.

Starting the Server

import tensorflow as tf
from tensorflow_on_slurm import tf_config_from_slurm
cluster, my_job_name, my_task_index = tf_config_from_slurm(ps_number=1)
cluster_spec = tf.train.ClusterSpec(cluster)
server = tf.train.Server(server_or_cluster_def=cluster_spec,
                         job_name=my_job_name, task_index=my_task_index)
if my_job_name == 'ps':
    server.join()
    sys.exit(0)

Here we import our Slurm helper module and use it to create and start the tf.train.Server. The tf_config_from_slurm function returns the cluster spec necessary to create the server along with the task name and task index of the current job. The ‘ps_number’ parameter specifies how many parameter servers to set up (we use 1). All other nodes will be working as normal workers and everything gets passed to the tf.train.Server constructor.
Afterwards we immediately check whether the current job is a parameter server. Since all the work in a parameter server (ps) job is handled by the tf.train.Server (which is running in a separate thread), we can just call server.join() and not execute the rest of the script.

Placing the Variables on a parameter server

def weight_variable(shape):
    with tf.device("/job:ps/task:0"):
        initial = tf.truncated_normal(shape, stddev=0.1)
        return tf.Variable(initial)
def bias_variable(shape):
    with tf.device("/job:ps/task:0"):
        initial = tf.constant(0.1, shape=shape)
        return tf.Variable(initial)

These two functions are used when defining the model parameters. Note the “with tf.device(“/job:ps/task:0”)” statements telling TensorFlow that the variables should be placed on the parameter server, thus enabling them to be shared between the workers. The “0” index denotes the I.D. of the parameter server used to store the variable. Here we’re only using one server, so all the variables are placed on task “0”.

Optimizer

loss = tf.reduce_mean(cross_entropy)
opt = tf.train.AdamOptimizer(1e-3)
opt = tf.train.SyncReplicasOptimizer(opt, replicas_to_aggregate=len(cluster['worker']),
                                     total_num_replicas=len(cluster['worker']))
is_chief = my_task_index == 0
sync_replicas_hook = opt.make_session_run_hook(is_chief)
train_step = opt.minimize(loss, global_step)

Instead of using the usual AdamOptimizer, we’re wrapping it with the SyncReplicasOptimizer. This enables us to prevent the application of stale gradients. In distributed training, the network communication may introduce communication delays which make it harder to train the model.

Creating the session

sync_replicas_hook = opt.make_session_run_hook(is_chief)
sess = tf.train.MonitoredTrainingSession(master=server.target,
                                         is_chief=is_chief,
                                         hooks=[sync_replicas_hook])
batch_size = 64
max_epoch = 10000

In distributed settings we’re using the tf.train.MonitoredTrainingSession instead of the usual tf.Session. This ensures the variables are properly initialized. It also allows you to restore a previously saved model and control how the summaries and checkpoints are written to disk.

Training

During the training, we split the batches between workers so everyone has their own unique batch subset to train on:

for i in range(max_epoch):
    batch = mnist.train.next_batch(batch_size)
    if i % len(cluster['worker']) != my_task_index:
        continue
    _, train_accuracy, xentropy = sess.run([train_step, accuracy, cross_entropy],
                                           feed_dict={x: batch[0], y_: batch[1],
                                           keep_prob: 0.5})

Summary

We hope this example was helpful in your experiments with TensorFlow on Slurm clusters. If you’d like to reproduce it or use our Slurm helper module in your experiments, don’t hesitate to clone our github repo.

Machine learning application in automated reasoning

May 16, 2017/in Data science, Deep learning, Machine learning /by Przemyslaw Chojecki

It all started with mathematics – rigorous thinking, science, technology. Today’s world is maths‑driven. Despite recent advances in deep learning, the way mathematics is done today is still much the same as it was 100 years ago. Isn’t it time for a change?

Introduction

Mathematics is at the core of science and technology. However the growing amount of mathematical research makes it impossible for non‑experts to fully use the developments made in pure mathematics. Research has become more complicated and more interdependent.
Moreover it is often impossible to verify correctness for non‑experts – knowledge is accepted as knowledge by a small group of experts (e.g. the problem with accepting Mochizuki’s proof of abc‑conjecture – it is not understandable for other experts).

Fig. 1: The graph on the left shows the growing number of submissions to arXiv – an Internet repository for scientific research. In 2012, mathematics accounted for approx. 20,000 submissions annually.

Automated reasoning

To address the issue mentioned above, researchers try to automate or semi‑automate:

Producing mathematics
Verifying existing mathematics

This domain of science is called automatic theorem proving and is a part of automated reasoning. The current approach to automation is:

Take a mathematical work (e.g. Feit‑Thompson theorem or proof of Kepler’s conjecture)
Rewrite it in Coq, Mizar or another Interactive Theorem Prover (language/program which understands logic behind mathematics and is able to check its correctness)
Verify

The downside to this approach is that it is a purely manual work and quite a tedious process! One has to fill in the gaps as the human way of writing mathematics is different than what Coq/Mizar accepts. Moreover mathematical work is based on previous works. One needs to lay down foundations each time at least to some extent (but have a look at e.g. Mizar Math Library).
Once in Coq/Mizar, there is a growing number of methods to prove new theorems:

Hammers and tactics (methods for automatic reasoning over large libraries)
Machine learning and deep learning

Here we concentrate on the last method of automated reasoning. Firstly, in order to use the power of machine learning and deep learning, one needs more data. Moreover to keep up with current mathematical research we need to translate LaTeX into Coq/Mizar much faster.

Building a dictionary

We need to automate translation of human‑written mathematical works in LaTeX to Coq/Mizar. We view it as an NLP problem of creating a dictionary between two languages. How can we build such a dictionary? We could build upon existing syntactic parsers (e.g. TensorFlow’s SyntaxNet) and enhance them with Types and variables, which we explain in an example:
Consider the sentence “Let $G$ be a group” . Then “G” is a variable of Type “group”.
Once we have such a dictionary with at least some basic accuracy we can use it to translate LaTeX into Coq/Mizar sentence by sentence. Nevertheless we still need a good source of mathematics! Here is what we propose in the DeepAlgebra program:
Algebraic geometry is one of the pillars of modern mathematical research, which is rapidly developing and has a solid foundation (Grothendieck’s EGA/SGA, The Stacks Project). It is “abstract” hence easier to verify for computers than analytical parts of mathematics.
The Stacks Project is an open multi‑collaboration on foundations of algebraic geometry starting from scratch (category theory and algebra) up to the current research. It has a well‑organized structure (an easy‑to‑manage dependency graph) and is verified thoroughly for correctness.
The Stacks Project now consists of:

547,156 lines of code
16,738 tags (57 inactive tags)
2,691 sections
99 chapters
5,712 pages
162 slogans

Moreover it has an API to query!

Statements (also in LaTeX)
Data for graphs

Below we present a few screenshots.

Fig. 2: One of the lemmas in the Stacks Project. Each lemma has a unique tag (here 01WC), which never changes, even though the number of the lemma may change. Each lemma has a proof and we can access its dependency graphs:

Figs. 3 and 4: Two dependency graphs for Lemma 01WC, which show the structure of the proof together with all the lemmas, propositions and definitions which were used along the way.

Conclusion

Summing up, we propose to treat the Stacks Project as a source of data for NLP research and eventual translation into one of the Interactive Theorem Provers. The first step in the DeepAlgebra program is to build a dictionary (syntactic parser with Types/variables) and then test it on the Stacks Project. This way we would build an “ontology” of algebraic geometry. If that works out, we can verify, modify and test it on arXiv (Algebraic Geometry submissions). We will report on our progress in automated reasoning in future texts.

This text was based on https://arxiv.org/abs/1610.01044
The author presented this material at AITP conference – http://aitp-conference.org/2017/

Region of interest pooling in TensorFlow – example

April 25, 2017/in Data science, Deep learning, Machine learning, Neptune /by Krzysztof Dziedzic, Patryk Miziuła and Błażej Osiński

In the previous post we explained what region of interest pooling (RoI pooling for short) is. In this one, we present an example of applying RoI pooling in TensorFlow. We base it on our custom RoI pooling TensorFlow operation. We also use Neptune as a support in our experiment performance tracking.

Example overview

Our goal is to detect cars in the images. We’d like to construct a network that is able to automatically draw a box around every car.
In our example we deal with car images from the Pascal VOC 2007 dataset. For simplicity we choose only cars not marked as truncated.

Exemplary images from Pascal VOC 2007 dataset

Neptune

We manage our experiment using Neptune. It’s a pretty handy tool:

We track the tuning in real time. Especially, we preview the currently estimated bounding boxes.
We can change model hyperparameters on the fly.
We can easily integrate Neptune with TensorFlow and get all the charts, graphs and summary objects from the TensorFlow graph.
We store the executed experiments in an aesthetic list.

Network architecture

In our example we use the Fast R-CNN architecture.
The network has two inputs:

Batch of images
Batch of potential bounding boxes – RoI proposals
In the Fast R-CNN model RoI proposals are generated via an external algorithm, for example selective search. In our example, we take ground truth bounding boxes from the Pascal annotations and generate more negative bounding boxes ourselves.

The network has two outputs:

Batch of RoI proposals not classified as background (with corrected coordinates)
Probabilities that RoI proposals consist of objects of the consecutive categories

The network consists of three main parts:

Deep convolutional neural network
- Input: images
- Output: feature map
We use the popular VGG16 network pretrained on the ImageNet dataset.
RoI pooling layer
- Input: feature map, RoI proposals resized to a feature map
- Output: max-pooled RoI proposals
Fully connected layer with RoI features
- Input: max-pooled RoI proposals
- Output: corrected RoI proposals, probabilities

RoI pooling in TensorFlow scheme — Fast R-CNN architecture

We note that our detection task can be also solved with the Faster R-CNN architecture, which works significantly faster :). However, the implementation of Faster R-CNN requires much more code to write, so we chose the simpler Fast R-CNN.

Loss function

We tune the network to minimize the loss given by
$loss = frac 1nsum_{i=1}^n frac 1{k_i} sum_{j=1}^{k_i} loss_{ij}$
where:

$n$ is a number of images in a batch,
$k_i$ is a number of RoI proposals for the image $i$,
$loss_{ij}$ is a loss for the RoI proposal $j$ for the image $i$.

For a single RoI proposal, $loss_{ij}$ is the sum of the classification and regression loss, where:

classification loss is the common cross entropy,

regression loss is a smooth L1 distance between the rescaled coordinates of a RoI proposal and the ground-truth box. The regression loss is computed if the ground-truth box is not categorized as background, otherwise it’s defined as 0.

Implementation details

Prerequisites

To run the code we provide, you need the following software:

CUDA 8,
TensorFlow 1.0 with GPU support,
our custom RoI pooling TensorFlow operation,
OpenCV,
Neptune (version 1.5): apply for our Early Adopters Program or try it immediately with Neptune Go.

Repository

You can download our code from our GitHub repository. It consists of two folders with the following content:

File	Purpose
code
main.py	The script to execute.
fast_rcnn.py	Builds the TensorFlow graph.
trainer.py	Preprocesses data and trains the network.
neptune_handler.py	Contains Neptune utilities.
config.yaml	Neptune configuration file.
get_data.py	Downloads images from Pascal VOC 2007 dataset
data
vgg16-20160129.tfmodel.torrent	References to weights of the pretrained network.

Description

When we run main.py , the script trainer.py first restores the VGG16 network with the pretrained weights. Then it adds the RoI pooling layer and the fully connected layer. Finally, it begins tuning the entire network with use of provided images and RoI proposals. It also sends information to Neptune, so we can track the tuning progress in real time.
After cloning the repository, please download the file vgg16-20160129.tfmodel referred to by the torrent file vgg16-20160129.tfmodel.torrent and save it in the data directory. Also, please run the script get_data.py to download needed images:

python get_data.py

Let’s test our RoI pooling in TensorFlow!

We run the script main.py from the code folder by typing:

neptune run --
            --im_folder $PWD/../data/images
            --roidb $PWD/../data/roidb
            --pretrained_path $PWD/../data/vgg16-20160129.tfmodel

If we want to also use a non-default learning rate value or the number of epochs, we can add:

--learning_rate 1e-03 --num_epochs 200

to the command at the end.
After a while, we can start observing the tuning progress in Neptune:

RoI pooling in TensorFlow - tuning — Tracking the network tuning in Neptune

Moreover, we can display the RoIs fitted to the cars by our network. We could just load all the processed images, but this procedure would take much of resources. That’s why we decided to activate this feature by a simple Neptune action.
To do that, we can go to the Actions tab and click ‘RUN’ to start sending the images.

RoI pooling in TensorFlow - turning on the image sending in Neptune — Turning on the image sending

After that, we can go to the Channels tab and expand the channels ‘region proposals for RoI pooling’ and ‘network detections’ by clicking ‘+’ signs.

Roi pooling in TensorFlow - expanding image channels in Neptune — Expanding image channels

Now we can see the RoIs in real time!

RoI pooling in TensorFlow - RoI preview — RoI proposals preview in Neptune

We can click on the pictures to zoom them. If we want Neptune to stop sending new images, we go to the Actions tab and click ‘RUN’ again.
An exemplary NeptuneGo execution of our script can be found here.

Summary

We hope you enjoy our example of RoI pooling in TensorFlow and experiment managing features offered by Neptune. If you want to comment our work, don’t be hesitate to leave us feedback!

References

R. Girshick, Fast R-CNN, IEEE International Conference on Computer Vision (ICCV), 2015.
S. Ren, K. He, R. Girshick & J. Sun, Faster R-CNN: towards real-time object detection with Region Proposal Networks, Neural Information Processing Systems (NIPS), 2015.
deepsense.ai, Region of interest pooling explained, 2017.

Neptune 1.5 – Python 3 support, simplified CLI, compact view

April 21, 2017/in Data science, Deep learning, Machine learning, Neptune /by Rafał Hryciuk

At the end of April 2017, deepsense.ai released a new version of Neptune, the DevOps platform for data scientists. Neptune 1.5 introduces a range of new features and improvements, including support for Python 3, simplification of Neptune CLI, offline execution, compact view, improved channels and charts, and a number of improvements in the user experience.

Python 3.5 Support

One of the most upvoted tickets on our feedback channel has been requests to add support for Python 3. Well, we put your request on our roadmap and now, using version 1.5, you can run Neptune experiments using both Python 2.7 and 3.5.
We encourage you to stay active in our feedback forum and vote for features you need in your data science work. This is how you will influence where Neptune goes and how it develops – and ultimately make it more convenient for you.

Simplification of Neptune CLI

Until now, Neptune CLI’s commands were long and complex. With version 1.5, however, convenience has taken center stage as we’ve introduced a host of improvements and simplifications. Click over and have a look at the simplified CLI commands and configuration file in our documentation.
To see how this change could work for you, compare the commands for running our “Flower Species Prediction” example.
In version 1.4:

neptune run flower-species-prediction/main.py --config flower-species-prediction/config.yaml --storage-url /tmp/neptune-iris --paths-to-dump flower-species-prediction

In version 1.5:

neptune run

Offline Execution

Our users often run parts of their experiments using Jupyter Notebook, but the Neptune client library requires communication with our server. Thanks to offline execution, users can disable communication with the server and run their experiments without CLI. Read more about this convenient development here.

Compact View

To make comparing your experiments easier we have introduced a compact view of the experiments table. You can now display more experiment results on your screen, and draw conclusions even faster and more confidently.

Improved Channels and Charts

Neptune 1.5 comes with the new API for channels and charts. Thanks to the new API you will be able to send and display even more data points. Your charts will load faster and more seamlessly. We encourage you to give the improved channels and charts a go.

The Neptune Pipeline

We are already working on the next version of Neptune, which is slated for a May release and will focus on better displaying the experiments list.
We hope you will enjoy working with our DevOps platform for data scientists. Neptune 1.5 will help you manage and monitor your machine learning experiments even more conveniently.
Would you like to test drive Neptune? Visit NeptuneGo!, have a look around and run your first experiments.

Deep learning for satellite imagery via image segmentation

April 12, 2017/in Data science, Deep learning, Machine learning /by Arkadiusz Nowaczynski

In the recent Kaggle competition Dstl Satellite Imagery Feature Detection our deepsense.ai team won 4th place among 419 teams. We applied a modified U-Net – an artificial neural network for image segmentation. In this blog post we wish to present our deep learning solution and share the lessons that we have learnt in the process with you.

Competition

The challenge was organized by the Defence Science and Technology Laboratory (Dstl), an Executive Agency of the United Kingdom’s Ministry of Defence on Kaggle platform. As a training set, they provided 25 high-resolution satellite images representing 1 km² areas. The task was to locate 10 different types of objects:

Buildings
Miscellaneous manmade structures
Roads
Tracks
Trees
Crops
Waterway
Standing water
Large vehicles
Small vehicles

Sample image from the training set with labels.

These objects were not completely disjoint – you can find examples with vehicles on roads or trees within crops. The distribution of classes was uneven: from very common, such as crops (28% of the total area) and trees (10%), to much smaller such as roads (0.8%) or vehicles (0.02%). Moreover, most images only had a subset of classes.

Correctness of prediction was calculated using Intersection over Union (IoU, known also as Jaccard Index) between predictions and the ground truth. A score of 0 meant complete mismatch, whereas 1 – complete overlap. The score result was calculated for each class separately and then averaged. For our solution the average IoU was 0.46, whereas for the winning solution it was 0.49.

Preprocessing

For each image we were given three versions: grayscale, 3-band and 16-band. Details are presented in the table below:

Type	Wavebands	Pixel resolution	#channels	Size
grayscale	Panchromatic	0.31 m	1	3348 x 3392
3-band	RGB	0.31 m	3	3348 x 3392
16-band	Multispectral	1.24 m	8	837 x 848
16-band	Short-wave infrared	7.5 m	8	134 x 136

We resized and aligned 16-band channels to match those from 3-band channels. Alignment was necessary to remove shifts between channels. Finally all channels were concatenated into single 20-channels input image.

Model

Our fully convolutional model was inspired by the family of U-Net architectures, where low-level feature maps are combined with higher-level ones, which enables precise localization. This type of network architecture was especially designed to effectively solve image segmentation problems. U-Net was the default choice for us and other competitors. If you would like more insights into architecture we suggest that you read the original paper. Our final architecture is depicted below:
For more details about specific modules click here.
Typical convolutional neural network (CNN) architecture involves increasing the number of feature maps (channels) with each max pooling operation. In our network we decided to keep a constant number of 64 feature maps throughout the network. This choice was motivated by two observations. Firstly, we can allow the network to lose some information after the downsampling layer because the model has access to low level features in the upsampling path. Secondly, in satellite images there is no concept of depth or high-level 3D objects to understand, so a large number of feature maps in higher layers may not be critical for good performance.
We developed separate models for each class, because it was easier to fine tune them individually for better performance and to overcome imbalanced data problems.

Training procedure

Models assign probability of belonging to a target class for each pixel from the input image. Although Jaccard was the evaluation metric, we used the per-pixel binary cross entropy objective for training.
We normalized images to have a zero mean and unit variance using precomputed statistics from the dataset. Depending on class we left preprocessed images unchanged or resized them together with corresponding label masks to 1024 x 1024 or 2048 x 2048 squares. During training we collected a batch of cropped 256 x 256 patches from different images where half of the images always contained some positive pixels (objects of target classes). We found this to be both the best and the simplest way to handle the imbalanced classes problem. Each image in a batch was augmented by randomly applying horizontal and vertical flips together with random rotation and color jittering.
Each model had approx. 1.7 million parameters. Its training (with batch size 4) from scratch took about two days on a single GTX 1070.

Predictions

We used a sliding window approach at test time with window size fixed to 256 x 256 and stride of 64. This allowed us to eliminate weaker predictions on image patch boundaries where objects may only be partially shown without context around them. To further improve prediction quality we averaged results for flipped and rotated versions of the input image, as well as for models trained on different scales. Overall we obtained well smoothed outputs.

Post-processing

Ground truth labels were provided in WKT format, presenting objects as polygons (defined by their vertices). It was necessary for us to generate submissions where polygons are concise and can be processed quickly by the evaluation system to avoid timeout limits. We found that this can be accomplished with minimal loss on the evaluation metric by using parameterized operations on binarized outputs. In our post-processing stage we used morphology dilation/erosion and simply removed objects/holes smaller than a given threshold.

Our solution

Buildings, Misc., Roads, Tracks, Trees, Crops, Standing Water

For these seven classes we were able to train convolutional networks (separately for each class) with binary cross entropy loss as described above on 20 channels inputs and two different scales (1024 and 2048) with satisfactory results. Outputs of the models were simply averaged and then post-processed with hyperparameters depending on particular classes.

Waterway

The solution for the waterway class was a combination of linear regression and random forest, trained on per pixel data from 20 input channels. Such a simple setup works surprisingly well because of the characteristic spectral response of water.

Large and Small Vehicles

We observed high variation of the results on the local validation and public leaderboard due to the small number of vehicles in the training set. To combat this we trained models separately for large and small vehicles, as well as single model for both of them (label masks were added together) on 20 channels inputs. Additionally, we repeated all experiments using 4 channels inputs (RGB + Panchromatic) to increase diversity of our models in ensemble. Outputs from models trained on both classes were averaged with single class specific models to produce final predictions for each type of vehicles.

Technologies

We implemented models in PyTorch and Keras (with TensorFlow backend), according to our team members’ preferences. Our strategy was to build separate models for each class, so this required careful management of our code. To run models and keep track of our experiments we used Neptune.

Final results

Below we present a small sample of the final results from our models:

Buildings

Roads

Tracks

Crops

Waterway

Small vehicles

Conclusions

Satellite imagery is a domain with a high volume of data which is perfect for deep learning. We have proved that the results gained from current state-of-the-art research can be applied to solve practical problems. Excited by our results, we look forward to more of such challenges in the future.
Team members:
Arkadiusz Nowaczyński
Michał Romaniuk
Adam Jakubowski
Michał Tadeusiak
Konrad Czechowski
Maksymilian Sokołowski
Kamil Kaczmarek
Piotr Migdał

Training XGBoost with R and Neptune

March 27, 2017/in Machine learning, Neptune /by Jan Lasek

In this blogpost we present the R library for Neptune – the DevOps platform for data scientists. Neptune’s R extension is presented by demonstrating the powerful XGBoost library and a bank marketing dataset (available at the UCI Machine Learning Repository).

The goal is to build a model that predicts how likely a given customer is to subscribe to a bank deposit. Such a model can be used as a basis for a recommendation system or for more efficient allocation of resources in a call center. The model is built by using XGBoost: a state-of-the-art library for training predictive models. XGBoost has a long legacy of successful applications in data science – here you can find a list of use cases in which it was used to win open machine learning challenges. If you are interested in more details and other modeling approaches to the problem under consideration we refer to this publication.
Let’s start with describing the dataset!

Bank customer data

The data we are dealing with here is a set of over 41K customer records from a bank. They comprise various features describing each customer. Among others, the provided data is the customer’s age, marital status and some other features regarding previous purchase history. We are also given a set of macroeconomic indicators, for example, the consumer confidence index. Finally, we are given the binary information whether a given customer subscribed for a bank deposit – about 11.3% of all customers decided to do so. Our goal is to build a model that gives the probability of this event.
Inquiries on subscribing to a bank deposit were made in a phone call. Along with data described above, we are also given information about how long such a call with each customer lasted. In the analysis, this feature should be disregarded as it would be considered a data leak. We want to train the model that gives information about the probability of subscribing to a deposit prior to taking any action (in particular, making a phone call). At the end of the day, we want to save resources and time spent on calling customers in vain.
We will employ relatively few preprocessing steps before plugging the data to the model. We will use the R’s model.matrix() function to encode categorical attributes. After this step, the data comprise 53 numeric attributes and a single target column. Loading the data and preprocessing is done using the code below. We also load all necessary libraries that we will use in this example: xgboost, neptune and ModelMetrics.

library(xgboost)
library(neptune)
library(ModelMetrics)
customer_data <- read.csv('https://s3-us-west-2.amazonaws.com/deepsense.neptune/data/bank-additional/bank-additional-full.csv', sep = ';')
customer_data$duration <- NULL
y <- customer_data$y == 'yes'
x <- model.matrix(y~.-1, data = customer_data)

Training XGBoost model

XGBoost is a powerful library for building ensemble machine learning models via the algorithm called gradient boosting. Training an XGBoost model is an iterative process. In each iteration, a new tree (or a forest) is built, which improves the accuracy of the current (ensemble) model.
In order to train and evaluate the model, we will split the data into three parts: a training set, a validation set and a test set. The training set will be used to build our model. With the validation set we will monitor the model’s performance on a different dataset than the training one. Finally, the test set will serve as a sanity check for the model’s final performance on a previously unseen holdout dataset. Here we decide to devote 60% of the data for training, 20% for validation and the remaining 20% for testing. For reproducibility we set a seed here.

set.seed(999)
train_valid_test <- sample(1:3, prob = c(0.6, 0.2, 0.2), replace = T, size = nrow(x))
train_idx <- train_valid_test == 1
valid_idx <- train_valid_test == 2
test_idx <- train_valid_test == 3
y_train <- y[train_idx]
y_valid <- y[valid_idx]
y_test <- y[test_idx]
x_train <- x[train_idx,]
x_valid <- x[valid_idx,]
x_test <- x[test_idx,]

At this point we should introduce an accuracy metric that we will employ. First, we note that there is some class imbalance in the response rate: as few as 1 out of 9 of all the responses are positive (this is typical in case of marketing data). In such applications, the area under the ROC curve (abbreviated as AUC) is often a metric of choice because it handles classification under imbalanced classes well. The rare class – customers subscribing to a deposit – is in this application of special interest for us.
To monitor the training process in Neptune, we need to specify the appropriate Neptune channels. They are used for keeping track of metrics that are important to us. We will use two numeric channels – for the training and validation of the AUC scores. Finally, we are going to record the test AUC score. We can also keep track of other things like the training time for each iteration. On top of the channels we can create custom charts, which can be later viewed in Neptune’s Dashboard:

Below we set up the Neptune channels and Neptune charts:

createNumericChannel('train_auc')
createNumericChannel('valid_auc')
createNumericChannel('test_auc')
createChart(chartName = 'Train & validation auc', series = list('train_auc', 'valid_auc'))
createNumericChannel('execution_time')
createChart(chartName = 'Total execution time', series = list('execution_time'))

Neptune facilitates monitoring our computations that are organized in a process called a job. When executing the job, created channels will be visible for inspection in Neptune’s menu. So far we defined four numeric channels and two charts. After executing the job, you can view the channels in Neptune’s UI:

Yet another useful feature of XGBoost is the possibility of calling a custom callback function after each iteration of the boosting algorithm. Callbacks are useful functions for debugging and online performance monitoring of your models. We will write our own function to keep track of the training process. Here you can see some examples of callback functions. We will specifically overwrite the function cb.print.evaluation() from the repository so that we can track the progress of learning in Neptune. This function is presented below. The variable start_time is a global variable that allows us to monitor the total execution time (we will create it in the next step).

cb.print.evaluation <- function (period = 1) {
  callback <- function(env = parent.frame()) {
    if (length(env$bst_evaluation) == 0 || period == 0)
      return()
    i <- env$iteration
    if ((i - 1)%%period == 0 || i == env$begin_iteration || i == env$end_iteration) {
      channelSend('train_auc', i, env$bst_evaluation[1])
      channelSend('valid_auc', i, env$bst_evaluation[2])
      channelSend('execution_time', i, as.numeric(Sys.time() - start_time))
    }
  }
  attr(callback, 'call') <- match.call()
  attr(callback, 'name') <- 'cb.print.evaluation'
  callback
}

Note that the function refers to its parent environment that it is called from – the parent.frame() function. This allows us to access the objects created during the training process (in this case we choose to monitor both the training and validation of the AUC scores). We can access them all by referring to the parent environment as shown above.
The conditional instructions in the code above check if any monitoring was set and then performs it in every period of iterations (this can be changed via the parameter print.every.n in the xgb.train() function), for the first and last iteration.
Next, we train our model. To start, we create a start_time variable to monitor the execution time.

start_time <- Sys.time()
model <- xgb.train(
params = list(
objective = 'binary:logistic',
eval_metric = 'auc',
max_depth = 4),
data = xgb.DMatrix(x_train, label = y_train),
nrounds = 50,
watchlist = list(
  train = xgb.DMatrix(x_train, label = y_train),
  validation = xgb.DMatrix(x_valid, label = y_valid)),
  callbacks = list(cb.print.evaluation()))

Let’s have a closer look at what is going on here. Above, the training and validation sets for performance monitoring are submitted to the model as the parameter named watchlist. Some extra configuration needs to be done. We set:

evaluation metric to be AUC (parameter eval_metric = ‘auc’)
learning task as a binary classification with logistic loss (objective = ‘binary:logistic’)
maximal depth of individual tree depth to 4 – an arbitrary value (max_depth = 4)
number of boosting iterations to 50 (nrounds = 50).

To finalize the preparations we need to specify a configuration file for Neptune with some metadata. For now it may be as simple as the exemplary file below.

name: Bank Marketing
project: Predicting Deposit Subscription

We can run our code with Neptune in command line with:

$neptune run bank_marketing.R --config xgb_config.yaml --dump-dir-url my_dump_dir

The file bank_marketing.R is our model’s R code. The file name xgb_config.yaml represents Neptune’s configuration and the dump_dir is a directory where all the job’s output and source code will be stored. This is useful for the reproducibility of the experiment.
In Neptune’s dashboard we can see both the training and validation of the AUC scores on a plot.

XGBoost has a useful parameter early_stopping. This parameter stops further training, when the evaluation metric values for the validation set does not improve for the next early_stopping iterations. In machine learning, it is a common way to prevent the overfitting of a model. However, it is not known in advance to what value you have to set this parameter to. The idea here is to plot the training and validation of loss and observe the moment when the training is no longer necessary. From the visualization of the training process above it appears that 10 is a sufficient number of iterations (trees) in our ensemble model. In this way, we also arrive at a less complex model with no loss of its accuracy.
We also changed the random seed above to observe if we arrive at a stable solution for different data shuffles (see train/validation/test split above). In general, based on our experimentation, the ensemble of 10 trees (that is, running the model for 10 iterations) appears to produce a decent and stable model overall. Here, Neptune helps us diagnose a proper early stopping time via presentation of accuracy scores. Finally, it is convenient to set the number of iterations as the job parameter and extend the configuration file as discussed here. For example, in Neptune’s R library, the command line argument nrounds can be accessed in the job via the nrounds <- params(‘nrounds’) command.
It’s time to inspect the model performance on the reminder set of records.

Model evaluation and the lift curve

Finally, we make predictions and evaluate the model using the holdout test set. Using parameter ntreelimit we may specify the model built after 10 iterations to be used for predicting new data (as discussed above).

predictions_test <- predict(model, xgb.DMatrix(x_test), ntreelimit = 10)
auc_test <- auc(y_test, predictions_test)

Evaluation yields 0.80 AUC. This score is close to the accuracy obtained on the validation set, which is good: we created a stable model with no sign of overfitting – it is ready to be used!
Let’s apply it to the holdout test set and measure its effectiveness on the lift chart. This is a tool for getting insight into the expected performance of our marketing campaign if we target it for the most likely customers as predicted by the model.
To produce the lift chart, we sort the true responses (y_test) according to the probability of deposit subscription in a decreasing order and compute a cumulative fraction of responses. This fraction is also normalized against the baseline level equal to the fraction of responses in the test data (11.7%). We create an extra numeric channel and a chart for the lift curve and plot it for top 10%, 20%, …., 100% of customers. This is accomplished using the function below.

lift_chart <- function(responses, predictions) {
  baseline <- mean(responses)
  responses_ordered <- responses[order(predictions, decreasing = TRUE)]
  lift <- cumsum(responses_ordered) / 1:length(responses_ordered) / baseline
  createNumericChannel('lift')
  createChart(chartName = 'Lift chart', series = list('lift'))
  n <- length(lift)
  for(x in seq(0.1, 1, by = 0.1)) {
    # max(., 1) assures a proper index >= 1
    channelSend('lift', x, lift[max(round(x * n), 1)])
  }
}
lift_chart(y_test, predictions_test)

In Neptune’s dashboard this produces an interactive plot named “Lift chart” presented below.

Based on this chart, we may analyze our results in greater detail. For example, we see that contacting top 20% of customers (as selected by our model) translates to over 3-fold increase in the hit rate (that is, the fraction of subscriptions in the selected group) as compared to the baseline level for the test data. This results in a more efficient targeting of our campaign. In practice, the desired number of contacted customers depends on the resources available and costs associated with contacting them.

The end

That’s all! We trained a model to predict how likely a customer is to order a given bank product. Using R and XGBoost with the help of Neptune, we trained a model and tracked its learning process. There is still room for improvement of the accuracy of the model. Playing with the parameters described above would be a good starting point here. You can give it a try and access the complete workflow at our Github repository (for the 1.4 version) or explore the job in NeptuneGo!.
This is our first attempt at integrating Neptune with R. We will be very grateful for your feedback here! We would like to make sure that it fits well with your experimentation pipeline. Moreover, we hope that it can improve your pipeline as dramatically as Neptune’s Python API changed our experience.

Neptune machine learning platform: grid search, R & Java support

March 6, 2017/in Data science, Deep learning, Machine learning, Neptune /by Rafał Hryciuk

In February we released a new version of Neptune, our machine learning platform for data scientists, supporting them in more efficient experiment management and monitoring. The latest 1.4 release introduces new features, like grid search — a hyperparameter optimization method and support for R and Java programming languages.

Grid Search

The first major feature introduced in Neptune 1.4 is support for grid search, which is one of the most popular hyperparameter optimization method. You can read more about grid search here.
In version 1.4 in your Neptune experiment you can pass a list or a range of values instead of passing a specific value for the numeric parameter. Neptune will create a grid search experiment and run a job for every combination of parameters’ values. Neptune groups and helps you manage results within the grid search experiment. You can define custom metrics for evaluation. Neptune will automatically select the combination of hyperparameters’ values that give the best value of the metric. Read an example.

R Support and Java Support

Neptune exposes REST API, so it is completely language and platform agnostic. deepsense.ai provides high-level client libraries for the most popular programming languages among data scientists (according to the poll taken in the community — see the results). Thanks to client libraries, users don’t have to implement communication via REST API themselves but instead they can invoke high-level functions. Until version 1.4 we only supported client library for Python. In version 1.4 we introduced support for for R and Java (which also covers Scala users). Thanks to new client libraries you can run, monitor and manage your experiments written in R or Java in the Neptune machine learning platform. You can get client libraries for R and Java here.

Future Plans

We have already been working on the next version of Neptune, which will be released at the beginning of April 2017. Next release will contain:

Architectural and API changes that will improve user experience.
New approach for handling snapshots of the experiments’ code.
Neptune Offline Context — the user will be able to run the code that uses Neptune API offline.

I hope you will enjoy working with our machine learning platform, now with grid search support and client libraries for R and Java. If you’d like to give us feedback, feel free to use our forum at https://community.neptune.ml.
Do you want to check out Neptune? Visit NeptuneGo!, look around and run your first experiments.

Competition rules

Our solution

Data

Blind contest

Bad neighborhoods remain bad

Neptune

Results

Needle in a haystack

Summary

Measuring human accuracy for medical images

Translating categorical predictions into log loss

Human predictions vs Kaggle results

Additional materials

Remarks

Code snippet

The problem

Data and goal

Problems with data

Evaluation

Our solution

A bunch of convolutional neural networks

Choosing the training loss function

Ensembling

Other techniques tested

Neptune

Results

Summary

Distributed TensorFlow

Between Graph Replication

Slurm

Distributed TensorFlow on Slurm

Starting the Server

Placing the Variables on a parameter server

Optimizer

Creating the session

Training

Summary

Introduction

Automated reasoning

Building a dictionary

Conclusion

Example overview

Neptune

Network architecture

Loss function

Implementation details

Prerequisites

Repository

Description

Let’s test our RoI pooling in TensorFlow!

Summary

References

Python 3.5 Support

Simplification of Neptune CLI

Offline Execution

Compact View

Improved Channels and Charts

The Neptune Pipeline

Competition

Preprocessing

Model

Training procedure

Predictions

Post-processing

Our solution

Buildings, Misc., Roads, Tracks, Trees, Crops, Standing Water

Waterway

Large and Small Vehicles

Technologies

Final results

Conclusions

Suggested readings

Bank customer data

Training XGBoost model

Model evaluation and the lift curve

The end

Grid Search

R Support and Java Support

Future Plans

Contact us