Machine learning Archives - Page 7 of 7

Machine Learning for Greater Fire Scene Safety

October 22, 2015/in Data science, Machine learning /by Jan Lasek

The lives of brave firemen are threatened during dangerous emergency missions while they try to save other people and their property. In this post I would like to share my experiences and winning strategy for the AAIA’15 Data Mining Competition: Tagging Firefighter Activities at a Fire Scene, in which I took first place.

The competition was organized jointly by the University of Warsaw and the Main School of Fire Service, in Warsaw, Poland. It lasted over 3 months during which 79 contestants submitted a total of 1,840 proposals with solutions on the competition’s hosting platform Knowledge Pit.
I particularly enjoy competitions with a potentially big impact – when something more than only a high accuracy score is at stake. This competition definitely had a flavor of this – the participants were asked to contribute toward the safety of firefighters on the scene during an emergency mission.

The challenge

It is certainly helpful for decision making during an emergency when you know what particular activity the members of a rescue team are currently engaged in. This was the goal of the competition – develop a model that recognizes what activity a fireman is performing based on sensory data from his body movements and a collection of statistics monitoring his vital functions. Actually, we are facing two dependent multiclass classification problems. The first class is the main posture of the fireman and the second one is his particular action. Here is a sample of the data the contestants were given:

posture	action	avg-ecg1	…	ll-acc-x	ll-acc-y	…	torso-gyro-z
stooping	manipulating	-0.03	…	-6.98	10.41	…	28.49
standing	signal water first	-0.04	…	-9.41	0.11	…	63.84
moving	running	-0.04	…	-8.75	3.81	…	-52.92
crawling	searching	-0.03	…	-36.61	2.74	…	-134.26
stooping	manipulating	-0.04	…	-3.00	2.23	…	-7.21

The first two columns present the two class attributes: the posture and the main action of the fireman. Each activity is described by ca. 2 second-time series of sensory data from accelerometers and gyroscopes and certain statistics on fireman’s vital functions. In total, there are 42 such statistics as well as 42 different time series. Moreover, as usual, you are given two datasets: “train” and “test”. In the training data, you are given instances along with labels of activities, just as exemplified in the table above. In the test data, the labels are not present and you are asked to design a model for automatic tagging of those activities. To select the best performing approach from participants’ proposals, the performance of a given model on the test set was taken into account (in terms of an evaluation metric discussed below). You can find more information on the competition at its hosting platform.
The number of possible activities is restricted to the set of labels by the competition organizers. There are five labels in the first class, and 16 in the second one. Moreover, the labels are dependent. Let us see their joint distribution.

	crawling	crouching	moving	standing	stooping
ladder down	0	0	465	0	0
ladder up	0	0	476	0	0
manipulating	0	1764	331	2356	1898
no action	0	87	0	490	0
nozzle usage	0	492	0	443	0
running	0	0	4324	0	0
searching	459	0	0	0	0
signal hose pullback	0	0	0	98	0
signal water first	0	0	41	496	0
signal water main	0	46	0	405	0
signal water stop	0	0	0	277	0
stairs down	0	0	644	0	0
stairs up	0	0	1157	0	0
striking	0	0	0	1022	0
throwing hose	0	0	0	234	930
walking	0	0	1064	0	0

For example, there are 4,324 instances in the data where a fireman is moving and running, and 234 instances where a fireman is standing and throwing a hose. Surely, there are many other activities that someone from the rescue team can engage in, however, the dataset was restricted to this particular subset. It may come as a big disappointment, but there were no “saving cat” label. As such, the competition was set up as a standard supervised learning task: we are given a training set of activities along with their tags. In the test set, we are to tag activities based on what we’ve learned from the examples in the training set.
Another thing to note is the fact that the distribution of labels is fairly unbalanced. For instance, a fireman is about four times more likely to be running than throwing a hose. This should be carefully considered, especially in the context of the evaluation metric adopted in the competition.
The chosen metric was balanced accuracy. It is defined in the following way. First, for a given label we define accuracy of predictions

Next, the balanced accuracy score for class C with L labels is equal to the average accuracy among its labels

Finally, since we have two dependent class attributes, we compute a weighted average of balanced accuracy scores for posture and action classes:

A higher weight is attached to the accuracy of classification of the more granular class action.

Overview of the solution

The approach to the task boils down to an extensive feature engineering step for time series data, before learning a set of classifiers. Along the way, there are a couple of interesting details to discuss. Since the final solution consisted of three slightly different Random Forest models that do not differ too much, I’ll describe just one of them.

Classification with two dependent class attributes

One of the interesting aspects of the challenge is the fact that we need to predict two dependent classes. In my approach, I performed a stepwise classification. In the first step, I predict the main posture of a fireman. In the second step, the particular activity is predicted based on the training set and the predicted label from the first step. Thanks to this approach, you can capture the hierarchical dependency between labels. Naturally, there are a number of other ways to deal with the two-class tagging problem. For instance, one could train two independent classifiers or concatenate the two labels. However, the approach of chaining two classifiers yielded better results in my case.

Drift between training and test data distribution

Another issue that came with the data was the fact that the activities in training and test set were performed by different firemen. This posed a real challenge. An important part of successful participation in any data mining competition is that you are able to set-up a local evaluation framework that is in-line with the one employed in the contest. Here, a natural solution would be to perform a stratified cross-validation over different firemen. However, no identifier of a fireman for a particular activity was provided. Hence, regardless of whether I liked it or not, I had to rely predominantly on preliminary evaluation scores that were based on 10% of the data during the competition (the final evaluation was done on the other 90% of test data). Of course, this was a problem not only for me but also for all the other contestants. As I talked to them at a conference workshop following the competition, they also relied mainly on preliminary evaluation results, as the evaluation on the training data yielded far too optimistic scores.

Feature engineering

The main effort during the competition was devoted to the extraction of interesting features describing the underlying time series (called signals). There are a couple of basic statistics that you can derive from the signal: mean, standard deviation, skewness, kurtosis, quantiles. I derived quantiles on a relatively rich grid ranging from 0.01, 0.05, 0.1, …, 0.95, 0.99. Because some of the activities are periodic, I thought that it would be useful to utilize some tools dedicated to that task. I processed each signal by Fourier transform as well as computed periodograms. From these transformed signals I once again extracted basic summary statistics. Another feature which is quite simple and proved to be useful in classification is correlation between signals. Intuitively, when you are running, the recordings of corresponding devices attached to your legs should be negatively correlated. Finally, I made some effort to identify peaks in the data. The idea is that, in case of performing different activities, e.g., running or striking, we can observe a different number of “peaks” in the signal. Peaks identification is a problem that is easy to state but hard to define mathematically. At the end, I ended up with a simple method that was based on counting chunks of a time series where it exceeds its mean by one or two standard deviations.
To battle the drift between training and test data, one should try to design generic (not subject-specific) features. For instance, the quantiles of distribution of acceleration are heavily dependent on a given person’s running pace and his/her motoric abilities. Presumably, these statistics are going to differ much from person to person. On the other hand, if you derive a correlation between acceleration recordings on left and right leg, this correlation may turn out to vary less between different firemen! This is a desired property of a feature, as the activities in test data were performed by a different set of people than those in the training set.
Feature extraction was the most tedious part of the solution, but I believe a worthy one. I derived a set of almost 5,000 features describing each single activity. Now, the next step is to train a model based on these features that learns to distinguish between different activities.

Let’s vote

If a group of experts is to decide on an important matter, it is often the case, that collectively they can make a better decision. As each of them looks at the problem from a slightly different perspective, they can jointly arrive at a more refined judgment. This idea is brilliantly explored in the Random Forest algorithm, which is an ensemble of decision trees. A large number of trees are trained on diverse subsamples of data so that their joint prediction, made by majority voting, usually yields higher accuracy than each single individual model. I employed this model to solve the problem of activity recognition.
Another appealing property of Random Forest is that it has an inherent method of selecting relevant attributes. Having extracted a quite rich set of features, it is certainly the case that some of them are only mildly useful. I handed over the task of selecting the most relevant ones to the model itself.
As already mentioned in the introduction, the distribution of labels in the data was fairly unbalanced. Recall that our solutions are evaluated against a balanced accuracy evaluation metric. Doing a poor job predicting some label, yields the same penalty regardless of its distribution in the data. To account for this, each tree in the forest was trained on a stratified subsample of the data, where each label was present in an equal proportion. This preserved the forest from focusing too much on the most prevalent labels, and gave a major improvement in the score.

Summary

Summing up, the competition was a very exciting experience. I would like to thank all the participants, as they made the contest a great event. Also, I want to thank the organizing committee from the University of Warsaw and the Main School of Fire Service for providing such an interesting dataset and setting up the competition. The winning solution yielded a balanced accuracy of 84% which was enough to beat other contestants’ solutions. Certainly, there is still some room for improvement, yet we took a small step toward increasing the safety of firemen at a fire scene.

Jan Lasek
(deepsense.ai Machine Learning Team)

About the Author:

Jan Lasek, Data Scientist at deepsense.ai, is also pursuing his PhD at the Institute of Computer Science, a part of the Polish Academy of Sciences. He graduated from Warsaw University where he studied both at the Faculty of Mathematics and the Faculty of Economic Sciences.

Diagnosing diabetic retinopathy with deep learning

September 3, 2015/in Data science, Deep learning, Machine learning /by Robert Bogucki

What is the difference between these two images?

The one on the left has no signs of diabetic retinopathy, while the other one has severe signs of it.

If you are not a trained clinician, the chances are, you will find it quite hard to correctly identify the signs of this disease. So, how well can a computer program do it?
In July, we took part in a Kaggle competition, where the goal was to classify the severity of diabetic retinopathy in the supplied images of retinas.
As we’ve learned from the organizers, this is a very important task. Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.
The contest started in February, and over 650 teams took part in it, fighting for the prize pool of $100,000.
The contestants were given over 35,000 images of retinas, each having a severity rating. There were 5 severity classes, and the distribution of classes was fairly imbalanced. Most of the images showed no signs of the disease. Only a few percent had the two most severe ratings.
The metric with which the predictions were rated was a quadratic weighted kappa, which we will describe later.
The contest lasted till the end of July. Our team scored 0.82854 in the private standing, which gave us 6th place. Not too bad, given our quite late entry.
You can see our progress on this plot:

Also, you can read more about the competition here.

Solution overview

What should be no surprise in an image recognition task, most of the top contestants used deep convolutional neural networks (CNNs), and so did we.
Our solution consisted of multiple steps:

image preprocessing
training multiple deep CNNs
eye blending
kappa score optimization

We briefly describe each of these steps below. Throughout the contest we used multiple methods for image preprocessing and trained many nets with different architectures. When ensembled together, the gain over the best preprocessing method and the best network architecture was little. We therefore limited ourselves to describing the single best model. If you are not familiar with convolutional networks, check out this great introduction by Andrej Karpathy: http://cs231n.github.io/convolutional-networks/.

Preprocessing

The input images, as provided by the organizers, were produced by very different equipment, had different sizes and very different colour spectrum. Most of them were also way too large to perform any non-trivial model fitting on them. A minimum preprocessing to make network training possible is to standardize the dimensions, but ideally one would want to normalize all other characteristics as well. Initially, we used the following simple preprocessing steps:

Crop the image to the rectangular bounding box containing all pixels above a certain threshold
Scale it to 256×256 while maintaining the aspect ratio and padding with black background (the raw images have black background as well, more or less)
For each RGB component separately, remap the colour intensities so that the CDF (cumulative distribution function) looks as close to linear as possible (this is called “histogram normalization”)

All these steps can be achieved in a single call of ImageMagick’s command line tool. In time, we realized that some of the input images contain regions of rather intensive noise. When using the simple bounding-box cropping described above, this leads to very bad quality crops, i.e. the actual eye occupying an arbitrary and rather small part of the image.

You can see gray noise at the top of the image. Using state of the art edge detectors, e.g. Canny, did not help much. Eventually, we developed a dedicated cropping procedure. This procedure chooses the threshold adaptively, exploiting two assumptions based on analysis of provided images:

There always exists a threshold level separating noise from the outline of the eye
The outline of the eye has an ellipsoidal shape, close to a circle, possibly truncated at the top and bottom. In particular it is a rather smooth curve, and one can use this smoothness to recognize the best values for the threshold

The resulting cropper produced almost ideal crops for all images, and is what we used for our final solutions. We also changed the target resolution to 512×512, as it seemed to significantly improve the performance of our neural networks compared to the smaller 256×256 resolution.
Here is how the preprocessed image looks like.

Just before passing the images to the next stage we transformed the images so the mean of each channel (R, G, B) over all images is approximately 0, and standard deviation approximately 1.

Convnet architecture

The core of our solution was a deep convolutional neural network. Although we started with fairly shallow models — 4 convolutional layers, we quickly discovered that adding more layers, and filters inside layers helps a lot. Our best single model consisted of 9 convolutional layers.
The detailed architecture is:

| Type    | nof filters | nof units |
|---------|-------------|-----------|
| Conv    | 16          |           |
| Conv    | 16          |           |
| Pool    |             |           |
| Conv    | 32          |           |
| Conv    | 32          |           |
| Pool    |             |           |
| Conv    | 64          |           |
| Conv    | 64          |           |
| Pool    |             |           |
| Conv    | 96          |           |
| Pool    |             |           |
| Conv    | 96          |           |
| Pool    |             |           |
| Conv    | 128         |           |
| Pool    |             |           |
| Dropout |             |           |
| FC1     |             | 96        |
| FC2     |             | 5         |
| Sofmax  |             |           |

All Conv layers have 3×3 kernel, stride 1 and padding 1. That way the size (height, width) of the output of the convolution is the same as the size of the input. In all our convolutional layers we follow the convolutional layer by batch normalization layer and ReLu activations. Batch normalization is a simple but powerful method to normalize the pre-activation values in the neural net, so that their distribution does not change too much during the training. One often standardizes the data to make zero mean and unit variance. Batch normalization takes it a step further. Check this paper by Google to learn more. Our Pool layers always use max pooling. The size of the pooling window is 3×3, and the stride is 2. That way the height and width of the image get halved by each pooling layer. In the FC (fully connected) layers we again use ReLu as activation function. The first fully connected layer, FC1 also employs batch normalization. For regularization we used Dropout layer before the first fully connected layer, and L2 regularization applied to some of the parameters.
Overall, the net has 925,013 parameters.
We trained the net using stochastic gradient descent with momentum and multiclass logloss as a loss function. Moreover, the learning rate has been adjusted manually a few times during the training. We have used our own implementation based on Theano and Nvidia cuDNN.
To further regularize the network, we augmented the data during the training by taking random 448×448 crops of images and flipping them horizontally and vertically, independently with probability 0.5. During the test time, we took few such random crops, flips for each eye and averaged our predictions over them. Predictions were also averaged over multiple epochs.
It took quite long to train and compute test predictions even for a single network. On a g2.2xlarge AWS instance (using Nvidia GRID K520) it took around 48 hours.

Eye blending

At some point we realized that the correlation between the scores of two eyes in a pair was quite high. For example, the percent of eye pairs for which the score for the left eye is the same as for the right one is 87.2%. For 95.7% of pairs the scores differ by at most 1, and for 99.8% by at most 2. There are two likely reasons for this kind of correlation.
The first is that the retinas of both eyes were exposed to the damaging effects of diabetes for the same amount of time, and are similar in structure, so the conjecture is that they should develop the retinopathy at similar rate. The less obvious reason is that the ground truth labels were produced by humans, and it is conceivable that a human expert is more likely to give the same image different scores, depending on the score of the other image of the pair.
Interestingly, one can exploit this correlation between the scores of a pair of eyes to produce a better predictor.
One simple way is to take the predicted distributions D_L and D_R for the left and right eye respectively and produce new distributions using linear blending, as follows. For the left eye, we predict c⋅D_L+(1-c)⋅D_R, similarly we predict c⋅D_R+(1-c)⋅D_L for the right eye, for some c in [0, 1]. We tried c = 0.7 and a few other values. Even this simple blending produced a significant increase in our kappa score. However, a much bigger improvement was gained when instead of an ad-hoc linear blend we trained a neural network. This network takes two distributions (i.e. 10 numbers) as inputs, and returns the new “blended” versions of the first 5 inputs. It can be trained using predictions on validation sets. As for the architecture, we decided to go with a very strongly regularized (by dropout) one with two inner layers of 500 rectified linear nodes each.
One obvious idea is to integrate the convolutional networks and the blending network into a single network. Intuitively, this could lead to stronger results, but such a network might also be significantly harder to train. Unfortunately, we did not manage to try this idea before the contest deadline.

Kappa optimization

Quadratic weighted kappa (QWK), the loss function proposed by the organizers, seems to be a standard one in the area of retinopathy diagnosis, but from the point of view of mainstream machine learning it is very unusual. The score of a submission is defined to be one minus the ratio between the total square error of the submission (TSE) and the expected squared error (ESE) of an estimator that answers randomly with the same distribution as the submission (look here for a more detailed description).
This is a rather hard loss function to directly optimize. Therefore, instead of trying to do that, we use a two-step procedure. We first optimize our models for multiclass logloss. This gives a probability distribution for each image. We then choose a label for each image by using a simulated annealing based optimizer. Of course we cannot really optimize QWK without knowing the actual labels. Instead, we define and optimize a proxy for QWK, in the following way. Recall that QWK = 1 – TSE/ESE. We estimate both TSE and ESE by assuming that the true labels are drawn from the distribution described by our prediction, and then plug these predictions into the QWK formula, instead of the true values. Note that both TSE and ESE are underestimated by the procedure above. These two effects cancel each other out to some extent, still our predictions QWK were off by quite a lot compared to the leaderboard scores.
That said, we found no better way of producing submissions. In particular, the optimizer described above outperforms all the ad-hoc methods we tried, such as: integer-rounded expectation, mode, etc.

We would like to thank California Healthcare Foundation for being a sponsor, EyePACS for providing the images, and Kaggle for setting up this competition. We learned a lot and were happy to take part in the development of tools that can potentially help diagnose diabetic retinopathy. We are looking forward to solving the next challenge.

deepsense.ai Machine Learning Team

Subscribe to our newsletter

Contact us

I consent to receive commercial information about deepsense.ai sp. z o.o., including details about its products and offer, through email communication.*

The administrator of the personal data provided by you in the registration form is deepsense.ai sp. z o.o., headquartered at al. Jerozolimskie 44, 00-024 Warsaw, Poland. Your personal data will be processed for the purpose of directing marketing content to you.

Detailed information about the processing of your personal data, including your rights, can be found in our privacy policy.

* This consent is required to receive email communication from deepsense.ai sp. z o.o. regarding the company and its offerings.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Locations

United States of America

deepsense.ai, Inc.
2100 Geng Road, Suite 210
Palo Alto, CA 94303
United States of America

Poland

Let us know how we can help

Our service offerings
contact@deepsense.ai

Media relations
media@deepsense.ai