Data science Archives - Page 7 of 10

multidplyr: first impressions

November 13, 2015/in Data science /by Przemyslaw Biecek

Two days ago Hadley Wickham tweeted a link with introduction to his new package multidplyr. Basically it’s a tool to take advantage of many cores for dplyr operations. Let’s see how to play with it.

What you can do with multidplyr?

As it was described on GitHub website, multidplyr is a library for processing data that is distributed across many cores, with the use of dplyr verbs.
The idea is kind of similar to spark. Similar solutions exists for R, and some of them are available for years (like RMPI, distribute, parallel or many other from the list https://cran.r-project.org/web/views/HighPerformanceComputing.html). But the problem with them is that they are made mainly for hackers. It is not that unusual to get an error with 20 lines of traceback without any warning.
Packages from Hadleyverse come with nicer design (as sometimes Apple products do), explode less often. With slightly smaller functionality we get more fun.
The multidplyr is still in the dev phase and sometimes it can make you really angry. But there are a lot of things that you can do well with it, and often also you can do them really fast.
In the multidplyr vignette you will find examples of playing with flights dataset. There are just 300k+ observations so it turns out that the overload related with the data distribution is larger than the computation time. But for larger datasets or for more complicated calculations you should expect some gain from heating of additional cores.

Use case

Right now, I am playing with log files from many bizarre devices. But the point is that there is a lot of rows of data (few hundreds of millions) and logs are stored in many many relatively small files. So I use multidplyr to read the data in parallel and do some initial pre-processing. The cluster is built on 15 cores and everything is done in plain R. It turns out that I can reduce the processing time from one day to two hours. So, it is an improvement. Even if you count the time that you need to spend learning the multidplyr (not that much if you know dplyr and spark).
Let’s see the example step by step.
First, initiate cluster with 15 nodes (one node per one core).

cluster = create_cluster(15)
## Initializing 15 core cluster.
set_default_cluster(cluster)

Find all files with the extension ‘log’. Data is there.

lf = data.frame(files=list.files(pattern = 'log', recursive = TRUE),
stringsAsFactors = FALSE)

Now, I need to define a function that reads a file and do some preprocessing. This function is then sent to all nodes in the cluster.

readAndExtractTimepoints = function(x) {
tmp = readLines(as.character(x)[1])
ftmp = grep(tmp, pattern='Entering scene', value=TRUE)
substr(ftmp,1,15)
}

cluster_assign_value(cluster, ‘readAndExtractTimepoints’, readAndExtractTimepoints)
Time to initiate some calculations. The list of file names is partitioned across nodes and for each file the readAndExtractTimepoints is executed. The result is an object of the class party_df (again it’s one file per row).

lf_distr = lf %>%
partition() %>%
group_by(files) %>%
do(timepoints =  readAndExtractTimepoints(.$files))
lf_distr
## Source: party_df [897 x 3]
## Groups: PARTITION_ID, files
## Shards: 15 [59--60 rows]
##
##    PARTITION_ID                     files   timepoints
##           (dbl)                     (chr)        (chr)
## 1             1 2013/01/cnk02a/cnk02a.log
## 2             1 2013/01/cnk02b/cnk02b.log
## 3             1   2013/01/cnk06/cnk06.log
## 4             1   2013/01/cnk07/cnk07.log
## 5             1   2013/01/cnk09/cnk09.log
## 6             1   2013/01/cnk10/cnk10.log
## 7             1 2013/01/cnk100/cnk100.log
## 8             1   2013/01/cnk11/cnk11.log
## 9             1   2013/01/cnk15/cnk15.log
## 10            1   2013/01/cnk16/cnk16.log

Results are ready to be collected and transformed into a classical list.

timeP = collect(lf_distr)
str(timeP$timepoints)
## List of 897
##  $ : chr [1:144830] "Jan  1 08:15:57 " "Jan  1 18:04:37 " "Jan  1 18:05:44 " "Jan  2 08:15:57 " ...
##  $ : chr [1:123649] "Jan  1 08:16:05 " "Jan  2 08:16:05 " "Jan  2 09:46:08 " "Jan  2 09:46:13 " ...
##  $ : chr [1:137661] "Jan  1 08:15:57 " "Jan  2 08:15:57 " "Jan  2 09:34:47 " "Jan  2 09:35:45 " ...

General impressions

I guess that one can speed up the whole process even further with the use of python or spark. But if the dataset is not huge then it is much easier to maintain a process that is using just a single technology/language.
Overall I like the multidplyr even if it still looks like a prototype. Sometimes things get nasty, like for example when you try to chain few different do() operations. But knowing the ‘Hadley’ effect I expect that it will be better and better with every version.
Finally, soon we should expect a solution for parallel processing that can be used by normal people not only by hackers.

Machine Learning for Greater Fire Scene Safety

October 22, 2015/in Data science, Machine learning /by Jan Lasek

The lives of brave firemen are threatened during dangerous emergency missions while they try to save other people and their property. In this post I would like to share my experiences and winning strategy for the AAIA’15 Data Mining Competition: Tagging Firefighter Activities at a Fire Scene, in which I took first place.

The competition was organized jointly by the University of Warsaw and the Main School of Fire Service, in Warsaw, Poland. It lasted over 3 months during which 79 contestants submitted a total of 1,840 proposals with solutions on the competition’s hosting platform Knowledge Pit.
I particularly enjoy competitions with a potentially big impact – when something more than only a high accuracy score is at stake. This competition definitely had a flavor of this – the participants were asked to contribute toward the safety of firefighters on the scene during an emergency mission.

The challenge

It is certainly helpful for decision making during an emergency when you know what particular activity the members of a rescue team are currently engaged in. This was the goal of the competition – develop a model that recognizes what activity a fireman is performing based on sensory data from his body movements and a collection of statistics monitoring his vital functions. Actually, we are facing two dependent multiclass classification problems. The first class is the main posture of the fireman and the second one is his particular action. Here is a sample of the data the contestants were given:

posture	action	avg-ecg1	…	ll-acc-x	ll-acc-y	…	torso-gyro-z
stooping	manipulating	-0.03	…	-6.98	10.41	…	28.49
standing	signal water first	-0.04	…	-9.41	0.11	…	63.84
moving	running	-0.04	…	-8.75	3.81	…	-52.92
crawling	searching	-0.03	…	-36.61	2.74	…	-134.26
stooping	manipulating	-0.04	…	-3.00	2.23	…	-7.21

The first two columns present the two class attributes: the posture and the main action of the fireman. Each activity is described by ca. 2 second-time series of sensory data from accelerometers and gyroscopes and certain statistics on fireman’s vital functions. In total, there are 42 such statistics as well as 42 different time series. Moreover, as usual, you are given two datasets: “train” and “test”. In the training data, you are given instances along with labels of activities, just as exemplified in the table above. In the test data, the labels are not present and you are asked to design a model for automatic tagging of those activities. To select the best performing approach from participants’ proposals, the performance of a given model on the test set was taken into account (in terms of an evaluation metric discussed below). You can find more information on the competition at its hosting platform.
The number of possible activities is restricted to the set of labels by the competition organizers. There are five labels in the first class, and 16 in the second one. Moreover, the labels are dependent. Let us see their joint distribution.

	crawling	crouching	moving	standing	stooping
ladder down	0	0	465	0	0
ladder up	0	0	476	0	0
manipulating	0	1764	331	2356	1898
no action	0	87	0	490	0
nozzle usage	0	492	0	443	0
running	0	0	4324	0	0
searching	459	0	0	0	0
signal hose pullback	0	0	0	98	0
signal water first	0	0	41	496	0
signal water main	0	46	0	405	0
signal water stop	0	0	0	277	0
stairs down	0	0	644	0	0
stairs up	0	0	1157	0	0
striking	0	0	0	1022	0
throwing hose	0	0	0	234	930
walking	0	0	1064	0	0

For example, there are 4,324 instances in the data where a fireman is moving and running, and 234 instances where a fireman is standing and throwing a hose. Surely, there are many other activities that someone from the rescue team can engage in, however, the dataset was restricted to this particular subset. It may come as a big disappointment, but there were no “saving cat” label. As such, the competition was set up as a standard supervised learning task: we are given a training set of activities along with their tags. In the test set, we are to tag activities based on what we’ve learned from the examples in the training set.
Another thing to note is the fact that the distribution of labels is fairly unbalanced. For instance, a fireman is about four times more likely to be running than throwing a hose. This should be carefully considered, especially in the context of the evaluation metric adopted in the competition.
The chosen metric was balanced accuracy. It is defined in the following way. First, for a given label we define accuracy of predictions

Next, the balanced accuracy score for class C with L labels is equal to the average accuracy among its labels

Finally, since we have two dependent class attributes, we compute a weighted average of balanced accuracy scores for posture and action classes:

A higher weight is attached to the accuracy of classification of the more granular class action.

Overview of the solution

The approach to the task boils down to an extensive feature engineering step for time series data, before learning a set of classifiers. Along the way, there are a couple of interesting details to discuss. Since the final solution consisted of three slightly different Random Forest models that do not differ too much, I’ll describe just one of them.

Classification with two dependent class attributes

One of the interesting aspects of the challenge is the fact that we need to predict two dependent classes. In my approach, I performed a stepwise classification. In the first step, I predict the main posture of a fireman. In the second step, the particular activity is predicted based on the training set and the predicted label from the first step. Thanks to this approach, you can capture the hierarchical dependency between labels. Naturally, there are a number of other ways to deal with the two-class tagging problem. For instance, one could train two independent classifiers or concatenate the two labels. However, the approach of chaining two classifiers yielded better results in my case.

Drift between training and test data distribution

Another issue that came with the data was the fact that the activities in training and test set were performed by different firemen. This posed a real challenge. An important part of successful participation in any data mining competition is that you are able to set-up a local evaluation framework that is in-line with the one employed in the contest. Here, a natural solution would be to perform a stratified cross-validation over different firemen. However, no identifier of a fireman for a particular activity was provided. Hence, regardless of whether I liked it or not, I had to rely predominantly on preliminary evaluation scores that were based on 10% of the data during the competition (the final evaluation was done on the other 90% of test data). Of course, this was a problem not only for me but also for all the other contestants. As I talked to them at a conference workshop following the competition, they also relied mainly on preliminary evaluation results, as the evaluation on the training data yielded far too optimistic scores.

Feature engineering

The main effort during the competition was devoted to the extraction of interesting features describing the underlying time series (called signals). There are a couple of basic statistics that you can derive from the signal: mean, standard deviation, skewness, kurtosis, quantiles. I derived quantiles on a relatively rich grid ranging from 0.01, 0.05, 0.1, …, 0.95, 0.99. Because some of the activities are periodic, I thought that it would be useful to utilize some tools dedicated to that task. I processed each signal by Fourier transform as well as computed periodograms. From these transformed signals I once again extracted basic summary statistics. Another feature which is quite simple and proved to be useful in classification is correlation between signals. Intuitively, when you are running, the recordings of corresponding devices attached to your legs should be negatively correlated. Finally, I made some effort to identify peaks in the data. The idea is that, in case of performing different activities, e.g., running or striking, we can observe a different number of “peaks” in the signal. Peaks identification is a problem that is easy to state but hard to define mathematically. At the end, I ended up with a simple method that was based on counting chunks of a time series where it exceeds its mean by one or two standard deviations.
To battle the drift between training and test data, one should try to design generic (not subject-specific) features. For instance, the quantiles of distribution of acceleration are heavily dependent on a given person’s running pace and his/her motoric abilities. Presumably, these statistics are going to differ much from person to person. On the other hand, if you derive a correlation between acceleration recordings on left and right leg, this correlation may turn out to vary less between different firemen! This is a desired property of a feature, as the activities in test data were performed by a different set of people than those in the training set.
Feature extraction was the most tedious part of the solution, but I believe a worthy one. I derived a set of almost 5,000 features describing each single activity. Now, the next step is to train a model based on these features that learns to distinguish between different activities.

Let’s vote

If a group of experts is to decide on an important matter, it is often the case, that collectively they can make a better decision. As each of them looks at the problem from a slightly different perspective, they can jointly arrive at a more refined judgment. This idea is brilliantly explored in the Random Forest algorithm, which is an ensemble of decision trees. A large number of trees are trained on diverse subsamples of data so that their joint prediction, made by majority voting, usually yields higher accuracy than each single individual model. I employed this model to solve the problem of activity recognition.
Another appealing property of Random Forest is that it has an inherent method of selecting relevant attributes. Having extracted a quite rich set of features, it is certainly the case that some of them are only mildly useful. I handed over the task of selecting the most relevant ones to the model itself.
As already mentioned in the introduction, the distribution of labels in the data was fairly unbalanced. Recall that our solutions are evaluated against a balanced accuracy evaluation metric. Doing a poor job predicting some label, yields the same penalty regardless of its distribution in the data. To account for this, each tree in the forest was trained on a stratified subsample of the data, where each label was present in an equal proportion. This preserved the forest from focusing too much on the most prevalent labels, and gave a major improvement in the score.

Summary

Summing up, the competition was a very exciting experience. I would like to thank all the participants, as they made the contest a great event. Also, I want to thank the organizing committee from the University of Warsaw and the Main School of Fire Service for providing such an interesting dataset and setting up the competition. The winning solution yielded a balanced accuracy of 84% which was enough to beat other contestants’ solutions. Certainly, there is still some room for improvement, yet we took a small step toward increasing the safety of firemen at a fire scene.

Jan Lasek
(deepsense.ai Machine Learning Team)

About the Author:

Jan Lasek, Data Scientist at deepsense.ai, is also pursuing his PhD at the Institute of Computer Science, a part of the Polish Academy of Sciences. He graduated from Warsaw University where he studied both at the Faculty of Mathematics and the Faculty of Economic Sciences.

Data mining of the votes of Members of Parliament

October 22, 2015/in Data science /by Przemyslaw Biecek

7th term of the Sejm has already come to its end. It would be nice to see how have the Members of Polish Parliament voted for these last 4 years! In total they took part in over 6000 votings. Did the representatives of the same clubs voted more similarly to each other? Did the Members of Polish Parliament who changed the clubs they belonged to voted in a different way than the Members of Parliament from their former clubs? Let’s see!

In order to display the similarity between the votes of the Members of Parliament we will use a technique known to geneticists –namely, a phylogenetic tree. Such diagrams are employed to present similarities between sequences of DNA/proteins of various organisms/genes. In our analyses the Members of Parliament will stand for organisms and their votes will serve as their DNA. We will build the phylogenetic tree for the Members of Parliament.
Phylogenetic trees are based on similarities between objects presented on these trees. In this case we will compare levels of similarity between votes cast by the Members of Parliament. Firstly, we will learn about the manner in which that similarity is calculated. As an example let us take six Members of Parliament and a dozen or so votings (here we will discuss 14 votings concerning the Act on Infertility Treatment). Each Member of Parliament might choose between the following options: to vote For, Against, and Abstain from voting (what is a ‘delicate’ against), or simply be absent during the voting. Let us encode these options with numbers: +2, -2, -1 and 0 respectively, or with colours: blue, red, yellow and light blue. The following graphic presents the votes of each of the six Members of Parliament during each of the votings under analysis. The left part of the diagram shows the similarity between the votes. During these votings J. Palikot and L. Miller voted in the same manner; E. Kopacz voted similarly; J. Piechocinski voted in less similar but still fairly alike way. Other two Members of Parliament, B. Szydlo and J. Gowin voted in a similar way but quite differently from the remaining four. The distance between the tree branches corresponds to the voting profiles.

Now we increase the number of votings from 14 to 6000. The vector with votes is longer but the similarity is calculated in the same way -the Euclidean distance.
One diagram cannot present all 6000 votings clearly. We do not draw them; the only thing presented at the diagrams are names of the Members of Parliament. To make the diagram more legible, next to the names I put the name of all the clubs that given Members of Parliament belonged to during the 7th term of the Sejm. Colours represent clubs in which politicians spent most of the period of the 7th term. Below you may see a fragment of the tree. It shows that J. Zalka and J. Gowin voted in a rather similar way but very different than rest of their club. The tree also allows us to notice that both of them cast most of the votes while they belonged to PO, but both of them also belonged to ZP and KPSP. Members of Parliament from PSL and PO usually voted quite alike and for that reason they belong to the same subtree.

We may also present the whole tree, although it has many leaves. When we include the MPs who left the Sejm and were elected for the Sejm, we have over 500 names. As we may see, the Members of Parliament from PO in most cases voted similarly. They create a separate subtree with PSL. PiS with a part of the right wing also creates its own subtree. The remaining two subtrees represent SLD and Twoj Ruch/Ruch Palikota.

The same tree may be presented in various manners; above, for example, we may see a more packed version of it.

However, a little fan like the one above is far more comprehensible. There is more space for names of the Members of Parliament.
If we take into consideration all the votings, we will notice that the greatest differences exist between the government and the opposition. It turns out that there are two Members of Parliament on the side of the government (Jaroslaw Gowin and Jacek Zalek) who usually voted as MPs from PO (the colour on the diagram corresponds to the club that a given Member of Parliament belonged to during most votings), yet their profiles differ considerably from the profiles of the remaining MPs. Besides, that pair migrated from one party to another, what may explain their incompatibility with the stance of PO. As far as PiS is concerned, the least compliant voters were Gorski Artur and Tomaszewski Jan (who finally transferred to PO at the end of the year).
There are many more such interesting stories where a typical voting profile is incompatible with the ‘main’ club; they can be found in every club. Just look for them for a while.

When we look at votings on specific acts, the pictures tend to change. Below you may see an example diagram concerning voting on Personal Income Tax Act.

Here you may look at the results of the votings on the Act on Higher Education. PO and PSL voted so similarly that the Members of Parliament belonging to both clubs are mixed. SLD and TR create their own trees. There are also groups of MPs, both within PO and PiS, who in all 53 votings (that is the number of votings on that act) voted identically (their tree branches are very short).

You may see an enlarged version of every diagram by clicking on it.
Computer specialist sometimes joke that, unlike the normal trees, their trees grow upside down.
As it turns out the trees created by data analysts may grow in every direction. Or in every direction at the same time!
R packets- cluster, ggdendro and ape– were used during analysis of the clusters.
The source codes can be found at github: https://github.com/mi2-warsaw/JakOniGlosowali/tree/master/glosy.
Data on votings may be found in the package called sejmRP: https://github.com/mi2-warsaw/sejmRP.

Do cats or dogs live longer?

October 15, 2015/in Data science /by Przemyslaw Biecek

Some time ago our herd has expanded by a guinea pig called Hugo. It turns out that the presence of a pet at home is a great pretext for discussing with children the concepts of randomness, distribution functions and distribution in general.

And this is how it started:
— Dad, how long do the guinea pigs live?
— On average from 5 to 7 years (google).
— And do other home pets live longer or shorter?
— Mice have the shortest lives, they live for a year up to three years but then, rodents are short-lived in general. But parrots, for example, they may live up to twenty years, even forty years. And tortoises live even until seventy years.

— What about dogs, how long do they live?
— It depends on the breed – bigger ones live shorter and smaller ones live longer, but usually it is between eight and sixteen years.
— Does it mean that all the dogs die after they reach sixteenth year of life?
— Well… (a longer pause at google+R+ggplot), no, that is only a typical lifespan. Some dogs live longer, some even up to twenty four years.

— And do cats or dogs live longer?
— Well… (a longer pause at google+R+ggplot). Among cats there are many more specimen who live longer than 20 years.

— So who will live longer, Bromba (our friends’ dog) or Mufinka (our friends’ cat)?
— Nobody knows how long will a particular cat or dog live – it depends from the quality of care of their owners and from many other factors. We can say something more about more numerous groups of animals. For example, it one gathered 100 dogs, it could be expected that half of them would live longer than 12 years. But if one gathered 100 cats, then it would turn out that half of them would live over 14 years. And every fifth cat would live over 18 years, what is longer than a lifespan of any dog really.

The presented numerical data is based on the data base VetCompass. It concerns animals visiting the vet, that is animals ”better taken care of”.

Statistician like a shoemaker

October 9, 2015/in Data science /by Przemyslaw Biecek

Children bring from school strange home assignments, like for example a question: What is your dad’s job similar to? After several hits (a cosmonaut, Formula 1 driver, firefighter) it turns out that the work performed by a statistician is very much similar to the work of a shoemaker. Why?

I don’t mean that a shoemaker drinks a lot and swears when something doesn’t work out (read: swears a lot). These are stereotypes. I mean that we can identify similar subgroups within both craft guilds.
A craftsman – seller. He gets a machine for making shoes (or ready-made shoes) and he sells them. He knows a bit about shoes, for example, that there is a right and left shoe and that if the client put them on wrongly, he will not like it. He can show the shoes from a better perspective and when he lacks the right size, he claims that after some period of wear the shoes will become looser or they will shrink etc. Who cares that the method assumes normality while the data are as skew as the Tower of Pisa? Surely, it is very robust and the impact of unrealized assumptions is marginal. Too little observation and the power of the test barely has the level of significance? And what of it? It rains very rarely in this area so the shoe will last for long.
A craftsman – artist. He knows best what his client needs. The client brought a beautiful rabbit skin and wanted high winter boots, but the artist made a beautiful cap. No, please, believe me, such data can serve as a source for such analyses. No, you don’t need a test for two means. In this case PCA will serve ideally. You will be satisfied and the cap looks fantastic.
A craftsman – theoretician. He does not make shoes but he examines what would happen to the shoes in some untypical or impossible scenarios. For example, how would the shoe’s lifespan change if one run with the speed coming close to infinity, or if one jumped on the surface of Mars, or if one had not two but 314 feet?
A craftsman – handyman. Clients come to him with shoes with holes in the soles and he patches them up, sews them up. He makes up medians or add some Wilcoxon test results. Maybe the final result is a little scary for people with taste but the clients are able to go home and do not have to look for a skin for a new pair of shoes.
A craftsman – future man. He notices that people’ve got bored with shoes and he is looking for new and more attractive niches. He does shoe science, is interested only in large sizes of shoes because such shoes entail technological challenges. The wall of his workshop is decorated with a black belt in six shoe sigma and he commutes riding an elephant.
In general, they make such a funny group which meets from time to time in order to exchange notes and decide which skin makes nicer, more comfortable, more stable or easier to make shoes.
Have I omitted some subgroups of this guild?

Multilevel classification, Cohen kappa and Krippendorff alpha

September 24, 2015/in Data science /by Przemyslaw Biecek

I was facing an interesting problem last week. Playing with data from The Genome Cancer Atlas (full genetic and clinical data for thousands of patients) I was building a classifier that predicts the type of cancer based on sets of genetic signatures.

In the PANCAN33 subset there are samples for 33 different types of cancer. And the classifier shall be able to classify a new sample to one of these 33 classes.
I’ve tried different methods like random forest, svm, bgmm and few others, and end up with collection of classifiers. How to choose the best one?
We need a method that computes an agreement between classifier predictions and true labels/cancer types. For binary classifiers there is a lot of commonly used metrics like precision, recall, accuracy etc. But here we have 33 classes. The confusion matrix is 33×33 cells large, a lot of number to compare.
Of course there are some straightforward solutions like fraction of samples on which classifier correctly guesses true labels. But such easy solutions suffer a lot if there is unequal distribution of classes (quite common). Such metrics may be high for dummy classifier like: always vote for most common class. It is better to avoid such metrics.
Are they other measures of agreement that we can use?
Actually I used two interesting ones – Cohen Kappa and Krippendorff Alpha. They take into account the distribution of votes for each rater. Moreover Krippendorff alpha takes into account missing data (find more information here).
Both coefficients are widely used by psychometricians (e.g. to asses how two psychiatrists agree on a diagnosis). We use them in order to estimate the performance of the classifier. Both coefficients are implemented in the irr package.
Below you will find an example application.

kappa2(cbind(predictions, trueLabels))
# Cohen's Kappa for 2 Raters (Weights: unweighted)
#
# Subjects = 3599
#   Raters = 2
#    Kappa = 0.941
#
#        z = 160
#  p-value = 0 
kripp.alpha(rbind(predictions, trueLabels))
# Krippendorff's alpha
#
# Subjects = 3599
# Raters = 2
# alpha = 0.941

Biplots, correspondence analysis and ggplot2

September 18, 2015/in Data science /by Przemyslaw Biecek

I was looking for biplots created with the use of ggplot2 library (because they look good and are customisable). It turns out that there are some nice solutions for PCA (like sinhrks/ggfortify; kassambara/factoextra; vqv/ggbiplot; fawda123/ggord) but I could not find suitable solution for correspondence analysis. So I create one. It’s available in pbiecek/ggplotit package and works for both CA{FactoMineR} and ca{ca} functions. You will find source of this function below, but let’s start with an example.

I’m going to use data about car sale offers from the PogromcyDanych package.
Let’s see what is the relation between a brand and a type of fuel.
Guess in which brands oil is more common than gas?

Let’s see.
Porsche, Mini and Smarts – these brands are mostly gas only.
Daewoo, Dodge – here you will find LPG fuelled cars.
LandRover, Audi, Volkswagen – here oil is most common.
This example was created by these lines:

library(PogromcyDanych)
library(ggplotit)
library(FactoMineR)
# contingency matrix for cars
tab = table(auta2012$Marka,  auta2012$Rodzaj.paliwa)
tab = tab[rowSums(tab) > 300, c(1,2,6)]
# correspondence analysis
obj = CA(tab)
ggplotit(obj, c(FALSE,TRUE), list(rownames(tab), c("Gas", "Gas+LPG", "Oil")))

And full source of the function

function (x, arrows = c(FALSE, FALSE), names = NULL, ...)
{
stopifnot(length(arrows) == 2)
stopifnot(length(names) == 2 | is.null(names))
X = as.data.frame(x$row$coord[, 1:2])
Y = as.data.frame(x$col$coord[, 1:2])
if (!is.null(names)) {
X$Names = names[[1]]
Y$Names = names[[2]]
}
else {
X$Names = rownames(x$row$coord)
Y$Names = rownames(x$col$coord)
}
colnames(X) = c("x.Dim1", "x.Dim2", "x.Names")
colnames(Y) = c("y.Dim1", "y.Dim2", "y.Names")
pl = ggplot() + geom_text(data = X, aes(x.Dim1, x.Dim2,
label = x.Names), color = "blue", size = 3) + geom_text(data = Y,
aes(y.Dim1, y.Dim2, label = y.Names), color = "red",
size = 3) + geom_hline(xintercept = 0, alpha = 0.5) +
geom_vline(yintercept = 0, alpha = 0.5) + theme_bw()
if (arrows[1]) {
pl = pl + geom_segment(data = X, aes(x = 0, xend = x.Dim1,
y = 0, yend = x.Dim2, label = x.Names), color = "blue",
arrow = arrow(angle = 15))
}
if (arrows[2]) {
pl = pl + geom_segment(data = Y, aes(x = 0, xend = y.Dim1,
y = 0, yend = y.Dim2, label = y.Names), color = "red",
arrow = arrow(angle = 15))
}
pl
}

Are you in favour of abolition of compulsory education for six-year-old children and return to compulsory education for seven-year-old children?

September 10, 2015/in Data science /by Przemyslaw Biecek

1st September was just few days ago. After the reform ‘lowering the age at which children start their school education’ the second group of 6 and 7-year-old children started attending the freshmen classes. And since we are in the ‘pre-election’ mode there are some votes about a reform reestablishing the previous age for starting school education.

We are going to use R, Wikipedia and packages: htmltable, ggplot2 and archivist to juxtapose these ideas with demographic data. We will use easily accessible data, that is, birth statistics.
Let’s use the htmltable package to download from Wikipedia birth statistics concerning Poland from the last 50 years. Then let’s use the ggplot package and bar charts to present the birth rate (chart on the right-hand side, click to enlarge).
There is a clear decreasing tendency, but some local ‘waves’ are also visible on the chart. After more numerous age cohorts from 1975-1985 more children were then born around 2008.

library(archivist); library(ggplot2); library(scales)
# access the dataset through archivist hook
births = archivist::aread("pbiecek/graphGallery/5a6c2a732c20d5a1bebe6507ebf09afa")
ggplot(births, aes(x=rok, ymin=0, ymax=urodzenia)) +
geom_linerange(size=2) + theme_bw() +
scale_y_continuous(label=comma) + scale_x_continuous(limits=c(1965,2015)) +
ggtitle("Number of births in last 50 years")

Since we are interested in the ‘present’, we will limit the period to the last 15 years (chart on the right-hand side, click to enlarge).
We can easily see that more children were born between 2008-2010.
Let us look at this data through the prism of the education reform. If we want to lower the age at which children start school education, we must accumulate x+1 age groups in x years.
Which x would you choose and which age groups would you accumulate in that way?
Unfortunately (in reality) the accumulation concerned the age groups 2007-2009 – the age groups from the slight demographic boom.

The chart on the right-hand side (click to enlarge) presents a theoretical result of such accumulation. Theoretical, because due to certain mortality rate and migration not all of the children born in Poland go to schools in Poland. What is more, for several years the parents of six-year-olds were allowed to decide if they wanted to send their child to school a year sooner.
A year before the compulsory lowering of the school age only 20% of parents decided to take that step. All of that means that the estimations presented here diverge to some extend from the actual number of children who will attend the first year. These numbers are not completely know at the moment as some of them belong to the future.

All right then, what about the reestablishing the previous age for starting school education?
Let us assume that in the referendum (we know that it will not happen soon, but let’s do a mind exercise) most of the voters will decide that the school age should be increased and the government will follow the majority vote and it will quickly adopt a relevant act. It will mean that now one age group would have to be divided in two.
The chart on the right-hand side (click to enlarge) displays a theoretical result of such action (still preserving the same reservation that the estimation is very rough). If we have another reform, it will unfortunately hit a demographic decline. After 6 years of schools almost coming apart at the seams accommodating over 150% of average number of students and working on shifts, suddenly the first year will have 3.3 times less pupils.

How this looks like in other countries?
Let’s download data from World Bank, data related to changes in the age of entering the primary education and let’s use a great package networkD3 to visualize how different countries behaves. We see the general tendency that more and more countries decide to lower the school starting age to 6.

archivist::aread("pbiecek/graphGallery/af0b5130b2db42b23c61d3ab373d7946")

Diagnosing diabetic retinopathy with deep learning

September 3, 2015/in Data science, Deep learning, Machine learning /by Robert Bogucki

What is the difference between these two images?

The one on the left has no signs of diabetic retinopathy, while the other one has severe signs of it.

If you are not a trained clinician, the chances are, you will find it quite hard to correctly identify the signs of this disease. So, how well can a computer program do it?
In July, we took part in a Kaggle competition, where the goal was to classify the severity of diabetic retinopathy in the supplied images of retinas.
As we’ve learned from the organizers, this is a very important task. Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.
The contest started in February, and over 650 teams took part in it, fighting for the prize pool of $100,000.
The contestants were given over 35,000 images of retinas, each having a severity rating. There were 5 severity classes, and the distribution of classes was fairly imbalanced. Most of the images showed no signs of the disease. Only a few percent had the two most severe ratings.
The metric with which the predictions were rated was a quadratic weighted kappa, which we will describe later.
The contest lasted till the end of July. Our team scored 0.82854 in the private standing, which gave us 6th place. Not too bad, given our quite late entry.
You can see our progress on this plot:

Also, you can read more about the competition here.

Solution overview

What should be no surprise in an image recognition task, most of the top contestants used deep convolutional neural networks (CNNs), and so did we.
Our solution consisted of multiple steps:

image preprocessing
training multiple deep CNNs
eye blending
kappa score optimization

We briefly describe each of these steps below. Throughout the contest we used multiple methods for image preprocessing and trained many nets with different architectures. When ensembled together, the gain over the best preprocessing method and the best network architecture was little. We therefore limited ourselves to describing the single best model. If you are not familiar with convolutional networks, check out this great introduction by Andrej Karpathy: http://cs231n.github.io/convolutional-networks/.

Preprocessing

The input images, as provided by the organizers, were produced by very different equipment, had different sizes and very different colour spectrum. Most of them were also way too large to perform any non-trivial model fitting on them. A minimum preprocessing to make network training possible is to standardize the dimensions, but ideally one would want to normalize all other characteristics as well. Initially, we used the following simple preprocessing steps:

Crop the image to the rectangular bounding box containing all pixels above a certain threshold
Scale it to 256×256 while maintaining the aspect ratio and padding with black background (the raw images have black background as well, more or less)
For each RGB component separately, remap the colour intensities so that the CDF (cumulative distribution function) looks as close to linear as possible (this is called “histogram normalization”)

All these steps can be achieved in a single call of ImageMagick’s command line tool. In time, we realized that some of the input images contain regions of rather intensive noise. When using the simple bounding-box cropping described above, this leads to very bad quality crops, i.e. the actual eye occupying an arbitrary and rather small part of the image.

You can see gray noise at the top of the image. Using state of the art edge detectors, e.g. Canny, did not help much. Eventually, we developed a dedicated cropping procedure. This procedure chooses the threshold adaptively, exploiting two assumptions based on analysis of provided images:

There always exists a threshold level separating noise from the outline of the eye
The outline of the eye has an ellipsoidal shape, close to a circle, possibly truncated at the top and bottom. In particular it is a rather smooth curve, and one can use this smoothness to recognize the best values for the threshold

The resulting cropper produced almost ideal crops for all images, and is what we used for our final solutions. We also changed the target resolution to 512×512, as it seemed to significantly improve the performance of our neural networks compared to the smaller 256×256 resolution.
Here is how the preprocessed image looks like.

Just before passing the images to the next stage we transformed the images so the mean of each channel (R, G, B) over all images is approximately 0, and standard deviation approximately 1.

Convnet architecture

The core of our solution was a deep convolutional neural network. Although we started with fairly shallow models — 4 convolutional layers, we quickly discovered that adding more layers, and filters inside layers helps a lot. Our best single model consisted of 9 convolutional layers.
The detailed architecture is:

| Type    | nof filters | nof units |
|---------|-------------|-----------|
| Conv    | 16          |           |
| Conv    | 16          |           |
| Pool    |             |           |
| Conv    | 32          |           |
| Conv    | 32          |           |
| Pool    |             |           |
| Conv    | 64          |           |
| Conv    | 64          |           |
| Pool    |             |           |
| Conv    | 96          |           |
| Pool    |             |           |
| Conv    | 96          |           |
| Pool    |             |           |
| Conv    | 128         |           |
| Pool    |             |           |
| Dropout |             |           |
| FC1     |             | 96        |
| FC2     |             | 5         |
| Sofmax  |             |           |

All Conv layers have 3×3 kernel, stride 1 and padding 1. That way the size (height, width) of the output of the convolution is the same as the size of the input. In all our convolutional layers we follow the convolutional layer by batch normalization layer and ReLu activations. Batch normalization is a simple but powerful method to normalize the pre-activation values in the neural net, so that their distribution does not change too much during the training. One often standardizes the data to make zero mean and unit variance. Batch normalization takes it a step further. Check this paper by Google to learn more. Our Pool layers always use max pooling. The size of the pooling window is 3×3, and the stride is 2. That way the height and width of the image get halved by each pooling layer. In the FC (fully connected) layers we again use ReLu as activation function. The first fully connected layer, FC1 also employs batch normalization. For regularization we used Dropout layer before the first fully connected layer, and L2 regularization applied to some of the parameters.
Overall, the net has 925,013 parameters.
We trained the net using stochastic gradient descent with momentum and multiclass logloss as a loss function. Moreover, the learning rate has been adjusted manually a few times during the training. We have used our own implementation based on Theano and Nvidia cuDNN.
To further regularize the network, we augmented the data during the training by taking random 448×448 crops of images and flipping them horizontally and vertically, independently with probability 0.5. During the test time, we took few such random crops, flips for each eye and averaged our predictions over them. Predictions were also averaged over multiple epochs.
It took quite long to train and compute test predictions even for a single network. On a g2.2xlarge AWS instance (using Nvidia GRID K520) it took around 48 hours.

Eye blending

At some point we realized that the correlation between the scores of two eyes in a pair was quite high. For example, the percent of eye pairs for which the score for the left eye is the same as for the right one is 87.2%. For 95.7% of pairs the scores differ by at most 1, and for 99.8% by at most 2. There are two likely reasons for this kind of correlation.
The first is that the retinas of both eyes were exposed to the damaging effects of diabetes for the same amount of time, and are similar in structure, so the conjecture is that they should develop the retinopathy at similar rate. The less obvious reason is that the ground truth labels were produced by humans, and it is conceivable that a human expert is more likely to give the same image different scores, depending on the score of the other image of the pair.
Interestingly, one can exploit this correlation between the scores of a pair of eyes to produce a better predictor.
One simple way is to take the predicted distributions D_L and D_R for the left and right eye respectively and produce new distributions using linear blending, as follows. For the left eye, we predict c⋅D_L+(1-c)⋅D_R, similarly we predict c⋅D_R+(1-c)⋅D_L for the right eye, for some c in [0, 1]. We tried c = 0.7 and a few other values. Even this simple blending produced a significant increase in our kappa score. However, a much bigger improvement was gained when instead of an ad-hoc linear blend we trained a neural network. This network takes two distributions (i.e. 10 numbers) as inputs, and returns the new “blended” versions of the first 5 inputs. It can be trained using predictions on validation sets. As for the architecture, we decided to go with a very strongly regularized (by dropout) one with two inner layers of 500 rectified linear nodes each.
One obvious idea is to integrate the convolutional networks and the blending network into a single network. Intuitively, this could lead to stronger results, but such a network might also be significantly harder to train. Unfortunately, we did not manage to try this idea before the contest deadline.

Kappa optimization

Quadratic weighted kappa (QWK), the loss function proposed by the organizers, seems to be a standard one in the area of retinopathy diagnosis, but from the point of view of mainstream machine learning it is very unusual. The score of a submission is defined to be one minus the ratio between the total square error of the submission (TSE) and the expected squared error (ESE) of an estimator that answers randomly with the same distribution as the submission (look here for a more detailed description).
This is a rather hard loss function to directly optimize. Therefore, instead of trying to do that, we use a two-step procedure. We first optimize our models for multiclass logloss. This gives a probability distribution for each image. We then choose a label for each image by using a simulated annealing based optimizer. Of course we cannot really optimize QWK without knowing the actual labels. Instead, we define and optimize a proxy for QWK, in the following way. Recall that QWK = 1 – TSE/ESE. We estimate both TSE and ESE by assuming that the true labels are drawn from the distribution described by our prediction, and then plug these predictions into the QWK formula, instead of the true values. Note that both TSE and ESE are underestimated by the procedure above. These two effects cancel each other out to some extent, still our predictions QWK were off by quite a lot compared to the leaderboard scores.
That said, we found no better way of producing submissions. In particular, the optimizer described above outperforms all the ad-hoc methods we tried, such as: integer-rounded expectation, mode, etc.

We would like to thank California Healthcare Foundation for being a sponsor, EyePACS for providing the images, and Kaggle for setting up this competition. We learned a lot and were happy to take part in the development of tools that can potentially help diagnose diabetic retinopathy. We are looking forward to solving the next challenge.

deepsense.ai Machine Learning Team

You’re doing it wrong: surveys concerning the referendum which is to take place on 6th September

August 27, 2015/in Data science /by Przemyslaw Biecek

I recently came across the presentations of the results of the surveys concerning participation in the planned referenda published by the portal Gazeta.pl.

My attention was caught by the diagram presented below which displays a distribution of answers to the question: “Are you going to vote in the referendum scheduled for 6th September?”

[source]
I guess that most of us would like to check on such diagram whether the group of eager voters is bigger than the group of reluctant voters or not.
Is it easy to read such information from this graph?
Absolutely not.
At least three mistakes should be corrected.
Mistake 1. Choice of colours.
Two most similar colors on the diagram are yellow and orange. They correspond to the two most extreme answers: ‘definitely not’ and ‘definitely yes’
Remember: when you present ordered values, the colour scale should be ordered as well.
Mistake 2. Order of the elements in the legend and on the diagram.
The right-hand side of the graph is devoted to the answers ‘rather yes’ / ‘definitely yes’. The right-hand side of the legend concerns the answers ‘rather no’ / ‘definitely no’.
Remember: Make your legend easily understandable. It should be closely adjusted to the diagram.
Mistake 3. Position of the ‘don’t know’ element.
As a neutral element ‘don’t know’ should be placed between ‘rather yes’ and ‘rather no’.
Remember: Use symmetries. If ‘definitely yes’ and ‘definitely no’ begun ideally in the top point of the diagram, it would be much easier to determine whether more people answered ‘yes’ or ‘no’.