Data science Archives - Page 7 of 10

eXtreme Gradient Boosting vs Random Forest [and the caret package for R]

November 27, 2015/in Data science /by Przemyslaw Biecek

Decision trees are cute. It is easy to visualize them, easy to explain, easy to apply and even easy to construct. Unfortunately they are quite unstable, particularly for large sets of correlated features.

Fortunately, there are some solutions that may help. One of the most popular solutions is to create a random forest, an ensemble of trees that vote independently, each tree is build on bootstrap sample of observations and subset of features. The other interesting approach is to use a gradient boosting method, to create a collection of trees that optimize the cases that are badly predicted by previous trees. Also one may use bagging instead of boosting so there are much more choices.

For me, the random forest if one of favorite tools when it comes to genetic data (because of OOB, proximity scores and feature importance scores). But recently here and there more and more discussions starts to point the eXtreme Gradient Boosting as a new sheriff in town.
So, let’s compare these two methods.
The literature shows that something is going on. For example Trevor Hastie said that
Boosting > Random Forest > Bagging > Single Tree
You will find more details on slides, and if you prefer videos rather than slides with math, you can watch this example.
I am going to use the caret package (a really really great package) to compare both methods. Random forest have tag “rf” while gradient boosting “xgbTree“.
I am going to use data from The Cancer Genome Atlas Project (next generation sequencing, expression of mRNA, 33 different tumors, 17000+ features, 300+ cases, 33 different classes) and the classifier should predict the type of cancer based on gene expression (Actually I am interested in genetic signatures, but classification is the first step).
Dataset is going to be divided into a testing and training data and the whole procedure will be replicated hundreds times to see what is the variability in model performance.
With the caret package, the training is so easy that I’ve added boosted logistic regression and SVM ,,just in case’’.

library(caret)
mat = lapply(c("LogitBoost", 'xgbTree', 'rf', 'svmRadial'),
          function (met) {
  train(subClasTrain~., method=met, data=smallSetTrain)
})

So, what are the results?
So, last week I’ve compared these two methods based on Walmart Recruiting Trip from Kaggle. There the goal was to classify a trip to one of 34 types of trips. It was easier to get good results with the use of random forest rather than boosting gradient. But let’s see what is happening with the cancer data.
Below you can see the distribution of accuracies (not a perfect measure, but here it is not a bad one, either) for random splits into the testing/training dataset.

So, it looks like for this dataset the random forest is doing better.
The ‘train’ function has a great argument ‘tuneGrid’, you can specify grid of parameters to be tested. Results may be different for different parameters and of course different datasets.

R vs SAS vs SPSS

November 19, 2015/in Data science /by Przemyslaw Biecek

Such titles, in many cases, are just introductions to flam wars. But not on this blog. Today we are going to illustrate some subtle differences among three statistical packages, R/SAS/SPSS. Small differences, but sometimes even a very small difference may have large consequences. So it is worth to know such things.

In statistics it is not that uncommon that different estimators may be used for the same parameter. A typical example is the standard deviation with two widely used estimators (biased/unbiased). But do you know, that for skewness and for kurtosis there are three common estimators? And for quantiles there is even more, namely 9 different estimators?
And the bizarre thing is that for different statistical packages different estimators are selected as the default ones?
Let’s have a more detailed look.
Skewness / kurtosis
To calculate these two statistics in R, one can use functions skewness and kurtosis from the package e1071. Both functions have additional parameter type to select which estimate of skewness / kurtosis should be calculated.
In R the default option is type=3, but in SAS and SPSS by default equivalents of type=2 are calculated.

x = runif(101)
sapply(1:3, skewness, x=x, na.rm=T)
# [1] 0.1245367 0.1264220 0.1226917
sapply(1:3, kurtosis, x=x, na.rm=T)
# [1] -1.116490 -1.111956 -1.153602

Quantiles
In R in order to calculate qunatiles one can use function quantile. It has an additional argument type, which takes values from 1 to 9. Each option is a different estimator for quantiles. In R by default the definition 7 is used. But for SAS you shall expect results equivalent to type=3, while for SPSS results equivalent with type=6.

sapply(1:9, function(q) quantile(x, 0.01, type=q))
1%         1%         1%         1%         1%         1%         1%         1%         1%
0.02272536 0.02272536 0.01426692 0.01435151 0.01858073 0.01443609 0.02272536 0.01719918 0.01754457

Contrasts
In R, to fit a linear model one usually uses the lm function. The argument contrasts specifies what contrasts are used for qualitative variables. The default contrasts in R are contr.treatment while in SAS you shall expect results equal to these obtained with contr.SAS.

 lm(Sepal.Width~Species, data=iris, contrasts = contr.SAS)$coef
#      (Intercept) Speciesversicolor  Speciesvirginica
#            3.428            -0.658            -0.454
 lm(Sepal.Width~Species, data=iris, contrasts = list(Species=contr.SAS))$coef
#(Intercept)    Species1    Species2
#      2.974       0.454      -0.204

Take home
Even basic statistics like skewness or kurtosis may be calculated in a different way in different statistical packages.
If we are building an analytical solution that is based on R/SAS/SPSS we shall be aware of the possibility that for the same statistic default settings for different packages may lead to different results.

multidplyr: first impressions

November 13, 2015/in Data science /by Przemyslaw Biecek

Two days ago Hadley Wickham tweeted a link with introduction to his new package multidplyr. Basically it’s a tool to take advantage of many cores for dplyr operations. Let’s see how to play with it.

What you can do with multidplyr?

As it was described on GitHub website, multidplyr is a library for processing data that is distributed across many cores, with the use of dplyr verbs.
The idea is kind of similar to spark. Similar solutions exists for R, and some of them are available for years (like RMPI, distribute, parallel or many other from the list https://cran.r-project.org/web/views/HighPerformanceComputing.html). But the problem with them is that they are made mainly for hackers. It is not that unusual to get an error with 20 lines of traceback without any warning.
Packages from Hadleyverse come with nicer design (as sometimes Apple products do), explode less often. With slightly smaller functionality we get more fun.
The multidplyr is still in the dev phase and sometimes it can make you really angry. But there are a lot of things that you can do well with it, and often also you can do them really fast.
In the multidplyr vignette you will find examples of playing with flights dataset. There are just 300k+ observations so it turns out that the overload related with the data distribution is larger than the computation time. But for larger datasets or for more complicated calculations you should expect some gain from heating of additional cores.

Use case

Right now, I am playing with log files from many bizarre devices. But the point is that there is a lot of rows of data (few hundreds of millions) and logs are stored in many many relatively small files. So I use multidplyr to read the data in parallel and do some initial pre-processing. The cluster is built on 15 cores and everything is done in plain R. It turns out that I can reduce the processing time from one day to two hours. So, it is an improvement. Even if you count the time that you need to spend learning the multidplyr (not that much if you know dplyr and spark).
Let’s see the example step by step.
First, initiate cluster with 15 nodes (one node per one core).

cluster = create_cluster(15)
## Initializing 15 core cluster.
set_default_cluster(cluster)

Find all files with the extension ‘log’. Data is there.

lf = data.frame(files=list.files(pattern = 'log', recursive = TRUE),
stringsAsFactors = FALSE)

Now, I need to define a function that reads a file and do some preprocessing. This function is then sent to all nodes in the cluster.

readAndExtractTimepoints = function(x) {
tmp = readLines(as.character(x)[1])
ftmp = grep(tmp, pattern='Entering scene', value=TRUE)
substr(ftmp,1,15)
}

cluster_assign_value(cluster, ‘readAndExtractTimepoints’, readAndExtractTimepoints)
Time to initiate some calculations. The list of file names is partitioned across nodes and for each file the readAndExtractTimepoints is executed. The result is an object of the class party_df (again it’s one file per row).

lf_distr = lf %>%
partition() %>%
group_by(files) %>%
do(timepoints =  readAndExtractTimepoints(.$files))
lf_distr
## Source: party_df [897 x 3]
## Groups: PARTITION_ID, files
## Shards: 15 [59--60 rows]
##
##    PARTITION_ID                     files   timepoints
##           (dbl)                     (chr)        (chr)
## 1             1 2013/01/cnk02a/cnk02a.log
## 2             1 2013/01/cnk02b/cnk02b.log
## 3             1   2013/01/cnk06/cnk06.log
## 4             1   2013/01/cnk07/cnk07.log
## 5             1   2013/01/cnk09/cnk09.log
## 6             1   2013/01/cnk10/cnk10.log
## 7             1 2013/01/cnk100/cnk100.log
## 8             1   2013/01/cnk11/cnk11.log
## 9             1   2013/01/cnk15/cnk15.log
## 10            1   2013/01/cnk16/cnk16.log

Results are ready to be collected and transformed into a classical list.

timeP = collect(lf_distr)
str(timeP$timepoints)
## List of 897
##  $ : chr [1:144830] "Jan  1 08:15:57 " "Jan  1 18:04:37 " "Jan  1 18:05:44 " "Jan  2 08:15:57 " ...
##  $ : chr [1:123649] "Jan  1 08:16:05 " "Jan  2 08:16:05 " "Jan  2 09:46:08 " "Jan  2 09:46:13 " ...
##  $ : chr [1:137661] "Jan  1 08:15:57 " "Jan  2 08:15:57 " "Jan  2 09:34:47 " "Jan  2 09:35:45 " ...

General impressions

I guess that one can speed up the whole process even further with the use of python or spark. But if the dataset is not huge then it is much easier to maintain a process that is using just a single technology/language.
Overall I like the multidplyr even if it still looks like a prototype. Sometimes things get nasty, like for example when you try to chain few different do() operations. But knowing the ‘Hadley’ effect I expect that it will be better and better with every version.
Finally, soon we should expect a solution for parallel processing that can be used by normal people not only by hackers.

Machine Learning for Greater Fire Scene Safety

October 22, 2015/in Data science, Machine learning /by Jan Lasek

The lives of brave firemen are threatened during dangerous emergency missions while they try to save other people and their property. In this post I would like to share my experiences and winning strategy for the AAIA’15 Data Mining Competition: Tagging Firefighter Activities at a Fire Scene, in which I took first place.

The competition was organized jointly by the University of Warsaw and the Main School of Fire Service, in Warsaw, Poland. It lasted over 3 months during which 79 contestants submitted a total of 1,840 proposals with solutions on the competition’s hosting platform Knowledge Pit.
I particularly enjoy competitions with a potentially big impact – when something more than only a high accuracy score is at stake. This competition definitely had a flavor of this – the participants were asked to contribute toward the safety of firefighters on the scene during an emergency mission.

The challenge

It is certainly helpful for decision making during an emergency when you know what particular activity the members of a rescue team are currently engaged in. This was the goal of the competition – develop a model that recognizes what activity a fireman is performing based on sensory data from his body movements and a collection of statistics monitoring his vital functions. Actually, we are facing two dependent multiclass classification problems. The first class is the main posture of the fireman and the second one is his particular action. Here is a sample of the data the contestants were given:

posture	action	avg-ecg1	…	ll-acc-x	ll-acc-y	…	torso-gyro-z
stooping	manipulating	-0.03	…	-6.98	10.41	…	28.49
standing	signal water first	-0.04	…	-9.41	0.11	…	63.84
moving	running	-0.04	…	-8.75	3.81	…	-52.92
crawling	searching	-0.03	…	-36.61	2.74	…	-134.26
stooping	manipulating	-0.04	…	-3.00	2.23	…	-7.21

The first two columns present the two class attributes: the posture and the main action of the fireman. Each activity is described by ca. 2 second-time series of sensory data from accelerometers and gyroscopes and certain statistics on fireman’s vital functions. In total, there are 42 such statistics as well as 42 different time series. Moreover, as usual, you are given two datasets: “train” and “test”. In the training data, you are given instances along with labels of activities, just as exemplified in the table above. In the test data, the labels are not present and you are asked to design a model for automatic tagging of those activities. To select the best performing approach from participants’ proposals, the performance of a given model on the test set was taken into account (in terms of an evaluation metric discussed below). You can find more information on the competition at its hosting platform.
The number of possible activities is restricted to the set of labels by the competition organizers. There are five labels in the first class, and 16 in the second one. Moreover, the labels are dependent. Let us see their joint distribution.

	crawling	crouching	moving	standing	stooping
ladder down	0	0	465	0	0
ladder up	0	0	476	0	0
manipulating	0	1764	331	2356	1898
no action	0	87	0	490	0
nozzle usage	0	492	0	443	0
running	0	0	4324	0	0
searching	459	0	0	0	0
signal hose pullback	0	0	0	98	0
signal water first	0	0	41	496	0
signal water main	0	46	0	405	0
signal water stop	0	0	0	277	0
stairs down	0	0	644	0	0
stairs up	0	0	1157	0	0
striking	0	0	0	1022	0
throwing hose	0	0	0	234	930
walking	0	0	1064	0	0

For example, there are 4,324 instances in the data where a fireman is moving and running, and 234 instances where a fireman is standing and throwing a hose. Surely, there are many other activities that someone from the rescue team can engage in, however, the dataset was restricted to this particular subset. It may come as a big disappointment, but there were no “saving cat” label. As such, the competition was set up as a standard supervised learning task: we are given a training set of activities along with their tags. In the test set, we are to tag activities based on what we’ve learned from the examples in the training set.
Another thing to note is the fact that the distribution of labels is fairly unbalanced. For instance, a fireman is about four times more likely to be running than throwing a hose. This should be carefully considered, especially in the context of the evaluation metric adopted in the competition.
The chosen metric was balanced accuracy. It is defined in the following way. First, for a given label we define accuracy of predictions

Next, the balanced accuracy score for class C with L labels is equal to the average accuracy among its labels

Finally, since we have two dependent class attributes, we compute a weighted average of balanced accuracy scores for posture and action classes:

A higher weight is attached to the accuracy of classification of the more granular class action.

Overview of the solution

The approach to the task boils down to an extensive feature engineering step for time series data, before learning a set of classifiers. Along the way, there are a couple of interesting details to discuss. Since the final solution consisted of three slightly different Random Forest models that do not differ too much, I’ll describe just one of them.

Classification with two dependent class attributes

One of the interesting aspects of the challenge is the fact that we need to predict two dependent classes. In my approach, I performed a stepwise classification. In the first step, I predict the main posture of a fireman. In the second step, the particular activity is predicted based on the training set and the predicted label from the first step. Thanks to this approach, you can capture the hierarchical dependency between labels. Naturally, there are a number of other ways to deal with the two-class tagging problem. For instance, one could train two independent classifiers or concatenate the two labels. However, the approach of chaining two classifiers yielded better results in my case.

Drift between training and test data distribution

Another issue that came with the data was the fact that the activities in training and test set were performed by different firemen. This posed a real challenge. An important part of successful participation in any data mining competition is that you are able to set-up a local evaluation framework that is in-line with the one employed in the contest. Here, a natural solution would be to perform a stratified cross-validation over different firemen. However, no identifier of a fireman for a particular activity was provided. Hence, regardless of whether I liked it or not, I had to rely predominantly on preliminary evaluation scores that were based on 10% of the data during the competition (the final evaluation was done on the other 90% of test data). Of course, this was a problem not only for me but also for all the other contestants. As I talked to them at a conference workshop following the competition, they also relied mainly on preliminary evaluation results, as the evaluation on the training data yielded far too optimistic scores.

Feature engineering

The main effort during the competition was devoted to the extraction of interesting features describing the underlying time series (called signals). There are a couple of basic statistics that you can derive from the signal: mean, standard deviation, skewness, kurtosis, quantiles. I derived quantiles on a relatively rich grid ranging from 0.01, 0.05, 0.1, …, 0.95, 0.99. Because some of the activities are periodic, I thought that it would be useful to utilize some tools dedicated to that task. I processed each signal by Fourier transform as well as computed periodograms. From these transformed signals I once again extracted basic summary statistics. Another feature which is quite simple and proved to be useful in classification is correlation between signals. Intuitively, when you are running, the recordings of corresponding devices attached to your legs should be negatively correlated. Finally, I made some effort to identify peaks in the data. The idea is that, in case of performing different activities, e.g., running or striking, we can observe a different number of “peaks” in the signal. Peaks identification is a problem that is easy to state but hard to define mathematically. At the end, I ended up with a simple method that was based on counting chunks of a time series where it exceeds its mean by one or two standard deviations.
To battle the drift between training and test data, one should try to design generic (not subject-specific) features. For instance, the quantiles of distribution of acceleration are heavily dependent on a given person’s running pace and his/her motoric abilities. Presumably, these statistics are going to differ much from person to person. On the other hand, if you derive a correlation between acceleration recordings on left and right leg, this correlation may turn out to vary less between different firemen! This is a desired property of a feature, as the activities in test data were performed by a different set of people than those in the training set.
Feature extraction was the most tedious part of the solution, but I believe a worthy one. I derived a set of almost 5,000 features describing each single activity. Now, the next step is to train a model based on these features that learns to distinguish between different activities.

Let’s vote

If a group of experts is to decide on an important matter, it is often the case, that collectively they can make a better decision. As each of them looks at the problem from a slightly different perspective, they can jointly arrive at a more refined judgment. This idea is brilliantly explored in the Random Forest algorithm, which is an ensemble of decision trees. A large number of trees are trained on diverse subsamples of data so that their joint prediction, made by majority voting, usually yields higher accuracy than each single individual model. I employed this model to solve the problem of activity recognition.
Another appealing property of Random Forest is that it has an inherent method of selecting relevant attributes. Having extracted a quite rich set of features, it is certainly the case that some of them are only mildly useful. I handed over the task of selecting the most relevant ones to the model itself.
As already mentioned in the introduction, the distribution of labels in the data was fairly unbalanced. Recall that our solutions are evaluated against a balanced accuracy evaluation metric. Doing a poor job predicting some label, yields the same penalty regardless of its distribution in the data. To account for this, each tree in the forest was trained on a stratified subsample of the data, where each label was present in an equal proportion. This preserved the forest from focusing too much on the most prevalent labels, and gave a major improvement in the score.

Summary

Summing up, the competition was a very exciting experience. I would like to thank all the participants, as they made the contest a great event. Also, I want to thank the organizing committee from the University of Warsaw and the Main School of Fire Service for providing such an interesting dataset and setting up the competition. The winning solution yielded a balanced accuracy of 84% which was enough to beat other contestants’ solutions. Certainly, there is still some room for improvement, yet we took a small step toward increasing the safety of firemen at a fire scene.

Jan Lasek
(deepsense.ai Machine Learning Team)

About the Author:

Jan Lasek, Data Scientist at deepsense.ai, is also pursuing his PhD at the Institute of Computer Science, a part of the Polish Academy of Sciences. He graduated from Warsaw University where he studied both at the Faculty of Mathematics and the Faculty of Economic Sciences.

Data mining of the votes of Members of Parliament

October 22, 2015/in Data science /by Przemyslaw Biecek

7th term of the Sejm has already come to its end. It would be nice to see how have the Members of Polish Parliament voted for these last 4 years! In total they took part in over 6000 votings. Did the representatives of the same clubs voted more similarly to each other? Did the Members of Polish Parliament who changed the clubs they belonged to voted in a different way than the Members of Parliament from their former clubs? Let’s see!

In order to display the similarity between the votes of the Members of Parliament we will use a technique known to geneticists –namely, a phylogenetic tree. Such diagrams are employed to present similarities between sequences of DNA/proteins of various organisms/genes. In our analyses the Members of Parliament will stand for organisms and their votes will serve as their DNA. We will build the phylogenetic tree for the Members of Parliament.
Phylogenetic trees are based on similarities between objects presented on these trees. In this case we will compare levels of similarity between votes cast by the Members of Parliament. Firstly, we will learn about the manner in which that similarity is calculated. As an example let us take six Members of Parliament and a dozen or so votings (here we will discuss 14 votings concerning the Act on Infertility Treatment). Each Member of Parliament might choose between the following options: to vote For, Against, and Abstain from voting (what is a ‘delicate’ against), or simply be absent during the voting. Let us encode these options with numbers: +2, -2, -1 and 0 respectively, or with colours: blue, red, yellow and light blue. The following graphic presents the votes of each of the six Members of Parliament during each of the votings under analysis. The left part of the diagram shows the similarity between the votes. During these votings J. Palikot and L. Miller voted in the same manner; E. Kopacz voted similarly; J. Piechocinski voted in less similar but still fairly alike way. Other two Members of Parliament, B. Szydlo and J. Gowin voted in a similar way but quite differently from the remaining four. The distance between the tree branches corresponds to the voting profiles.

Now we increase the number of votings from 14 to 6000. The vector with votes is longer but the similarity is calculated in the same way -the Euclidean distance.
One diagram cannot present all 6000 votings clearly. We do not draw them; the only thing presented at the diagrams are names of the Members of Parliament. To make the diagram more legible, next to the names I put the name of all the clubs that given Members of Parliament belonged to during the 7th term of the Sejm. Colours represent clubs in which politicians spent most of the period of the 7th term. Below you may see a fragment of the tree. It shows that J. Zalka and J. Gowin voted in a rather similar way but very different than rest of their club. The tree also allows us to notice that both of them cast most of the votes while they belonged to PO, but both of them also belonged to ZP and KPSP. Members of Parliament from PSL and PO usually voted quite alike and for that reason they belong to the same subtree.

We may also present the whole tree, although it has many leaves. When we include the MPs who left the Sejm and were elected for the Sejm, we have over 500 names. As we may see, the Members of Parliament from PO in most cases voted similarly. They create a separate subtree with PSL. PiS with a part of the right wing also creates its own subtree. The remaining two subtrees represent SLD and Twoj Ruch/Ruch Palikota.

The same tree may be presented in various manners; above, for example, we may see a more packed version of it.

However, a little fan like the one above is far more comprehensible. There is more space for names of the Members of Parliament.
If we take into consideration all the votings, we will notice that the greatest differences exist between the government and the opposition. It turns out that there are two Members of Parliament on the side of the government (Jaroslaw Gowin and Jacek Zalek) who usually voted as MPs from PO (the colour on the diagram corresponds to the club that a given Member of Parliament belonged to during most votings), yet their profiles differ considerably from the profiles of the remaining MPs. Besides, that pair migrated from one party to another, what may explain their incompatibility with the stance of PO. As far as PiS is concerned, the least compliant voters were Gorski Artur and Tomaszewski Jan (who finally transferred to PO at the end of the year).
There are many more such interesting stories where a typical voting profile is incompatible with the ‘main’ club; they can be found in every club. Just look for them for a while.

When we look at votings on specific acts, the pictures tend to change. Below you may see an example diagram concerning voting on Personal Income Tax Act.

Here you may look at the results of the votings on the Act on Higher Education. PO and PSL voted so similarly that the Members of Parliament belonging to both clubs are mixed. SLD and TR create their own trees. There are also groups of MPs, both within PO and PiS, who in all 53 votings (that is the number of votings on that act) voted identically (their tree branches are very short).

You may see an enlarged version of every diagram by clicking on it.
Computer specialist sometimes joke that, unlike the normal trees, their trees grow upside down.
As it turns out the trees created by data analysts may grow in every direction. Or in every direction at the same time!
R packets- cluster, ggdendro and ape– were used during analysis of the clusters.
The source codes can be found at github: https://github.com/mi2-warsaw/JakOniGlosowali/tree/master/glosy.
Data on votings may be found in the package called sejmRP: https://github.com/mi2-warsaw/sejmRP.

Do cats or dogs live longer?

October 15, 2015/in Data science /by Przemyslaw Biecek

Some time ago our herd has expanded by a guinea pig called Hugo. It turns out that the presence of a pet at home is a great pretext for discussing with children the concepts of randomness, distribution functions and distribution in general.

And this is how it started:
— Dad, how long do the guinea pigs live?
— On average from 5 to 7 years (google).
— And do other home pets live longer or shorter?
— Mice have the shortest lives, they live for a year up to three years but then, rodents are short-lived in general. But parrots, for example, they may live up to twenty years, even forty years. And tortoises live even until seventy years.

— What about dogs, how long do they live?
— It depends on the breed – bigger ones live shorter and smaller ones live longer, but usually it is between eight and sixteen years.
— Does it mean that all the dogs die after they reach sixteenth year of life?
— Well… (a longer pause at google+R+ggplot), no, that is only a typical lifespan. Some dogs live longer, some even up to twenty four years.

— And do cats or dogs live longer?
— Well… (a longer pause at google+R+ggplot). Among cats there are many more specimen who live longer than 20 years.

— So who will live longer, Bromba (our friends’ dog) or Mufinka (our friends’ cat)?
— Nobody knows how long will a particular cat or dog live – it depends from the quality of care of their owners and from many other factors. We can say something more about more numerous groups of animals. For example, it one gathered 100 dogs, it could be expected that half of them would live longer than 12 years. But if one gathered 100 cats, then it would turn out that half of them would live over 14 years. And every fifth cat would live over 18 years, what is longer than a lifespan of any dog really.

The presented numerical data is based on the data base VetCompass. It concerns animals visiting the vet, that is animals ”better taken care of”.

Statistician like a shoemaker

October 9, 2015/in Data science /by Przemyslaw Biecek

Children bring from school strange home assignments, like for example a question: What is your dad’s job similar to? After several hits (a cosmonaut, Formula 1 driver, firefighter) it turns out that the work performed by a statistician is very much similar to the work of a shoemaker. Why?

I don’t mean that a shoemaker drinks a lot and swears when something doesn’t work out (read: swears a lot). These are stereotypes. I mean that we can identify similar subgroups within both craft guilds.
A craftsman – seller. He gets a machine for making shoes (or ready-made shoes) and he sells them. He knows a bit about shoes, for example, that there is a right and left shoe and that if the client put them on wrongly, he will not like it. He can show the shoes from a better perspective and when he lacks the right size, he claims that after some period of wear the shoes will become looser or they will shrink etc. Who cares that the method assumes normality while the data are as skew as the Tower of Pisa? Surely, it is very robust and the impact of unrealized assumptions is marginal. Too little observation and the power of the test barely has the level of significance? And what of it? It rains very rarely in this area so the shoe will last for long.
A craftsman – artist. He knows best what his client needs. The client brought a beautiful rabbit skin and wanted high winter boots, but the artist made a beautiful cap. No, please, believe me, such data can serve as a source for such analyses. No, you don’t need a test for two means. In this case PCA will serve ideally. You will be satisfied and the cap looks fantastic.
A craftsman – theoretician. He does not make shoes but he examines what would happen to the shoes in some untypical or impossible scenarios. For example, how would the shoe’s lifespan change if one run with the speed coming close to infinity, or if one jumped on the surface of Mars, or if one had not two but 314 feet?
A craftsman – handyman. Clients come to him with shoes with holes in the soles and he patches them up, sews them up. He makes up medians or add some Wilcoxon test results. Maybe the final result is a little scary for people with taste but the clients are able to go home and do not have to look for a skin for a new pair of shoes.
A craftsman – future man. He notices that people’ve got bored with shoes and he is looking for new and more attractive niches. He does shoe science, is interested only in large sizes of shoes because such shoes entail technological challenges. The wall of his workshop is decorated with a black belt in six shoe sigma and he commutes riding an elephant.
In general, they make such a funny group which meets from time to time in order to exchange notes and decide which skin makes nicer, more comfortable, more stable or easier to make shoes.
Have I omitted some subgroups of this guild?

Multilevel classification, Cohen kappa and Krippendorff alpha

September 24, 2015/in Data science /by Przemyslaw Biecek

I was facing an interesting problem last week. Playing with data from The Genome Cancer Atlas (full genetic and clinical data for thousands of patients) I was building a classifier that predicts the type of cancer based on sets of genetic signatures.

In the PANCAN33 subset there are samples for 33 different types of cancer. And the classifier shall be able to classify a new sample to one of these 33 classes.
I’ve tried different methods like random forest, svm, bgmm and few others, and end up with collection of classifiers. How to choose the best one?
We need a method that computes an agreement between classifier predictions and true labels/cancer types. For binary classifiers there is a lot of commonly used metrics like precision, recall, accuracy etc. But here we have 33 classes. The confusion matrix is 33×33 cells large, a lot of number to compare.
Of course there are some straightforward solutions like fraction of samples on which classifier correctly guesses true labels. But such easy solutions suffer a lot if there is unequal distribution of classes (quite common). Such metrics may be high for dummy classifier like: always vote for most common class. It is better to avoid such metrics.
Are they other measures of agreement that we can use?
Actually I used two interesting ones – Cohen Kappa and Krippendorff Alpha. They take into account the distribution of votes for each rater. Moreover Krippendorff alpha takes into account missing data (find more information here).
Both coefficients are widely used by psychometricians (e.g. to asses how two psychiatrists agree on a diagnosis). We use them in order to estimate the performance of the classifier. Both coefficients are implemented in the irr package.
Below you will find an example application.

kappa2(cbind(predictions, trueLabels))
# Cohen's Kappa for 2 Raters (Weights: unweighted)
#
# Subjects = 3599
#   Raters = 2
#    Kappa = 0.941
#
#        z = 160
#  p-value = 0 
kripp.alpha(rbind(predictions, trueLabels))
# Krippendorff's alpha
#
# Subjects = 3599
# Raters = 2
# alpha = 0.941

Biplots, correspondence analysis and ggplot2

September 18, 2015/in Data science /by Przemyslaw Biecek

I was looking for biplots created with the use of ggplot2 library (because they look good and are customisable). It turns out that there are some nice solutions for PCA (like sinhrks/ggfortify; kassambara/factoextra; vqv/ggbiplot; fawda123/ggord) but I could not find suitable solution for correspondence analysis. So I create one. It’s available in pbiecek/ggplotit package and works for both CA{FactoMineR} and ca{ca} functions. You will find source of this function below, but let’s start with an example.

I’m going to use data about car sale offers from the PogromcyDanych package.
Let’s see what is the relation between a brand and a type of fuel.
Guess in which brands oil is more common than gas?

Let’s see.
Porsche, Mini and Smarts – these brands are mostly gas only.
Daewoo, Dodge – here you will find LPG fuelled cars.
LandRover, Audi, Volkswagen – here oil is most common.
This example was created by these lines:

library(PogromcyDanych)
library(ggplotit)
library(FactoMineR)
# contingency matrix for cars
tab = table(auta2012$Marka,  auta2012$Rodzaj.paliwa)
tab = tab[rowSums(tab) > 300, c(1,2,6)]
# correspondence analysis
obj = CA(tab)
ggplotit(obj, c(FALSE,TRUE), list(rownames(tab), c("Gas", "Gas+LPG", "Oil")))

And full source of the function

function (x, arrows = c(FALSE, FALSE), names = NULL, ...)
{
stopifnot(length(arrows) == 2)
stopifnot(length(names) == 2 | is.null(names))
X = as.data.frame(x$row$coord[, 1:2])
Y = as.data.frame(x$col$coord[, 1:2])
if (!is.null(names)) {
X$Names = names[[1]]
Y$Names = names[[2]]
}
else {
X$Names = rownames(x$row$coord)
Y$Names = rownames(x$col$coord)
}
colnames(X) = c("x.Dim1", "x.Dim2", "x.Names")
colnames(Y) = c("y.Dim1", "y.Dim2", "y.Names")
pl = ggplot() + geom_text(data = X, aes(x.Dim1, x.Dim2,
label = x.Names), color = "blue", size = 3) + geom_text(data = Y,
aes(y.Dim1, y.Dim2, label = y.Names), color = "red",
size = 3) + geom_hline(xintercept = 0, alpha = 0.5) +
geom_vline(yintercept = 0, alpha = 0.5) + theme_bw()
if (arrows[1]) {
pl = pl + geom_segment(data = X, aes(x = 0, xend = x.Dim1,
y = 0, yend = x.Dim2, label = x.Names), color = "blue",
arrow = arrow(angle = 15))
}
if (arrows[2]) {
pl = pl + geom_segment(data = Y, aes(x = 0, xend = y.Dim1,
y = 0, yend = y.Dim2, label = y.Names), color = "red",
arrow = arrow(angle = 15))
}
pl
}

Are you in favour of abolition of compulsory education for six-year-old children and return to compulsory education for seven-year-old children?

September 10, 2015/in Data science /by Przemyslaw Biecek

1st September was just few days ago. After the reform ‘lowering the age at which children start their school education’ the second group of 6 and 7-year-old children started attending the freshmen classes. And since we are in the ‘pre-election’ mode there are some votes about a reform reestablishing the previous age for starting school education.

We are going to use R, Wikipedia and packages: htmltable, ggplot2 and archivist to juxtapose these ideas with demographic data. We will use easily accessible data, that is, birth statistics.
Let’s use the htmltable package to download from Wikipedia birth statistics concerning Poland from the last 50 years. Then let’s use the ggplot package and bar charts to present the birth rate (chart on the right-hand side, click to enlarge).
There is a clear decreasing tendency, but some local ‘waves’ are also visible on the chart. After more numerous age cohorts from 1975-1985 more children were then born around 2008.

library(archivist); library(ggplot2); library(scales)
# access the dataset through archivist hook
births = archivist::aread("pbiecek/graphGallery/5a6c2a732c20d5a1bebe6507ebf09afa")
ggplot(births, aes(x=rok, ymin=0, ymax=urodzenia)) +
geom_linerange(size=2) + theme_bw() +
scale_y_continuous(label=comma) + scale_x_continuous(limits=c(1965,2015)) +
ggtitle("Number of births in last 50 years")

Since we are interested in the ‘present’, we will limit the period to the last 15 years (chart on the right-hand side, click to enlarge).
We can easily see that more children were born between 2008-2010.
Let us look at this data through the prism of the education reform. If we want to lower the age at which children start school education, we must accumulate x+1 age groups in x years.
Which x would you choose and which age groups would you accumulate in that way?
Unfortunately (in reality) the accumulation concerned the age groups 2007-2009 – the age groups from the slight demographic boom.

The chart on the right-hand side (click to enlarge) presents a theoretical result of such accumulation. Theoretical, because due to certain mortality rate and migration not all of the children born in Poland go to schools in Poland. What is more, for several years the parents of six-year-olds were allowed to decide if they wanted to send their child to school a year sooner.
A year before the compulsory lowering of the school age only 20% of parents decided to take that step. All of that means that the estimations presented here diverge to some extend from the actual number of children who will attend the first year. These numbers are not completely know at the moment as some of them belong to the future.

All right then, what about the reestablishing the previous age for starting school education?
Let us assume that in the referendum (we know that it will not happen soon, but let’s do a mind exercise) most of the voters will decide that the school age should be increased and the government will follow the majority vote and it will quickly adopt a relevant act. It will mean that now one age group would have to be divided in two.
The chart on the right-hand side (click to enlarge) displays a theoretical result of such action (still preserving the same reservation that the estimation is very rough). If we have another reform, it will unfortunately hit a demographic decline. After 6 years of schools almost coming apart at the seams accommodating over 150% of average number of students and working on shifts, suddenly the first year will have 3.3 times less pupils.

How this looks like in other countries?
Let’s download data from World Bank, data related to changes in the age of entering the primary education and let’s use a great package networkD3 to visualize how different countries behaves. We see the general tendency that more and more countries decide to lower the school starting age to 6.

archivist::aread("pbiecek/graphGallery/af0b5130b2db42b23c61d3ab373d7946")