Data science Archives - Page 10 of 10

Spark + R = SparkR

February 12, 2015/in Data science /by Przemyslaw Biecek

Spark wins more and more hearts. And no wonder, comments from different sources tell us about a significant speed up (by an order of magnitude) for analysis of big datasets. Well-developed system for caching objects in memory allows us to avoid torturing hard discs during iterative operations performed on same data.

Applications in Spark had to be written in Java, Scala or Python and I saw this fact as a slight drawback of this platform. These programming languages are really nice but I use some advance statistical algorithms available only in R and I do not feel like rewriting them into Python nor Java.
Fortunately, now we have a connector for R, which allows to seamlessly integrate R and Spark. This connection sprang up over a year ago on the initiative of Shivaram Venkataraman from Berkeley. He received support from many people who joined his project of development of SparkR package (see github http://amplab-extras.github.io/SparkR-pkg/)
Today I would like to share my first impressions after using that package.

Use-case

‘The problem’ seemed to be an ideal match for the processing profile of Spark. Briefly: I have MAAAAANY short time series (each consisting of several hundred observations) and I want to create a regression model for each series. Then I would like to employ ARIMA-type model for the residuals to obtain better predictions.
I started my test using a local standalone installation including only 1 master and 1 worker (which is a toy really) each with 8GB RAM. The test version of Spark is in fact a pre-built version for hadoop 2.3 see more details here: https://spark.apache.org/downloads.html.
At the beginning I tried to work on MapR package (from http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop where you can download preinstalled Hadoop from MapR along with the whole zoo) but for some reason this version did not work satisfactorily. As a result I used only plain Spark on a local system of files. After all my main goal was to test the SparkR.
The beginnings were difficult. SparkR often throw exceptions and stack-traces from Java did not explain much either. Something rasped somewhere in that mechanism.
Documentation for SparkR package is rather mediocre so it took a while to guess what are limitations for the key and value pair (basic Spark structure) and what happens when lists contain more than two slots.
Similar problems were caused by the fact that functions in SparkR package are very naughty and collide with functions of other packages, such as dplyr package for example. If you load dplyr package after SparkR package, then function collect (overloaded for S3 class) from dplyr will overwrite collect (overloaded for S4 class) function from SparkR. We will get a non informative error.
But, after the phase of errors came the time when things actually started working. The basic structure on which you work in Spark are pairs key-value. It is really convenient that both key and value may be an object of any type or structure. This allows you to transmit as a value a data.frame, regression model or any other R objects.
It is also convenient that just a few lines of code are enough to launch R code on Spark. Most initiations, serializations and deserializations are hidden in the SparkR package.
I managed to perform the planned use-case for over a dozen lines of code which I present below having removed the unnecessary details. Once you get through the initial difficulties you discover that using SparkR is really very pleasant.
Few comments related to the attached code:
sparkR.init() function initiates a connection with Spark backend, textFile() creates a connection with the data source whether it is in hdfs or in the local directory structure.
flatMap() function is used to convert the collection of lines into collection of vectors of words. As result it produces a vector of words (or, to be more precise, a list with one vector, a kind of Spark-type format).
You can launch lappy function on this object, which will be applied to each vector separately (remotely on Spark). This function should extract the most important elements from the vectors of words and save them as a list consisting of two elements –key and value. In my data each row is one measurement of the demand for the X property of Y object in time T. As a key I chose column with X id and as a value I saved the list with Y and T.
groupByKey() function reshuffles the data to group pairs with the same key (the same X id).
Next, map() processes the groups. The received argument is a list with key as the first element and list of values (here: list of vectors) as the second element. It is fairly easy to transform such a list into a table with data and to perform predictions for it.
You may use collect() function to download results into R.
The whole procedure worked out. Unfortunately, standalone version of Spark linked through SparkR is tragically tragically tragically slow.
Now I’m going to test a bigger real Spark cluster and I hope that change of backend will considerably speed up data processing.

# install_github("amplab-extras/SparkR-pkg",
# subdir="pkg")
library(SparkR)
# you need to have a properly configured Spark installation
sc <- sparkR.init()
# read data from local disk
linesRDD <- textFile(sc, "data.txt")
# preprocessing of text file, split by t, return list of columns
words <- flatMap(linesRDD,
function(line) {
list(strsplit(line, split = "t")[[1]])
})
# now extract keys and values
demandWithKey <- lapply(words, function(word) { list( ---here-key---,
---here-value--- })
# group by key
groups <- groupByKey(demandWithKey, 1L)
# do the mapping
mapRes <- map(groups, function(x) {
# x[[1]] is the key
# x[[2]] is the list of values
# write your logic here
})
# collect results, download them to R
collect(mapRes)

Przemyslaw Biecek

Pretty heat maps

January 29, 2015/in Data science /by Przemyslaw Biecek

Do you know where Kamil Stoch earns most of his points in season 2013/2014?

Some time ago I came across a pheatmaps package (see here) for R software which generates much nicer heat maps than the standard heatmap() function. This is why the package is named p(retty)heatmaps.
Let us compare these maps through an example. Pheatmap package is usually used in genomics but we will use it below to present results of the ski jumping competitions during the season 2013/2014. For each ski jumper who gained at least 500 points in the general classification we will count number of point he won on each ski jump. Then we will visualize that matrix (ski jumper vs. ski jump) using a heat map.
A heat map is a visualization of a numerical matrix in which color represents the numerical value of a given cell and dendrograms mark the similarity of rows and columns.
We will import data concerning ski jumping from SmarterPoland package for R. (Note, that this dataset was used in the competition for the best visualization during the PAZUR conference 2014).
Firstly, we need some dplyr magic to count the number of points which were won by the individual jumpers on given ski jumps and added to the general classification.

library(pheatmap)
library(tidyr)
library(RColorBrewer)
library(SmarterPoland)
country_jumper_points %
mutate(jumper = paste(jumperSurname, jumperName)) %>%
group_by(jumper, compName) %>%
summarise(max.points = sum(classPoints, na.rm=TRUE)) %>%
group_by(jumper) %>%
filter(sum(max.points, na.rm=TRUE) > 400) %>%
spread(compName, max.points) %>%
as.data.frame()
rownames(country_jumper_points) <- country_jumper_points[,1]
country_jumper_points <- country_jumper_points[,-1]
# how this table looks like
country_jumper_points[1:5,1:5]
# Bischofshofen Engelberg Falun Garmisch-Partenkirchen Innsbruck
# AMMANN SIMON 50 38 32 60 80
# BARDAL ANDERS 40 105 40 20 18
# DAMJAN JERNEJ 0 NA 24 9 NA
# DIETHART THOMAS 100 90 14 100 45
# FREUND SEVERIN 26 11 100 0 16

Then we will look at the ugly standard heat maps and much prettier heat maps from pheatmap package.
Default legend, more legible grating, adequately matched colors and more suitable margins. These are some nice advantages of the pheatmaps package.

heatmap(as.matrix(country_jumper_points))
pheatmap(country_jumper_points,
main="Where they've got points")
pheatmap(country_jumper_points, border="white",
color = brewer.pal(9,"Blues"),
main="Where they've got points")

Przemyslaw Biecek

Is a simple linear regression able to knock spots off SVM and Random Forest?

January 15, 2015/in Data science /by Przemyslaw Biecek

A friend of mine took part in a project in which he had to perform future prediction of Y characteristic. The problem was that Y characteristic showed an increasing trend over time. For the purposes of this post let us assume that Y characteristic was energy demand or milk yield of cows or any other characteristic that with positive trend over time.

So, we have discussed possible approaches to this problem. As a benchmark we used the techniques that heats up processors, like the random forest and SVM. However, it turns out (and after the fact it is along intuition) that if we deal with a generally stable trend, the range of values observed in the future might be different than the range of values observed in the past. In that case techniques such as simple linear regression may give better results than the mentioned SVM and RF (which more or less look for similar cases in the past and average them).
Let us consider the following example. We have N predictors at our disposal and we want to predict development of Y characteristic. In reality it depends on only the first characteristic. We will juxtapose SVM, RandomForest, simple regression and lasso type regularised regression.
This example will be purely simulation based. We start with small random data, 100 observations and N= 25predictors (results will be similar for larger datasets). Testing set will be beyond the domain of the training set, i.e. we increase all values by +1.

library(dplyr)
library(lasso2)
library(e1071)
library(randomForest)
library(ggplot2)
# will be useful in simulations
getData <- function(n = 100, N = 25 ){
x <- runif(N*n) %>%
matrix(n, N)
# artificial out-of-domain x
x_test <- x + 1
list(x = x,
y = x[,1] * 5 + rnorm(n),
x_test = x_test,
y_test = x_test[,1] * 5 + rnorm(n))
}
# let's draw a dataset
gdata <- getData()
head(gdata$y)
# [1] -0.5331184 3.1140116 4.9557897 3.2433499 2.8986888 5.2478431
dim(gdata$x)
# [1] 100 25

There is a linear relationships within the selected data between the first predictor and Y. We added a small random noice to avoid being too tendentious.

with(gdata,
qplot(x[,1], y) +
geom_smooth(se=FALSE, method="lm")
)

Let us fit the model for each approach and calculate MSE for each model.

fitModels <- function(x, y) {
ndata <- data.frame(y, x)
list(
model_lasso = l1ce(y ~ ., data=ndata),
model_lm = lm(y ~ ., data=ndata),
model_svm = svm(x, y),
model_rf = randomForest(x, y))
}
testModels <- function(models, x_test, y_test) {
predict_lasso <- predict(models$model_lasso, data.frame(x_test))
predict_lm <- predict(models$model_lm, data.frame(x_test))
predict_svm <- predict(models$model_svm, x_test)
predict_rf <- predict(models$model_rf, x_test)
c(
lasso = mean((predict_lasso - y_test)^2),
lm = mean((predict_lm - y_test)^2),
rf = mean((predict_rf - y_test)^2),
svm = mean((predict_svm - y_test)^2))
}
# time for fitting
models <- fitModels(gdata$x, gdata$y)
testModels(models, gdata$x_test, gdata$y_test)
# lasso lm rf svm
# 0.8425946 1.4672156 15.7713529 25.0271363

This time the Lasso wins. Now we are going to repeat random drawing and model adjustment 100 times.
And pipe this results directly to boxplots (I %)

replicate(100,{
gdata <- getData(N=N)
models <- fitModels(gdata$x, gdata$y) testModels(models, gdata$x_test, gdata$y_test) }) %>%
t() %>%
boxplot(main = paste("MSE for", N, "variables"))

The lower MSE, the better.
Boxplots present results of the whole simulations. We did not select the characteristics so the linear regression suffers from the random noise. Lasso regularisation helps as expected.
Of course, methods such as SVM or RandomForest must have lost that competition because in the ‘future’ value of Y is X1*5 but range of X1 is between 1-2 and not between 0-1.
The same situation takes place in case of N=5 variables (here regression has an advantage) and N=50 variables (and here it has not).

What are the conclusions?
– SVM and RandomForest work in a ‘domain’. If some monotonic trend is observed and future values are likely to be far from these in training set, the trend should be removed in advance.
– If there are many variables, for regression it is advisable to do some variable selection first or optionally to choose a method that would do that for us (like Lasso for example). RF deal with this problem on it’s own.
– It is much more difficult to predict the future than the past ;)