Last week’s post showed how to download data on ratings of over 200 television series. The rating was broken down by gender and age of the user.

The application presented below allows for selection of any two age/gender groups of users and comparison of their ratings of particular television series. We can check which two series were ranked the highest and the lowest by these groups and which two series were most differently rated. There are some interesting differences between genders and especially age groups. Click to open and have fun! Next week I will show you how to develop such application. If you cannot start the application above, you can try to open it at https://smarterpoland.shinyapps.io/serialeIMDB_groups/. Przemyslaw Biecek

https://deepsense.ai/wp-content/uploads/2019/02/You-should-not-watch-these-movies-with-your-wife-or-girl.jpg3371140Przemyslaw Biecekhttps://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svgPrzemyslaw Biecek2015-03-19 06:30:392019-07-09 10:40:02You should not watch these movies with your wife / girl

Data harvested from the web pages is a source of interesting information. Pulling data used to require quite a lot of resilience and misshapen Perl scripts struggling with messy sources of web pages. Today’s web pages more and more frequently comply meet the standards. There are also more and more civilized tools for parsing websites.

A package rvest has lately gained my sympathy. Harvesting data from web pages with that package is very easy. I would like to present it on the example of scrapping ratings from television series separately for age group and gender of the reviewer. We will make use of this data next week. Today I will only show you how to download it. When you visit the Internet Movie DataBase (IMDB) and find the tab “user ratings” (see here for example), you will find ratings of given movies broken down by age and gender groups. When you use the package rvest to download the data and parse the html page into html tree, the whole procedure comes down to two lines.

We can choose only the interesting elements/nodes from that tree. In order to select particular elements we will use an extraordinary tool SelectGadget (http://selectorgadget.com/). The only thing that we need to do now is to click on the content that we want to read and the content that we want to ignore. SelectorGadget will suggest the so-called Css.selector and will highlight in yellow the elements compatible with it. The selector that we need in case of our films is “table:nth-child(11) td:nth-child(3)”.

Screenshot of the webpage with elements highlighted by SelectorGadget. All nodes that match the css selector are marked yellow.

In order to harvest the interesting data we need to: 1. Select the nodes of the html tree compatible with the selector (using the function html_nodes), 2. Extract the text from the nodes (using the function html_text). If there is something else apart from numbers, we can clean the nodes with a regular expression. The three lines presented below will reveal to us assessments of a given movie (in this case BreakingBad) of various age groups and different genders.

Seems so trivial! Let us repeat this for all the tv series. We will check which of them are preferred by women and which by men. We can also see which series gain recognition among the younger audience and which among the older audience. Instead of titles of the series the addresses of the web pages of IMDB service use identification numbers. The identification number for BreakingBad is tt0903747. Where can we find the identification numbers of the movies? We can use the data set serialeIMDB from the package PogromcyDanych. It contains identification numbers of over 200 most popular television series gathered in the column imdbId. The following code will download the webpage with ratings of each series. Then, it will extract the ratings broken down by age and gender and save it in the table called ratings.

library(rvest)
library(PogromcyDanych)
serialsToParse = levels(serialeIMDB$imdbId)
# prepare matrix for results
ratingsGroup = matrix("", length(serialsToParse), 14)
rownames(ratingsGroup) = serialsToParse
colnames(ratingsGroup) = c("Males", "Females", "Aged under 18", "Males under 18",
"Females under 18", "Aged 18-29", "Males Aged 18-29", "Females Aged 18-29",
"Aged 30-44", "Males Aged 30-44", "Females Aged 30-44", "Aged 45+",
"Males Aged 45+", "Females Aged 45+")
# for all series
for (serial in serialsToParse) {
page = html(paste0("http://www.imdb.com/title/",serial,"/ratings"))
nodes3 = html_nodes(page, "table:nth-child(11) td:nth-child(3)")
ratingsGroup[serial,] = gsub(html_text(nodes3)[-1], pattern="[^0-9.]", replacement="")[1:14]
}

Below you will find 10 first rows and four selected columns from the collected dataset.

Last week we wrote about multidimensional linear models. We discussed a case in which a k-dimensional vector of the dependent variables is related to a grouping variable. We look at matrices E and H in order to find out whether there is any relationship (see the previous blog).

We still haven’t solved one problem, though. The dependent variable has k dimensions so the matrices of H and E effects have kxk dimensions. As a result these matrices can be viewed on a plot only when they are projected onto some two-dimensional space. But which two-dimensional subspace should we choose? We can try various projections of E and H matrices but is any of them better than the others? Let us remember that our goal is to see whether the subgroups of the independent variable really diversify the multidimensional dependent variables. So, we will reduce dimension in the dependent variables so as to preserve the greatest between groups variation (groups determined by the independent variable). Canonical discriminant analysis is a very popular technique used to perform such reduction of dimension. It identifies orthogonal vectors in the dependent variable space which explain the greatest possible between-group variation. If we choose the first two coordinates, we will get a subspace in which the analyzed groups are characterized by the highest between group variation. Now we can compare matrices H and E in this particular subspace. R program obviously has (many) packages allowing for simple construction of CDA. The one I used is called candisc. We need to load the data, select only the data concerning Poland and build a multidimensional linear model (like the one we built a week ago).

library(PISA2012lite)
pol = student2012[student2012$CNT == "Poland",]
model = lm(cbind(PV1MATH, PV1READ, PV1SCIE)~ST28Q01+ST04Q01, pol)

Next, using the candisc function we perform CDA analysis for the chosen grouping variable (you can choose only one variable) and we build a HE plot for this variable.

Matrices H and E were transformed on the plot –they were multiplied by E^{-1} on the right side. As a result we can see HE^{-1} and a unit matrix. Where does this transformation come from? It is much easier to compare the blue effect ellipsis to a circle than to another ellipsis. The diagram presented above suggests that the three dimensions of the dependent variable are strongly correlated along the axis differentiating the groups of ST8Q01 variable (code for number of books at home). Notice that if the variable has got two levels (such as gender) we need only one dimension to achieve maximum linear separation. This is why the canonical components of such variables are one-dimensional and the corresponding HE plots look in the following way.

https://deepsense.ai/wp-content/uploads/2019/02/Canonical-discriminant-analyses-and-HE-plots.jpg3371140Przemyslaw Biecekhttps://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svgPrzemyslaw Biecek2015-02-26 06:30:552019-07-09 10:04:32Canonical discriminant analyses and HE plots

GPS helps the drivers to avoid traffic jams, yet in more advanced uses it allows for fleet management or remote drone strikes. It is just the same with visualization. Bars and dots can be used to present a set of several means but there are also more advanced uses such as presentation of differences in the covariance structures. And it’s today’s topic. But first things first.

I have recently faced the following problem (field of science: anthropology). We have skulls, both female and male, found in different places. Each skull is described by five numbers specifying distances between particular elements of the skull. What question do we want to answer? We want to test whether and to what extent the parameters are different for female and male skulls and for the skulls found in various locations. If parameters of the skulls were expressed by one number instead of five we could solve the puzzle using a classic two-way analysis of variance.Having five parameters describing the skulls, we can approach each of them individually and solve the problem by performing five separate analysis of variances. However, our parameters are interdependent and in this case we will get much better results if we employ models allowing for simultaneous modeling of multidimensional variables. We face the problem of simultaneous modeling of multidimensional measurements quite often, especially when the variables we try to present are interdependent. This is the case for example during simultaneous modeling of skills (e.g. in education, measured in few subdomains like math, reading, science), financial products (prepossession vs. risk) or in industry like modeling of the amount of milk, protein and fat in the milk yield. Is there scope for interesting plots? Let us see. We will start from taking a quick look at the one-dimensional problem. The standard linear model with one-dimensional response is usually as follows:

The null hypothesis in a general form can be written in the following way (L is a contrast matrix):

The test for such hypothesis is based on the quotient of two estimates. One of them is a estimate of the size of random noise

and the other one is the estimate of the size of effect

After scaling such quotient has distribution F. If the effect is considerably bigger than the distortion we consider this effect really significant. That’s it. But how about multidimensional models? Firstly, effect Y is multidimensional and as a result the effects are described by the coefficient matrix and the vector of random noise is also multidimensional.

The hypothesis is constructed in a very similar way. The only difference is that zero on the right side of the equation is a matrix.

Now we have reached the test phase. We proceed in the same way as in case of one-dimensional model. Firstly, we calculate the size of the effect (also sometimes called effect Hypothesis, whence H standing for hypothesis) and the scope of random noise (E for error). Now these values are symmetric matrices.

Once we have such a pair of matrices we have to face the following question: when is the matrix for effect considerably bigger than the matrix for random noise? There are several approaches to this question but they usually come down to the analysis of the distance between the effect matrix times inverse of the error matrix and the unit matrix (in case of the null hypothesis the distance should be insignificant).

One can do this using characteristic values etc. However, one can also visually compare these two matrices, e.g. on a plot! This is what HE plots are for. In R program you will find tools for creation of HE plots in the heplots package. Let us see how to use them on an example. Data on the skulls are not publicly available, so let us take as an example data from PISA study on levels of skills in three fields: mathematics, science and reading. We will check the influence of two factors, gender and numbers of books at home, on this three-dimensional vector. Let’s go to R! We load the package and select only the data concerning Poland. Then we build a three-dimensional linear model and we draw a HE plots for it.

library(PISA2012lite)
pol = student2012[student2012$CNT == "Poland",]
model = lm(cbind(PV1MATH, PV1READ, PV1SCIE)~ST28Q01+ST04Q01, pol)
heplot(model, size="effect.size", las=1, term.labels=F)

Interestingly, the moment we look at this diagram we come to at least three conclusions: (i) results in mathematics and reading comprehension are correlated in the residuals, (ii) effect produced by books is rather strong and it is correlated both with the results in mathematics and reading, (iii) effect produced by gender is also strong but its nature is different; females’ results in reading comprehension are on average much better while their average results in mathematics are a little worse. All the relationships presented on one plot. This is a diagram presenting matrices H and E for the pair of variables. We can display all the pairs using the pairs() function. But what should we do when the variables are very numerous? Instead of generating dozens of diagrams we would like to present the most important information on one diagram which would sum up all the variables. In such case we might use canonical discriminant analyses which will be a topic of the next week’s post. More information about HE plots is available: Visual Hypothesis Tests in Multivariate Linear Models: The heplots Package for R John Fox Michael Friendly Georges Monette HE Plots for Multivariate Linear Models Michael Friendly HE Plots for Repeated Measures Designs Michael Friendly

Spark wins more and more hearts. And no wonder, comments from different sources tell us about a significant speed up (by an order of magnitude) for analysis of big datasets. Well-developed system for caching objects in memory allows us to avoid torturing hard discs during iterative operations performed on same data.

Applications in Spark had to be written in Java, Scala or Python and I saw this fact as a slight drawback of this platform. These programming languages are really nice but I use some advance statistical algorithms available only in R and I do not feel like rewriting them into Python nor Java. Fortunately, now we have a connector for R, which allows to seamlessly integrate R and Spark. This connection sprang up over a year ago on the initiative of Shivaram Venkataraman from Berkeley. He received support from many people who joined his project of development of SparkR package (see github http://amplab-extras.github.io/SparkR-pkg/) Today I would like to share my first impressions after using that package.

Use-case

‘The problem’ seemed to be an ideal match for the processing profile of Spark. Briefly: I have MAAAAANY short time series (each consisting of several hundred observations) and I want to create a regression model for each series. Then I would like to employ ARIMA-type model for the residuals to obtain better predictions. I started my test using a local standalone installation including only 1 master and 1 worker (which is a toy really) each with 8GB RAM. The test version of Spark is in fact a pre-built version for hadoop 2.3 see more details here: https://spark.apache.org/downloads.html. At the beginning I tried to work on MapR package (from http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop where you can download preinstalled Hadoop from MapR along with the whole zoo) but for some reason this version did not work satisfactorily. As a result I used only plain Spark on a local system of files. After all my main goal was to test the SparkR. The beginnings were difficult. SparkR often throw exceptions and stack-traces from Java did not explain much either. Something rasped somewhere in that mechanism. Documentation for SparkR package is rather mediocre so it took a while to guess what are limitations for the key and value pair (basic Spark structure) and what happens when lists contain more than two slots. Similar problems were caused by the fact that functions in SparkR package are very naughty and collide with functions of other packages, such as dplyr package for example. If you load dplyr package after SparkR package, then function collect (overloaded for S3 class) from dplyr will overwrite collect (overloaded for S4 class) function from SparkR. We will get a non informative error. But, after the phase of errors came the time when things actually started working. The basic structure on which you work in Spark are pairs key-value. It is really convenient that both key and value may be an object of any type or structure. This allows you to transmit as a value a data.frame, regression model or any other R objects. It is also convenient that just a few lines of code are enough to launch R code on Spark. Most initiations, serializations and deserializations are hidden in the SparkR package. I managed to perform the planned use-case for over a dozen lines of code which I present below having removed the unnecessary details. Once you get through the initial difficulties you discover that using SparkR is really very pleasant. Few comments related to the attached code: sparkR.init() function initiates a connection with Spark backend, textFile() creates a connection with the data source whether it is in hdfs or in the local directory structure. flatMap() function is used to convert the collection of lines into collection of vectors of words. As result it produces a vector of words (or, to be more precise, a list with one vector, a kind of Spark-type format). You can launch lappy function on this object, which will be applied to each vector separately (remotely on Spark). This function should extract the most important elements from the vectors of words and save them as a list consisting of two elements –key and value. In my data each row is one measurement of the demand for the X property of Y object in time T. As a key I chose column with X id and as a value I saved the list with Y and T. groupByKey() function reshuffles the data to group pairs with the same key (the same X id). Next, map() processes the groups. The received argument is a list with key as the first element and list of values (here: list of vectors) as the second element. It is fairly easy to transform such a list into a table with data and to perform predictions for it. You may use collect() function to download results into R. The whole procedure worked out. Unfortunately, standalone version of Spark linked through SparkR is tragically tragically tragically slow. Now I’m going to test a bigger real Spark cluster and I hope that change of backend will considerably speed up data processing.

# install_github("amplab-extras/SparkR-pkg",
# subdir="pkg")
library(SparkR)
# you need to have a properly configured Spark installation
sc <- sparkR.init()
# read data from local disk
linesRDD <- textFile(sc, "data.txt")
# preprocessing of text file, split by t, return list of columns
words <- flatMap(linesRDD,
function(line) {
list(strsplit(line, split = "t")[[1]])
})
# now extract keys and values
demandWithKey <- lapply(words, function(word) { list( ---here-key---,
---here-value--- })
# group by key
groups <- groupByKey(demandWithKey, 1L)
# do the mapping
mapRes <- map(groups, function(x) {
# x[[1]] is the key
# x[[2]] is the list of values
# write your logic here
})
# collect results, download them to R
collect(mapRes)

Przemyslaw Biecek

https://deepsense.ai/wp-content/uploads/2019/02/Spark-R-SparkR.jpg3371140Przemyslaw Biecekhttps://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svgPrzemyslaw Biecek2015-02-12 06:30:432019-07-09 08:31:00Spark + R = SparkR

Do you know where Kamil Stoch earns most of his points in season 2013/2014?

Some time ago I came across a pheatmaps package (see here) for R software which generates much nicer heat maps than the standard heatmap() function. This is why the package is named p(retty)heatmaps. Let us compare these maps through an example. Pheatmap package is usually used in genomics but we will use it below to present results of the ski jumping competitions during the season 2013/2014. For each ski jumper who gained at least 500 points in the general classification we will count number of point he won on each ski jump. Then we will visualize that matrix (ski jumper vs. ski jump) using a heat map. A heat map is a visualization of a numerical matrix in which color represents the numerical value of a given cell and dendrograms mark the similarity of rows and columns. We will import data concerning ski jumping from SmarterPoland package for R. (Note, that this dataset was used in the competition for the best visualization during the PAZUR conference 2014. See the winning submissions here). Firstly, we need some dplyr magic to count the number of points which were won by the individual jumpers on given ski jumps and added to the general classification.

Then we will look at the ugly standard heat maps and much prettier heat maps from pheatmap package. Default legend, more legible grating, adequately matched colors and more suitable margins. These are some nice advantages of the pheatmaps package.

A friend of mine took part in a project in which he had to perform future prediction of Y characteristic. The problem was that Y characteristic showed an increasing trend over time. For the purposes of this post let us assume that Y characteristic was energy demand or milk yield of cows or any other characteristic that with positive trend over time.

So, we have discussed possible approaches to this problem. As a benchmark we used the techniques that heats up processors, like the random forest and SVM. However, it turns out (and after the fact it is along intuition) that if we deal with a generally stable trend, the range of values observed in the future might be different than the range of values observed in the past. In that case techniques such as simple linear regression may give better results than the mentioned SVM and RF (which more or less look for similar cases in the past and average them). Let us consider the following example. We have N predictors at our disposal and we want to predict development of Y characteristic. In reality it depends on only the first characteristic. We will juxtapose SVM, RandomForest, simple regression and lasso type regularised regression. This example will be purely simulation based. We start with small random data, 100 observations and N= 25predictors (results will be similar for larger datasets). Testing set will be beyond the domain of the training set, i.e. we increase all values by +1.

library(dplyr)
library(lasso2)
library(e1071)
library(randomForest)
library(ggplot2)
# will be useful in simulations
getData <- function(n = 100, N = 25 ){
x <- runif(N*n) %>%
matrix(n, N)
# artificial out-of-domain x
x_test <- x + 1
list(x = x,
y = x[,1] * 5 + rnorm(n),
x_test = x_test,
y_test = x_test[,1] * 5 + rnorm(n))
}
# let's draw a dataset
gdata <- getData()
head(gdata$y)
# [1] -0.5331184 3.1140116 4.9557897 3.2433499 2.8986888 5.2478431
dim(gdata$x)
# [1] 100 25

There is a linear relationships within the selected data between the first predictor and Y. We added a small random noice to avoid being too tendentious.

The lower MSE, the better. Boxplots present results of the whole simulations. We did not select the characteristics so the linear regression suffers from the random noise. Lasso regularisation helps as expected. Of course, methods such as SVM or RandomForest must have lost that competition because in the ‘future’ value of Y is X1*5 but range of X1 is between 1-2 and not between 0-1. The same situation takes place in case of N=5 variables (here regression has an advantage) and N=50 variables (and here it has not). What are the conclusions? – SVM and RandomForest work in a ‘domain’. If some monotonic trend is observed and future values are likely to be far from these in training set, the trend should be removed in advance. – If there are many variables, for regression it is advisable to do some variable selection first or optionally to choose a method that would do that for us (like Lasso for example). RF deal with this problem on it’s own. – It is much more difficult to predict the future than the past ;)

Przemyslaw Biecek

https://deepsense.ai/wp-content/uploads/2019/02/Is-a-simple-linear-regression-able-to-knock-spots-off-SVM-and-Random-Forest.jpg3371140Przemyslaw Biecekhttps://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svgPrzemyslaw Biecek2015-01-15 06:30:142019-07-09 08:28:08Is a simple linear regression able to knock spots off SVM and Random Forest?