Data science Archives - Page 10 of 10

A friend of mine took part in a project in which he had to perform future prediction of Y characteristic. The problem was that Y characteristic showed an increasing trend over time. For the purposes of this post let us assume that Y characteristic was energy demand or milk yield of cows or any other characteristic that with positive trend over time.

So, we have discussed possible approaches to this problem. As a benchmark we used the techniques that heats up processors, like the random forest and SVM. However, it turns out (and after the fact it is along intuition) that if we deal with a generally stable trend, the range of values observed in the future might be different than the range of values observed in the past. In that case techniques such as simple linear regression may give better results than the mentioned SVM and RF (which more or less look for similar cases in the past and average them).
Let us consider the following example. We have N predictors at our disposal and we want to predict development of Y characteristic. In reality it depends on only the first characteristic. We will juxtapose SVM, RandomForest, simple regression and lasso type regularised regression.
This example will be purely simulation based. We start with small random data, 100 observations and N= 25predictors (results will be similar for larger datasets). Testing set will be beyond the domain of the training set, i.e. we increase all values by +1.

library(dplyr)
library(lasso2)
library(e1071)
library(randomForest)
library(ggplot2)
# will be useful in simulations
getData <- function(n = 100, N = 25 ){
x <- runif(N*n) %>%
matrix(n, N)
# artificial out-of-domain x
x_test <- x + 1
list(x = x,
y = x[,1] * 5 + rnorm(n),
x_test = x_test,
y_test = x_test[,1] * 5 + rnorm(n))
}
# let's draw a dataset
gdata <- getData()
head(gdata$y)
# [1] -0.5331184 3.1140116 4.9557897 3.2433499 2.8986888 5.2478431
dim(gdata$x)
# [1] 100 25

There is a linear relationships within the selected data between the first predictor and Y. We added a small random noice to avoid being too tendentious.

with(gdata,
qplot(x[,1], y) +
geom_smooth(se=FALSE, method="lm")
)

Let us fit the model for each approach and calculate MSE for each model.

fitModels <- function(x, y) {
ndata <- data.frame(y, x)
list(
model_lasso = l1ce(y ~ ., data=ndata),
model_lm = lm(y ~ ., data=ndata),
model_svm = svm(x, y),
model_rf = randomForest(x, y))
}
testModels <- function(models, x_test, y_test) {
predict_lasso <- predict(models$model_lasso, data.frame(x_test))
predict_lm <- predict(models$model_lm, data.frame(x_test))
predict_svm <- predict(models$model_svm, x_test)
predict_rf <- predict(models$model_rf, x_test)
c(
lasso = mean((predict_lasso - y_test)^2),
lm = mean((predict_lm - y_test)^2),
rf = mean((predict_rf - y_test)^2),
svm = mean((predict_svm - y_test)^2))
}
# time for fitting
models <- fitModels(gdata$x, gdata$y)
testModels(models, gdata$x_test, gdata$y_test)
# lasso lm rf svm
# 0.8425946 1.4672156 15.7713529 25.0271363

This time the Lasso wins. Now we are going to repeat random drawing and model adjustment 100 times.
And pipe this results directly to boxplots (I %)

replicate(100,{
gdata <- getData(N=N)
models <- fitModels(gdata$x, gdata$y) testModels(models, gdata$x_test, gdata$y_test) }) %>%
t() %>%
boxplot(main = paste("MSE for", N, "variables"))

The lower MSE, the better.
Boxplots present results of the whole simulations. We did not select the characteristics so the linear regression suffers from the random noise. Lasso regularisation helps as expected.
Of course, methods such as SVM or RandomForest must have lost that competition because in the ‘future’ value of Y is X1*5 but range of X1 is between 1-2 and not between 0-1.
The same situation takes place in case of N=5 variables (here regression has an advantage) and N=50 variables (and here it has not).

What are the conclusions?
– SVM and RandomForest work in a ‘domain’. If some monotonic trend is observed and future values are likely to be far from these in training set, the trend should be removed in advance.
– If there are many variables, for regression it is advisable to do some variable selection first or optionally to choose a method that would do that for us (like Lasso for example). RF deal with this problem on it’s own.
– It is much more difficult to predict the future than the past ;)

Przemyslaw Biecek

Contact us

I consent to receive commercial information about deepsense.ai sp. z o.o., including details about its products and offer, through email communication.*

The administrator of the personal data provided by you in the registration form is deepsense.ai sp. z o.o., headquartered at al. Jerozolimskie 44, 00-024 Warsaw, Poland. Your personal data will be processed for the purpose of directing marketing content to you.

Detailed information about the processing of your personal data, including your rights, can be found in our privacy policy.

* This consent is required to receive email communication from deepsense.ai sp. z o.o. regarding the company and its offerings.

Locations

United States of America

deepsense.ai, Inc.
2100 Geng Road, Suite 210
Palo Alto, CA 94303
United States of America

Poland

Let us know how we can help

Our service offerings
contact@deepsense.ai

Media relations
media@deepsense.ai

Is a simple linear regression able to knock spots off SVM and Random Forest?

Contact us

Locations

Let us know how we can help

Services

Resources

About us

Support

Join our community