eXtreme Gradient Boosting vs Random Forest [and the caret package for R]

Decision trees are cute.
It is easy to visualize them, easy to explain, easy to apply and even easy to construct.
Unfortunately they are quite unstable, particularly for large sets of correlated features.

R vs SAS vs SPSS

Today we are going to illustrate some subtle differences among three statistical packages, R/SAS/SPSS. Small differences, but sometimes even a very small difference may have large consequences. So it is worth to know such things.

multidplyr: first impressions

Two days ago Hadley Wickham tweeted a link with introduction to his new package multidplyr. Basically it’s a tool to take advantage of many cores for dplyr operations. Let’s see how to play with it.

Understanding Apache Spark’s Execution Model Using SparkListeners

When you execute an action on a RDD, Apache Spark runs a job that in turn triggers tasks using DAGScheduler and TaskScheduler, respectively. They are all low-level details that may be often useful to understand when a simple transformation is no longer simple performance-wise and takes ages to complete.