deepsense.aideepsense.ai logo
  • Careers
    • Job offers
    • Summer internship
  • Clients’ stories
  • Services
    • AI software
    • Team augmentation
    • AI advisory
    • Train your team
  • Industries
    • Retail
    • Manufacturing
    • Financial & Insurance
    • IT operations
    • TMT & Other
    • Medical & Beauty
  • Knowledge base
    • Blog
    • R&D hub
  • About us
    • Our story
    • Management
    • Advisory board
    • Press center
  • Contact
  • Menu Menu
multidplyr: first impressions

multidplyr: first impressions

November 13, 2015/in Data science /by Przemyslaw Biecek

Two days ago Hadley Wickham tweeted a link with introduction to his new package multidplyr. Basically it’s a tool to take advantage of many cores for dplyr operations. Let’s see how to play with it.
Screen Shot 2015-11-13 at 10.54.15

What you can do with multidplyr?

As it was described on GitHub website, multidplyr is a library for processing data that is distributed across many cores, with the use of dplyr verbs.
The idea is kind of similar to spark. Similar solutions exists for R, and some of them are available for years (like RMPI, distribute, parallel or many other from the list https://cran.r-project.org/web/views/HighPerformanceComputing.html). But the problem with them is that they are made mainly for hackers. It is not that unusual to get an error with 20 lines of traceback without any warning.
Packages from Hadleyverse come with nicer design (as sometimes Apple products do), explode less often. With slightly smaller functionality we get more fun.
The multidplyr is still in the dev phase and sometimes it can make you really angry. But there are a lot of things that you can do well with it, and often also you can do them really fast.
In the multidplyr vignette you will find examples of playing with flights dataset. There are just 300k+ observations so it turns out that the overload related with the data distribution is larger than the computation time. But for larger datasets or for more complicated calculations you should expect some gain from heating of additional cores.

Use case

Right now, I am playing with log files from many bizarre devices. But the point is that there is a lot of rows of data (few hundreds of millions) and logs are stored in many many relatively small files. So I use multidplyr to read the data in parallel and do some initial pre-processing. The cluster is built on 15 cores and everything is done in plain R. It turns out that I can reduce the processing time from one day to two hours. So, it is an improvement. Even if you count the time that you need to spend learning the multidplyr (not that much if you know dplyr and spark).
Let’s see the example step by step.
First, initiate cluster with 15 nodes (one node per one core).

cluster = create_cluster(15)
## Initializing 15 core cluster.
set_default_cluster(cluster)

Find all files with the extension ‘log’. Data is there.

lf = data.frame(files=list.files(pattern = 'log', recursive = TRUE),
stringsAsFactors = FALSE)

Now, I need to define a function that reads a file and do some preprocessing. This function is then sent to all nodes in the cluster.

readAndExtractTimepoints = function(x) {
tmp = readLines(as.character(x)[1])
ftmp = grep(tmp, pattern='Entering scene', value=TRUE)
substr(ftmp,1,15)
}

cluster_assign_value(cluster, ‘readAndExtractTimepoints’, readAndExtractTimepoints)
Time to initiate some calculations. The list of file names is partitioned across nodes and for each file the readAndExtractTimepoints is executed. The result is an object of the class party_df (again it’s one file per row).

lf_distr = lf %>%
partition() %>%
group_by(files) %>%
do(timepoints =  readAndExtractTimepoints(.$files))
lf_distr
## Source: party_df [897 x 3]
## Groups: PARTITION_ID, files
## Shards: 15 [59--60 rows]
##
##    PARTITION_ID                     files   timepoints
##           (dbl)                     (chr)        (chr)
## 1             1 2013/01/cnk02a/cnk02a.log
## 2             1 2013/01/cnk02b/cnk02b.log
## 3             1   2013/01/cnk06/cnk06.log
## 4             1   2013/01/cnk07/cnk07.log
## 5             1   2013/01/cnk09/cnk09.log
## 6             1   2013/01/cnk10/cnk10.log
## 7             1 2013/01/cnk100/cnk100.log
## 8             1   2013/01/cnk11/cnk11.log
## 9             1   2013/01/cnk15/cnk15.log
## 10            1   2013/01/cnk16/cnk16.log

Results are ready to be collected and transformed into a classical list.

timeP = collect(lf_distr)
str(timeP$timepoints)
## List of 897
##  $ : chr [1:144830] "Jan  1 08:15:57 " "Jan  1 18:04:37 " "Jan  1 18:05:44 " "Jan  2 08:15:57 " ...
##  $ : chr [1:123649] "Jan  1 08:16:05 " "Jan  2 08:16:05 " "Jan  2 09:46:08 " "Jan  2 09:46:13 " ...
##  $ : chr [1:137661] "Jan  1 08:15:57 " "Jan  2 08:15:57 " "Jan  2 09:34:47 " "Jan  2 09:35:45 " ...

General impressions

I guess that one can speed up the whole process even further with the use of python or spark. But if the dataset is not huge then it is much easier to maintain a process that is using just a single technology/language.
Overall I like the multidplyr even if it still looks like a prototype. Sometimes things get nasty, like for example when you try to chain few different do() operations. But knowing the ‘Hadley’ effect I expect that it will be better and better with every version.
Finally, soon we should expect a solution for parallel processing that can be used by normal people not only by hackers.

Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
  • Share on Reddit
  • Share by Mail
https://deepsense.ai/wp-content/uploads/2019/02/multidplyr-first-impressions.jpg 337 1140 Przemyslaw Biecek https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Przemyslaw Biecek2015-11-13 14:46:452021-03-03 13:53:56multidplyr: first impressions

Start your search here

NEWSLETTER SUBSCRIPTION

    You can modify your privacy settings and unsubscribe from our lists at any time (see our privacy policy).

    This site is protected by reCAPTCHA and the Google privacy policy and terms of service apply.

    THE NEWEST AI MONTHLY DIGEST

    • AI Monthly Digest 20 - TL;DRAI Monthly Digest 20 – TL;DRMay 12, 2020

    CATEGORIES

    • Elasticsearch
    • Computer vision
    • Artificial Intelligence
    • AIOps
    • Big data & Spark
    • Data science
    • Deep learning
    • Machine learning
    • Neptune
    • Reinforcement learning
    • Seahorse
    • Job offer
    • Popular posts
    • AI Monthly Digest
    • Press release

    POPULAR POSTS

    • AI trends for 2021AI trends for 2021January 7, 2021
    • A comprehensive guide to demand forecastingA comprehensive guide to demand forecastingMay 28, 2019
    • What is reinforcement learning? The complete guideWhat is reinforcement learning? deepsense.ai’s complete guideJuly 5, 2018

    Would you like
    to learn more?

    Contact us!
    • deepsense.ai logo white
    • Services
    • Customized AI software
    • Team augmentation
    • AI advisory
    • Knowledge base
    • Blog
    • R&D hub
    • deepsense.ai
    • Careers
    • Summer internship
    • Our story
    • Management
    • Advisory board
    • Press center
    • Support
    • Terms of service
    • Privacy policy
    • Code of ethics
    • Contact us
    • Join our community
    • facebook logo linkedin logo twitter logo
    • © deepsense.ai 2014-
    Scroll to top

    This site uses cookies. By continuing to browse the site, you are agreeing to our use of cookies.

    OKLearn more

    Cookie and Privacy Settings



    How we use cookies

    We may request cookies to be set on your device. We use cookies to let us know when you visit our websites, how you interact with us, to enrich your user experience, and to customize your relationship with our website.

    Click on the different category headings to find out more. You can also change some of your preferences. Note that blocking some types of cookies may impact your experience on our websites and the services we are able to offer.

    Essential Website Cookies

    These cookies are strictly necessary to provide you with services available through our website and to use some of its features.

    Because these cookies are strictly necessary to deliver the website, refuseing them will have impact how our site functions. You always can block or delete cookies by changing your browser settings and force blocking all cookies on this website. But this will always prompt you to accept/refuse cookies when revisiting our site.

    We fully respect if you want to refuse cookies but to avoid asking you again and again kindly allow us to store a cookie for that. You are free to opt out any time or opt in for other cookies to get a better experience. If you refuse cookies we will remove all set cookies in our domain.

    We provide you with a list of stored cookies on your computer in our domain so you can check what we stored. Due to security reasons we are not able to show or modify cookies from other domains. You can check these in your browser security settings.

    Other external services

    We also use different external services like Google Webfonts, Google Maps, and external Video providers. Since these providers may collect personal data like your IP address we allow you to block them here. Please be aware that this might heavily reduce the functionality and appearance of our site. Changes will take effect once you reload the page.

    Google Webfont Settings:

    Google Map Settings:

    Google reCaptcha Settings:

    Vimeo and Youtube video embeds:

    Privacy Policy

    You can read about our cookies and privacy settings in detail on our Privacy Policy Page.

    Accept settingsHide notification only