deepsense.aideepsense.ai logo
  • Careers
    • Job offers
    • Summer internship
  • Clients’ stories
  • Services
    • AI software
    • Team augmentation
    • AI advisory
    • Train your team
    • Generative models
  • Industries
    • Retail
    • Manufacturing
    • Financial & Insurance
    • IT operations
    • TMT & Other
    • Medical & Beauty
  • Knowledge base
    • deeptalks
    • Blog
    • R&D hub
  • About us
    • Our story
    • Management
    • Advisory board
    • Press center
  • Contact
  • Menu Menu
Cooperative data exploration

Cooperative data exploration

January 27, 2016/in Big data & Spark /by Piotr Łusakowski

Living in a world of big data comes with a certain challenge. Namely, how to extract value from this ever-growing flow of information that comes our way. There are a lot of great tools that can help us, but they all require a lot of resources. So, how do we ease the burden on this CPU/RAM demand? One way to do it is to share the data we are working on and the results of our computations with others.

Imagine a team of data scientists working on a problem. Each of them attacks the problem from a different angle and tries various techniques to tackle it. But what they all have in common is the data lying at the base of the problem. So, why should this data be instantiated multiple times in cluster memory? Why shouldn’t it be shared between all team members? Furthermore, when someone on the team makes an interesting discovery about the data, they should be able to share their results with others instantly so that everyone can benefit from the new insight.

The Solution

Technologies

The IPython notebook is a well-known tool among the data science community. It serves both as a great experimentation tool and research documentation device. On the other hand, Apache Spark is a rising star in the Big Data world, that already has a Python API. In this post we’ll show how combining these two great pieces of software can improve the data exploration process.

Simple data sharing

Let’s run a simple Spark application (in our heads, for now – it’s far easier to debug and compile times are awesome) that reads a CSV file into a DataFrame (Spark’s distributed data collection). This DataFrame is large – several gigabytes, for example (or any number that You feel is noticeable across the cluster). How can we share this DataFrame between multiple team members? Let’s think of our application as a server. A server can handle multiple clients and act as a “hub”, where shared state is stored.

Luckily for us, Spark is well prepared for such a use case. Its Python API uses a library called Py4J that acts as a proxy between Python code and the JVM-based core. Internally, Py4J works like this: the JVM process sets up a Gateway Server which listens on incoming connections and allows calling methods on JVM objects via reflection. On the other side of the communication pipe, Py4J provides Java Gateway, a Python object that handles “talking” to the Gateway Server and exposes objects imitating those in the JVM process.

Let’s quickly add this functionality to our imaginary application. By now it has read our big DataFrame and exposed a server to which we can connect and perform operations on JVM objects. But on which objects exactly? We can’t just say “let’s do a .select() on that DataFrame”. That’s why Py4J allows us to define the Entry Point: an object proxied to Python side by default. Let’s put our DataFrame in this object and define an access method that returns it. Now, anyone that connects to our Gateway Server can use the DataFrame without it being duplicated in cluster’s memory.

Scala code

class MyEntryPoint() {
    def getDataFrame(): DataFrame = // ...
    // ...
}

Python code

java_gateway = JavaGateway(GatewayClient(address='localhost', port=8080),
                                         auto_convert=True)
java_gateway.entry_point.getDataFrame().select(“my_column”).toDF().show()

Sharing results

We’ve managed to convince our Spark application to serve a DataFrame. But when we build another DataFrame, we’d like other users to be able to access it. Let’s add a collection of DataFrames to our Entry Point. For simplicity, let it be a String → DataFrame map. Whenever we want to share our results with other team members, we just need to append a new mapping to the collection: “my new result” → DataFrame. Now, anyone can access the map and see what we’ve done. Just remember to stay thread-safe :).

Scala code

class MyEntryPoint() {
    def getDataFrame(id: String): DataFrame = // ...
    def saveDataFrame(id: String, dataFrame: DataFrame) = // ...
    // ...
}

Extensions

Using this core functionality, it’s now easy to think of possible improvements like

  • user management, each user having their own namespace, being able to share what they want with only selected colleagues
  • notifications: “Look, John published a new result!”
  • sharing more than just data: any JVM object can be shared this way, provided it’s immutable (mutable objects can be shared as well, of course, but for obvious reasons it’s best to avoid that if possible)

Of course, this is only a taste of what can be achieved using this technique. Please let us know if you come up with an interesting application for it.

Putting it all together

Below is an outline of how to set up both JVM (Scala) and Python side of the above solution. Your Spark application should basically run in an infinite loop. In order to control it better, you could add some custom methods to the Entry Point like “shutdown application”, but it’s not mandatory.

Scala code

class MyEntryPoint() {
   def getDataFrame(id: String): DataFrame = // ...
   def saveDataFrame(id: String, dataFrame: DataFrame) = // ...
   // ...
}
val gatewayServer = new GatewayServer(new MyEntryPoint(), 8080)
gatewayServer.start()

Python code

java_gateway = JavaGateway(GatewayClient(address='localhost', port=8080),
                           auto_convert=True)
selected = java_gateway.entry_point.getDataFrame("my DataFrame").select(“my_column”).toDF()
java_gateway.entry_point.saveDataFrame(“my new shiny DataFrame”, selected._jdf)

The last line contains a reference to protected _jdf member of DataFrame object. While this is a bit unorthodox, it’s used here for simplicity. The _jdf object is Py4J’s proxy to the corresponding JVM DataFrame object. An alternative implementation requires us to share Spark’s SQLContext over Py4J connection and registering our output DataFrame as a temporary table.

Cooperative data exploration by Piotr Lusakowski Hadoop Summit Dublin

Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
  • Share on Reddit
  • Share by Mail
https://deepsense.ai/wp-content/uploads/2019/02/cooperative-data-exploration.jpg 217 750 Piotr Łusakowski https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Piotr Łusakowski2016-01-27 19:18:432021-01-05 16:51:43Cooperative data exploration

Start your search here

Build your AI solution
with us!

Contact us!

NEWSLETTER SUBSCRIPTION

    You can modify your privacy settings and unsubscribe from our lists at any time (see our privacy policy).

    This site is protected by reCAPTCHA and the Google privacy policy and terms of service apply.

    CATEGORIES

    • Generative models
    • Elasticsearch
    • Computer vision
    • Artificial Intelligence
    • AIOps
    • Big data & Spark
    • Data science
    • Deep learning
    • Machine learning
    • Neptune
    • Reinforcement learning
    • Seahorse
    • Job offer
    • Popular posts
    • AI Monthly Digest
    • Press release

    POPULAR POSTS

    • ChatGPT – what is the buzz all about?ChatGPT – what is the buzz all about?March 10, 2023
    • How to leverage ChatGPT to boost marketing strategyHow to leverage ChatGPT to boost marketing strategy?February 26, 2023
    • How can we improve language models using reinforcement learning? ChatGPT case studyHow can we improve language models using reinforcement learning? ChatGPT case studyFebruary 20, 2023

    Would you like
    to learn more?

    Contact us!
    • deepsense.ai logo white
    • Services
    • Customized AI software
    • Team augmentation
    • AI advisory
    • Generative models
    • Knowledge base
    • deeptalks
    • Blog
    • R&D hub
    • deepsense.ai
    • Careers
    • Summer internship
    • Our story
    • Management
    • Advisory board
    • Press center
    • Support
    • Terms of service
    • Privacy policy
    • Code of ethics
    • Contact us
    • Join our community
    • facebook logo linkedin logo twitter logo
    • © deepsense.ai 2014-
    Scroll to top

    This site uses cookies. By continuing to browse the site, you are agreeing to our use of cookies.

    OKLearn more

    Cookie and Privacy Settings



    How we use cookies

    We may request cookies to be set on your device. We use cookies to let us know when you visit our websites, how you interact with us, to enrich your user experience, and to customize your relationship with our website.

    Click on the different category headings to find out more. You can also change some of your preferences. Note that blocking some types of cookies may impact your experience on our websites and the services we are able to offer.

    Essential Website Cookies

    These cookies are strictly necessary to provide you with services available through our website and to use some of its features.

    Because these cookies are strictly necessary to deliver the website, refuseing them will have impact how our site functions. You always can block or delete cookies by changing your browser settings and force blocking all cookies on this website. But this will always prompt you to accept/refuse cookies when revisiting our site.

    We fully respect if you want to refuse cookies but to avoid asking you again and again kindly allow us to store a cookie for that. You are free to opt out any time or opt in for other cookies to get a better experience. If you refuse cookies we will remove all set cookies in our domain.

    We provide you with a list of stored cookies on your computer in our domain so you can check what we stored. Due to security reasons we are not able to show or modify cookies from other domains. You can check these in your browser security settings.

    Other external services

    We also use different external services like Google Webfonts, Google Maps, and external Video providers. Since these providers may collect personal data like your IP address we allow you to block them here. Please be aware that this might heavily reduce the functionality and appearance of our site. Changes will take effect once you reload the page.

    Google Webfont Settings:

    Google Map Settings:

    Google reCaptcha Settings:

    Vimeo and Youtube video embeds:

    Privacy Policy

    You can read about our cookies and privacy settings in detail on our Privacy Policy Page.

    Accept settingsHide notification only