deepsense.aideepsense.ai logo
  • Careers
    • Job offers
    • Summer internship
  • Clients’ stories
  • Services
    • AI software
    • Team augmentation
    • AI advisory
    • Train your team
    • Generative models
  • Industries
    • Retail
    • Manufacturing
    • Financial & Insurance
    • IT operations
    • TMT & Other
    • Medical & Beauty
  • Knowledge base
    • deeptalks
    • Blog
    • R&D hub
  • About us
    • Our story
    • Management
    • Advisory board
    • Press center
  • Contact
  • Menu Menu
Scheduling Spark jobs in Seahorse

Scheduling Spark jobs in Seahorse

January 30, 2017/in Big data & Spark, Seahorse /by Michal Szostek

In the latest Seahorse release we introduced the scheduling of Spark jobs. We will show you how to use it to regularly collect data and send reports generated from that data via email.

Use case

Let’s say that we have a local meteo station and the data from this station is uploaded automatically to Google Sheets. That is, it contains only one row (i.e. row number 2) of data which always contains current weather characteristics.
There is one header row and one row with data: temperature, humidity, atmospheric pressure, wind speed and time.
We would like to generate some graphs from this data every now and then and do it without any repeated effort.

Solution

Collecting data

The first workflow that we create collects a snapshot of live data and appends it to a historical data set. It will be run at regular intervals.
Let’s start by creating two data sources in Seahorse. The first data source represents the Google Spreadsheet with live-updated weather data:
Seahorse's Google Spreadsheet data source options.
For the historical weather data, we create a data source representing a file stored locally within Seahorse. Initially, it only contains a header and one row – current data.
Data source named "Historical weather data", which has its source in a library CSV file.
Now we can create our first Spark job which reads both data sources, concatenates them and writes back to the “Historical weather data” data source. In this way, there will be a new data row in “Historical weather data” every time the workflow is run.
Screenshot shows a workflow with four nodes, which will run as a scheduled job. At the top, there are two "Read DataFrame" operations named "Historical weather data" and "Live weather data". Their ouput is connected to "Union" operation named "Append live data snapshot". Its output is fed to "Write DataFrame" operation, writing to "Historical weather data". This is one of the workflow to be scheduled using Spark job scheduling.
There is one more step for us to do – this Spark job needs to be scheduled. It is run every half an hour and after every scheduled job execution an email is sent.
Form for selecting scheduling options: "Select cluster preset: default Run workflow every hour at 05,35 minute(s) past the hour Send email reports to: execution-reports@example.com"
In the email there is a link to that workflow, where we are able to see node reports from the execution.

Sending a weather report

The next workflow is very simple – it consists of a Read DataFrame operation and a Python Notebook. It is done separately from collecting the data because we want to send reports at a much lower rate.
Workflow consisting of two operations - "Read DataFrame", reading from "Historical weather data" and connected to it "Python Notebook".
In a notebook, first we transform the data a little:

import pandas
import numpy
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
data = dataframe().toPandas()
data['time'] = pandas.to_datetime(data['time'])
data = data.sort('time')

Next, we prepare a function that will be used to line-plot a time series of weather characteristics like temperature, air pressure and humidity:

def plot(col_to_plot):
   ax = plt.axes()
   ax.spines["top"].set_visible(False)
   ax.spines["right"].set_visible(False)
   ax.get_xaxis().tick_bottom()
   ax.get_yaxis().tick_left()
   y_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False)
   ax.yaxis.set_major_formatter(y_formatter)
   plt.locator_params(nbins=5, axes='x')
   plt.locator_params(nbins=5, axes='y')
   plt.yticks(fontsize=14)
   plt.xticks(fontsize=14)
   plt.plot(data['time'], data[col_to_plot])
   plt.gcf().autofmt_xdate()
   plt.show()

After editing the notebook, we set up its parameters so that reports are sent to our email address after each run.
Options of Python Notebook. Execute notebook: true. Send E-mail report: true. Email Address: reports@example.com.
Finally, we can set up a schedule for our report. It is similar to the previous one, only sent less often– we send a report every day in the evening. After each execution two emails are sent – one, as previously, with a link to the executed workflow and the second one containing a notebook execution result.

Result

After some time, we have enough data to plot it. Here is a sample from the “Historical weather data” data source…
Data report for "Historical weather data" Read DataFrame collected using Spark job scheduling. There is a table: columns are "temperature", "humidity", "atmospheric pressure", "wind speed" and "time"; rows contain some sample data.
… and graphs that are in the executed notebook:
Two cells in Python Notebook, with inputs: "plot('temperature')" and "plot('atmospheric pressure')". Both outputs are images with line graphs.

Summary

We created a data processing and email reporting pipeline using two simple workflows and some Python code. Seahorse works well not only for big data, but also for scenarios when you need to integrate data from different sources and periodically generate reports – Seahorse data sources and Spark job scheduling are the right tool for the job.

Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
  • Share on Reddit
  • Share by Mail
https://deepsense.ai/wp-content/uploads/2019/02/scheduling-spark-jobs-in-seahorse-header.jpg 217 750 Michal Szostek https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Michal Szostek2017-01-30 16:51:232021-01-05 16:50:30Scheduling Spark jobs in Seahorse

Start your search here

Build your AI solution
with us!

Contact us!

NEWSLETTER SUBSCRIPTION

    You can modify your privacy settings and unsubscribe from our lists at any time (see our privacy policy).

    This site is protected by reCAPTCHA and the Google privacy policy and terms of service apply.

    CATEGORIES

    • Generative models
    • Elasticsearch
    • Computer vision
    • Artificial Intelligence
    • AIOps
    • Big data & Spark
    • Data science
    • Deep learning
    • Machine learning
    • Neptune
    • Reinforcement learning
    • Seahorse
    • Job offer
    • Popular posts
    • AI Monthly Digest
    • Press release

    POPULAR POSTS

    • ChatGPT – what is the buzz all about?ChatGPT – what is the buzz all about?March 10, 2023
    • How to leverage ChatGPT to boost marketing strategyHow to leverage ChatGPT to boost marketing strategy?February 26, 2023
    • How can we improve language models using reinforcement learning? ChatGPT case studyHow can we improve language models using reinforcement learning? ChatGPT case studyFebruary 20, 2023

    Would you like
    to learn more?

    Contact us!
    • deepsense.ai logo white
    • Services
    • Customized AI software
    • Team augmentation
    • AI advisory
    • Generative models
    • Knowledge base
    • deeptalks
    • Blog
    • R&D hub
    • deepsense.ai
    • Careers
    • Summer internship
    • Our story
    • Management
    • Advisory board
    • Press center
    • Support
    • Terms of service
    • Privacy policy
    • Code of ethics
    • Contact us
    • Join our community
    • facebook logo linkedin logo twitter logo
    • © deepsense.ai 2014-
    Scroll to top

    This site uses cookies. By continuing to browse the site, you are agreeing to our use of cookies.

    OKLearn more

    Cookie and Privacy Settings



    How we use cookies

    We may request cookies to be set on your device. We use cookies to let us know when you visit our websites, how you interact with us, to enrich your user experience, and to customize your relationship with our website.

    Click on the different category headings to find out more. You can also change some of your preferences. Note that blocking some types of cookies may impact your experience on our websites and the services we are able to offer.

    Essential Website Cookies

    These cookies are strictly necessary to provide you with services available through our website and to use some of its features.

    Because these cookies are strictly necessary to deliver the website, refuseing them will have impact how our site functions. You always can block or delete cookies by changing your browser settings and force blocking all cookies on this website. But this will always prompt you to accept/refuse cookies when revisiting our site.

    We fully respect if you want to refuse cookies but to avoid asking you again and again kindly allow us to store a cookie for that. You are free to opt out any time or opt in for other cookies to get a better experience. If you refuse cookies we will remove all set cookies in our domain.

    We provide you with a list of stored cookies on your computer in our domain so you can check what we stored. Due to security reasons we are not able to show or modify cookies from other domains. You can check these in your browser security settings.

    Other external services

    We also use different external services like Google Webfonts, Google Maps, and external Video providers. Since these providers may collect personal data like your IP address we allow you to block them here. Please be aware that this might heavily reduce the functionality and appearance of our site. Changes will take effect once you reload the page.

    Google Webfont Settings:

    Google Map Settings:

    Google reCaptcha Settings:

    Vimeo and Youtube video embeds:

    Privacy Policy

    You can read about our cookies and privacy settings in detail on our Privacy Policy Page.

    Accept settingsHide notification only