deepsense.aideepsense.ai logo
  • Careers
    • Job offers
    • Summer internship
  • Clients’ stories
  • Services
    • AI software
    • Team augmentation
    • AI advisory
    • Train your team
    • Generative models
  • Industries
    • Retail
    • Manufacturing
    • Financial & Insurance
    • IT operations
    • TMT & Other
    • Medical & Beauty
  • Knowledge base
    • deeptalks
    • Blog
    • R&D hub
  • About us
    • Our story
    • Management
    • Advisory board
    • Press center
  • Contact
  • Menu Menu
GeoJson Operations in Apache Spark with Seahorse SDK

GeoJson Operations in Apache Spark with Seahorse SDK

February 7, 2017/in Big data & Spark, Seahorse /by Adam Jakubowski

A few days ago we released Seahorse 1.4, an enhanced version of our machine learning, Big Data manipulation and data visualization product.

This release also comes with an SDK – a Scala toolkit for creating new custom operations to be used in Seahorse.
As a showcase, we will create a custom Geospatial operation with GeoJson and make a simple Seahorse workflow using it. This use-case is inspired by Example 8 from book Advanced Analytics with Spark.

Geospatial data in GeoJson format

GeoJson can encode many geographic data types. For example:

  • Location Point
  • Line
  • Region described by a polygon

GeoJson encodes Geospatial data using Json.

{
    "type": "Point",
    "coordinates": [125.6, 10.1]
}

New York City Taxi Trips

In our workflow we will use a New York Taxi Trip dataset with pickup and drop-off location points.
Let’s say that we want to know how many Taxi Trips started on Manhattan and ended up in Bronx. To achieve this, we could use an operation filtering dataset by location contained inside some geographic region. Let’s call this operation ‘Filter inside Polygon’.

Manhattan GeoJson polygon Visualization

Manhattan GeoJson Polygon visualization using geojson.io

Bronx GeoJson polygon Visualization

Bronx GeoJson Polygon visualization using geojson.io

You can download GeoJson data with New York boroughs from https://github.com/dwillis/nyc-maps.

Implementing custom operation using Seahorse SDK steps:

  1. Clone Seahorse SDK Example Git Repository.
  2. Implement custom Geospatial operation.
  3. Assembly jar with custom Geospatial operation.
  4. Add assembled jar to Seahorse instance.
  5. Use custom Geospatial operation in a Workflow.

Seahorse SDK Example Git Repository

The fastest way to start developing a custom Seahorse operations is by cloning our SDK Example Git Repository and write your code from there.

git clone --branch 1.4.0 https://github.com/deepsense-io/seahorse-sdk-example

The Seahorse SDK Example Repository has all Seahorse and Apache Spark dependencies already defined in an SBT build file definition.
Let’s add the geospatial data library to our dependencies in the build.sbt file:

libraryDependencies += "com.esri.geometry" % "esri-geometry-api" % "1.0"

Now we can implement the FilterInsidePolygon operation:

@Register // 1
final class FilterInsidePolygon
 extends DOperation1To1[DataFrame, DataFrame] { // 2
 override val id: Id = "48fa3638-bc8d-4430-909f-85d4ece824a3" // 3
 override val name: String = "Filter Location Inside Polygon"
 override val description: String = "Filter out rows " +
   "for which location is contained within a specified GeoJson polygon"
 lazy val geoJsonPointColumnSelector = SingleColumnSelectorParam(
  name = "GeoJson Point - Column Name",
  description = Some("Column name containing " +
    "location written as Point object in GeoJson"),
  portIndex = 0
 )
 lazy val geoJsonPolygon = StringParam( // 5
   name = "GeoJson Polygon",
   description = Some("Polygon written in GeoJson format used for filtering out the rows")
 )
 override def params = Array(geoJsonPointColumnSelector, geoJsonPolygon)
  1. We need to annotate the operation with @Register so that Seahorse knows that it should be registered in the operation catalogue.
  2. We extend DOperation1To1[DataFrame, DataFrame] because our operation takes one DataFrame as an input and returns one DataFrame as the output.
  3. UUID is used to uniquely identify this operation.
    After making changes in operation (e.g. name is changed), idshould not be changed – it is used by Seahorse to recognize that it’s the same operation as before.
  4. Parameter geoJsonPointColumnSelector will be used for selecting a column with Point Location data.

Parameter geoJsonPolygon represents a Polygon we will be checking location points against.
Now we can implement an execute method with actual GeoSpatial and Apache Spark dataframe code.

override protected def execute(input: DataFrame)(context: ExecutionContext): DataFrame = {
 // 1 - Parse polygon
 val polygon = GeometryEngine.geometryFromGeoJson(
   $(geoJsonPolygon),
   GeoJsonImportFlags.geoJsonImportDefaults,
   Geometry.Type.Polygon
 )
 val columnName = DataFrameColumnsGetter.getColumnName(
   input.schema.get,
   $(geoJsonPointColumnSelector)
 )
 val filtered = input.sparkDataFrame.filter(row => {
   try {
     val pointGeoJson = row.getAs[String](columnName)
     // 2 - Parse location point
     val point = GeometryEngine.geometryFromGeoJson(
       pointGeoJson,
       GeoJsonImportFlags.geoJsonImportDefaults,
       Geometry.Type.Point
     )
     // EPSG:4326 Spatial Reference
     val standardCoordinateFrameForEarth = SpatialReference.create(4326)
     // 3 - Test if polygon contains point
     GeometryEngine.contains(
       polygon.getGeometry,
       point.getGeometry,
       standardCoordinateFrameForEarth
     )
   } catch {
     case _ => false // ignore invalid rows for now
   }
 })
 DataFrame.fromSparkDataFrame(filtered)
}
  1. We parse our polygon data using GeometryEngine class from esri-geometry-api library.
  2. Inside the Apache Spark dataFrame filter we use GeometryEngine class again to parse location points in each row.
  3. We use GeometryEngine class to test whether the point is contained inside the specified polygon.

Now that the implementation is finished, you can build a JAR and add it to Seahorse:

  1. Run sbt assembly. This produces a JAR in the target/scala-2.11 directory.
  2. Put this JAR in $SEAHORSE/jars, where $SEAHORSE is the directory with docker-compose.yml or Vagrantfile (depending whether you run Docker or Vagrant).
  3. Restart Seahorse (By either `stop`ing and `start`ing `docker-compose` or `halt`ing and `up`ing `vagrant`).
  4. Operations are now visible and ready to use in Seahorse Workflow Editor.

And now that we have Filter Inside Polygon, we can start implementing our workflow:
NYC Taxi Trips Seahorse Workflow
First we need to specify our datasource. You can download the whole 26GB dataset from http://www.andresmh.com/nyctaxitrips.
Since we are using a Local Apache Spark Master we sample 100k rows from the whole dataset.
Additionally we also transformed the latitude and longitude column into one GeoJson column for both pickup and drop-off locations.
You can download the final preprocessed dataset used for this experiment from here.

Building our Workflow

Let’s start by defining our Datasource in Seahorse:
NYC Taxi Trips Datasource Parameters
Next, we use NYC borough GeoJson polygon values from https://github.com/dwillis/nyc-maps as attributes in our Filter Inside Polygon operations:
Filter Inside Polygon operation parameters
Finally, let’s run our workflow and see how many Taxi trips started on Manhattan and ended in Bronx:
NYC Taxi Trips Workflow Node Report
As we can see from our 100k data sample, only 383 trips started on Manhattan and ended in Bronx.

Summary

We cloned the Seahorse SDK Example repository and starting from it we implemented a custom Filter Inside Polygon operation using Scala. Then we built a JAR file with our operation and we added it to Seahorse. After that we built our Workflow and used custom Filter Inside Polygon operation working in Apache Spark dataframes with GeoJson data.

Links

  • Check Filter Inside Polygon source code in our Seahorse SDK Example repository.
  • Download Workflow which can be imported in Seahorse.
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
  • Share on Reddit
  • Share by Mail
https://deepsense.ai/wp-content/uploads/2019/02/geojson-operations-in-apache-spark-with-seahorse-sdk.jpg 337 1140 Adam Jakubowski https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Adam Jakubowski2017-02-07 13:56:052021-01-05 16:50:26GeoJson Operations in Apache Spark with Seahorse SDK

Start your search here

Build your AI solution
with us!

Contact us!

NEWSLETTER SUBSCRIPTION

    You can modify your privacy settings and unsubscribe from our lists at any time (see our privacy policy).

    This site is protected by reCAPTCHA and the Google privacy policy and terms of service apply.

    CATEGORIES

    • Generative models
    • Elasticsearch
    • Computer vision
    • Artificial Intelligence
    • AIOps
    • Big data & Spark
    • Data science
    • Deep learning
    • Machine learning
    • Neptune
    • Reinforcement learning
    • Seahorse
    • Job offer
    • Popular posts
    • AI Monthly Digest
    • Press release

    POPULAR POSTS

    • ChatGPT – what is the buzz all about?ChatGPT – what is the buzz all about?March 10, 2023
    • How to leverage ChatGPT to boost marketing strategyHow to leverage ChatGPT to boost marketing strategy?February 26, 2023
    • How can we improve language models using reinforcement learning? ChatGPT case studyHow can we improve language models using reinforcement learning? ChatGPT case studyFebruary 20, 2023

    Would you like
    to learn more?

    Contact us!
    • deepsense.ai logo white
    • Services
    • Customized AI software
    • Team augmentation
    • AI advisory
    • Generative models
    • Knowledge base
    • deeptalks
    • Blog
    • R&D hub
    • deepsense.ai
    • Careers
    • Summer internship
    • Our story
    • Management
    • Advisory board
    • Press center
    • Support
    • Terms of service
    • Privacy policy
    • Code of ethics
    • Contact us
    • Join our community
    • facebook logo linkedin logo twitter logo
    • © deepsense.ai 2014-
    Scroll to top

    This site uses cookies. By continuing to browse the site, you are agreeing to our use of cookies.

    OKLearn more

    Cookie and Privacy Settings



    How we use cookies

    We may request cookies to be set on your device. We use cookies to let us know when you visit our websites, how you interact with us, to enrich your user experience, and to customize your relationship with our website.

    Click on the different category headings to find out more. You can also change some of your preferences. Note that blocking some types of cookies may impact your experience on our websites and the services we are able to offer.

    Essential Website Cookies

    These cookies are strictly necessary to provide you with services available through our website and to use some of its features.

    Because these cookies are strictly necessary to deliver the website, refuseing them will have impact how our site functions. You always can block or delete cookies by changing your browser settings and force blocking all cookies on this website. But this will always prompt you to accept/refuse cookies when revisiting our site.

    We fully respect if you want to refuse cookies but to avoid asking you again and again kindly allow us to store a cookie for that. You are free to opt out any time or opt in for other cookies to get a better experience. If you refuse cookies we will remove all set cookies in our domain.

    We provide you with a list of stored cookies on your computer in our domain so you can check what we stored. Due to security reasons we are not able to show or modify cookies from other domains. You can check these in your browser security settings.

    Other external services

    We also use different external services like Google Webfonts, Google Maps, and external Video providers. Since these providers may collect personal data like your IP address we allow you to block them here. Please be aware that this might heavily reduce the functionality and appearance of our site. Changes will take effect once you reload the page.

    Google Webfont Settings:

    Google Map Settings:

    Google reCaptcha Settings:

    Vimeo and Youtube video embeds:

    Privacy Policy

    You can read about our cookies and privacy settings in detail on our Privacy Policy Page.

    Accept settingsHide notification only