Big data & Spark Archives

AIOps Network Traffic Analysis (NTA) – a business guide

March 27, 2020/in Big data & Spark, Machine learning /by Konrad Budek

Network Traffic Analysis (NTA) is a key component of modern cybersecurity in companies. With machine learning and artificial intelligence solutions, the sheer amounts of data to analyze is an asset to be used rather than, as was once the case, a challenge to overcome.

This post looks at:

What is network traffic analysis
The benefits of network traffic analysis
How AI and machine learning can support network traffic analysis

According to Markets and Markets data, the global network traffic monitoring software and traffic analysis tool market is projected to grow from $1.9 billion in 2019 to $3.2 billion by 2024. The growth is driven mostly by the increasing demand for sophisticated network monitoring tools and advanced network management systems that can handle the growing traffic and increasing flow of information.

The growth in internal traffic is a direct reflection of global trends. According to Cisco data, nearly 66% of the global population will be online by 2023. The increase in traffic is driven not only by users but also by the myriad of connected devices that form the IoT cloud around us.

The share of Machine-to-Machine (M2M) connections is estimated to grow from 33% in 2018 to 50% in 2023, while the consumer segment will rise to 74% of this share and business segment for 26%.

What is network traffic analysis

In its most basic form, Network Traffic Analysis (NTA) is the process of recording and analyzing network traffic patterns in search of suspicious elements and security threats. The term was originally coined by Gartner to describe a growing industry in the computer security ecosystem.

The foundation of NTA is the assumption that there is a “normal” situation in the system that reflects daily operations. Due to seasonal or general trends, operations fluctuate naturally, but overall the system remains stable and thus internal network monitoring can be done with a traffic analyzer. Knowing the “normal” situation is the first step in spotting signs of malicious activities within the system.

In addition to spotting security threats, NTA is also used to optimize the system, spotting inefficiencies as well as the system’s need for additional components when it arises.

Network Traffic Analysis software tools analyze a system’s communication flow, including

TCP/UDP packets
“Virtual network traffic” done in virtual private networks
Traffic to and from cloud environments (storage, computing power, etc.)
API calls to cloud-based apps or SaaS solutions.

This means that nearly all traffic and information flow can be tracked and analyzed by smart network traffic analysis solutions. Modern solutions often use sophisticated techniques like reinforcement learning.

A key component of network analytics tools is the dashboard used to interface with the team, which receives clear information about the network. The dashboard enables easier network performance monitoring and diagnostics and is a convenient way to convey technical knowledge to those who lack it. Reducing complexity to simplicity, the dashboard will

Play its part in convincing your financial director to spring for a powerful new server or another essential component.

NTA solutions are clearly sophisticated and powerful tools. But what are the direct benefits of network traffic analysis?

The benefits of network traffic analysis

There are at least several benefits:

Avoiding bandwidth and server performance bottlenecks – Armed with knowledge about how information flows in the system, one can analyze network problems, define problems and start looking for solutions.
Discovering apps that gobble up bandwidth – tweaking the system can deliver significant savings when API calls are reduced or information is reused.
Proactively reacting to a changing environment – a key feature when it comes to delivering high-quality services for clients and customers. The company can react to increasing demand or spot signs of an approaching peak to harden the network against it. Advanced network traffic analysis tools are often armed with solutions designed to respond in real-time to network changes much faster than any administrator would.
Managing devices exclusively – with modern network monitoring applications companies can group devices and network components to manage them, effectively making use of network performance analytics done earlier.
Resource usage optimization – With all apps, devices, components, and traffic pinpointed with a dashboard, the company can make more informed decisions about the system’s resources and costs.

The key challenge in computer network management is processing and analyzing the gargantuan amounts of data networks produce. Looking for the proverbial needle in the haystack is an apt metaphor for searching for insights among the data mined from a network.

Using ML tools is the only way to effectively monitor network traffic.

How machine learning can support traffic analysis

The key breakthrough that comes from using machine learning-powered tools in NTA is in automation. The lion’s share of the dull and repetitive yet necessary work is done by machines. Also, in real-time network analysis, time is another component that can be handled only by machines. Machines and neural networks can spot and analyze the hidden patterns in data to deliver a range of advantages for companies. To name just a few:

Intrusion detection

The first and sometimes the only sign of intrusion into a system is suspicious traffic that can be easily overlooked. Intrusions are often detected only after 14 days.

AI-based solutions are tireless, analyzing traffic in real-time. Armed with the knowledge of infrastructure operations, the system can spot any sign of malicious activity.

Reducing false positives

AI-based solutions are less prone to the false-positives that can turn the life of a system administrator into a living hell. AI-based systems significantly enrich ML-supported NTA with false-positive detection and reduction, enabling the team to focus more on real challenges than on verifying every alert.

Workload prediction

With data about ongoing system performance, the solution can deliver information about predicted traffic peaks or downs to optimize spending.

Thus the benefits are twofold. First, the company can manage the costs of infrastructure, be it cloud or on-prem, to handle the real traffic and avoid overpaying. Second, there is much more predictability in the estimated need for resources, so they can be booked in advance or the costs can be optimized in other ways.

Spotting early signs of attack (DDoS)

Distributed Denial of Service attacks attempts to suddenly overload a company’s resources in an effort to take down the website or other online service. The losses are hard to predict – from the company’s reputation being hit as unable to defend itself against cybercrime attacks, to the staggering and quickly accruing losses due to being unavailable for customers.

With the early information about incoming attacks, the company can set up defenses like blocking certain traffic, ports or locations to keep availability on other markets. Also, network traffic reports can be used by various agencies that fight cybercrime and will hunt for those responsible for the attack.

Malicious packet detection

Sometimes it is not about the intrusion and the malicious activity is not aimed directly at the company. A user could have downloaded malware onto a private device connected with an enterprise network via a VPN. With that, the infection can spread or the software itself can leverage the company’s resources, such as computing power, and use it for its own purposes, like mining cryptocurrency without the owner’s consent.

Summary

Network traffic monitoring and analysis is one of the key components of modern enterprise-focuses cybersecurity. The gargantuan amounts of data to process also make it a perfect foundation for ML-based solutions, which thrive on data.

That’s why deepsense.ai delivers a comprehensive AIOps architecture-based platform for network data analytics.

If you have any questions about the AIOps solutions we provide, don’t hesitate to contact Andy Thurai, our Head of US operations via the contact form or aiops@deepsense.ai email address.

Five hottest big data trends 2018 for the techies

May 10, 2018/in Big data & Spark /by Konrad Budek

Currently, the world is producing 16.3 zettabytes of data a year. According to IDC, by 2025 that amount will rise tenfold, to 163 zettabytes a year. But how big, exactly, is a zetta?

To imagine how much data data scientists and managers have to handle every single day, find something familiar – Earth’s atmosphere, the Solar System or the Milky Way. According to NASA, Earth’s atmosphere has a mass of approximately five zettagrams. So for every gram of gas around our planet, we produce here on Earth a bit more than 3 bytes of data each year. In 2025 there will be 30 bytes of data generated every year for each gram of air around the globe.
Usually, the distance between the stars or planets is measured in Astronomical Units, which are equal to the distance between the Sun and the Earth. One AU is about 150 million kilometers, or 150 gigameters. So if there were one byte of information for every meter between the Sun and Earth, there would be enough space for Windows Vista and a few useful apps (not many of them though). The distance between the Sun and Pluto is equal to 6 terameters, so if there were one byte of every meter between the Sun and Pluto there would be six terabytes of storage. That equals a bit more than Wikipedia’s SQL dataset in January 2010.
According to NASA, the entire Milky Way is 1000 zettameters wide. So assuming every meter could hold a byte, it would take about six years from 2025 to fill up all the Milky Way’s diameter with data.
That being the magnitude of the world’s data, is it any surprise that data scientists and businesses are seeking ways to manage the amount of data they’re dealing with?

[optinlocker]

[/optinlocker]

1. Spark firing up the big data in business

The people who manage and harvest big data say Apache Spark is their software of choice. According to Microstrategy’s data, Spark is considered “important” for 77% of world’s enterprises, and critical for 30%.
Spark’s importance and popularity are growing throughout industry In 2017, it surpassed MapReduce as the most popular infrastructure solution for handling big data. Considering that, learning how to leverage Spark to boost up big data management is profitable both for engineers and data scientists.

2. Real-time data processing – challenge in a batch

Modern data science is not only about gaining insight, but doing so fast. All industries benefit from getting information in real time both to optimize existing processes and to develop new ones. The ability to react during an event is crucial to maintenance (preventing breakdowns), marketing (knowing when to reach out to someone) and quality control (getting things right on the producing line).
Currently, internet marketing is the best playground for data streaming. Real time data is a key tool in augmenting marketing for 40% of marketers. In Real-Time Bidding (RTB), digital-ad impressions are sold at automated auctions. Both the buyer and the seller need a platform that provides delay-free, up-to-the-second data. What’s more, internet analytics rely on processing real-time data to build heatmaps, map digital customers’ journey and gather customers’ behavioral data.
Real-time processing is unachievable with traditional, batch-based data processing. Spark makes it easy by unifying batch and streaming, enabling seamless conversion between the two modes of processing.

3. From academia to business – productizing the models

AI and machine learning were once nothing more than academic playthings, as the models were too unstable and unreliable to handle business challenges. Integrating them in the enterprise environment was also tricky. Machine learning models, commonly trained using Python or R, often prove hard to integrate with an existing application built with, say, Java. But the Spark framework makes this integration easy, as it provides support for Scala, Java, Python, and R. It enables you to run your machine learning model right inside the data management solution to harvest insight in a faster, automated way.
With productized models, AI is set to increase labor productivity by 40%. Thus, it’s no surprise that 72% of US business leaders consider AI a “business advantage”.

4. Unstructured data – cleaning up the mess

Companies gather numerous types of data, including video, images, and text. Most of it is unstructured, coded with various exotic formats or, sometimes, with no format at all.
In fact, data scientists can spend as much as 90% of their time making data useful by structuring and cleaning it up. Applying data processing technologies such as Spark to integrate and manage data from heterogeneous sources makes both harvesting insights and building machine learning models much easier.

5. Edge computing – process data faster and cheaper

As the amount of data produced skyrockets, computing it becomes a considerable challenge. According to General Electric data, every 8 hours of driving an autonomous vehicle generates 40 terabytes of data. Streaming all of it would be neither efficient nor safe. Imagine a child running down the street. Such information must be processed immediately in real time, as any delay could endanger the child’s welfare.
That’s why edge computing, or managing data near its source (at the edge of the network) maximizes the efficiency of data management and reduces the cost of internet transfer.
Due to the growing amount of data, (just imagine the earth’s atmosphere with a few bytes of every gram of air mentioned above) edge computing will keep on growing.
Big data has been called the new oil. But unlike oil, the amount of data available is not only growing, but accelerating. The problem is not with gathering it, but with managing it, as data, unlike oil, is most valuable when shared and combined with other resources, not just sold.

The hottest guys for the hottest trends

Considering the trends above, it is no surprise that Data Scientist has been called the the hottest job in US. As Glassdoor states, there were 4,524 job openings for data scientists with a median base salary of $110,000.
But being a machine learning specialist requires a unique skill set, one that includes analytical skills, technical proficiency and adata-oriented mindset. According to Linkedin, the number of data scientists in the US has risen nearly tenfold since 2012.
Becoming a data scientist is currently one of the most profitable career paths for the IT engineers. On the other hand, while data scientist may be among the world’s best-paid careers, companies are struggling to find the right people That’s why some companies choose to train them in-house with the assistance of an experienced partner.

Five trends for business to surf the big data wave

May 2, 2018/in Big data & Spark /by Konrad Budek

When data becomes really big, sometimes it is faster to upload it into a truck and send it via highway, than to transfer it online. That’s how the 100 petabytes of satellite images of Earth DigitalGlobe collected over 17 years are being moved.

Data being transported in Amazon’s “Snowmobile” truck says a lot about just how big data has grown. And just how much is that? Well, Global IP traffic data has skyrocketed from 96,054 petabytes a month in 2016 to 150,910 petabytes in 2018 and is estimated to hit 278,108 in 2021.
According to an IDC analysis, the world created 16,3 zettabytes of data a year in 2017. By 2025, the amount will be tenfold higher. That’s why businesses cannot afford to overlook critical trends in data management and big data analytics. So how should you go about riding this wave?

1. Growing demand for data science and analytics talents

According to a recent Glassdoor report, data scientist is “the best job in America in 2018”. At the beginning of the year, there were 4,524 job openings with a median base salary of $110,000. Candidates follow the market closely, and it should come as no surprise that new ways to become a data scientist are emerging. As Linkedin’s 2017 U.S. Emerging Jobs Report states, there are 9.8 times more Machine Learning Engineers working today than in 2012.
But there is one challenge – being a data scientist takes more than just knowing how to code in Python. The job requires both specific tech skills and a “data-oriented” mindset. Although new ways of becoming data scientist are emerging, there is a growing gap between the demand for data scientists and the actual supply of this rare species – and that doesn’t appear likely to change anytime soon.
Any company seeking to boost up the business processes with AI and machine learning should bear in mind how challenging hiring a data scientist can be. Building a team internally, Enlisting the support of an experienced mentor to build a team internally, from the ground up, may be more cost-effective and much faster than acquiring specialists.

2. Augmented analysis on the rise

According to Gartner, augmented analytics is “an approach that automates insights using machine learning and natural language generation” that “marks the next wave of disruption in the data and analytics market”. With augmented analytics, companies can use new technologies to efficiently harvest the insights from data, both obvious and less obvious ones.
According to Forbes Insights, 69% of leading-edge companies believe that augmented intelligence will improve customer loyalty and 50% of all companies surveyed think that these technologies will improve customer experience. What’s more, 60% of companies believe that augmented analytics will be crucial in helping them obtain new customers. Clearly, this is not a technology to overlook over underestimate in future investments. Given the volume of data being produced today and the amount of possible correlations it gives rise to, the days of its being inspected manually are numbered.

3. Edge computing speeds up data transfer

Just as the number of connected devices is growing, so too is the need for computing power required to analyze data. There will be no fewer than 30 billion data-producing devices in 2020. With the decentralization of data comes a plethora of challenges, including delays in data transfer and huge pressure exerted on central infrastructure.
To optimize the amount of data to be transferred via the Internet, companies perform more and more computing near the edge of the network, close to the source of the data. Consider autonomous vehicles, each of which, according to GE, generates 40 terabytes of data for every eight hours of driving. It would be neither practical nor cheap – and for sure not safe – to send all that data to an external processing facility.

4. Convergence, or merging the past, present and future world in one dashboard

Merging data from various sources is nothing new – companies have been using weather forecasts or combining historical data with sales predictions for a long time. New technologies, however, are bringing data merging into the mainstream, and there’s a growing number of use cases to exemplify just why this is happening.
One inspiring example comes from Stella Artois, the producer of apple and pear cider. Analysis of its historical data showed the company that its sales grow when the temperature rises above a certain degree.
So it decided to run an outdoor campaign on digital billboards on a cost-per-minute basis, triggered only by specific conditions – proper temperature, sunny weather and no clouds. The company reported a YOY sales increase north of 65% increase during the period when its weather-responsive campaign ran, efficiently harnessing the power of its data.
We can thank Herradura Tequila, a premium liquor brand, for another fascinating example. The company partnered with Foursquare to gain information about potential customers with their check-ins. What’s more, Herradura’s producer was able to obtain information about other places customers liked to visit.
With the information it gathered, the company was able to send targeted ads to the people representing a similar profile. Combining the historical data with the profile and geolocation resulted in a 23% incremental rise in visits to places selling Herradura among people who had been exposed to the ads, when compared to the control group.

5. New data types – Excel is not enough (nor even close)

With voice-activated personal assistants or image recognition technology going mainstream, companies will gain access to new, unstructured types of data.
By dint of the sheer amount and nature of such data, those flowing from diverse and unstructured sources need to be processed with machine learning algorithms. Manual processing would just be too time-consuming and ineffective.
deepsense.ai’s cooperation with global researcher Nielsen provides a fine example of such algorithms at work. With the power of deep learning, deepsense.ai built an app that swiftly recognizes the ingredient fields on various FMCG products and then uploads the information into a structured, centralized database. With an accuracy of 90%, the solution gives researchers a new tool to deliver faster analysis of the retail market.
The possibilities are countless – be they visual quality control based on IP cameras or logo detection and brand visibility analytics tools – you name it. As the automating quality testing with machine learning may increase defect detection rates up to 90%, the possible savings are impressive.
Big data is quite a common term in modern business. But harvesting the power of the information becomes more problematic as the amount collected continues to grow.

GeoJson Operations in Apache Spark with Seahorse SDK

February 7, 2017/in Big data & Spark, Seahorse /by Adam Jakubowski

A few days ago we released Seahorse 1.4, an enhanced version of our machine learning, Big Data manipulation and data visualization product.

This release also comes with an SDK – a Scala toolkit for creating new custom operations to be used in Seahorse.
As a showcase, we will create a custom Geospatial operation with GeoJson and make a simple Seahorse workflow using it. This use-case is inspired by Example 8 from book Advanced Analytics with Spark.

Geospatial data in GeoJson format

GeoJson can encode many geographic data types. For example:

Location Point
Line
Region described by a polygon

GeoJson encodes Geospatial data using Json.

{
    "type": "Point",
    "coordinates": [125.6, 10.1]
}

New York City Taxi Trips

In our workflow we will use a New York Taxi Trip dataset with pickup and drop-off location points.
Let’s say that we want to know how many Taxi Trips started on Manhattan and ended up in Bronx. To achieve this, we could use an operation filtering dataset by location contained inside some geographic region. Let’s call this operation ‘Filter inside Polygon’.

Manhattan GeoJson Polygon visualization using geojson.io

Bronx GeoJson Polygon visualization using geojson.io

You can download GeoJson data with New York boroughs from https://github.com/dwillis/nyc-maps.

Implementing custom operation using Seahorse SDK steps:

Clone Seahorse SDK Example Git Repository.
Implement custom Geospatial operation.
Assembly jar with custom Geospatial operation.
Add assembled jar to Seahorse instance.
Use custom Geospatial operation in a Workflow.

Seahorse SDK Example Git Repository

The fastest way to start developing a custom Seahorse operations is by cloning our SDK Example Git Repository and write your code from there.

git clone --branch 1.4.0 https://github.com/deepsense-io/seahorse-sdk-example

The Seahorse SDK Example Repository has all Seahorse and Apache Spark dependencies already defined in an SBT build file definition.
Let’s add the geospatial data library to our dependencies in the build.sbt file:

libraryDependencies += "com.esri.geometry" % "esri-geometry-api" % "1.0"

Now we can implement the FilterInsidePolygon operation:

@Register // 1
final class FilterInsidePolygon
 extends DOperation1To1[DataFrame, DataFrame] { // 2
 override val id: Id = "48fa3638-bc8d-4430-909f-85d4ece824a3" // 3
 override val name: String = "Filter Location Inside Polygon"
 override val description: String = "Filter out rows " +
   "for which location is contained within a specified GeoJson polygon"
 lazy val geoJsonPointColumnSelector = SingleColumnSelectorParam(
  name = "GeoJson Point - Column Name",
  description = Some("Column name containing " +
    "location written as Point object in GeoJson"),
  portIndex = 0
 )
 lazy val geoJsonPolygon = StringParam( // 5
   name = "GeoJson Polygon",
   description = Some("Polygon written in GeoJson format used for filtering out the rows")
 )
 override def params = Array(geoJsonPointColumnSelector, geoJsonPolygon)

We need to annotate the operation with @Register so that Seahorse knows that it should be registered in the operation catalogue.
We extend DOperation1To1[DataFrame, DataFrame] because our operation takes one DataFrame as an input and returns one DataFrame as the output.
UUID is used to uniquely identify this operation.
After making changes in operation (e.g. name is changed), idshould not be changed – it is used by Seahorse to recognize that it’s the same operation as before.
Parameter geoJsonPointColumnSelector will be used for selecting a column with Point Location data.

Parameter geoJsonPolygon represents a Polygon we will be checking location points against.
Now we can implement an execute method with actual GeoSpatial and Apache Spark dataframe code.

override protected def execute(input: DataFrame)(context: ExecutionContext): DataFrame = {
 // 1 - Parse polygon
 val polygon = GeometryEngine.geometryFromGeoJson(
   $(geoJsonPolygon),
   GeoJsonImportFlags.geoJsonImportDefaults,
   Geometry.Type.Polygon
 )
 val columnName = DataFrameColumnsGetter.getColumnName(
   input.schema.get,
   $(geoJsonPointColumnSelector)
 )
 val filtered = input.sparkDataFrame.filter(row => {
   try {
     val pointGeoJson = row.getAs[String](columnName)
     // 2 - Parse location point
     val point = GeometryEngine.geometryFromGeoJson(
       pointGeoJson,
       GeoJsonImportFlags.geoJsonImportDefaults,
       Geometry.Type.Point
     )
     // EPSG:4326 Spatial Reference
     val standardCoordinateFrameForEarth = SpatialReference.create(4326)
     // 3 - Test if polygon contains point
     GeometryEngine.contains(
       polygon.getGeometry,
       point.getGeometry,
       standardCoordinateFrameForEarth
     )
   } catch {
     case _ => false // ignore invalid rows for now
   }
 })
 DataFrame.fromSparkDataFrame(filtered)
}

We parse our polygon data using GeometryEngine class from esri-geometry-api library.
Inside the Apache Spark dataFrame filter we use GeometryEngine class again to parse location points in each row.
We use GeometryEngine class to test whether the point is contained inside the specified polygon.

Now that the implementation is finished, you can build a JAR and add it to Seahorse:

Run sbt assembly. This produces a JAR in the target/scala-2.11 directory.
Put this JAR in $SEAHORSE/jars, where $SEAHORSE is the directory with docker-compose.yml or Vagrantfile (depending whether you run Docker or Vagrant).
Restart Seahorse (By either `stop`ing and `start`ing `docker-compose` or `halt`ing and `up`ing `vagrant`).
Operations are now visible and ready to use in Seahorse Workflow Editor.

And now that we have Filter Inside Polygon, we can start implementing our workflow:

First we need to specify our datasource. You can download the whole 26GB dataset from http://www.andresmh.com/nyctaxitrips.
Since we are using a Local Apache Spark Master we sample 100k rows from the whole dataset.
Additionally we also transformed the latitude and longitude column into one GeoJson column for both pickup and drop-off locations.
You can download the final preprocessed dataset used for this experiment from here.

Building our Workflow

Let’s start by defining our Datasource in Seahorse:

Next, we use NYC borough GeoJson polygon values from https://github.com/dwillis/nyc-maps as attributes in our Filter Inside Polygon operations:

Finally, let’s run our workflow and see how many Taxi trips started on Manhattan and ended in Bronx:

As we can see from our 100k data sample, only 383 trips started on Manhattan and ended in Bronx.

Summary

We cloned the Seahorse SDK Example repository and starting from it we implemented a custom Filter Inside Polygon operation using Scala. Then we built a JAR file with our operation and we added it to Seahorse. After that we built our Workflow and used custom Filter Inside Polygon operation working in Apache Spark dataframes with GeoJson data.

Scheduling Spark jobs in Seahorse

January 30, 2017/in Big data & Spark, Seahorse /by Michal Szostek

In the latest Seahorse release we introduced the scheduling of Spark jobs. We will show you how to use it to regularly collect data and send reports generated from that data via email.

Use case

Let’s say that we have a local meteo station and the data from this station is uploaded automatically to Google Sheets. That is, it contains only one row (i.e. row number 2) of data which always contains current weather characteristics.

We would like to generate some graphs from this data every now and then and do it without any repeated effort.

Solution

Collecting data

The first workflow that we create collects a snapshot of live data and appends it to a historical data set. It will be run at regular intervals.
Let’s start by creating two data sources in Seahorse. The first data source represents the Google Spreadsheet with live-updated weather data:

For the historical weather data, we create a data source representing a file stored locally within Seahorse. Initially, it only contains a header and one row – current data.

Now we can create our first Spark job which reads both data sources, concatenates them and writes back to the “Historical weather data” data source. In this way, there will be a new data row in “Historical weather data” every time the workflow is run.

There is one more step for us to do – this Spark job needs to be scheduled. It is run every half an hour and after every scheduled job execution an email is sent.

In the email there is a link to that workflow, where we are able to see node reports from the execution.

Sending a weather report

The next workflow is very simple – it consists of a Read DataFrame operation and a Python Notebook. It is done separately from collecting the data because we want to send reports at a much lower rate.

In a notebook, first we transform the data a little:

import pandas
import numpy
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
data = dataframe().toPandas()
data['time'] = pandas.to_datetime(data['time'])
data = data.sort('time')

Next, we prepare a function that will be used to line-plot a time series of weather characteristics like temperature, air pressure and humidity:

def plot(col_to_plot):
   ax = plt.axes()
   ax.spines["top"].set_visible(False)
   ax.spines["right"].set_visible(False)
   ax.get_xaxis().tick_bottom()
   ax.get_yaxis().tick_left()
   y_formatter = matplotlib.ticker.ScalarFormatter(useOffset=False)
   ax.yaxis.set_major_formatter(y_formatter)
   plt.locator_params(nbins=5, axes='x')
   plt.locator_params(nbins=5, axes='y')
   plt.yticks(fontsize=14)
   plt.xticks(fontsize=14)
   plt.plot(data['time'], data[col_to_plot])
   plt.gcf().autofmt_xdate()
   plt.show()

After editing the notebook, we set up its parameters so that reports are sent to our email address after each run.

Finally, we can set up a schedule for our report. It is similar to the previous one, only sent less often– we send a report every day in the evening. After each execution two emails are sent – one, as previously, with a link to the executed workflow and the second one containing a notebook execution result.

Result

After some time, we have enough data to plot it. Here is a sample from the “Historical weather data” data source…

… and graphs that are in the executed notebook:

Summary

We created a data processing and email reporting pipeline using two simple workflows and some Python code. Seahorse works well not only for big data, but also for scenarios when you need to integrate data from different sources and periodically generate reports – Seahorse data sources and Spark job scheduling are the right tool for the job.

Optimize Spark with DISTRIBUTE BY & CLUSTER BY

May 18, 2016/in Big data & Spark /by Witold Jędrzejewski

Distribute by and cluster by clauses are really cool features in SparkSQL. Unfortunately, this subject remains relatively unknown to most users – this post aims to change that.
In order to gain the most from this post, you should have a basic understanding of how Spark works. In particular, you should know how it divides jobs into stages and tasks, and how it stores data on partitions. If you ’re not familiar with these subject, this article may be a good starting point (besides spark documentation, of course).
Please note that this post was written with Spark 1.6 in mind.

Cluster by/Distribute by/Sort by

Spark lets you write queries in a SQL-like language – HiveQL. HiveQL offers special clauses that let you control the partitioning of data. This article explains how this works in Hive. But what happens if you use them in your SparkSQL queries? How does their behavior map to Spark concepts?

Distribute By

Repartitions a DataFrame by the given expressions. The number of partitions is equal to spark.sql.shuffle.partitions. Note that in Spark, when a DataFrame is partitioned by some expression, all the rows for which this expression is equal are on the same partition (but not necessarily vice-versa)!
This is how it looks in practice. Let’s say we have a DataFrame with two columns: key and value.

SET spark.sql.shuffle.partitions = 2
SELECT * FROM df DISTRIBUTE BY key

Equivalent in DataFrame API:

df.repartition($"key", 2)

Example of how it could work:

Sort By

Sorts data within partitions by the given expressions. Note that this operation does not cause any shuffle.
In SQL:

SELECT * FROM df SORT BY key

Equivalent in DataFrame API:

df.sortWithinPartitions()

Example of how it could work:

Cluster By

This is just a shortcut for using distribute by and sort by together on the same set of expressions.
In SQL:

SET spark.sql.shuffle.partitions = 2
SELECT * FROM df CLUSTER BY key

Equivalent in DataFrame API:

df.repartition($"key", 2).sortWithinPartitions()

Example of how it could work:

When Are They Useful?

Why would you ever want to repartition your DataFrame? Well, there are multiple situations where you really should.

Skewed Data

Your DataFrame is skewed if most of its rows are located on a small number of partitions, while the majority of the partitions remain empty. You really should avoid such a situation. Why? This makes your application virtually not parallel – most of the time you will be waiting for a single task to finish. Even worse, in some cases you can run out of memory on some executors or cause an excessive spill of data to a disk. All of this can happen if your data is not evenly distributed.
To deal with the skew, you can repartition your data using distribute by. For the expression to partition by, choose something that you know will evenly distribute the data. You can even use the primary key of the DataFrame!
For example:

SET spark.sql.shuffle.partitions = 5
SELECT * FROM df DISTRIBUTE BY key, value

could work like this:

Note that distribute by does not guarantee that data will be distributed evenly between partitions! It all depends on the hash of the expression by which we distribute. In the example above, one can imagine that the hash of (1,b) was equal to the hash of (3,a). And even when hashes for two rows differ, they can still end up on the same partition, when there are fewer partitions than unique hashes! But in most cases, with bigger data samples, this trick can mitigate the skew. Of course, you have to make sure that the partitioning expression is not skewed itself (meaning that expression is equal for most of the rows).

Multiple Joins

When you join two DataFrames, Spark will repartition them both by the join expressions. This means that if you are joining to the same DataFrame many times (by the same expressions each time), Spark will be doing the repartitioning of this DataFrame each time.
Let’s see it in an example.
Let’s open spark-shell and execute the following code.
First, let’s create some DataFrames to play with:

val data = for (key <- 1 to 1000000) yield (key, 1)
sc.parallelize(data).toDF("key", "value").registerTempTable("df")
import scala.util.Random
sc.parallelize(Random.shuffle(data)).toDF("key", "value").registerTempTable("df1")
sc.parallelize(Random.shuffle(data)).toDF("key", "value").registerTempTable("df2")

While performing the join, if one of the DataFrames is small enough, Spark will perform a broadcast join. This is actually a pretty cool feature, but it is a subject for another blog post. Right now, we are interested in Spark’s behavior during a standard join. That’s why – for the sake of the experiment – we’ll turn off the autobroadcasting feature by the following line:

sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")

Ok, now we are ready to run some joins!

sqlContext.sql("CACHE TABLE df")
sqlContext.sql("SELECT * FROM df JOIN df1 ON df.a = df1.a").show
sqlContext.sql("SELECT * FROM df JOIN df2 ON df.a = df2.a").show

Let’s see how it looks in SparkUI (for spark-shell it usually starts on localhost:4040). Three jobs were executed. Their DAGs look like this:

Job 1

Job 2

Job 3

The first job just creates df and caches it. The second one creates df1 and loads df from the cache (this is indicated by the green dot) and then repartitions both of them by key. The third DAG is really similar to the second one, but uses df2 instead of df1. So it is transparent that we repartitioned df by key two times.
How can this be optimised? The answer is, you can repartition the DataFrame yourself, only once, at the very beginning.

val dfDist = sqlContext.sql("SELECT * FROM df DISTRIBUTE BY a")
dfDist.registerTempTable("df_dist")
sqlContext.sql("CACHE TABLE df_dist")
sqlContext.sql("SELECT * FROM df_dist JOIN df1 ON df_dist.a = df1.a").show
sqlContext.sql("SELECT * FROM df_dist JOIN df2 ON df_dist.a = df2.a").show

Job 1

Job 2

Job 3

This time the first job has an additional stage – we perform repartitioning by key there. But in both of the following jobs, one stage is skipped and the repartitioned DataFrame is taken from the cache – note that green dot is in a different place now.

Sorting in Join

There is one thing I haven’t yet tell you about yet. Starting from version 1.2, Spark uses sort-based shuffle by default (as opposed to hash-based shuffle). So actually, when you join two DataFrames, Spark will repartition them both by the join expressions and sort them within the partitions! That means the code above can be further optimised by adding sort by to it:

SELECT * FROM df DISTRIBUTE BY a SORT BY a

But as you now know, distribute by + sort by = cluster by, so the query can get even simpler!

SELECT * FROM df CLUSTER BY a

Multiple Join on Already Partitioned DataFrame
Ok, but what if the DataFrame that you will be joining to is already partitioned correctly? For example, if it is a result of grouping by the expressions that will be used in join? Well, in that case you don’t have to repartition it once again – a mere sort by will suffice.

val dfDist = sqlContext.sql("SELECT a, count(*) FROM some_other_df GROUP BY a SORT BY a")
dfDist.registerTempTable("df_dist")
sqlContext.sql("CACHE TABLE df_dist")
sqlContext.sql("SELECT * FROM df_dist JOIN df1 ON df_dist.a = df1.a").show
sqlContext.sql("SELECT * FROM df_dist JOIN df2 ON df_dist.a = df2.a").show

In fact, adding an unnecessary distribute by can actually harm your program! In some cases, Spark won’t be able to see that the data is already partitioned and will repartition it twice. Of course, there is a possibility that this behaviour will change in future releases.

Final Thoughts

Writing Spark applications is easy, but making them optimal can be hard. Sometimes you have to understand what is going on underneath to be able to make your Spark application as fast as it can be. I hope that this post will help you achieve that, at least when it comes to distribute by and cluster by.

US Baby Names – Data Visualization

April 22, 2016/in Big data & Spark, Seahorse /by Rafał Hryciuk

A few days ago we released Seahorse 1.1, an enhanced version of our machine learning, Big Data manipulation and data visualization product. Today, we will analyze statistical data about births in the USA over the last century and show you, how easy is data visualization with the new version of Seahorse.

The input dataset consists of 5,647,426 rows representing newborn children’s names in the U.S., aggregated by name, gender, year and state. Due to privacy reasons, names with less than 5 babies born in the same year and state were excluded from the dataset. Each row consists of 6 cells: Id (numeric), Name (string), Year (numeric), Gender (string), State (string), Count (numeric). Please note that we will be using only the demo version of Seahorse (in contrast to full-scale Spark cluster Seahorse deployment, this version works on a single machine with limited resource consumption) and the experiment execution may last more than 15 minutes.
We will create a workflow for data analysis and explore the data using Seahorse. We will show you how to create the experiment step-by-step, however, the final version can be downloaded here. To follow along with this post and fully understand it, download and install Seahorse 1.1 using these instructions.

Step-by-Step Workflow Creation

Reading the Data

The first thing we need to do is load the data into Seahorse. This is done by placing a Read DataFrame operation on the canvas and modifying its parameters:
SOURCE: https://s3.amazonaws.com/workflowexecutor/examples/data/StateNames.csv
Run the workflow by clicking the RUN button in the top panel. Depending on your internet connection speed, it will take up to few minutes to download the data. When the execution completes, we can see what the data looks like. Display DataFrame’s report by clicking on the output port of the Read DataFrame operation.

The read data frame report. It contains a data sample of only 20 out of the 5,647,426 rows.

By clicking on the icons in report header, you can see more information about the whole data set (not just the data sample). Click on the Year column icon to see its values’ distribution:

As we can see in the histogram, we have data for the years 1910 to 2014. Another interesting information is that we have a lot more data from recent years.
Now, let’s choose the Gender column in the report:

The pie chart for the Gender column illustrates that there is more data for females than males. Since this is official government data, we can assume that there are more female newborns than male, or that unusual names are more popular for males thus are not included in the dataset.

Filtering the Data

Let’s take a closer look at the female names (an analysis of male names would be analogous; if you feel confident, you can explore male names while following this tutorial). We can use a Filter Rows operation to select only female names.
The Filter Rows operation is parameterized by an expression which will be used to select the desired rows. To select female names, we can use the expression below:
CONDITION: Gender = ‘F’

Finding the Most Popular Names

The first question we will explore is: what names were the most popular. We need to sum the Count column for every name, disregarding the year and the state in which the child was born. To achieve this, we will use a SQL Transformation operation. Let’s leave dataframe id set to `df` and just modify the expression parameter:
EXPRESSION:select Name, sum(Count) as Count from df group by Name order by Count desc

After running the operation we can explore its report and see that the three most popular names are Mary, Patricia and Elizabeth.
We have found the most popular names in the U.S., but we can also narrow our results down to find the most popular names in particular states.

Most Popular Names in States using Custom Transformer

To find the most popular names in a certain state we need to execute a sequence of operations. It is a good practice to enclose such a sequence within a “procedure” which can be reused later.
To do that we need to use a Create Custom Transformer operation. Please drag and drop Create Custom Transformer to the canvas and click the Edit workflow button on the operation parameters panel. This takes us to a new canvas where we can create a custom Transformer. A Transformer is a function which takes a DataFrame as an input and returns another DataFrame as the output. We want to create a Transformer that calculates the most popular names in a given state.
To do that, we need to:
1. Calculate how many times each name was given in every state.

We can achieve this by executing a SQL Transformation operation:
EXPRESSION: select df.Name, df.State, sum(df.Count) as AllTimeCount from df group by df.State, df.Name
2. For every state, calculate number of occurrences for the most popular name in the state.

In order to do so we need to execute a SQL Transformation operation:
EXPRESSION: select State, max(AllTimeCount) as MaxCount from df group by State
3. Now, we will use a Join operation to filter out only the most popular names for each state. We will join name occurrences (from step 1) with counts of the most popular names in every state (step 2).

We need to add a Join operation to the canvas and set two equality constraints as the joining conditions: Left DataFrame’s column MaxCount should be equal to right DataFrame’s column AllTimeCount. Left DataFrame’s column State should be equal to right DataFrame’s column State. This can be done by setting two appropriate groups in the join columns parameter (see image below).

Close the inner workflow by clicking the CLOSE INNER WORKFLOW button at top of the screen.
Add a Transform operation to the canvas and connect it with nodes, as shown below:

After we execute the current workflow and open the report for the Transform operation, we can see which names were the most popular in which state. By clicking on the Name column, we can see its distribution:

It turns out that Mary was the most popular name in Washington D.C. and 48 states with Jennifer topping the list in just two states.

Traditional vs Modern Names

Some parents like traditional names while others prefer modern ones. We will now explore which names were popular in the past, and which names are popular now. To do that we split our data into two DataFrames:

Data from before January 1, 1980
Data since January 1, 1980

We can do such splits by executing two SQL Transformations.
SQL Transformation (select traditional names):
EXPRESSION: select * from df where Year < 1980
SQL Transformation (select modern names):
EXPRESSION: select * from df where Year >= 1980

Now let’s find out what were the most popular names before and after 1980. We do not need to write another SQL Expression, as we already created a Transformer that does it for us. All we need to do is apply the Transformer to different DataFrames. The SQL Transformation operation, besides returning a DataFrame, also returns a SqlTransformer which can be executed using a Transform operation.
We can also use the previously defined Create Custom Transformer to calculate popular names in each state and just apply it to another DataFrame.

The execution of the current workflow might take a few minutes. By viewing reports from the lowest operations, we can see that the most popular names before 1980 were Mary, Patricia and Linda. Mary was the most popular name in every state and in Washington D.C. The most popular names since 1980 are Jessica, Ashley and Jennifer.
Now, let’s take a closer look at the report for the most popular modern names:
When we click on the Name column in the report for the most popular modern names in states we see:

We can see that Jessica was the most popular name in 31 states, Ashley in Washington D.C. and 16 states, with Emily leading the way in two states and Sarah in one (New Hampshire).

Name Popularity Plot – Data Visualization

As we discovered, the most popular names were:

all time: Mary, Patricia, Elizabeth;
before 1980: Mary, Patricia, Linda;
after 1980: Jessica, Ashley, Jennifer.

Let’s filter only the data regarding these names and then take a closer look.
As a first step, we have to filter the data using a SQL Transformation operation with an appropriate expression:
EXPRESSION:

select Name, Year, sum(Count) as Count
   from df
   where
       Name = 'Mary'
       or Name = 'Patricia'
       or Name = 'Elizabeth'
       or Name = 'Linda'
       or Name = 'Jessica'
       or Name = 'Ashley'
       or Name = 'Jennifer'
   group by Name, Year
   order by Year

Now, let’s connect a Notebook operation to the newly created and filtered DataFrame:

Notebook gives us the ability to interactively explore the data. We will use it to draw a plot of the names’ popularity. To do so, we need to execute the following Python code in the Notebook:

import matplotlib
%matplotlib inline
df = dataframe().toPandas()
# Reshape data (produce a “pivot” table) based on column values
names_over_year = df.pivot("Year", "Name", "Count")
names_over_year.plot()          # Make plot of DataFrame using matplotlib
fig = matplotlib.pyplot.gcf()   # Get a reference to the current figure
fig.set_size_inches(18.5, 10.5) # Set the figure size

The code above prepares the data and draws a plot of the names’ popularity. After executing it, a plot will be displayed:

The graph shows that Mary was a very popular name until around 1960, when its popularity dropped. Meanwhile the popularity of Elizabeth has been the most stable throughout the time-frame.
The popularity of the name Linda is an interesting example as it became hugely popular during the 1950s and then rapidly dropped back to its previous low levels. Linda may owe its brief popularity to Buddy Clark’s 1946 hit song “Linda”.

Conclusion

We have briefly explored U.S. Government’s statistical dataset. It transpires that names’ popularity may be influenced by many factors. For example, we have found that the most popular name in the last century was Mary, which may owe its popularity to Christian culture, but it has lost in popularity since the 1960s when counterculture revolution took place. Another interesting observation is peaks in names popularity influenced by popular culture (i.e. Linda). The next fact worth mentioning is that the faster a name becomes popular, the faster its popularity drops after reaching the peak (both slopes are similar). On the other hand, there are names with stable popularity, like “Elizabeth”, and they have their part in every generation.

Improve Apache Spark aggregate performance with batching

March 21, 2016/in Big data & Spark, Seahorse /by Adam Jakubowski

Seahorse provides users with reports on their data at every step in the workflow. A user can view reports after each operation to review the intermediate results. In our reports we provide users with distributions for columns in the form of a histogram for continuous data, and a pie chart for categorical data. In order to calculate data reports Seahorse needs to execute many independent aggregate operations for each data column. We managed to significantly improve aggregate performance by batching those operations together. That way those aggregations can be executed in just one data-pass.

Categorical distribution

DataFrame report. Users are able to view distributions of their data after clicking on the chart icon.

Data Reports in Seahorse

Reports are generated for each DataFrame after every Operation. This means that performance is crucial here.
In our first approach, we used Apache Spark’s built-in `histogram` on RDD[Double] feature. We would initially map our RDD[Row] to RDD[Double] for each Double column and call `histogram` for each one.
Unfortunately this approach is very time-intensive. Calculating histograms for each column independently meant that we were performing expensive, full-data passes per each column.
Another alternative would be to introduce a form of special multi-histogram operation which would operate on multiple columns. Apache Spark already does that for column statistics – there is a Multicolumn Statistics method that calculates column statistics for each column in only one data pass (MultivariateStatisticalSummary).
This approach was not good enough for us. We would like to have a generic solution to combine arbitrary operations together instead of writing custom multi-operations.

Abstracting over aggregator

An operation producing one value out of a whole dataset is called aggregation. In order to perform an aggregation, the user must provide three components:

`mergeValue` function – aggregates results from a single partition
`mergeCombiners` function – merges aggregated results from the partitions
`initialElement` – the initial aggregator value.

With these three pieces of data, Apache Spark is able to aggregate data across each partition and then combine these partial results together to produce final value.
We introduced an abstract `Aggregator` encapsulating the `mergeValue`, `mergeCombiners` and, `initialElement` properties. Thanks to this abstraction we can create `Wrappers` for our aggregators with additional behaviors.
MultiAggregator is one such wrapper. It wraps a sequence of aggregators and executes them in a batch – in a single data-pass. Its implementation involves:

`mergeValue` – aggregates results from a single partition for each aggregator
`mergeCombiners` – merges aggregated results from the partitions for each aggregator
`initialElement` – initial aggregator values for each aggregator

case class MultiAggregator[U, T](aggregators: Seq[Aggregator[U, T]])
 extends Aggregator[Seq[U], T] {
 override def initialElement: Seq[U] = aggregators.map(_.initialElement)
 override def mergeValue(accSeq: Seq[U], elem: T): Seq[U] = {
   (accSeq, aggregators).zipped.map { (acc, aggregator) =>
     aggregator.mergeValue(acc, elem)
   }
 }
 override def mergeCombiners(leftSeq: Seq[U], rightSeq: Seq[U]): Seq[U] = {
   (leftSeq, rightSeq, aggregators).zipped.map { (left, right, aggregator) =>
     aggregator.mergeCombiners(left, right)
   }
 }
}

This special aggregator wraps a sequence of aggregators and then uses their `mergeValue`, `mergeCombiners` and `initialElement` functions to perform aggregation in one batch.

Reading Results in a Type-safe Manner

Each aggregator can return an arbitrary value type. In order to maintain type safety and genericity we added a special class encapsulating results for all input aggregators. In order to get specific aggregator result, we pass initial aggregator object to batched result. That way result type is correctly inferred from the aggregator’s result type.

// Usage
// Aggregators to execute in a batch
val aggregators: Seq[Aggregator[_,_]] = ???
// Batched result executed in one spark call
val batchedResult: BatchedResult = AggregatorBatch.executeInBatch(rows, aggregators)
// Accessing result is done by passing origin aggregator to the result object
val firstAggregator = aggregators(0)
// Type of `firstResult` is derived from output type of `firstAggregator`
val firstResult = batchedResult.forAggregator(firstAggregator)

As we can see, each value type is derived from the aggregator’s return type.

Measurements and comparing aggregate performace

Let’s measure time for calculating histograms for data rows made up of 5 numeric columns.
In this experiment, we first execute an empty foreach statement to force Spark to cache the data in the clusters (job 0). Otherwise, the first job would have additional overhead for reading the data.

// job 0 - forces Spark to cache the data in cluster. It will take away data-loading overhead
// from future jobs and make future measurements reliable.
rows.foreach((_) => ())

Then we will run Spark’s `histogram` operation for each column (jobs 1- 5).

private def sequentialProcessing(rows: RDD[Row], buckets: Array[Double]): Seq[Array[Long]] =
 for (i <- 0 until 5) yield {
   val column = rows.map(extractDoubleColumn(i))
   column.histogram(buckets, true)
 }

And use our batched aggregator (job 6).

private def batchedProcessing(rows: RDD[Row], buckets: Array[Double]): Seq[Array[Long]] = {
 val aggregators = for (i <- 0 until 5) yield
   HistogramAggregator(buckets, true).mapInput(extractDoubleColumn(i))
 val batchedResult = AggregatorBatch.executeInBatch(rows, aggregators)
 for (aggregator <- aggregators) yield batchedResult.forAggregator(aggregator)
}

The results:

We improved from a linear number of `histogram` calls against column number to a single batched aggregate call. This method is about 5 times faster for this specific experiment.

Summary

Using only Spark’s built-in `histogram` method we would be stuck with calling it for each numeric column. Using our approach we can achieve the same results in a single method call which yields a large aggregate performance boost. Our method is roughly N times faster (where N stands for the number of columns).
We are also able to batch other arbitrary aggregators along with it. In our case they would be:

Calculating distributions of categorical columns
Counting missing values
Calculating min, max and mean values for numerical columns

In the future we could add more operations as long as they comply with the aggregator interface.

Should I eat this mushroom?

February 29, 2016/in Big data & Spark, Seahorse /by Grzegorz Chilkiewicz

A few days ago we have released Seahorse 1.0, a visual platform for machine learning and Big Data manipulation available for all, for free! Today, we show you how to use Seahorse to solve a simple classification problem.

We will try to distinguish edible mushrooms from poisonous ones basing on their appearance. Mushroom picking requires a lot of experience and knowledge on mushroom species recognition. Wrong classification of a mushroom (mistaking poisonous for edible one) might result in very serious health problems. We will show you that with Seahorse it is possible to aid mushroom pickers in this responsible task of mushroom classification.
We will use a publicly available dataset with mushrooms^[1], where specimens descriptions are classified in two classes: edible and poisonous. Using this dataset we create and verify a prediction model that will be usable for classification of gilled mushrooms in Lepiota and Agaricus Family (the most common representative is Agaricus bisporus, also known as table mushroom).
The dataset consists of 8124 instances (4208 edible and 3916 poisonous). Each instance is described by 22 features (phenotypic traits codes, e.g. cap color, odor, veil type, habitat) and 1 label column (edible or poisonous).
We will create a Seahorse workflow: a graphical machine learning experiment representation.

Workflow Overview

Mushroom classification – Seahorse workflow

This is a complete experiment that generates a mushroom classification model. It can be downloaded from here, but we will also show you how to create that experiment step-by-step. To follow up with this article and fully understand it, you should download and install Seahorse 1.0 using these instructions.

Step-by-Step Workflow Creation

Reading the Data

The data is provided in a form of a 23-column, comma-separated CSV-like file with column names in the first line. To work with the dataset, it has to be loaded into Seahorse. This can be done by a Read DataFrame operation. Let’s place it on the canvas using drag-and-drop from the operations palette. To load the data, we need to provide the correct path to the file.

Read DataFrame operation parameters panel

Just click at the Read DataFrame operation on the canvas. Now, in panel on the right you will see its parameters. The Read DataFrame needs to have its parameters modified:
SOURCE: https://s3.amazonaws.com/workflowexecutor/examples/data/mushrooms.csv
After setting the Read DataFrame’s parameters to the correct values, the operation is ready to be executed – just simply click the RUN button in the top Seahorse toolbar. If you have much more operations on the canvas and you are interested in the results of only one operation, you can use partial execution of the workflow. Simply select that operation before clicking RUN.

RUN button

When the execution ends, a report of the operation will be available. Let’s click on the operation output port to see its result.

Operation output port

At the bottom of the screen you will see a simple DataFrame report. It contains information about data loaded by the Read DataFrame operation.

The DataFrame’s report

The DataFrame is too wide (more than 20 columns) to allow viewing the data sample in the report, so we are able to explore here only the column types and names. If you want to explore the data a bit more, you can use a Notebook operation (it allows interactive data exploration): Just place it on the canvas (use drag-and-drop technique) and connect its input port with the Read DataFrame output port (click on the output port and drag it to an input port of the other operation). Now, you can open the Notebook by selecting it on the canvas and clicking at the “Open notebook” button on the right panel.

”Open notebook” button

In the newly opened window, enter: dataframe().take(10) and click on the “run cell” button (you can use a shortcut for it: Ctrl+Enter). Now you can closely investigate the data sample of the first 10 rows returned by the Read DataFrame operation.

Data Transformation

In the first column we have labels stating to which class a specific sample belongs (possible values: edible or poisonous). We have discovered that all columns have string values. To perform classification, Seahorse needs a numeric label column and a vector of numerics as features column. We need to map string values to numbers and assembly features into a single vector column. This can be done by combining multiple operations – String Indexer with One Hot Encoder and Assemble Vector – as shown in the workflow overview image.
The String Indexer operation translates string values to numeric ordinal values. It needs to have its parameters modified, as follows:

String Indexer operation parameters panel

OPERATE ON: multiple columns
INPUT COLUMNS: Including index range 0-22 (all columns)
OUTPUT: append new columns
COLUMN NAME PREFIX: indexed

One Hot Encoder operation parameters panel

The One Hot Encoder operation translates ordinal values to vector having “1” only at position given by input numeric value. It needs to have its parameters modified, as follows:

OPERATE ON: multiple columns
INPUT COLUMNS: Including index range 24-38 and 40-45

We have to exclude the column at index 39 (indexed_veil-type) because all mushroom specimens had partial veil. One Hot Encoder does not allow operating on columns with only one value (unless user wants to drop the last category using DROP LAST parameter).

Assemble Vector operation parameters panel

Assemble Vector merges columns with numerics and vectors of numerics into a single vector of numerics. It needs to have its parameters modified, as follows:
INPUT COLUMNS: Including index range 24-45 (columns generated by the String Indexer, excluding generated column containing edibility label)
OUTPUT COLUMN: features

Remove Unnecessary Columns

We will use the Execute SQL Expression operation to remove unnecessary columns from the dataset and give more meaningful names to columns that are essential for our experiment. It will make the dataset reports smaller and facilitate exploring data.
Execute SQL Expression needs to have its parameters modified, as follows:

DATAFRAME ID: df
EXPRESSION:
SELECT
    edible AS edibility_string,
    indexed_edible AS edibility_label,
    features
FROM df

Splitting Into Training and Test Set

Split operation parameters panel

To perform a fair evaluation of our model, we need to split our data into two parts: a testing dataset and a training dataset. That task could be accomplished by using a Split operation. To divide the dataset in ratio 1 to 3, we do need to modify its default parameters:
SPLIT RATIO: 0.25 (percentage of rows that should end up in the first output DataFrame – the test set)

Model Training

Logistic Regression operation parameters panel

Logistic Regression operation LABEL COLUMN parameter

To train a model, we need to use the Fit operation, which can be used to fit an Estimator. We want to use logistic regression classification, so we will put the Logistic Regression operation on the canvas and connect it to the Fit operation.
We will leave almost all default values of Logistic Regression parameters unchanged, we need only to change the label column to

edibility_label.
LABEL COLUMN: edibility_label

Verifying Model Effectiveness On Test Data

By using Split operation, we had generated a training dataset and a test dataset from the input data. We have trained our model on the training dataset. Now it is time to use the test dataset to verify the effectiveness of our classification model. To generate predictions using the trained model, we need to use a Transform operation. To assess effectiveness of our model we will count “Falsely Poisonous” (waste of edible mushrooms) and “Falsely Edible“ (very dangerous!) entries in the test dataset. To perform the calculations, we will use the Execute SQL Expression operation.
After executing the Transform operation, we can investigate the resulting report:

Thanks to projecting only the necessary columns, the resulting dataset fits in the column number limit and we are able to view the data sample. We can notice that the test dataset has 2043 entries. Also we can see that the String Indexer operation assigned 0 to “e” label (edible class) and 1 to “p” label (poisonous class). We can also notice that the prediction column has “almost the same” values as “edibility” column, so we can suspect that our model performs well. Let’s measure its performance:
The Execute SQL Expression (Falsely Edible) needs to have its parameters modified, as follows:

DATAFRAME ID: df
EXPRESSION: SELECT * FROM df WHERE prediction=0 AND edibility_label=1
The Execute SQL Expression (Falsely Poisonous) needs to have its parameters modified, as follows:
DATAFRAME ID: df
EXPRESSION: SELECT * FROM df WHERE prediction=1 AND edibility_label=0
Now we can explore reports in these two operations and compare the number of rows in each of them:
Falsely Poisonous: 0
Falsely Edible: 0

Conclusion

It means that we have trained a surprisingly accurate prediction model for classifying mushrooms. We have to remember that some dangers remain such as:

Mushroom picker examining the specimen can make a mistake during assessment of traits or entering data to the computer.
Data sample might be too small to create a comprehensive model for predicting edibility of all mushrooms that users will want to classify.
Two different mushroom species can have identical traits, while one of the species is edible and the second one is poisonous.

Due to those problems, we can use this model only to aid expertise on mushroom edibility. Experts should always have the last word over issues that are potentially dangerous for other people, like classifying poisonous mushrooms. This must be followed not only due to lack of confidence for the newly created prediction model (experts can make mistakes even more often than this model), but most of all – due to legal responsibility of those decisions.

^[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/]. Irvine, CA: University of California, School of Information and Computer Science.

Fast and accurate categorical distribution without reshuffling in Apache Spark

February 10, 2016/in Big data & Spark, Seahorse /by Adam Jakubowski

In Seahorse we want to provide our users with accurate distributions for their categorical data. Categorical data can be thought of as possible results of an observation that can take one of K possible outcomes. Some examples: Nationality, Marital Status, Gender, Type of Education.

In the described scenario we don’t have prior knowledge whether a given feature is categorical or not. We would like to treat features as categorical as long as their value set is small enough. Moreover, if there are too many distinct values we are no longer interested in its discrete distribution.
Seahorse is build on top of Apache Spark. Naive and easy approach to this problem would be to simply use Spark’s reduceByKey or groupByKey methods. The ReduceByKey method operates on key-value pairs and accepts a function reducing two values into one. If there are many values for one key, those will get reduced to one value with the provided function.

// limit - max. number of distinct categories to treat a feature as categorical
if(rdd.distinct().count() < limit) {
  // causes partition reshuffling
  val discreteDistribution = rdd.map(v => v -> 1).reduceByKey(_ + _).collect()
  Some(discreteDistribution)
} else {
  None
}

Unfortunately using these methods means poor performance due to Spark’s partition reshuffling.
Counting occurrences of specific values of a feature can be easily done without reshuffling. We might think of calculating a category distribution as counting occurrences of each category. This can be done using RDD’s function aggregate.
The aggregate function reduces a dataset to one value using a seq and comb functions and an initial aggregator value. The seq function adds one value to an aggregator. It is used to combine all values from a partition into one aggregated value. The comb function adds two aggregators together – it is used to combine results from different partitions.
In our case aggregator value will be of type Option(Map[T, Long]) representing occurrences count for value of type T or None if there are too many distinct categorical values.
The comb function (mergeCombiners in code) adds two aggregator maps together merging their results. If the result map has more distinct categorical values than the specified limit – no distribution is returned (a None).
The seq function (mergeValue in code) increments occurrences count of a value in the aggregator map.
Let the code speak for itself:

import scalaz._
import scalaz.Scalaz._
(...) // other imports
object FunctionalDemo {
  // limit - represents the maximum number of categories.
  // if there are more distinct values than `limit`, no categorical distribution will be returned
  case class CategoryDistributionCalculator[T](limit: Int) {
    def calculate(rdd: RDD[T]): Option[Map[T, Long]] =
        rdd.aggregate(zero)(mergeValue _, mergeCombiners _)
    // initial aggregator value
    private val zero = Option(Map.empty[T, Long])
   def mergeCombiners[T](
      leftOpt: Option[Map[T, Long]],
      rightOpt: Option[Map[T, Long]]): Option[Map[T, Long]] = {
      for (left <- leftOpt; right <- rightOpt) yield {
        // Adds maps in such a way that numbers from same keys are summed.
        val sumMap = left |+| right
        // yields Some when boolean condition is met. None otherwise
        (sumMap.keySet.size <= limit).option(sumMap)
      }
    }.flatten
    def mergeValue[T](acc: Option[Map[T, Long]], next: T): Option[Map[T, Long]] =
      mergeCombiners(acc, Some(Map(next -> 1)))
  }
  def main(args: Array[String]): Unit = {
    (...)
    val valueSlow = rdd.map(v => v -> 1).reduceByKey(_ + _).collect()
    val valueFast = CategoryDistributionCalculator(30).calculate(rdd).get
    (...)
  }
}

The code above is clean, functional and takes advantage of Scalaz. Unfortunately, it allocates a new object per each record in the dataset. With Apache Spark, sometimes it’s better not to use immutable objects in favor of mutable ones for performance’s sake. According to the documentation, both seq and comb functions are allowed to modify and return their first argument instead of creating a new object to avoid memory allocation. In the next approach we take advantage of that. The code below is not as pretty, but its performance is much better.

(...) // imports
object MutableDemo {
  // limit - represents maximum size of categorical type
  // if there are more distinct values than `limit`, no categorical distribution will be returned
  case class CategoryDistributionCalculator[T](limit: Int) {
    // None is returned if there are too many distinct values (>limit) to calculate categorical distribution
    def calculate(rdd: RDD[T]): Option[Map[T, Long]] = {
      val mutableMapOpt = rdd.aggregate(zero)(mergeValue _, mergeCombiners _)
      val immutableMapOpt = mutableMapOpt.map(_.toMap)
      immutableMapOpt
    }
    private val zero = Option(mutable.Map.empty[T, Long])
    private def mergeValue(accOpt: Option[mutable.Map[T, Long]],
      next: T): Option[mutable.Map[T, Long]] = {
      accOpt.foreach { acc =>
        addOccurrencesToMap(acc, next, 1)
      }
      replacedWithNoneIfLimitExceeded(accOpt)
    }
    private def mergeCombiners(leftOpt: Option[mutable.Map[T, Long]],
      rightOpt: Option[mutable.Map[T, Long]]): Option[mutable.Map[T, Long]] = {
      for (left <- leftOpt; rightMap <- rightOpt) {
        rightMap.foreach { case (element, count) =>
          addOccurrencesToMap(left, element, count)
        }
      }
      replacedWithNoneIfLimitExceeded(leftOpt)
    }
    private def addOccurrencesToMap(
      occurrences: mutable.Map[T, Long],
      element: T,
      count: Long): Unit = {
      occurrences(element) = occurrences.getOrElse(element, 0L) + count
    }
    private def replacedWithNoneIfLimitExceeded(
      mapOpt: Option[mutable.Map[T, Long]]): Option[mutable.Map[T, Long]] = {
      mapOpt.flatMap { map =>
        if (map.size <= limit) mapOpt else None
      }
    }
  }
  def main(args: Array[String]): Unit = {
    (...)
    val valueSlow = rdd.map(v => v -> 1).reduceByKey(_ + _).collect()
    val valueFast = CategoryDistributionCalculator(30).calculate(rdd).get
    (...)
  }
}

Measurements

We put a hypothesis that the last approach should be at least two time faster. The `group by` operation works in two stages with a partition reshuffling between them. Our solution runs in one stage.
A simple experiment with Apache Spark seems to confirm that. For the experiment we used:

a local Apache Spark cluster,
a Data Set containing 10,000,000 rows,
with 33 labels in the category column (uniformly distributed).

In the experiment we execute an empty foreach statement first to force Spark to cache the data in the clusters (job 0). Otherwise the first job would have additional overhead (for reading the data).
The jobs 1 and 2 are parts of the first, naive solution. The job 1 measures categories count. The job 2 calculates the distributions.
The job 2 calculates distributions.
The job 3 and 4 represent the two optimised solutions. The job 3 is the functional one. The job 4 is the one that takes the advantage of mutable objects.

// job 0 - forces Spark to cache data in cluster. It will take away data-loading overhead
// from future jobs and make future measurements reliable.
rdd.foreach((_) => ())
// job 1 - checks if there are too many distinct values to form category
if(rdd.distinct().count() < limit) {
 // job 2 - calculates categorical distribution
 val discreteDistribution = rdd.map(v => v -> 1).reduceByKey(_ + _).collect()
 Some(discreteDistribution)
} else {
 None
}
// job 3 - optimised mutable approach
FunctionalDemo.CategoryDistributionCalculator(limit).execute(rdd)
// job 4 - functional approach
MutableDemo.CategoryDistributionCalculator(limit).execute(rdd)

The job 1 (distinct count) and the job 2 (reduceByKey) have 400 tasks and use memory to perform shuffling.
The job 3 and job 4 – have only 200 tasks, run in one stage and don’t use additional memory to perform shuffling. They also check for category counts dynamically.
An interesting observation is that the functional job 3 is slower than the reduceByKey job even though it has only 200 tasks. A memory allocation overhead is that severe!

Summary

Categorical features are very common in Data Science. With the proposed approach we are able to calculate accurate categorical distributions fast and without a prior knowledge if a feature is categorical or not. It is guaranteed to execute in only one Apache Spark job and without any partition reshuffling.

This post looks at:

What is network traffic analysis

The benefits of network traffic analysis

How machine learning can support traffic analysis

Intrusion detection

Reducing false positives

Workload prediction

Spotting early signs of attack (DDoS)

Malicious packet detection

Summary

1. Spark firing up the big data in business

2. Real-time data processing – challenge in a batch

3. From academia to business – productizing the models

4. Unstructured data – cleaning up the mess

5. Edge computing – process data faster and cheaper

The hottest guys for the hottest trends

1. Growing demand for data science and analytics talents

2. Augmented analysis on the rise

3. Edge computing speeds up data transfer

4. Convergence, or merging the past, present and future world in one dashboard

5. New data types – Excel is not enough (nor even close)

Geospatial data in GeoJson format

New York City Taxi Trips

Implementing custom operation using Seahorse SDK steps:

Seahorse SDK Example Git Repository

Building our Workflow

Summary

Links

Use case

Solution

Collecting data

Sending a weather report

Result

Summary

Cluster by/Distribute by/Sort by

Distribute By

Sort By

Cluster By

When Are They Useful?

Skewed Data

Multiple Joins

Sorting in Join

Final Thoughts

Step-by-Step Workflow Creation

Reading the Data

Filtering the Data

Finding the Most Popular Names

Most Popular Names in States using Custom Transformer

Traditional vs Modern Names

Name Popularity Plot – Data Visualization

Conclusion

Data Reports in Seahorse

Abstracting over aggregator

Reading Results in a Type-safe Manner

Measurements and comparing aggregate performace

Summary

Workflow Overview

Step-by-Step Workflow Creation

Reading the Data

Data Transformation

Remove Unnecessary Columns

Splitting Into Training and Test Set

Model Training

Verifying Model Effectiveness On Test Data

Conclusion

Measurements

Summary

Contact us

Locations

Let us know how we can help

Services

Resources

About us

Support

Join our community