Five hottest big data trends 2018 for the techies

Currently, the world is producing 16.3 zettabytes of data a year. According to IDC, by 2025 that amount will rise tenfold, to 163 zettabytes a year. But how big, exactly, is a zetta?

Five trends for business to surf the big data wave

According to IDC analysis, 16,3 zettabytes of data were created worldwide in 2017. Such volume shows why keeping an eye on big data business trends is so crucial.

GeoJson Operations in Apache Spark with Seahorse SDK

A few days ago we released Seahorse 1.4, an enhanced version of our machine learning, Big Data manipulation and data visualization product. This release also comes with an SDK – a Scala toolkit for creating new custom operations to be used in Seahorse. As a showcase, we will create a custom Geospatial operation with GeoJson […]

Scheduling Spark jobs in Seahorse

In the latest Seahorse release we introduced the scheduling of Spark jobs. We will show you how to use it to regularly collect data and send reports generated from that data via email. Use case Let’s say that we have a local meteo station and the data from this station is uploaded automatically to Google […]

Optimize Spark with DISTRIBUTE BY & CLUSTER BY

Distribute by and cluster by clauses are really cool features in SparkSQL. Unfortunately, this subjectremains relatively unknown to most users – this post aims to change that.

US Baby Names – Data Visualization

A few days ago we released Seahorse 1.1, an enhanced version of our machine learning, Big Data manipulation and visualization product. Today, we will show you how the new version of Seahorse can be used for data mining and data visualization.

Improve Apache Spark aggregate performance with batching

Seahorse provides users with reports on their data at every step in the workflow. A user can view reports after each operation to review the intermediate results. In our reports we provide users with distributions for columns in the form of a histogram for continuous data, and a pie chart for categorical data.

Should I eat this mushroom?

A few days ago we have released Seahorse 1.0, a visual platform for machine learning and Big Data manipulation available for all, for free! Today, we show you how to use Seahorse to solve a simple classification problem.

Fast and accurate categorical distribution without reshuffling in Apache Spark

In Seahorse we want to provide our users with accurate distributions for their categorical data. Categorical data can be thought of as possible results of an observation that can take one of K possible outcomes. Some examples: Nationality, Marital Status, Gender, Type of Education.

Cooperative data exploration

Living in a world of big data comes with a certain challenge. Namely, how to extract value from this ever-growing flow of information that comes our way. There are a lot of great tools that can help us, but they all require a lot of resources. So, how do we ease the burden on this CPU/RAM demand? One way to do it is to share the data we are working on and results of our computations with others.