Data science Archives - deepsense.ai

Optimizing Computational Resources for Machine Learning and Data Science Projects: A Practical Approach

July 2, 2024/in Data science, Machine learning /by Łukasz Gębala

Every computation requires computing resources. Sure, sometimes a regular calculator, a piece of paper, and a pencil are sufficient. However, in machine learning, powerful computing resources are necessary:

The model needs to be fed with a massive amount of data.
Appropriate calculations must be performed for each data point to process it into a pattern.
Some parameters must be adjusted to teach the model the correct mappings, necessitating further recalculations and computational resources.

Your teammates also need to train models. Ultimately, the amount of computational resources is always insufficient. Nevertheless, there are ways to

Reduce this deficit.
Increase the utilization of available resources.
Gain more freedom in developing your research or building a startup

This article may provide useful insights if you have encountered a similar situation.

Challenges of Computational Resource Allocation

At deepsense.ai, we specialize in addressing machine learning and data science challenges with custom solutions tailored to our client’s specific needs. With clients spanning various industries, we encounter diverse problems that demand adaptability and versatility. Our in-house computational resources are utilized for developing and testing these solutions. While cloud computing is trendy, it may not always be practical due to cost, availability of on-demand GPUs, or data confidentiality concerns.

When working on client projects, we often encounter a situation where multiple Machine Learning Engineers or Data Scientists need to share limited computational resources for their training sessions. Allocating these resources efficiently is challenging, especially during the solution design stage when we need to experiment with various parameters and models. Establishing a fixed schedule for resource usage is not feasible as we can’t always estimate the time required for computations.

Existing Solutions and Their Implications

It has become apparent that this issue is not unique to us, and that others have also encountered and addressed it. Solutions exist that may not fully resolve the problem, but can certainly mitigate it significantly. This challenge prompted the development of supercomputers and computing clusters, and we can draw upon their expertise. Furthermore, we can apply this knowledge to cloud resources, provided the need arises and opportunities arise.

Adopting SLURM for Efficient Resource Management

We have established our cluster using SLURM (formerly known as Simple Linux Utility for Resource Management). Our decision is based on several factors. First, SLURM supports all required resources, including CPU, RAM, and GPU. Second, it is compatible with the Linux operating system, Python, and other AI tools and models we commonly use. Third, SLURM is a stable and widely used solution. It is estimated that approximately 60% of the supercomputers on the Top500 list run on SLURM, and some of our staff have encountered it during their academic work.

How SLURM Enhances Resource Efficiency

This way, we have a tool that lets us “request” the resources needed to carry out a planned task. If the required resources are not available, the task will be queued and launched as soon as the resources become available. The workload manager will handle this without requiring the special involvement of an engineer at this stage. The engineer who submitted the task will receive email notifications when the task starts and ends. Since computing resources operate 24/7, this queuing method allows for more efficient use of resources, as tasks can be carried out outside of regular working hours. Additionally, with different types of resources available, one can “request” specific resource models or just a “type” of resource, which enhances the flexibility of this solution.

User Interaction with the SLURM Cluster

From the user’s perspective, we communicate with the cluster through a login node. This node is where tasks are prepared and configured before queuing and running. The controller, which knows the states of all compute nodes, then allocates the appropriate resources by assigning them to subsequent tasks in the queue.

Simplyfied schema of SLURM components. Icons source: draw.io

Optimizing SLURM Implementation for Machine Learning Workloads

When setting up our SLURM implementation, we carefully considered the available computing hardware, storage options, and data access performance and security requirements. The key conclusions we reached during this phase, and upon revisiting it after implementation, were as follows:

Network: The cluster should have a fast internal network connection of at least 10Gbit. AI model training often involves large datasets that must reach the compute node(s) where the job will be run. Even if the node can quickly handle the task, it won’t be efficient if downloading data for computations significantly slows it down. Ideally, such a network should be redundant to avoid a single point of failure that could bring the entire cluster out of service.
Data storage: Data should be easily accessible to each compute node, fast to read and write (since computation results need to be saved), and secure. A possible node failure should not affect the data on the node. NAS servers and distributed network file systems like Ceph can fulfill these requirements.
Operating system unification. All compute nodes should run under the control of the same operating system version. The available system and programming libraries should also be identical on each node. We cannot allow a situation where the programmed task code cannot run because a node lacks a required library. We base our systems on Ubuntu, which allows us to create code on Ubuntu desktops and then run it on an Ubuntu cluster. This consistency makes it easier and faster for us to design and develop solutions.
Computational libraries. The development of external libraries and models progresses every day. We need to work with different versions of libraries due to varying requirements. Given the number and variety of projects our engineers work on, each Machine Learning Engineer requires considerable disk space. Adding up the needs of all engineers only increases the complexity. We implemented a general library store using a network file system and LMOD to address this. This allows us to maintain only one copy of a given library (in a given version), available at any time on any node. Engineers can enable the libraries they need, disable the ones they don’t, or experiment with different library versions. This approach reduces the need to clean up environments of excess libraries.
Temporary data. According to the architecture of such a cluster, users can only submit jobs through the access node and cannot select a specific compute node on which the job will run. Therefore, all necessary elements, such as code or temporary data, must be available on each compute node. Users upload the required files through the access node, and the compute nodes have immediate access to them. Distributed network file systems are ideal for this purpose. Examples include Ceph and Lustre. For data downloaded from NFS, we use FS-Cache, which works very well for frequently used files.

Example of Cluster Configuration

Let’s use the example of our test cluster, consisting of one access node, two compute nodes without GPUs, and two compute nodes with four GPUs each. If we develop the solution well, we can scale it quite freely (by adding more nodes) and implement it (or help with implementation) on the client’s infrastructure or in the cloud.

The backbone is a 2 x 10Gbps Ethernet network based on redundant switches, enabling speeds up to 20Gbps. This technology is reasonable in our case as it does not expose us to additional costs and offers good performance.

Data Management and Efficiency

For data storage, we will use a NAS array and a network FS (file system) built with the help of Ceph. At this particular moment, “data” should be understood as everything needed for work that must be saved on disk and available on each compute node: input data for calculations (mostly customer data), libraries, models, repositories, code, temporary data, etc. This setup will also allow us to react flexibly and distribute data properly depending on the quantity and type of projects we are working on. Additionally, Ceph allows us to manage engineers’ temporary data, such as models reasonably. If one engineer needs to use model X, this model is downloaded and saved in the temporary cache. If another engineer wants to use model X, he does not have to download it again because this model is already available in the cache. Over time, this workflow improvement will save disk space and enhance efficiency.

*A small off-topic here – we omit security issues such as data encryption, communication encryption, and permission management of who can see what, etc. These are beyond the scope of this article. However, they are considered, implemented, and practiced following industry standards.*

Efficient Library Management with LMOD

The next question was what we could do with the excess libraries needed by our engineers, downloaded and saved here and there. Perhaps some part (as it turned out – a significant one) could be shared by installing them once, in one place, and making them available “in one go” to all nodes. Drawing on the experience of others, we used LMOD, an environment modules system. It allows you to make both libraries and programs available as loadable modules. This, in turn, gives the ability to quickly and easily switch between different libraries or programs and between their versions. At the same time, we save space and time because one copy of the software is installed and made available for general use. From our point of view, it is important to switch easily between different versions of Python or CUDA and test different versions of AI-related libraries such as PyTorch, Numpy, TensorFlow, and many others.

Practical Configuration of a SLURM Cluster

Access Node and Cluster Controller

Our access node will also be the cluster controller, running two main cluster services: slurmctld and slurmdbd. If you have the option, separating these services and the access node into three independent machines is worth separating. This solution is more sensible in large environments (actual supercomputers, university labs, etc.). However, we are not building a supercomputer; we are effectively utilizing available hardware resources. Moreover, we have checked our needs and capabilities, and tests have shown that this solution fully meets our needs.

Node Authentication with Munge

SLURM uses Munge to authenticate communication between nodes. This service should be installed and launched with the same (secret) key on all nodes to communicate with each other. It’s important to have Munge working before starting SLURM.

Configuring slurmdbd Service

The slurmdbd service saves information about executed tasks to the database. This is useful for statistical and accounting purposes – we can verify how many resources a given project or type of task consumed, how many tasks individual users launched, etc. Its configuration is simple and comes down to specifying database connection parameters (MySQL or MariaDB). Example configuration files are well documented. The most important options to configure are:

AuthType=auth/munge
AuthInfo=/run/munge/munge.socket.2

StorageType=accounting_storage/mysql
DbdHost=db-host.example.com
StorageLoc=slurm_DB
StoragePass=ThePasswordThatShouldBeProtected
StorageUser=slurm
StorageHost=slurm-ctlr.example.com

So, the previously mentioned authentication using Munge and the access data for the database.

Configuring slurmctld Service

The main controller is the slurmctld service. It is responsible for queueing tasks, allocating resources, and monitoring the states of nodes. Its configuration can be divided into two parts: the configuration of the load and resource manager and the actual resources of the cluster. The manager configuration is very rich in options, and it is worth studying the documentation to set the appropriate options for yourself. Let’s define our cluster:

AuthType=auth/munge
ClusterName=slurm-lab
SlurmctldHost=slurm-ctrl.example.com
GresTypes=gpu

As we already know, we use Munge for authentication. We name our cluster, specify the controller’s host, and define the GPU type resource (Gres means Generic RESource) as a resource that the controller is about to manage.

Scheduling and Resource Selection

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory

Backfill is the default plugin for managing the schedule in SLURM. It results in better utilization of the entire cluster, e.g., lower-priority tasks will be run if they do not interfere with the (predicted) execution of higher-priority tasks. The select/cons_tres plugin (in conjunction with the OverSubscribe parameter set for partitions) determines the ability to share unused resources (as opposed to the select/linear plugin, which operates at the level of entire nodes). Finally, the CR_CPU_Memory parameter sets the “logical” processor and memory as the “units” that we operate on when allocating resources (as opposed to, for example, CR_Core or CR_Socket). We can also use the CR_CPU parameter – then RAM will not be tracked when allocating resources to tasks.

Many other parameters allow us to adjust the cluster’s operation to our needs. For example, we can specify programs that will be run before and/or after the actual computational tasks, parameters for accounting statistics (duration, allocated resources, projects, etc.), or power-saving management. The flexibility of configuration is high.

Defining Cluster Resources

It’s time to define our resources. In this part, we define nodes – they can differ in parameters such as the availability or not of GPUs, the type of GPUs, but also simply the amount of CPUs and RAM. Therefore, it is necessary to describe well what they have so that the controller knows how it can allocate tasks depending on the requested resources, e.g.:

NodeName=lab-gpu-01 RealMemory=122880 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Feature="rtx4080" Gres=gpu:4

The lab-gpu-01 node has 120GB of RAM (the amount of memory we want to allocate for executing tasks assigned in the cluster), 2 processors with 12 cores each, and each core can execute two threads, along with 4 GPUs. By specifying the “Feature” parameter, you can introduce an additional level for the requested resources. For example, if there are different GPUs or CPUs in the nodes, you can request the execution of a task on a specific type of desired resource. If we don’t specify anything, the task will be launched on the first available resource (in the requested quantity). Additionally, in the gres.conf file, we define these generic resources (in our case, GPUs) more precisely, for example:

NodeName=lab-gpu-01 Name=gpu File=/dev/nvidia[0-3]

In this way, SLURM will know that the devices /dev/nvidia[0-3] are responsible for GPU-type resources, allowing it to allocate them (and block other tasks from using them).

Organizing Nodes into Partitions

Finally, we organize nodes into partitions, which are logical groups of nodes. A node can belong to more than one partition, but this should be carefully considered and probably tested in practice. The minimum is to create one partition and put all nodes into it, and the controller will handle organizing tasks according to the requested resources. However, we can (in our example) divide partitions based on resources. For example, one partition has nodes with GPUs, and another has nodes without GPUs. Example:

PartitionName=ml-cpu Nodes=lab-cpu-[01,02] Default=YES MaxTime=5-00:00:00 DefaultTime=04:00:00 State=UP AllowGroups=ml-users
PartitionName=ml-gpu Nodes=lab-gpu-[01,02] MaxTime=5-00:00:00 DefaultTime=04:00:00 State=UP AllowGroups=ml-users

We have two partitions with two nodes each: one partition only with CPU nodes (and this is the default partition) and the other with GPU nodes. We also specify the maximum job duration for 5 days, the default for 4 hours, while AllowGroups tells us that only users in the ml-users group can use these partitions.

Please note that most of the above features/parameters are highly individual. Available resources are an obvious determinant, but the specifics (needs) of users and, finally, the type of projects (i.e., tasks being run) will significantly influence the final configuration.

Verifying Cluster Configuration

We can check the status of the configured cluster by executing the sinfo command on the controller, and the result should look as follows:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
ml-cpu*      up 5-00:00:00      2   idle lab-cpu[01-02]
ml-gpu       up 5-00:00:00      2   idle lab-gpu[01-02]

Practical Experiences with Using SLURM

Accessing the Cluster

Time to start calculating our models. We gain access to the cluster via SSH through the login node, which in this case is the controller. It is on this node that we configure, prepare, and later run computational tasks. Therefore, we can upload the necessary input data for computations or code, or use SSH to download the code from a repository and prepare a virtual environment. Whatever we need. We should remember that because we are using network file systems shared between nodes, all our preparations will be immediately available on every compute node.

Tasks can be run in two modes:

Interactive Mode: Use the srun or salloc commands. Remember that such tasks are run “here and now” (depending on resource availability). If our SSH session is interrupted, the running task will also be interrupted, and resources will be released.
Batch Mode: Using the sbatch command along with the sbatch script. Such a task is put into the task schedule and will run (from our point of view) in the background. After starting it, you can immediately disconnect from the access node and, for example, wait for an email notification about its completion.

Interactive Task Example

Interactive tasks are useful for verifying if our computation preparations are working correctly, for example, if we have included all the necessary libraries, if the correct paths to data have been provided, and finally, if the code itself is invoked correctly. For example:

srun -p ml-gpu \
     -N 1 \
     --ntasks-per-node=1 \
     --mem-per-cpu=10GB \
     --gres=gpu:1 \
     --constraint="rtx4080" \
     --time=2:00:00 \
     -A LAB-test \
     --pty /bin/bash -l

We are launching a task on the (-p) ml-gpu partition, allocating one node (-N), one CPU core (—-ntasks-per-node), 10GB of memory (—-mem-per-cpu), and one GPU card (—-gres) of type rtx4080 (—-constraint), for a duration of 2 hours (—time) within the LAB-test project (-A). Finally, the task itself is to run a bash shell there.

This way, we get access to a compute node with the requested resources allocated. Then we load the necessary LMOD modules, e.g.:

module load apps/python/3.10
module load apps/cuda/11.7
module load libs/python/torch/1.13.1-cuda-11.7-python-3.10
module load libs/python/neptune-client/1.9.1-python-3.10

Thanks to LMOD, we immediately get access to the required Python version, CUDA libraries, and PyTorch, as well as the appropriate Neptune library, which allows us to save computation results for easy and efficient analysis later on. Assuming we have several available versions of such programs or libraries, switching between them is easy, making it easier to develop our code. Finally, we run the code itself, e.g.:

source ~/venv/bin/activate
python ~/code/program.py --data=/path/to/source/data

If necessary, we make appropriate corrections – add libraries, fix the code, and upload additional data. When everything is working correctly, we end the interactive session with:

logout

Returning to the access node, and according to the above (tested) experience, we prepare an sbatch script with the following content:

#!/bin/bash -l
## Project name
#SBATCH -A LAB-test
## Task name
#SBATCH -J Experiment-01
## E-mail notifications about job progress
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=username@example.com
## Number of nodes
#SBATCH -N 2
## Tasks per node (by default it is amount of CPU cores per node)
#SBATCH --ntasks-per-node=20
## RAM amount per computing core
#SBATCH --mem-per-cpu=5GB
## Maximum task duration (format D-HH:MM:SS)
#SBATCH --time=1-00:00:00
## Partition select (by default ml-cpu, according to SLURM config)
#SBATCH -p ml-gpu
## Amount of GPU per node
#SBATCH --gres=gpu:4
## Save outpu logs to file
#SBATCH --output="/home/username/slurm-logs/experiment-01.%j-%N.out.log"
## Save error logs to file
#SBATCH --error="/home/username/slurm-logs/experiment-01.%j-%N.err.log"
## We may request specific feature - rtx4080 in this case
#SBATCH --constraint="rtx42080"
## Now we repeat steps checked during interactive task
## which is: load necessary modules
module load apps/python/3.10
module load apps/cuda/11.7
module load libs/python/torch/1.13.1-cuda-11.7-python-3.10
module load libs/python/neptune-client/1.9.1-python-3.10
## activate v-env and launch job
source ~/venv/bin/activate
python ~/code/program.py --data=/path/to/source/data

Please note that in the sbatch script, all comments start with ##. On the other hand, #SBATCH indicates parameters passed to SLURM – these must be understandable and acceptable to it.

Running the Batch Job

This time, we want to run the task on 2 nodes from the ml-gpu partition. On each of them, we allocate 20 cores with 5GB of memory for each core (i.e., 100GB per node) and 4 GPUs. We additionally specify that these should be rtx4080 cards. The task should not take longer than 1 day.

The above parameters are an example. In practice, some tasks take several days to compute. But there are also (rarely) situations where we can use most (or even all) of our resources for one task—then, in turn, the task’s duration is significantly shortened. Of course, information on how to select parameters and time estimates comes with experience, but we already have some, and after all, we want to use our resources more efficiently.

Finally, we specify in which case (BEGIN, END, FAIL) and to which address notifications about the job status should be sent.

From now on, we can run as many such tasks as we need. If the requested resources are insufficient, the tasks will be queued. From there, they will be launched as resources become available. So, after a whole day of preparing code and calculations, we can safely queue them and go for a well-deserved rest, returning the next day to find the calculations in progress or even the results ready.

Conclusion

We wanted to effectively use our computing resources. We have a variety of hardware, both old and new. However, we can use each resource to the limits of its capabilities. Ultimately, this translates into our calm and efficient work and customer solutions. SLURM itself is great, flexible, and has enormous possibilities. Combined with network FS, LMOD, and thoughtfully designed resource partitioning, we have a tool that allows us not to (im)patiently wait for computing resources. Of course, these are always in short supply, but a cluster working 24/7 makes access to them easier for us.

Let us know if you need professional consultation, help with your SLURM configuration, or have questions about the above article!

Links:

AI Copilot’s Impact on Productivity in Revolutionizing Ada Language Development

July 15, 2024/in Data science, Machine learning /by deepsense.ai

How can we boost Ada software developers’ productivity? We teamed up with AdaCore to create a proof-of-concept copilot solution for the Ada programming language. This is the story of how we approached the challenge.

Project Overview

The Copilot for Ada programming language project aimed to research and develop a proof-of-concept code completion tool and evaluate its performance on the Ada code generation task. Its idea is to boost Ada software developers’ productivity by providing intelligent code completions and suggestions, improving the pace of task automation, and saving significant amounts of time on repetitive and boilerplate code.

Services provided

AI Research
AI Implementation

Key objectives

The solution aimed to prepare the groundwork for significantly boosting the effectiveness of Ada developers in the future by developing intelligent code completion and suggestions solution

Key outcomes

Demo application allowing to test code completion LLM-based solution on any Ada language samples,
Fine-tuned checkpoints of the selected coding LLMs, such as StarCoder and CodeGen, and their comparison with the baseline pre-trained models,
Recommendations on further steps for fine-tuning and enhancing Ada-specific code completion models.

Accelerating AI Integration with a Proof of Concept

M. Anthony Aiello, Head of Product & Innovation at AdaCore

“deepsense.ai quickly delivered a Proof of Concept for a code completion tool, using a state-of-the-art technological stack, including the newest available LLMs and libraries. They also led an excellent LLM discovery workshop that jump-started AdaCore’s integration of LLM solutions into our business processes and products. Their technical knowledge and commitment to delivering tailored, top-notch services were evident throughout our collaboration. Partnering with deepsense.ai has helped us accelerate our understanding of AI, implement AI solutions, and gain a strategic edge in today’s competitive landscape.”

Client background

AdaCore specializes in software development tools and services, primarily focused on Ada language programming for high-integrity systems to meet rigorous requirements for reliability, safety, security, and maintainability. With headquarters in New York and Paris, AdaCore provides its expertise to leading global defense, healthcare, automotive, aerospace, and railway enterprises.

Challenges and solutions

This project encountered several challenges while developing the copilot solution for the Ada programming language.

Challenges faced

LLMs Evaluation
Evaluating LLMs is difficult because no performance metrics can be automatically calculated and nicely correlated with the model’s performance in the program synthesis task without manually creating a set of programming challenges with unit tests. To circumvent this, we used text comparison metrics like BLEU and chrF measures to compare the ground truth code to the model’s generation.
Training Objective
Standard autoregressive training would not suffice because we need information from before and after the cursor to complete the code completion task correctly. As a result, we needed to understand how LLMs are trained to perform fill-in-the-middle tasks (FIM).

Resources
Despite the availability of memory-efficient training methods, memory requirements continue to be challenging when fine-tuning large models, particularly those with extended context. We had to rely on small batch sizes, which meant slow iteration speed (one training run could take several days to fine-tune on the largest dataset for more than one epoch).
The fast pace of the developments in the field of coding LLMs
During the project, the new versions of CodeGen and StarCoder models were released. Because our training pipeline was designed with modularity in mind, we could seamlessly incorporate these models.

Solution approach

Based on our experience in delivering AI solutions, we decided to take the following steps:

Formulation of the problem in Data Science terms
Challenge decomposition into more miniature epics
Setting development order and priorities
Choice of metrics and methods for evaluating the performance of the code generation model
Weekly meetings with AdaCore to present progress and gather feedback

Development process

The project’s goal in Data Science terms was to fine-tune the LLM, which is already pre-trained on code, for Ada code synthesis. The overall strategy included three crucial components as follows:

Training dataset preparation
We took an existing and cleaned corpus of GitHub repositories with permissive licenses called The Stack. After keeping only files with the correct Ada code and some additional preprocessing, we transformed it into a form that allows us to train the models on the FIM task.
Evaluation
Due to the lack of better evaluation methods, we utilized the chrF metric, which measures the similarity of predicted text and the ground truth, fully aware of some shortcomings. We evaluated the checkpoints on the held-out Ada corpus and the dataset of short Ada programming challenges found on the AdaCore website.
Fine-tuning Runs
We run and compared the performance of pre-trained and fine-tuned StarCoder and CodeGen models in different configurations, manipulating the number of model’s parameters, precision, context lengths, and memory-efficient fine-tuning methods, like LoRA and QLoRA

Data

We utilized three data sources throughout the project:

The Stack is a collection of code repositories for 358 programming languages available under permissive licenses. We retained only file extensions associated with Ada scripts (.ads, .adb, .ada), validated their correctness with the libadalang module, and filtered out files lacking Ada keywords in their contents. In the end, 30,528 files remained, representing 2.4% of the original dataset.
Ada Course Labs, which consisted of short Ada exercises with descriptions. We utilized this dataset as a supplementary test set.
Ada code from GitHub, the code that was not in The Stack dataset to improve the model’s performance with a more extensive training corpus and generate a test set of Ada files unseen by pre-trained models during their training phase

Key contributors

Since it was a scoped project with clear deliverables defined, deepsense.ai, a team of experienced consultants, managers, and developers, took ownership of the whole development process:

Project Manager
Technical Leader
Data Scientists
Principal ML Engineer (Technical Consultant)
ML Engineers

In addition to the team from deepsense.ai, the client-side team was actively engaged throughout the project, ensuring collaboration and alignment with the project goals.

Outcomes and benefits

The fine-tuning improved the model’s performance on the Ada code synthesis tasks compared to the pre-trained version. Considering the AdaCore company feedback, the project is a significant step forward. The generations from the delivered fine-tuned model outperformed the ground truth and GitHub Copilot even though Copilot uses a much larger model with a longer context and a more complex prompting method.

Lessons learned

The field of coding LLMs is constantly evolving, with more and more capable models being released at a fast pace. Even the foundation models pre-trained on publicly available code repositories have a decent level of understanding of how to program in the Ada language. The model’s capabilities can be further improved by fine-tuning an additional Ada-specific corpus of code repositories.
For coding models, context length is more important than the model’s size – even a 1B model with a context of 8k tokens can outperform a 15.5B model with only 2k tokens context.
While useful as a proxy measure for the model’s output quality, existing metrics and benchmarks for coding LLMs are unreliable for differentiating models. The end-user subjective evaluation is still required to determine the model’s usefulness.

Summary

The co-pilot for the Ada programming language project aimed to enhance Ada developer productivity with an intelligent code completion tool. deepsense.ai overcame challenges in LLM evaluation, training, and data resource constraints through problem decomposition and choosing the suitable evaluation scheme. The project marked significant progress for AdaCore in leveraging ML efforts to improve Ada developers’ productivity.

Credit risk modelling with Machine Learning

Using machine learning in credit risk modelling

May 5, 2021/in Data science, Machine learning /by deepsense.ai

Cost of risk is one of the biggest components in banks’ cost structure. Thus, even a slight improvement in credit risk modelling can translate in huge savings. That’s why machine learning is often implemented in this area.

We would like to share with you some insights from one of our projects, where we applied machine learning to increase credit scoring performance.To illustrate our insights we selected a random pool of 10 000 applications.

How the regular process works

Loan applications are usually assessed through a credit score model, which is most often based on a logistic regression (LR). It is trained on historical data, such as credit history. The model assesses the importance of every attribute provided and translates them into a prediction.

The main limitation of such a model is that it can take into account only linear dependencies between input variables and the predicted variable. On the other hand, it is this very property that makes logistic regression so interpretable. LR is in widespread used in credit risk modelling.

Credit scoring from a logistic regression model

What machine learning brings to the table

Machine learning enables the utilization of more advanced modeling techniques, such as decision trees and neural networks. This introduces non-linearities to the model and allows to detect more complex dependencies between the attributes. We decided to use an XGBoost model fed with features selected with the use of a method called permutation importance.

Credit scoring from tree-based model

However, ML models are usually so sophisticated that they are hard to interpret. Since a lack of interpretability would be a serious issue in such a highly regulated field as credit risk assessment, we opted to combine XGBoost and logistic regression.

Combining the models

We used both scoring engines – logistic regression and the ML based one – to assess all of the loan applications

With a clear correlation between the two assessment approaches, a high score in one model would likely mean a high score in the other.

Loan applications assessed by 2 models

In the original approach, logistic regression was used to assess applications. The acceptance level was set around 60% and the risk resulted at 1%

Initial credit application split (acceptance to portfolio risk)

If we decrease the threshold by a couple of points, the acceptance level hits 70% while the risk jumps to 1,5%

Credit applications’ split after lowering the threshold

We next applied a threshold for an ML model, allowing us to get an acceptance percentage to the original level (60%) while bringing the risk down to 0,75% that is by 25% lower than the risk level resulting from only traditional approach.

Credit applications’ split after applying Machine Learning

Summary

Machine learning is often seen as difficult to apply in banking due to the sheer amount of regulation the industry faces. The facts don’t necessarily back this up. ML is successfully used in numerous, heavily regulated industries. The example above is just one more example of how. Thanks to this innovative approach it is possible to increase the sustainability of the loans sector and make loans even more affordable to bank customers. There’s nothing artificial about that kind of intelligence.

3D meets AI – an unexplored world of new business opportunities

May 22, 2020/in Data science, Deep learning, Machine learning /by Krzysztof Palczewski, Jarosław Kochanowicz and Michał Tadeusiak

AI has become a powerful force in computer vision and it has unleashed tangible business opportunities for 2D visual data such as images and videos. Applying AI can bring tremendous results in a number of fields. To learn more about this exciting area, read our overview of 2D computer vision algorithms and applications.

Despite its popularity, there is nothing inherent to 2D imagery that makes it uniquely suitable for AI application. In fact, artificial intelligence systems can analyze various forms of information, including volumetric data. In spite of the increasing number of companies already using 3D data gathered by lidar or 3D cameras, AI applications aren’t the mainstream in their industries.

In this post, we describe how to leverage 3D data across multiple industries with the use of AI. Later in the article we’ll have a closer look at the nuts and bolts of the technology and we’ll aslo show what it takes to apply AI to 3D data. At the end of the post, you’ll also find an interactive demo to play with.

In the 3D world, there is no Swiss Army Knife

3D data is what we call volumetric information. The most common types include:

2.5D data, including information on depth or the distance to visible objects, but no volumetric information of what’s hidden behind them. Lidar data is an example.
3D data, with full volumetric information. Examples include MRI scans or objects rendered with computer graphics.
4D data, where volumetric information is captured as a sequence, and the outcome is a recording where one can go back and forth in time to see the changes occurring in the volume. We refer to this as 3D + time, which we can treat as the 4th dimension. Such representation enables us to visualize and model dynamic 3D processes, which is especially useful in medical applications such as respiratory or cardiac monitoring.

There are also multiple data representations. These include a compound of 2D images along the normal axis, sparse Point Cloud representation and voxelized representation. Such data could have additional channels, like reflectance in every point of a lidar’s view.

Depending on the business need, there can be different objectives for using AI: object detection and classification, semantic segmentation, instance segmentation and movement parameterization, to name a few. Moreover, every setup has its own characteristics and limitations that should be addressed with a dedicated approach (or, in the case of artificial neural networks, with a sophisticated and thoroughly designed architecture). These are the main reasons our clients come to us, and to take advantage of our experience in the field. We are responsible for delivering the AI part of specific projects, even though the majority of their competencies are built in-house.

Let us have a closer look at a few examples

1. Autonomous driving

Task: 3D object detection and classification,
Data: 2.5 Point clouds captured with a lidar: sparse data, big distances between points

Autonomous driving data are very sparse because:

the distances between objects in outdoor environments are significant
In the majority of cases lidar rays from the front and rear of the car don’t return to lidar, since there are no objects to reflect them.
The resolution of objects gets worse the further they are from the laser scanner. Due to the angular expansion of the beam it’s impossible to determine the precise shape of objects that are far away.

For autonomous driving, we needed a system that can take advantage of data sparsity to infer 3D bounding boxes around objects. One such network is the part-aware and aggregation neural network i.e. Part-A2 net (https://arxiv.org/abs/1907.03670). This is a two-stage network that uses the high separability of objects, which functions as segmentation information.

In the first stage, the network estimates the position of foreground points of objects inside bounding boxes generated by an anchor-based or anchor-free scheme. Then, in the second stage, the network aggregates local information for box refinement and class estimation. The network output is shown below, with the colors of points in bounding boxes showing their relative location as perceived by the Part-A² net.

Source of image: From Points to Parts: 3D Object Detection from Point Cloud with Part-aware and Part-aggregation Network

2. Indoor scene mapping

Task: Object instance segmentation
Data: Point clouds, sparse data, relatively small distances between points

A different setup is called for in mapping indoor environments, such as we do with instance segmentation of objects in office space or shops (see this dataset for better intuition: S3DIS dataset). Here we employ a relatively high-density representation of a point cloud and BoNet architecture.

In this case the space is divided into a 1- x 1- x 1-meter cubic grid. In each cube, a few thousand points are sampled for further processing. In an autonomous driving scenario, such a grid division would make little sense given the sheer number of cubes produced, many of which are empty and only a few of which contain any relevant information.

The network produces semantic segmentation masks as well as bounding boxes. The inference is a two-stage process. The first produces a global feature vector to predict a fixed number of bounding boxes. It also tallies scores to indicate whether some of the predicted classes are inside those boxes. The point-level and global features derived in the first stage are then used to predict a point-level binary mask with the class assignment. The pictures below show a typical scene with the segmentation masks.

3D meets AI - Indoor scene mapping — An example from the S3DIS dataset. From left: input image, semantic segmentation labels, instance segmentation labels

3. Medical diagnosis

Task: 3D Semantic segmentation
Data: Stacked 2D images, dense data, small distance between images

This is a highly controlled setup, where all 2D images are carefully and densely stacked together. Such a representation can be treated as a natural extension of a 2D setup. In such cases, modifying existing 2D approaches will deliver satisfactory results.

An example of a modified 2D approach is the 3D U-Net (https://arxiv.org/abs/1606.06650), where all 2D operations for a classical U-Net are replaced by their 3D counterparts. If you want to know more about AI in medicine, check out how it can be used to help with COVID-19 diagnosis and other challenges.

3D meets AI - Medical diagnosis — Source: Head CT scan

4. A 3D-enhanced 2D approach

There is also another case, where luckily, it can be relatively straightforward to apply expertise and technology developed for 2D cases in 3D applications. One such scenario is where there are 2D labels available, but the data and the inference products are in 3D. Another is when 3D information can play a supportive role.

In such a case, a depth map produced by 3D cameras can be treated as an additional image channel beyond regular RGB colors. Such additional information increases the sensitivity of neural networks to edge detection and thus yield better object boundaries.

3D meets AI - A 3D-enhanced 2D approach — Source: Azure Kinect DK depth camera

Examples of the projects we have delivered in such a setup include:

Defect detection based on 2D and 3D images.

We developed an AI system for a tire manufacturer to detect diverse types of defects. 3D data played a crucial role as it allowed for ultra-precise detection of submillimeter-size bubbles and scratches.

Object detection in a factory

We designed a system to detect and segment industrial assets in a chemical facility that had been thoroughly scanned with high resolution laser scanners. Combining 2D and 3D information allowed us to digitize the topology of the installation and its pipe system.

3D data needs a mix of competencies

At deepsense.ai, we have a team of data scientists and software engineers handling the algorithmic, visualization, and integration capabilities. Our teams are set up to flexibly adapt to specific business cases and provide tailor-made AI solutions. The solutions they produce are an alternative to pre-made, off-the-shelf products, which often prove too rigid and constrained; they fail once user expectations deviate from the assumptions of their designers.

Processing and visualizing data in near real time with appropriate user experience is no piece of cake. Doing so requires a tough balancing act, including

combining specific business needs, technical limitations resulting from huge data loads and the need to support multiple platforms.

It is always easier to discuss based on an example. Next section shows what it takes to develop an object detection system for autonomous vehicles with outputs accessible from a web browser. The goal is to predict bounding boxes of 3 different classes: car, pedestrian and cyclist, 360 degrees around the car. Such a project can be divided into 4 interconnected components: data processing, algorithms, visualizations and deployment.

Data preprocessing

In our example, we use the KITTI and A2D2 datasets, two common datasets for autonomous driving, and ones our R&D hub rely on heavily. In both datasets, we use data from spinning lidars for inference and cameras for visualization purposes.

Lidars and cameras work independently, capturing data at different rates. To obtain a full picture, all data have to be mapped to a common coordinate system and adjusted for time. This is no easy task. As lidars are constantly spinning, each point is captured at a different time, while simultaneously the position and rotation of the car in relation to world coordinates is changing. Meanwhile, the precise location and angle of the car is not known perfectly due to limitations of geolocation systems such as GPS. These difficulties make it extremely difficult to precisely and stably determine the absolute positions of objects around you (SLAM can be used to tackle some of the problems).

Fortunately, absolute positioning of objects around the vehicle is not always required.

Algorithms

There are a vast number of approaches when it comes to 3D data. However, factors such as the length to and between objects and high sparsity will play an essential role in which algorithm we ultimately settle on. As in the first example above, we used Part-A2 net.

Deployment

We have relied on a complete, in-house solution for visualization, data handling, and UI. We have used expertise in the Unity engine to develop a cross-platform, graphically rich and fully flexible solution. In terms of a platform, we opted for maximum availability, which can be satisfied by a popular web browser like Chrome or Mozilla and WebGL as Unity’s compilation platform.

Visualization/UI

WebGL, while very comfortable for the user, disables drive access and advanced GPU features, limits available RAM to 2GB and processing to a single thread. Additionally, while standalone solutions in Unity may rely on existing libraries for point cloud visualization, making it possible to visualize hundreds of millions of points (thanks to advanced GPU features), this is not the case in WebGL.

Therefore, we have developed an in-house visualization solution enabling real-time, in-browser visualization of up to 70 mln points. Give it a try!

Such visualization could be tailored to the company’s specific needs. In a recent project, we took a different approach: we used AR glasses in visualizing a factory in all its complexity. This enabled our client to reach next level user experience and see the factory in a whole new light.

Summary

We hope that this post has shed some light on how AI can be used with 3D data. If you have a particular 3D use case in mind or you are just curious about the potential for AI solutions in your field, please reach out to us. We’ll be happy to share our experience and discuss potential ways we can help you apply the power of artificial intelligence in your business. Please drop us an email at contact@deepsense.ai.

deepsense.ai and books-box.com – using machine learning to deliver knowledge pills

May 12, 2020/in Data science /by Oleh Plakhtiy

Modern science is facing a completely new challenge: overload. According to research from the University of Ottawa, the total number of research papers published since 1665 passed the 50 million mark in 2009 and approximately 2.5 million new papers are published every year. In fact, it is nearly impossible to be up-to-date with all this information, at least for a human being. Machine learning tools make it easier and faster to find information in today’s ever vaster trove of publications.

Books-box.com runs a platform with access to a wide variety of science-oriented literature. Its library contains around 5,000 books across multiple categories. But rather than distributing whole books in a digital form, it provides page-level access to required pieces of knowledge. This is often the case in academia and research work, where a particular piece of information is needed to enrich a paper and deliver more credible information.

We had the pleasure of working with books-box.com and providing them with NLP services. The goal of the project was to create a recommendation engine that suggests relevant literature to users based on the content they’re viewing.

Text embedding

To make a book’s text readable to a computer, the words are transformed into vectors. A vector is just a set of real numbers that functions as input for a Machine Learning algorithm. If you would like to learn more about the technologies and techniques we use, click on over to our business guide to Natural Language Processing.

One way to transform a sentence into a vector of numbers is one-hot encoding. This technique transforms a word into an n-vector where “n” equals the number of all unique words the model was taught during training. Unfortunately, this solution isn’t very useful for text because the vector becomes enormous and the word order and context are completely lost.

Enter embeddings, the state-of-the-art in NLP algorithms. To create a sentence embedding means to assign a vector to a sentence in a vector space that conserves semantics. When two embedding vectors in this space are close to each other, the sentences they represent are similar in meaning. Commonly used vector size is between 100 and 1024, which is much smaller than the number of all unique words.

To create sentence embeddings we use the top-shelf neural network-based NLP algorithms ELMo and BERT. ELMo uses deep, bi-directional LSTM recurrent neural networks, while BERT uses the Transformer attention mechanism. The engine uses both algorithms to make final recommendations.

We developed a proprietary aggregation mechanism that allows us to generate aggregate embedding vectors for each book page. They allow us to easily check page similarity, by calculating the cosine similarity of two vectors, a standard metric in multidimensional vector space.

Recommendations

When viewing a page users will get five other page recommendations that may interest them. But having around 200,000 pages per book category to get recommendations means calculating 200,000 page embedding comparisons for each request! That’s a lot of computing time. But instead of calculating the similarity online, we calculate it beforehand, and store top recommendations.

Having a pre-calculated cosine similarity between all pages, book-box’s recommendations can now be given almost instantaneously – the higher the score, the better the recommendation will be.

Ball tree is an alternative solution to storing raw embeddings in a multi-dimensional, space partitioning data structure like k-d tree. The beauty of this approach is that it doesn’t require all possible embedding comparisons to be calculated. Instead, the data structure enables the optimal search for the nearest points (embeddings) in multidimensional space. In our case, however, there is one problem with this approach – from the business side we have required for one page to have recommendations from a variety of books. But top k similar pages for one (k is a parameter which needs to be chosen during the tree-build phase) would most likely be from the same book. And that was not the solution we were looking for.

Threshold

Each pair of pages comes with a similarity score. In order to achieve the quality of recommendations desired, a threshold has to be selected. The higher the similarity score, the higher the quality of the recommendations will be, even if a smaller number of recommendations remain available.

It is worth noting that assessing recommendation quality is not a straightforward (binary) task as it takes the subjective opinion of the assessor into account.

Cloud deployment

books-box.com regularly adds new books to its library. This entails preprocessing new books (parsing from html etc.), transforming their pages to embedding vectors and then updating the recommendation structure. Such operations require a lot of computing power, especially when we’re talking about thousands of books. To run neural networks for embedding, we need GPU devices for fast parallel computing.

We decided to deploy our recommendation engine on Amazon Web Services (AWS) cloud, which allowed us to control costs and work on the solution’s elasticity, durability and scaling capabilities. AWS also provides a convenient system of spot instances that are available at a discount of up to 90% compared to on-demand pricing.

Our deployment consists of three elements: API server, Simple Storage Service (S3) bucket, Graph updater.

The API server

Exposes a simple API to retrieve recommendations
Provides API for category management
Fetches new books and storing them in S3 basket

The S3 bucket

Stores books
Stores embeddings
Stores recommendation graphs

The Graph updater

Processes new books

As new books are ingested into the S3 bucket, a new message is sent to the appropriate queue in the Simple Queue Service (SQS). It contains information about where the book is stored in the bucket. Each message represents one book and each queue represents one category.

The CloudWatch component observes the size of this queue, and it will update the instances count in the Auto Scaling Group (ASG) accordingly – if there are a lot of messages, it will increase the count; otherwise it will decrease it.

The Auto Scaling Group (ASG) keeps track of the number of instances running. If ASG instance count drops to 0, it will terminate all of the running instances. Once Elastic Compute Cloud (EC2) instances come online, they will connect to the queue and start processing jobs. When there are no more jobs, ASG is set back to zero and the instances will terminate.

To make our solution cost effective, we went with EC2 spot instances, cutting costs by up to 70% compared to on-demand instances. When using EC2’s in conjunction with SQS we can continue processing even if instances are terminated because of the price limit has been reached.They will be back on as soon as the price drops again and they will pick up any work that’s still left on the queues.

Each of the EC2’s runs dockerized applications that process the books and keep a graph that’s stored and updated on S3. Thankfully, AWS offers the data transfer between EC2 and S3 free of charge.

Summary

Scanning through immense amounts of text in the latest scientific publication was a painful and time-consuming process. The ML-powered tools delivered by books-box cuts all the noise and delivers the desired pages straight to the researcher in little to no time.

AI in healthcare – tackling COVID-19 and other future challenges

May 8, 2020/in Data science, Deep learning /by Paulina Knut, Maciej Leoniak and Konrad Budek

Throughout history, tackling pandemics has always been about using the latest knowledge and approaches. Today, with AI-powered solutions, healthcare has new tools to tackle present and future challenges, and the COVID-19 pandemic will prove to be a catalyst of change.

It was probably a typical October day in Messina, a Sicilian port, when 12 genoese ships docked. People were horrified to discover the dead bodies of sailors aboard, and with them the entrance of the black death to Europe. Today, in the age of vaccines and advanced medical treatments, the specter of a pandemic may until recently have seemed a phantom menace. But the COVID pandemic has proved otherwise.

There are currently several challenges regarding the COVID, including symptoms that can be easily mistaken with those of the common flu. An X-ray or CT image of lungs is a key element in the diagnosis and treatment of COVID 19 – the disease produces several telltale signs that are easy for trained professionals to spot. Or a trained neural network.

Neural networks- a building block for medical AI analysis

Computer scientists have traditionally developed methods that let them find keypoints on images based on defined heuristics, which allow them to tackle a huge array of problems. For example, locating machine parts on a uniform conveyor belt where simple colour filtration differentiates them from the background. But this is not the case for more sophisticated problems, where extensive domain knowledge is required.

Enter Neural Networks, algorithms inspired by the mathematical model of how the human brain processes signals. In the same way as humans gain knowledge by gathering experience, Neural Networks process data and learn on their own, instead of being manually tuned.

In AI-powered image processing, every pixel is represented as an input node and its value is passed to neurons in the next layer, allowing the interdependencies between pixels to be captured. As seen in the face detection model below, the lower layers develop the ability to filter simple shapes like edges and corners (e.g., eye corners) or color gradients. These are then used by intermediate layers to construct more sophisticated shapes representing the parts of the objects being analysed (in this case eyes, parts of lips or a lung edge etc.). The high layers analyse recognised parts and classify them as specific objects. In the case of X-ray images, such objects may be a rib, a lung or an irrelevant object in the background.

Source: researchgate.net

A neural network can see details the average observer cannot, and even specialists would be hard-pressed to find. But such skill requires a significant amount of training and a good dataset.

What does it take to train neural networks?

Data scientists spend a lot of time ensuring their models have the ability to generalise, and can thus deliver accurate predictions from data they didn’t encounter during training. This requires vast knowledge of data preprocessing and augmentation techniques, state-of-the-art network architectures and error-interpreting skills. The iterative process of designing and executing experiments is also both very time- and computing power-consuming and requires good organisation if it is to be done efficiently. Under these conditions, high prediction accuracy is hard to achieve – deepsense.ai’s teams have been developing this ability for 7 years.

The key difference between a human specialist and a neural network is that the latter is completely domain-agnostic. An algorithm that excelled in Segmenting satellite images or recognising individual North Atlantic right whales from a population of 447 of North Atlantic right whales can just as well be used for medical image recognition after tuning.

AI in medical data

Numerous AI solutions are currently used in medicine: from appointments and digitization of medical records to drug dosing algorithms (applications of artificial intelligence in health care). However, doctors still have to perform painstaking and repetitive tasks e.g. by analyzing images.

Images are used across the field of medicine, but they play a particularly important role in radiology. According to IBM estimates, up to 90% of all medical data is in image form, be it x-rays, MRIs or most other output from a diagnostic device. That is why radiology as a field is so open to using new technologies. Computers initially used in clinical imaging for administrative work, such as image acquisition and storage, are now becoming an indispensable element of the work environment at the beginning of the image archiving and communication system.

Recently, deep learning has been used with great success in medical imaging thanks to its ability to extract features. In particular, neural networks have been used to detect and differentiate bacterial and viral pneumonia in childrens’ chest radiographs).

COVID appears to be a similar case. Studies show that 86% of Covid-19 patients have ground-glass opacities (GGO), 64% have mixed GGO and consolidation and 71% have vascular enlargement in the lesion. This can be observed on CT scans as well as chest X-ray images and can be relatively easily spotted by a trained neural network.

There are several advantages of CT and x-ray scans when it comes to diagnosing COVID-19. The speed and noninvasiveness of these methods make them suitable for assisting doctors in determining the development of the infection and making decisions regarding performance of invasive tests. Also, due to the lack of both vaccines and medications, immediately isolating the infected patient is the only way to prevent the spread of the disease.

How deepsense.ai already supports healthcare

deepsense.ai’s first foray into medical data was when we took part in a competition to classify the severity of diabetic retinopathy using images of retinas. The contestants were given over 35,000 images of retinas, each having a severity rating. There were 5 severity classes, and the distribution of classes was fairly imbalanced. Most of the images showed no signs of disease. Only a few percent had the two most severe ratings. After months of hard work, we took 6th place.

As we gained more contact and experience with medical data, our results improved, and after some time we were able to take on challenges such as producing an algorithm that could automatically detect nuclei. With images acquired under a variety of conditions and having different cell types, magnification, and imaging modality (brightfield vs. fluorescence), the main challenge was to ensure the ability to generalise across these conditions.

Another interesting project we did involved automatic stomatological assessment. We trained a model to read an x-ray image and detect and identify teeth, accessories and lesions including laces, implants, cavities, cavity fillings, and parodontosis, among a long list of others. In yet another project, we estimated minimum (end-systolic) and maximum (end-diastolic) volumes of the left ventricle from a set of MRI-images taken over one heartbeat. Our results were rated “excellent” by cardiologists that reviewed our work.

Move your mouse cursor over the image to see the difference.

The standardized formats used in medical imaging allow for better transfer of knowledge in modeling different problems. In a recent research project we explored the potential of image preprocessing of CT scans in DICOM format.

Image preprocessing is a vital aspect of computer vision projects. Developing the optimal procedure rests upon the team’s experience in similar projects as well as their ability to explore new ideas. In this case the specialized image preprocessing methods we developed made the image more readable for the model and boosted its performance by 20%.

The deepsense take-away

It is common to think that an epidemic starts and ends, with no further threat to fear. But that’s not true. The black death started with the arrival of twelve ships from Genoa, then proceeded to claim the lives of up to 50 million Europeans. The disease still exists today, with 3248 people infected and 584 dead between 2010 and 2015. That’s right, the disease never really disappeared.

700 hundred years ago, Ragusa (modern Dubrovnik), then a Venice-controlled port city, played a prominent role in slowing the spread of the disease.. Learning from the tragic fate of other port cities including Venice, Genoa, Bergen and Weymouth, officials in Ragusa hold sailors on their ships for 30 days (trentino) to check if they were healthy and slow the spread of the disease.

COVID-19 is neither the most deadly nor the last pandemic humans will face. The key is to apply the latest knowledge and the most sophisticated solutions available to tackle the challenges they present. AI can support not only the most dramatic life-death issues in healthcare, but also more mundane cases. According to an Accenture study, AI can deliver savings of up to $150 billion annually by 2025 by supporting both the front line, with diagnosis augmentation, and the back office, by enhancing document processing or delivering more accurate cost estimates. This translates to potential significant savings for each hospital that adopts AI.

If you want to know more about the ways AI-powered solutions can support healthcare and tackle modern and future pandemics, contact us through the form below!

A business guide to Natural Language Processing (NLP)

September 24, 2019/in Data science, Deep learning /by Konrad Budek and Artur Zygadlo

With chatbots powering up customer service on one hand and fake news farms on the other, Natural Language Processing (NLP) is getting attention as one of the most impactful branches of Artificial Intelligence (AI).

When Alan Turing proposed his famous test in 1950, he couldn’t, despite the prescience that accompanies brilliance such as his, predict how easy breaking the test would become. And how far from intelligence the machine that broke the test would be!

Modern Natural Language Processing is being used in multiple industries, in both large-scale projects delivered by tech giants and minor tweaks local companies employ to improve the user experience.

The solutions vary from supporting internal business processes in document management to improving customer service by automated responses generated for the most common questions. According to IDC data cited by Deloitte, companies leveraging the information buried in plain sight in documents and other unstructured data can achieve up to $430 billion in productivity gains by 2020.

The biggest problem with NLP is the significant difference between machines mimicking the understanding of text and actually understanding it. The difference is easily shown with ELIZA software (a famous chatbot from the 1960s), which was based on a set of scripts that paraphrased input text to produce credible-looking responses. The technology was sufficient to produce some text, but far from demonstrating real understanding or delivering business value. Things changed, however, once machine learning models came into use.

What is natural language processing?

As the name implies, natural language processing is the act of a machine processing human language, analyzing the queries in it and responding in a human manner. After several decades of NLP research strongly based on a combination of computer science and linguistic expertise, the “deep learning tsunami” (a term coined by Stanford CS and Linguistics professor Christopher Manning) has recently taken over this field of AI as well, similarly to what happened in computer vision.

Many NLP tasks today are tackled with deep neural networks, which are frequently used among various techniques that enable machines to understand a text’s meaning and its author’s intent.

Modern NLP solutions work on text by “reading” it and making a network of connections between each word. Thus, the model gets more information on the context, the sentiment and exactly what the author sought to communicate.

Tackling the context

Context and intent are critical in analyzing text. Analyzing a picture without context can be tricky – is a fist a symbol of violence, or a bro fist-bump?

The challenge grows even further with NLP, as there are multiple social and cultural norms at work in communication. “The cafe is way too cool for me” can refer to a too-groovy atmosphere or the temperature. Depending on the age of the speaker, a “savage” punk rock concert can be either positive or negative. Before the machine learning era, the flatness of traditional, dictionary-based solutions provided information with significantly less accuracy.

The best way to deal with this challenge is to deliver a word-mapping system based on multidimensional vectors (so-called word embeddings) that provide complex information on the words they represent. Following the idea of distributional semantics (“You shall know a word by the company it keeps”), the neural network learns word representations by looking at the neighboring words. A breakthrough moment for neural NLP came in 2013, when the renowned word2vec model was introduced. However, one of the main problems that word2vec could not solve was homonymy, as the model could not distinguish between different meanings of the same word. A way to significantly improve handling the context in which a word is used in a sentence was found in 2018, when more sophisticated word embedding models like BERT and ELMo were introduced.

Natural Language Processing examples

Recent breakthroughs, especially GPT-2, have significantly improved NLP and delivered some very promising use cases, including the ones elaborated below.

Automated translation

One of the most widely used applications of natural language processing is automated translation between two languages, e.g. with Google Translate. The translator delivers increasingly accurate texts, good enough to serve even in court trials. Google Translate was used when a British court failed to deliver an interpreter for a Mandarin speaker.

Machine translation was one of the first successful applications of deep learning in the field of NLP. The neural approach quickly surpassed statistical machine translation, the technology that preceded it. In a translation task, the system’s input and output are sequences of words. The typical neural network architecture used for translation is therefore called seq2seq, and consists of two recurrent neural networks (encoder and decoder).

The first seq2seq paper was published in 2014, and subsequent research led Google Translate to switch from statistical to neural translation in 2016. Later that year, Google announced a single multi-lingual system that could translate between pairs of languages the system had never seen explicitly, suggesting the existence of some interlingua-like representation of sentences in vector space.

Another important development related to recurrent neural networks is the attention mechanism, which allows a model to learn to focus on particular parts of sequences, greatly improving translation quality. Further improvements come from using Transformer architecture instead of Recurrent Neural Networks.

Chatbots

Automated interaction with customers causes their satisfaction with the overall user experience to rise significantly. And that’s not a thing to overcome, as up to 88% of customers are willing to pay more for better customer experience.

A great example of chatbots improving the customer experience comes from Amtrak, a US railway company that transports 31 million passengers yearly and administrates over 21,000 miles of rails across America. The company decided to employ Julie, a chatbot that supports passengers in searching for a convenient commute. She delivered 800% ROI and reduced the cost of customer service by $1 million yearly while also increasing bookings by 25%.

Speech recognition

As much as a company can use a chatbot to perform some customer service, one can have a personal assistant in the pocket. According to eMarketer data, up to 111.8 million people in the US–over a third of its population–will use a voice assistant at least once a month. The voice assistant market is growing rapidly, with companies such as Google, Apple, Amazon and Samsung developing their assistants not only for mobile devices, but also for TVs and home appliances.

Despite the privacy concerns voice assistants are raising, speech is becoming the new interface for human-machine interaction. The interface can also be used to control industrial machines, especially when employees have their hands occupied – a case common across industries from HoReCa to agriculture and construction – assuming that the noise is reduced enough for machine to register the voice properly.

Thanks to advances in NLP, speech recognition solutions are getting smarter and delivering a better experience for users. As the assistants come to understand speakers’ intentions better and better, they will provide more accurate answers to increasingly complex questions.

An unexpected example of speech recognition comes from deepsense.ai’s project renovating and digitalizating classic movies, where the machine delivers an automated transcription. When combined with a facial recognition tool, the system transcribed and annotated the actor speaking in the film.

Sentiment analysis

Social media provides numerous ways of reaching customers, gathering information on their habits and delivering excellence. It’s also a melting pot of perspectives and news, delivering unprecedented insight on public opinion. This insight can be understood using sentiment analysis tools, which check if the context where a brand is exposed in social media is positive, negative or neutral.

Sentiment analysis can be done without the assistance of AI by building up a glossary of positive and negative words and checking their frequency. If there is swearing or words like “broken” near the brand, sentiment is negative. Yet those systems cannot spot irony or more sophisticated hate. The sentence “I would be happy to see you ill” suggests an aggression and possibly hatred, yet there are no slurs or swearing. By supporting the analysis of words in the glossary by checking the relations between words in each sentence, a machine learning model can deliver a better understanding of the text and provide more information on the message’s subjectivity.

So good can that understanding be, in fact, that deepsense.ai delivered a solution that could spot terrorist propaganda and illicit content in social media in real-time. In the same way, it is possible to deliver a system that spots hate speech and other forms of online harassment. A study from the Pew Research Center shows that up to 41% of adult Americans have experienced some form of online harassment, a number that is likely to increase, mostly due to the rising prevalence of the Internet in people’s daily lives.

Natural language generation

Apart from understanding text, machines are getting better at delivering new texts. According to research published in Foreign Affairs, texts being produced by modern AI software are, for unskilled readers, comparable to those written by journalists. ML models are indeed already writing texts for world media organizations. And while that may seem a fascinating accomplishment, it was the fear of what such advanced abilities might portend that led OpenAI not to make GPT-2 public.

The most known case of automated journalism comes from the Washington Post, where Heliograf covers sport events. Its journalistic debut came in 2016, when the software was responsible for writing up coverage of the Olympic Games in Rio.

In business, natural languae generation is used to produce more polite and humane responses to FAQs. Thus, ironically, automating the conventional communication will make it more personal and humane than current, trigger-based solutions.

Text analytics

Apart from delivering real-time monitoring and sentiment analysis, NLP tools can analyze long and complicated texts, as is already being done at EY, PwC and Deloitte, all of which employ machine learning models to review contracts. The same can be applied to analyze emails or other company-owned unstructured data. According to Gartner estimates, up to 80% of all business data is unstructured and thus nonactionable for companies.

A good example of natural language processing in text analytics is a solution deepsense.ai designed for market research giant Nielsen. The company delivered reports on the ingredients in all of the FMCG products available on the market.

The process of gathering the data was time-consuming and riddled with pitfalls: an employee had to manually read a label, check the ingredients and fill out the tables. The entire process took up to 30 minutes per product. Also, due to inconsistencies in naming, the task was riddled with inconsistencies, as the companies delivered the product ingredients in local languages, English and, especially on the beauty and skin care markets, Latin.

deepsense.ai delivered a comprehensive system that processed an image of the product label taken with a smartphone. The solution spotted the ingredients, scanned the text and sorted the ingredients into tables, effectively reducing the work time from 30 minutes to less than two minutes, including the time needed to gather and validate the data.

Another use case of text analytics is the automated question response function generated by Google, which aims not only to provide search results for particular queries, but a complete answer to the user’s needs, including a link to the referred website, and a description of the matter.

Summary

Natural language processing provides numerous opportunities for companies from multiple industries and segments. Apart from relatively intuitive ways to leverage NLP, such as processing the documents and chatbots, there are multiple other applications, including real time social media analytics and supporting journalism or research work.

NLP models can be used to further augment existing solutions–from supporting the reinforcement learning models behind autonomous cars by providing better sign recognition to augmenting demand forecasting tools with extensions to analyze headlines and deliver more event-based predictions.

Because natural language is the best way to transfer information between humans and machines, the applications NLP makes possible will only increase and will soon be augmenting business processes around the globe.

AI Monthly Digest #12 – the shadow of malicious use

September 6, 2019/in Data science, AI Monthly Digest /by Konrad Budek and Arkadiusz Nowaczynski

With this edition of AI Monthly Digest, we have now for a full year been bringing readers carefully selected and curated news from the world of AI and Machine Learning (ML) that deepsense.ai’s team considers important, inspiring and entertaining.

Our aim is to deliver information that people not necessarily involved in AI and ML may find interesting. Also, the digest is curated by data scientists who ensure that the information included isn’t just hot air or marketing mumbo-jumbo, but significant news that will impact the global machine learning and reinforcement learning world.

This edition focuses on natural language processing, as the GPT-2 model is still an important element of AI-related discourse. This edition also contrasts the enthusiasm of ML-developers with concerns expressed by a renowned professor of Psychology.

1200 questions to ask

With natural language processing, a computer needs to generate natural texts in response to a human. This is at least troublesome, especially if a longer text or speech is required.

While these problems are being tackled in various ways, the gold standard is currently to run the newest solution on a benchmark. Yet delivering one is another challenge, to put it mildly.

To tackle it, researchers from the University of Maryland created a set of over 1200 questions that are easy to answer for a human and nearly impossible for a machine. To jump from “easy” to “impossible” is sometimes a matter of very subtle changes. As the researchers have said:

if the author writes “What composer’s Variations on a Theme by Haydn was inspired by Karl Ferdinand Pohl?” and the system correctly answers “Johannes Brahms,” the interface highlights the words “Ferdinand Pohl” to show that this phrase led it to the answer. Using that information, the author can edit the question to make it more difficult for the computer without altering the meaning of the question. In this example, the author replaced the name of the man who inspired Brahms, “Karl Ferdinand Pohl,” with a description of his job, “the archivist of the Vienna Musikverein,” and the computer was unable to answer correctly. However, expert human quiz game players could still easily answer the edited question correctly.

Capitalizing on this knowledge, researchers will be able to deliver better benchmarking for models and thus determine which part of the question confuses the computer.

Why does it matter

With each and every breakthrough, researchers get closer to delivering human-level natural language processing. At the same time, however, it is increasingly hard to determine if the neural network is understanding the processed text, or is just getting better fitted to the benchmark. Were the latter the case, the model would outperform existing solutions but register no significant improvement in real-life performance.

An example with detailed explanations are available in the video below.

A benchmark updated with those 1200 questions delivers significantly more precise information on the model’s ability to process the language and spot the drawbacks.

Large GPT-2 released

GPT-2 is probably the hottest topic among AI Trends 2019, especially considering the groundbreaking effect and controversial decision to NOT make the model public. Instead, OpenAI, the company behind the model, decided to cooperate with chosen institutions to find a way to harden the model against potential misuse.

And the threat is serious. According to research published in Foreign Affairs, readers consider GPT-2-written texts nearly as credible and convincing as those written by journalists and published in The New York Times (72% compared to 83%). Thus the articles are good enough to be especially dangerous as a weapon of mass disinformation or fake news factory – AI can produce a limitless amount of credible-looking texts with no effort.

To find the balance between supporting the development of the global science of AI and protecting models from being used for maleficent ends, OpenAI is releasing the model in iterations, starting a small one and ultimately aiming to make the model public but with the threat of misuse minimized.

Why does it matter

As research published in Foreign Affairs states, the model produces texts that an unskilled reader will find comparable to journalist-written ones. Image recognition models are already outperforming human experts in their tasks. But with all these cultural contexts, humor and irony, natural language once seemed protected by the unassailable fortress of the human mind.

The GPT-2 model has apparently cracked the gates and with business appliances it may be on the road to delivering a model that can provide human-like performance. The technology just needs to be controlled so as not to fall into the wrong hands.

What is this GPT-2 all about?

A GPT-2 model is, as stated above, one of the hottest topics of AI in 2019. But even the specialist can find it hard to understand the nitty-gritty of how the model works. To make the matter more clear, Jay Alammar has prepared a comprehensive guide to the technology.

Why does it matter

The guide is good enough to allow a person who has limited to no knowledge on the matter to understand the nuances of the model. For a moderately skilled data scientist given sufficient computing power and a dataset, the guide is sufficient to reproduce the model for example to support demand forecasting with NLP. It enables a data scientist to broaden his or her knowledge with one comprehensive article – a convenient way indeed.

Doing research is one thing, but sharing the knowledge it affords is a whole different story.

Malicious use, you say?

Jordan Peterson is a renowned professor and psychologist who studies the structure of myth and its role in shaping social behavior. If not a household name, he is certainly a public person and well-known speaker.

Using deep neural networks, AI researcher Chris Vigorito launched a notjordanpeterson.com website that allowed any user to generate any text that was later read with the neural network-generated voice of Jordan Peterson. As was the case with Joe Rogan, the output was highly convincing, mirroring the manner of speaking, breathing and natural pauses.

The networks was trained on 20 hours of transcripted Jordan Peterson speeches, an easy number to obtain where a public speaker is concerned. The amount of work was considerable, but not overwhelming.

Why does it matter

The creation of the neural network is not as interesting as Jordan Peterson’s response. He has written a blogpost entitled “I didn’t say that”, where he calls the situation “very strange and disturbing”. In the post, he notes that while it was fun to hear himself singing popular songs, the prospect of being an unwitting part of a scam is more than real. Due to the rising computing power available at affordable prices and algorithms getting better and less data-hungry, the threat of this technology being used for malicious ends is rising. If you’d like to know just how malicious he means, I’ll leave you with this to consider.

I can tell you from personal experience, for what that’s worth, that it is far from comforting to discover an entire website devoted to allowing whoever is inspired to do so produce audio clips imitating my voice delivering whatever content the user chooses—for serious, comic or malevolent purposes. I can’t imagine what the world will be like when we will truly be unable to distinguish the real from the unreal, or exercise any control whatsoever on what videos reveal about behaviors we never engaged in, or audio avatars broadcasting any opinion at all about anything at all. I see no defense, and a tremendously expanded opportunity for unscrupulous troublemakers to warp our personal and collective reality in any manner they see fit.

AI Monthly Digest #11 – From OpenAI partnering with Microsoft to battle.net Blade Runners

August 8, 2019/in Data science, AI Monthly Digest /by Konrad Budek and Arkadiusz Nowaczynski

AI models are skilled in Chess, Go, StarCraft and, since July, six-player Texas Hold’em Poker. But the hunt for inhuman players has begun.

And when AI models get bored with beating humans in games, it’s apparently time for a bike ride. Read on to find out just why.

OpenAI partners with Microsoft

In March, OpenAI shifted to a for-profit paradigm, forging OpenAI LP. The company has now entered into a strategic partnership with Microsoft to build a new computational platform in Azure Cloud. OpenAI will port its codebase into Azure and develop new solutions with tools available there.

The main benefit for Microsoft will be the access it gains to the fruits of OpenAI’s work as a preferred partner for commercializing new AI technologies. OpenAI was formed with the goal of improving people’s lives with technology and has delivered multiple AI breakthrough models including MuseNet, which produces music in various styles, and GPT-2, a natural language processing model and gold standard in text generation.

For more on the joint venture, click on over to the official press release.

Why does it matter?

As the company recently announced, taking a non-profit approach to develop AI proved too daunting even for the likes of Elon Musk and Sam Altman, the company’s CEO. Unlike with most software, developing artificial intelligence requires not only skilled and talented people but also an astonishing amount of computing power. Training a single GPT-2 model is estimated to cost up to $50,000 – and that’s only for one of many experiments run in a given year. So, securing access to computing power is a must. By providing that, Microsoft will be getting access to the base of talented AI developers, significantly increasing its AI development potential.

AI beats skilled human players in six-player poker

Pluribus bot is the first AI-controlled agent capable of beating human pro players in six-player no-limit Hold’em poker, the most popular format of this game in the world. Unlike chess and Go, poker is a game with hidden information – the player cannot see the hands of other players. The game itself involves bluff and a great deal of psychological factors. AI bots were good at beating one opponent, but going up against more than one was a major milestone.

More details are available in the ai.facebook blogpost.

Why does it matter

Dealing with a one-on-one situation, though common in recreational games, is rare when solving real-life problems. Moreover, it is somewhere between difficult and impossible to get all the information one needs, as in chess or Go. Delivering models that work in a limited-information, multiple-agent environment is pushing models that support demand forecasting or managing cybersecurity threats.

When an autonomous car is not enough

Cars, be they traditional or autonomous, come with various disadvantages, especially in cities. They get stuck in traffic jams, and require parking, which can be hard to come by.

To provide more sustainable autonomous transport, Chinese scientists have brought out an autonomous bicycle. The machine responds to voice commands, avoids obstacles and maintains balance. It uses a new Tianjic chip, which supports processing neuroscience-inspired algorithms. Check out the video below to see how it works.

Why does it matter

The research itself sounds like a lot of fun, but it also constitutes an excellent foundation for further work on autonomous transportation. A bike can be used for delivering fast food, or modified to work as a motorized Rickshaw in our ever more crowded metropolises. Couriers can used them to deliver the mail and other documents or to transport individuals incapable of riding a bike.

Autonomous bikes need not be fully autonomous, by the way: AI can support the driving process or provide alerts to riders.

Alan Turing will be featured on 50 pound banknote

Often hailed as the godfather of modern computing, Alan Turing is widely known for his work on cracking the famed German Enigma code and as the leader of the team that enhanced the cracking methods delivered by Polish mathematicians.

Turing is also considered a pioneer of artificial intelligence. He came up with the Turing Test as a first way to determine if a machine mimicking a human in conversation is truly intelligent.

So great was Turing’s contribution to humanity that he will now be featured on Britain’s 50-pound banknote. Chosen from 227,299 nominations covering 989 eligible characters, Turing was ultimately picked by Mark Carney, Bank of England governor.

Why does it matter

The announcement is a sign that computer science is no longer considered a novelty and prominent AI researchers earn the same respect chemists, physicists or life sciences experts do, as representation on a banknote well attests.

BattleNet Blade Runners

Deepmind, in conjunction with Blizzard, has deployed AlphaStar model on Battle.net, to allow players to test their skills and mettle against artificial intelligence. Battle.net is an official platform connecting players from all around the world, enabling them to quickly find opponents for a multiplayer match.

There is just one twist: the famed, reinforcement learning-trained AlphaStar will play anonymously, thus allowing players to compete with the model as they would do in any match with a normal opponent.

AlphaStar has been developed significantly beyond the abilities it commanded in defeating human professional players MaNa and TLO. Deepmind capped the actions-per-minute and actions-per-second rate to make it more accurately appropriate human abilities limited by muscles and the need to operate a mouse and keyboard. The model’s perception has also been narrowed to a single frame to come in line with what human players see on the screen.

Finally, the model is able to control and compete in any race given, be it Terran, Protoss or Zerg, representing all the factions available in the game. This represents serious progress: during matches in January, the model could only control Protoss units fighting against other Protoss.

Why does it matter

At the moment, it doesn’t. But let the experiment run its course and tune in later for an update. We anticipate more impressive progress.

Interestingly, this most recent news thrilled the players’ community, which all too clearly remembers the wounds AlphaStar inflicted in dominating renowned pros. Given that any match with the model is counted as a normal encounter and a ranking match, on-guard players try to spot and avoid AlphaStar lurking in the muddy waters of the battle.net rankings.

Players have reported “odd” behavior of some opponents and have been uploading videos on YouTube, where they discuss if the other player is actually AlphaStar incognito. They also advise each other to check if the partner responds to messages. Being called a noob by an opponent even once can be strong evidence that the opponent on the other side of the battlefield isn’t human.

So players are on the hunt for a replicant.

AI Monthly Digest #10 – AI tackles climate change and deciphers long-forgotten languages

July 8, 2019/in Data science, AI Monthly Digest /by Konrad Budek and Arkadiusz Nowaczynski

June brought record-breaking temperatures, perfectly highlighting the global challenge of climate change. Is that AI-related news? Check and see in the latest AI Monthly Digest.

A common misconception about machine learning projects is that they are by definition big. However, any number of AI-powered micro-tweaks and improvements are applied in everyday work. A good example of both micro and macro tweaks that can fix a major problem can be found in the paper described below.

AI tackling climate change

The world witnessed an extraordinarily hot June, with average temperatures 2 degrees celsius above normal in Europe. According to the World Meteorological Organization, the heatwave is consistent with predictions based on greenhouse gas concentrations and human-induced climate change.

Tackling this challenge will not be easy: according to World Bank Data, fossil fuel energy consumption still stacks to 79% of total. Furthermore, greenhouse gasses, particularly methane, are emitted by cattle, with livestock being responsible of 14.5% of total human-induced greenhouse emissions.

The most prominent figures in AI today, including DeepMind CEO Demis Hassabis, Turing award winner Yoshua Bengio, and Google Brain co-founder Andrew Ng, have authored a comprehensive paper on ways that AI can tackle the changing climate.

Their call for collaboration is meant to inspire practitioners, engineers and investors to deliver short- and long-term solutions for measures within our reach. Those include producing low-carbon electricity through better forecasting, scheduling, and control for variable sources of energy, mitigating the damage produced by high-carbon economies through, for example, better predictive maintenance as well as help minimize energy use in transportation, smart buildings and cities. The applications can vary from designing grid-wide control systems or optimizing scheduling with more accurate demand forecasting.

Why does it matter

Climate change is one of the greatest challenges mankind faces today, with truly cataclysmic scenarios approaching. Further temperature increases may lead to a variety of disasters, from flooding coastal regions due to melting ice caps, agricultural crises and conflicts over access to water.

Green energy promises solutions, yet these are not without their challenges, many of which could be solved with machine learning, deep learning or reinforcement learning. Responsibility is among deepsense.ai’s most important AI trends, and being responsible for the planet would be an excellent example of just why we chose to focus on that trend.

We will provide more in-depth content on climate change and AI-powered ways of tackling it. So stay tuned!

Giants racing to produce the best image recognition

If machine learning is today’s equivalent of the steam engine revolution, data and hardware are the coal and engine that power the machines. Facebook and Google are like the coal mines of yesteryear, having access to large amounts of fuel and power to build new models and experiment.

It should come as no surprise that breakthroughs are usually powered by the tech giants. Google’s state of the art in image recognition, EfficientNet, has been a recent giant step forward. The model was delivered by automated searching procedure uniformly scaling each dimension of the network in order to find the best combination.

EfficientNet stands for something.

The result is state-of-the-art in Image recognition. At least when it comes to combining efficiency and accuracy. But not when it comes to accuracy alone.

Not even a month later Facebook delivered a model that outperformed Google’s. The key lay in scaling the enormous dataset it was trained on. The social media mogul has access to Instagram’s database, which holds no less than billions of user-tagged images, a dataset ready to be chewed over by a hungry deep learning model.

The neural network was released to the public using a recently launched Pytorch Hub platform for sharing cutting edge models.

Why does it matter

Both advances show how important machine learning is for the tech giants and how much effort they invest in pushing their research forward. Every advancement in image recognition brings new breakthroughs closer. For example, models are becoming more accurate in detecting diabetic retinopathy using images of the eye. Every further development delivers new ways to solve problems that would be unsolvable without ML (Machine learning) – manufacturing for visual quality control is among the best examples.

XLNet outperforms BERT

As we noted in a past AI Monthly Digest, Google has released Bidirectional Encoder Representation from Transformations (BERT). BERT was, until recently, the state-of-the-art when it comes to Natural Language Processing benchmarks. The newly announced XLNet is an autoregressive pretraining method (as opposed to an autoencoder-like BERT) which learns a language model by predicting the next word in a sequence using the permutation of all the surrounding words. An intuitive explanation can be found (here).

The XLNet model proved more effective than BERT in beating all 20 benchmark tasks.

Why does it matter

Understanding a natural language was considered a benchmark for intelligence, with Alan Turing’s test being among the best examples. Every push forward delivers new possibilities in building new products and solving problems, be they business ones or something more uncommon, like the example below.

AI-powered archeology? Bring it on!

Deep learning-based models are getting even better at understanding natural language. But what about language that is natural, but has never been deciphered due to lack of knowledge or a frustratingly small amount of extant text?

Recent research from MIT and Google shows that a machine learning approach can deliver major improvements in deciphering ancient texts. In the basics of modern natural language processing techniques, all of the words in a given text are assumed to be related to each other. The machine itself doesn’t “understand” text it in a human way, but rather forms its own assumptions based on the relations and connotations of each word in a sentence.

Disc of Phaistos, one of the most famous mysteries of archaeology

In this approach, the translation process is not built on understanding the world, but rather finding similarly connotated words that transfer the same message. This is entirely different than humans’ approach to language.

By making the algorithm less data-hungry, the researchers deliver a model that translates texts from rare and long-lost languages. The approach is described in this paper.

Why does it matter

While there are countless examples of machine learning in business, there are also new horizons to discover in the humanities. Deciphering the secrets of the past is every bit as exciting as building defenses against the challenges of the future.

The more sophisticated approach to and possible brute-force breaking of unknown languages provides a way to uncover more language-related secrets.

A Disc of Phaistos? Or a Voynich manuscript maybe?