deepsense.ai
  • Careers
    • Job offers
    • Summer internship
  • Clients’ stories
  • Services
    • AI software
    • Team augmentation
    • AI advisory
    • Train your team
  • Industries
    • Retail
    • Manufacturing
    • Financial & Insurance
    • IT operations
    • TMT & Other
    • Medical & Beauty
  • Knowledge base
    • Blog
    • R&D hub
  • About us
    • Our story
    • Management
    • Advisory board
    • Press center
  • Contact
  • Menu Menu
Paramount factors in successful machine learning projects. Part 2/2

Paramount factors in successful machine learning projects. Part 2/2

January 9, 2023/in Machine learning /by Robert Bogucki and Jan Kanty Milczek

In the first part of our guide we focused on properly executing the entire process of building and implementing machine learning models with a focus on the main goal –  solving the overarching business challenge. In the second part of our material we dig deeper into the topic of modeling.

MODEL. It’s never too late to learn.

Machine learning should be approached differently depending on the goal. If you’re taking part in a competition, you’ll want to focus mainly on feature engineering and experimentation – much of the other work will be done for you. In academia, researching previous approaches is vital. For commercial projects, you can likely skip a big portion of experimentation in favor of making doubly sure that you have a good idea of what is ultimately needed of you and how it fits into the big picture.

Understanding the goal and prioritizing the process accordingly is the key to success. The checklist presented below can help organize your efforts. While reading this checklist, you should decide which points are helpful means to your end and which would only slow you down.

Don’t shuffle your data (before reading this)
A lot of people set up their validation code to split the dataset into train and test and call it a day. This is very dangerous, as it is likely to induce data leaks. Are we sure that time is not a factor? We probably shouldn’t use the future to predict the past. Was it grouped by source? We should plan for any and all of those. A simple, yet very imperfect sanity check is to validate twice – once based on shuffled data, and once on a natural split (like the last lines of a file). If there’s a big discrepancy, you need to think twice.

Make the validation as real as possible
When testing your solution, try to imitate the exact scenario and environment the model will be used in as closely as possible. If it’s store sales prediction, you want a different model for new (or hypothetical) stores and existing locations. You likely want different validations too. In addition to a time split, you need to add some lag to account for model deprecation, and you likely want to prevent the same store from appearing in both the training and validation set. A rule of thumb is that “the worse results the validation gives, the better it likely is”.

Try to make the main metric “natural”
When making a model, look for metrics that make sense in the context of the problem. AUC is good internally, but I challenge anyone to explain it to a non-technical person. Precision and Recall convey information in a simpler way, but there’s multiple ways to set the prediction threshold, and in the end they are 2 numbers. F1 needs to die already. It takes 2 flawed numbers and combines them in a flawed way to arrive at an even more flawed number. Instead, you could ask “How much money is a true positive/negative worth? How much does a false positive/negative cost?” If there is one metric non-technical people understand, it’s $$$.

Set up your pipeline in a way that prevents mistakes
Modeling is often a long process. Structure your project and code in a way that ensures that avoidable mistakes are avoided. Obviously, the validation code should not care about the specifics of the model. The training code should not even have access to anything to do with validation.

Double check your framing
Establishing your validation pipeline is a nice opportunity to revisit how you framed the problem at hand and how this is being reflected in the pipeline. Plan a session or two with other stakeholders and domain experts as well to align on methodology, metrics, KPIs, initial results, and how it will all eventually fit.

MULTIPLE APPROACHES. No pain no gain.

Consistently developing good models is only possible when methodically testing hypotheses. You may believe you already know the best architecture/method for the job. What if your knowledge is outdated though? Or, worse still, you were wrong to begin with? The best way to verify is to carefully curate a gauntlet of different models, evaluate them and decide how they figure in the final solution.

Check random prediction & constant prediction
It might seem pointless to do so, but they provide some context to the problem at hand. For example, it’ll help you assess whether a model is failing to learn something at all, or just cannot break through some kind of glass ceiling.

Develop a simple rule-based model
It may be predicting the previous value for time series, a “group by XYZ, then use median” approach, a regex detecting common patterns… We expect that any model we build would at least be an improvement on that.

Try a classical ML model
Before using the newest and most capable tools at your disposal, try linear regression, gradient boosting or something similarly easy to define, train & use. See how much each model can squeeze out of the data and try to understand why. Do not perform non-obvious feature extraction/selection, that’s not the point here.

Test the standard approach for your problem class
Are you doing NLP? There’s probably a version of BERT for you. Financial modeling? Factorization Machines are likely still king.  This not only helps you understand the problem better, but it may also become the foundation of the overall solution.

Use any kind of benchmark available
If there is a solution or an approach that has previously been used, test it. It may be your client’s current ruleset, or something you found in a research paper. It is paramount that you don’t spend too much time attempting to be worse than what’s already out there. Also, if possible, verify your validation pipeline – it should rank solutions in similar order as the public benchmarks. If it doesn’t, it’s important to investigate why.

Try ensembling, if you need it
Ensembling of different models (or different instances of the same model!) can help you squeeze out some more quality information from your data. It’s a powerful tool that regularizes your output. Please remember that the more distinct your models are, the better the ensemble performs. Just make sure not to overdo it. The time needed to build a big ensemble or use it for prediction quickly racks up.

After the above, you can get a little crazy
Now you know what the pre-existing approaches are and can make educated guesses about the quality of your own. It’s high time to test them. Leverage your validation/test/evaluation pipeline to make decisions for you. If you think an idea is silly, perhaps you can run it through the code anyway – there’s not much to lose.

FEATURE ENGINEERING. Learn to walk before you run.

The topic of feature engineering has fallen out of fashion in recent years due to the prevalence of deep learning and the focus on models doing “their own feature engineering”. This does not mean that it is obsolete. Especially, the smaller the dataset, the more important it is to help the model understand the data.

Be creative
Think about properties that can be (even remotely) useful for your algorithms. This is an area where some domain knowledge may help, but you should go beyond this. If experts use 5 specific variables to assess something, ask them about 20 more that may be related to the problem.

Copy from others (or from your past self)
It may be worthwhile to spend some time inspecting similar problems and features that worked well there. Treat this as an inspiration – even if you can’t translate them directly, you may be able to come up with a proxy or an analogy.

Be thorough
Some features can be extracted from the way data is collected. When predicting the click through rate for a commercial, looking at the user’s history is much more important than just the specific datapoint. A user who sees a lot of commercials in a short time frame is most likely a bot. We cannot infer this if we decide to treat each record as a separate entity.

Help your model
A lot of models are limited in terms of scope. Linear regression treats each feature separately to an extent – but we can make “new features” out of feature interactions (and feature interactions are not limited to cross-products). On the other hand, when using tree-based models, some (order-definable) categorical variables will work better as a discrete value than one-hot encoded.

Remove or change features that are too specific
A full address is probably not needed – as opposed to info about being located in a city or not, or the distance to the closest highway…

CONTINUOUS IMPROVEMENT. To err is a model, to understand is divine.
Your model will be wrong in the future and this may make someone unhappy. What if you can predict what mistakes can happen? What if you can somehow spot them? Fix them?

Understand errors, predictions, variables and factors
By understanding your model’s shortcomings very well, you might not only get some ideas on how to mend them but also design the whole thing to make them less severe.

Know your predictions
The same goes for predictions. What do they usually look like? Is there a bias? What are the typical results in typical scenarios?

Know how to change your (model’s) mind
What variables make it work? Do you know how your model reacts to changes in hyperparameters? You may check whether increasing/decreasing them gives the desired result.

Reassess your results and validation pipeline
Simple models might not be powerful and expressive enough to expose the shortcomings of the validation pipeline. Once your errors are fewer or smaller, make sure you revisit the validation method and make sure it stays solid throughout the whole process.

META

You are most likely overwhelmed with the number of dos and don’ts in this guide. It does tell you to make the distinction between necessary and unnecessary, but there’s two more things that should be mentioned, especially if you’re working on the project with other people.

Periodically make an examination of conscience
Some of the best practices described here require planning, regular maintenance, good coding, and a careful approach. There’s situations where those prerequisites deteriorate. Use your time to go back and bring your approach back up to par.

Document when cutting corners
Sometimes you cannot follow all of the best practices, e.g. there is no way to make a believable validation pipeline. Those compromises should be noted and explained, so that in future iterations of the project you (or especially others) know what should be improved.

https://deepsense.ai/wp-content/uploads/2022/12/Paramount-factors-in-successful-machine-learning-projects-part-2.jpeg 337 1140 Robert Bogucki https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Robert Bogucki2023-01-09 07:00:302023-01-09 09:27:36Paramount factors in successful machine learning projects. Part 2/2
How to perform self-supervised learning on high-dimensional data

How to perform self-supervised learning on high-dimensional data

October 28, 2022/in Computer vision, Artificial Intelligence, Machine learning /by Łukasz Kuśmierz

In this post, the self-supervised learning paradigm is discussed. This method of training machine learning models is emerging nowadays, especially for high-dimensional data. In order to focus the attention of this article, we will only work on examples from the computer vision area. However, the methods presented are general and may be successfully used for problems from other domains as well.

Introduction

As data sets grow and tasks become more complicated, supervised learning and reinforcement learning approaches turn out to be harder to apply efficiently. The reason is that the feedback signal needed during training is becoming increasingly hard to obtain.

In this post, I present another learning paradigm which is free of this problem – self-supervised learning. This method of training machine learning models is emerging nowadays, especially for high-dimensional data.

In order to focus the attention of this article, we will only work on examples from the computer vision area. However, the methods presented are general and may be successfully used for problems from other domains as well.

The challenge of labeling in supervised learning and reinforcement learning

Current deep learning techniques work great in tasks where there is plenty of data that comes with a feedback signal. In supervised tasks, the feedback comes in the form of error signals computed based on the labels attached to each data point. Thus, we have to specify the correct output for a given input.

In reinforcement learning, a scalar feedback signal is provided from the environment. In this case we do not have to state how the agent should behave, but we should be able to assess how the agent has performed. This is usually done in the context of tasks that have a temporal aspect — the AI agent acting in its environment over multiple time steps must adjust its behavior based on possibly delayed and sparse reward signals.

The problem with the supervised approach is that it does not scale very well. It may be relatively easy to label 100 images, but usually we need thousands or millions of labeled examples before the model can learn the nuances of the task. All practitioners can probably confidently say that there are hardly ever enough labeled data points.

How to perform self-supervised learning on high-dimensional data - sheeps

Source: SOLO paper. If we want to train a good model to segment sheeps using supervised learning, we have to color every single sheep manually in hundreds of pictures.

In the context of reinforcement learning, it may be relatively easy to obtain many samples (episodes) from virtual environments, like with chess, Go, or Atari games. However, this is becoming much more difficult in the “wild” with an actual physical agent interacting with the real world. Not only is the environment richer and noisier, but it may not be feasible to obtain many episodes with bad actions (think about self-driving cars or AI controlling nuclear power plants).

This is one of the reasons why we almost always use transfer learning. Here, we take a model that was pre-trained on another dataset (e.g., in computer vision the standard practice is to use ImageNet) and use it as a starting point for training on our dataset. Note that ImageNet is in itself a fairly large  labeled dataset. But there is so much more digital data available!

Could we somehow benefit from that data without laborious and time-consuming labeling? I will try to answer this question later in this article.

Pretext tasks

In the absence of labels, we do not have any clear task that can be written in terms of a cost function. How can we then learn from data itself?

The general strategy is to define an auxiliary, pretext task that gives rise to a self-supervised learning (SSL) signal. To do so, in general we must ask the model to predict some aspects of input data, possibly given some corrupted version of that data. Perhaps the most straightforward idea is to work in the input space and to ask the model to generate part of the input tensor given an input that had this part of the input masked or replaced. Such SSL methods are known as generative methods and have been a big hit in the context of natural language processing, where the masked words (or tokens) are predicted given the context defined by the surrounding text.

Similar ideas have been developed in the context of computer vision and other modalities.

For example, deep belief networks are a generative model that jointly learns to map inputs to latent representations and to generate the same inputs given the latent representation, whereas masked autoencoders are tasked with reconstructing patches (pixels) that have been randomly masked from the input image, which is achieved given the context (non-masked pixels). Although such methods can be effective, generating images, sounds, videos, and other high-dimensional objects that feature a lot of variability is a rather difficult task. It would be great if we could come up with a simpler pretext task to achieve the same goal…

Useful representations

Wait, but what do we want to achieve anyway? As mentioned before, we want to pretrain our model on unlabeled data and then use it in other (“downstream”) tasks where available labeled data is limited. Therefore, the model should use the unlabeled data to learn something useful about that data: something that is transferable to other similar datasets.

One way of looking at this is to say that we want the model to take an input \(x\) (e.g., an image or a video clip) and output a vector \(z\) that represents the input in some useful way. Following the literature, we will refer to this vector as representation or embedding. Of course, the crucial thing here is to specify what “useful” means; in general, this will depend on the downstream task.

Fortunately, many tasks share some common features that can be utilized to assess whether a given representation is useful. Indeed, classification, detection, and segmentation tasks in computer vision or speech recognition are characterized by important invariances (e.g., I can recognize an object regardless of its position in the image) and equivariances (e.g., a detection box should encompass the entire object regardless of its position in the image). Notice that I did not specify whether the object is a dog or a pineapple — in this sense these are very general features that do not depend on many other details of the task.

Augmentations

Ok, but how can we translate invariances into good representations? The basic idea is to ensure that the representation features the desired invariances. To do so, we should first state the invariances – the easiest way of doing it is to list, in the form of a procedural definition, transformations that our desired representation should be invariant to.

These transformations can be implemented in the form of a function that takes an input \(x\) and outputs a new (possibly random) view \(x’\). Note that the same procedure is used as part of a standard supervised learning pipeline where it is referred to as data augmentation. In practice, when working with images one could for example use albumentations or the torchvision.transforms module of PyTorch.

In this article I focus on computer vision where input is a static image, but the same methods can be adapted to other modalities including cross-modal self-supervision with, say, videos that contain both sound and a series of images. The crucial difference in how to deal with these other inputs lies in defining a good set of augmentations.

Invariance term: squeeze representations together

The next step is to formalize our intuition described above in the form of a cost function. As a reminder, our goal is to ensure that the representation (output of the model) is invariant under our chosen set of augmentations. To this end, let us take two views of the same input image \(x^A\) and \(x^B\) and pass them through the model obtaining a pair of joint embeddings \(z^A\) and \(z^B\). Next, we calculate the cosine similarity between these two representations $$\mathrm{sim}(z^A, z^B) \equiv \cos{(\phi)} =\frac{z^A \cdot z^B}{\Vert z^A\Vert \Vert z^B \Vert}.$$ Ideally, we should have \(\phi=0\) (hence, \(\cos{\phi}=1\)), so we want to minimize the cost function of the form \(l_{sim} = -\mathrm{sim}(z^A, z^B)\) which we will call invariance or similarity cost. As usual, this cost should be averaged over all images in the batch, leading to $$\mathcal{L}_{sim} = -\frac{1}{N}\sum_{i=1}^N \mathrm{sim}(z_i^A, z_i^B)$$ As the similarity cost decreases, representations of different views of the same image are pressed together, ultimately leading to a model that produces representations that are invariant under the set of transformations used to augment our dataset. However, this alone will not work as a good representation extractor.

Collapse

It is easy to understand why the similarity cost by itself is not enough. Take the following rather boring model

$$
f(x) = z_0,
$$

which ignores the input and always outputs the same representation (say, \(z_0 = [1,1,…,1]\)). Since this is the simplest solution to the optimization problem defined by \(\mathcal{L}_{sim}\) and “everything which is not forbidden is allowed”, we can expect such undesirable solutions to appear frequently as we optimize \(\mathcal{L}_{sim}\).

This is indeed what happens and this phenomenon is called a (representation) collapse. It is useful to think about the current self-supervised learning techniques in terms of how they avoid collapse. From this perspective, there are two main categories of SSL methods: contrastive and regularization-based. Below we describe in detail two popular examples, each relatively simple but still representative of its category.

Projection head

There is an additional detail: it is not actually \(z\) that is being used in downstream tasks. Instead, as it turns out, it is more beneficial to use an intermediate representation \(h\), see Fig. 1. In other words, the projection head \(g\) that is used to calculate \(z = g(h)\) is thrown away after the training. The intuition behind this trick is that the full invariance is actually detrimental in some tasks. For example, it is great if our model can report “dog” even if only the dog’s tail is visible in the image, but the same model should also be able to output “tail” or “dog tail” if it is asked to do so.

Fig. 1: A schematic diagram of a self-supervised training pipeline based on augmentations and joint embeddings. Although this image is taken from the SimCLR paper, the same overall strategy is employed in both contrastive and non-contrastive SSL methods.

Fig. 1: A schematic diagram of a self-supervised training pipeline based on augmentations and joint embeddings. Although this image is taken from the SimCLR paper, the same overall strategy is employed in both contrastive and non-contrastive SSL methods.

Contrastive learning (SimCLR)

Contrastive learning methods can be thought of as generating supervision signals from a pretext discriminative task. In the past few years there has been an explosion of interest in contrastive learning and many similar methods have been developed. Here, let us focus on a famous example, SimCLR, which stands for “a simple framework for contrastive learning of visual representations”.

Indeed, the algorithm is pretty straightforward.

  • First, take a batch of images \((x_i)_{i\in\{1,..,N\}}\) where batch size \(N\) should be large.
  • Second, for a given input image \(x_k\) generate (sample) two views, \(\tilde{x}_i\) and \(\tilde{x}_j\). Note that this gives us a new, extended batch of augmented images of size \(2 N\).
  • Third, apply the same base encoder \(f\) and projection head \(g\) to each sample in the extended batch obtaining “useful” representations \(h_i = f(\tilde{x}_i)\) and “invariant” representations \(z_i = g(h_i)\).
  • Fourth, optimize \(f\) and \(g\) jointly by minimizing the contrastive loss \(\mathcal{L}_{InfoNCE}\).
  • Last, throw away \(g\) and use \(f\) in the downstream task(s).

But what is \(\mathcal{L}_{InfoNCE}\)? In the original paper this loss function was termed NT-Xent for the “normalized temperature-scaled cross entropy loss”, but it is basically a version of InfoNCE loss introduced in the Contrastive Predictive Coding paper, which in itself is a special case of noise-contrastive estimation. The main idea here is to split the batch into positive and negative pairs. Positive pairs are two different views of the same image and, as discussed above, their representations should be close to each other. The crucial idea is that all the other (“negative”) pairs are treated as non-matching pairs whose representations should be pulled apart. Note that this approximation makes sense only if the dataset is rich enough and contains many categories. In this case the likelihood that two randomly chosen images represent the same object (or two very similar objects) is small.

How to pull negative pairs apart? In SimCLR this is achieved by the following loss

$$
\mathcal{L}_{InfoNCE} = -\frac{1}{N}\sum_{i, j=P(i)} \log\frac{\exp\left(\mathrm{sim}(z_i, z_j)\right/\tau)}
{\sum_{k\neq i}\exp\left(\mathrm{sim}(z_i, z_k)\right/\tau)},
$$

where \(P(i)\) returns the index of the other view of the same image (positive “partner”) and \(\tau\) is a “temperature” hyperparameter that is introduced to adjust how strongly hard negative examples are weighed. One can think of this loss as a cross-entropy loss for multi-class classification with a softmax layer or, in other words, as a multinomial logistic regression.

The pretext task can be then summarized as follows: given a view of an image \(\tilde{x}_i\), find the other view of the same image among the set containing all the other \(2N – 1\) views of images in the extended batch. It is also easy to see that this loss function can be decomposed as

$$
\mathcal{L}_{InfoNCE} = \mathcal{L}_{sim} + \mathcal{L}_{con},
$$

where \(\mathcal{L}_{sim}\) is the familiar similarity term discussed above

and \(\mathcal{L}_{con}\) is a contrastive term that pulls all representations in the batch apart.

Additional details:

  • The base encoder can be any differentiable model. The authors of the original paper have opted for variants of ResNet-50 as this neural network has emerged as the standard architecture used to compare different methods.
  • In SimCLR the projection head is a simple multilayer perceptron with a single hidden unit. The dimensionality of \(z\) does not have to be very large but it is important for the projection head to be nonlinear.
  • In the original paper the authors have presented the results of systematic experiments aiming to find the best set of augmentations: see Fig. 2 that shows the augmentations studied in the paper. They found that no single transformation is enough to learn good representations. The best results among pairs of transformations were obtained by combining random cropping with random color distortion. Interestingly, the authors have also included random Gaussian blur in their standard pipeline.
  • This method strongly benefits from relatively large batch sizes and long training sessions.
  • Some interesting limitations of this and other contrastive methods are discussed in this paper. If many objects are present in images, the dominant object may suppress the learning of statistics of smaller objects. Similarly, easy-to-learn shared features may suppress the learning of other features.
Fig. 2: Augmentations studied in the SimCLR paper.

Fig. 2: Augmentations studied in the SimCLR paper.

Noncontrastive methods (Barlow Twins)

Noncontrastive methods avoid collapse without relying on negative pairs. This class is quite diverse and includes methods such as BYOL and SimSiam, which break the symmetry between two branches that generate two views (and their representations) of input images, as well as methods based on clustering like ClusterFit or SwAV.

Another idea is to minimize the redundancy between components of \(z\). The reduction of redundancy is the cornerstone of the efficient coding hypothesis, a theory of sensory coding in the brain proposed by Horace Barlow, hence the name Barlow twins. Here, the cost function is based upon the cross-correlation matrix \(\mathcal{C}\) of size \(M\times M\), where \(M\) is the number of representation neurons (dimensionality of \(z\)). As before, two views of each image in the batch are passed through the network leading to two representations per image, \(z^A\) and \(z^B\).

Previously we have used the notation \(z_i\) to denote the representation of an image \(i\). To better understand Barlow Twins, we have to extend our notation. Let \(z_{i,\alpha}\) denote the \(\alpha\)-th component (neuron) of vector \(z_{i}\) and \(\overline{z_{\alpha}}=(1/N)\sum_{i}z_{i,\alpha}\) the batch average of that component. Each component can be z-scored (normalized) over the batch

$$
u_{i,\alpha} = \frac{z_{i,\alpha} – \overline{z_{\alpha}}}{\sqrt{\overline{z_\alpha^2}-\overline{z_{\alpha}}^2}}.
$$

The cross-correlation matrix is defined as

$$
\mathcal{C}_{\alpha\beta}
=
\overline{u^A_{\alpha} u^B_{\beta}}.
$$

Note that only positive pairs are averaged over the batch here.

The loss is then defined as

$$
\mathcal{L}_{BT}
=
\sum_{\alpha} \left( (1 – \mathcal{C}_{\alpha\alpha})^2
+
\lambda \sum_{\beta\neq\alpha} \left(\mathcal{C}_{\alpha\beta}\right)^2
\right).
$$

The contrastive term is absent from this loss and collapse is avoided due to a different mechanism, which can be understood by analyzing two terms in the Barlow Twins loss. The first, invariance term is trying to push all the diagonal terms of the cross-correlation matrix towards \(1\) (perfect correlation).

Components of \(z^A\) and \(z^B\) are perfectly correlated when \(z^A\) and \(z^B\) are identical, as desired from the invariance principle. The second, redundancy reduction term is trying to decorrelate different neurons (components of \(z\)). This has the effect of the output neurons to contain non-redundant information about the inputs, leading to non-trivial representations.

Additional details:

  • Barlow Twins do not need nearly as large batch sizes as SimCLR.
  • Unlike SimCLR, Barlow Twins benefit very strongly from a large dimensional output (invariant) representation \(z\).

Summary and additional reading

Self-supervised learning is here to stay to complement supervised learning and reinforcement learning whenever getting enough labels or feedback signals from the environment becomes troublesome. As we saw, the key to beneficial training in a self-supervised manner is to smartly define the pretext task and to set the loss function carefully.

For those who would like to deepen their knowledge of this topic, I recommend the blog article on self-supervised learning written by Yann LeCun and Ishan Misra.

https://deepsense.ai/wp-content/uploads/2022/10/How-to-perform-self-supervised-learning-on-high-dimensional-data.jpeg 337 1140 Łukasz Kuśmierz https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Łukasz Kuśmierz2022-10-28 10:00:502022-10-28 12:45:26How to perform self-supervised learning on high-dimensional data
The recent rise of diffusion-based models

The recent rise of diffusion-based models

September 5, 2022/in Machine learning /by Maciej Domagała

Every fan of generative modeling has been living an absolute dream for the last year and a half (at least!). The past few months have brought several developments and papers on text-to-image generation, each one arguably better than the last. We have observed a social media surge of spectacular, purely AI-generated images, such as this golden retriever answering tough questions on the campaign trail or a brain riding a rocketship to the moon.

The recent rise of diffusion-based models - Introduction

Sources: https://openai.com/dall-e-2/ and https://imagen.research.google/

In this post, we will sum up the very recent history of solving the text-to-image generation problem and explain the latest developments regarding diffusion models, which are playing a huge role in the new, state-of-the-art architectures.

The recent rise of diffusion-based models - Short timeline of image generation and text-to-image solutions

A short timeline of image generation and text-to-image solutions.

It all starts with DALL·E

In 2020 the OpenAl team [1] published the GPT-3 model – a multimodal do-it-all huge language model, capable of machine translation, text generation, semantic analysis etc. The model swiftly became regarded as the state-of-the-art for language modeling solutions, and DALL·E [7] can be viewed as a natural expansion of the transformer capabilities into the computer vision domain.

Autoregressive approach

The authors proposed an elegant two-stage approach:

  • train a discrete VAE model to compress images into image tokens,
  • concatenate the encoded text snippet with the image tokens and train the autoregressive transformer to learn the joint distribution over text and images.

The final version was trained on 250 million text-image pairs obtained from the Internet.

CLIP

During inference, the model is able to output a whole batch of generated images. But how can we estimate which images are best? Simultaneously with the publication of DALL·E, the OpenAI team presented a solution for image and text linking called CLIP [9]. In a nutshell, CLIP offers a reliable way of pairing a text snippet with its image representation. Putting aside all of the technical aspects, the idea of training this type of model is fairly simple – take the text snippet and encode it, take an image and encode it. Do that for a lot of examples (400 million (image, text) pairs) and train the model in a contrastive fashion.

The recent rise of diffusion-based models - Visualisation of CLIP contrastive pre-training

Visualisation of CLIP contrastive pre-training, source: [9]

This kind of mapping allows us to estimate which of the generated images are the best match considering the text input. For anyone who would like to see the power of CLIP – feel free to check out my previous post on combining CLIP and evolutionary algorithms to generate images [deepsense.ai’s blogpost].

DALL·E attracted major attention from people both inside and outside the AI world; it gained lots of publicity and stirred a great deal of conversation. Even so, it only gets an honorable mention here, as the trends shifted quite quickly towards novel ideas.

All you need is diffusion

Sohl-Dickstein et al. [2] proposed a fresh idea on the subject of image generation – diffusion models.

The recent rise of diffusion-based models - Generative models

Generative models, source: [13]

The idea is inspired by non-equilibrium thermodynamics, although underneath it is packed with some interesting mathematical concepts. We can notice the already known concept of encoder-decoder structure here, but the underlying idea is a bit different than what we can observe in traditional variational autoencoders. To understand the basics of this model, we need to describe forward and reverse diffusion processes.

Forward image diffusion

This process can be described as gradually applying Gaussian noise to the image until it becomes entirely unrecognizable. This process is fixed in a stochastic sense – the noise application procedure can be formulated as the Markov chain of sequential diffusion steps. To untangle the difficult wording a little bit, we can neatly describe it with a few formulas. Assume that images have a certain starting distribution \(q\left(\bf{x}_{0}\right)\). We can sample just one image from this distribution – \(\bf{x}_{0}\). We want to perform a chain of diffusion steps \(\bf{x}_{0} \rightarrow \bf{x}_{1} \rightarrow … \rightarrow \bf{x}_{\it{T}}\), each step disintegrating the image more and more.

How exactly is the noise applied? It is formally defined by a noising schedule \(\{\beta_{t}\}^{T}_{t=1}\), where for every \(t = 1,…,T\) we have \(\beta_{t} \in (0,1)\). With such a schedule we can formally define the forward process as

$$
q\left(\mathbf{x}_{t} \mid \mathbf{x}_{t-1}\right)=\mathcal{N}\left(\sqrt{1-\beta_{t}} \mathbf{x}_{t-1}, \beta_{t} \mathbf{I}\right)
$$

There are just two more things worth mentioning:

  • As the number of noising steps increases \((T \to \infty)\), the final distribution \(q(\mathbf{x}_{T})\) approaches a very handy isotropic Gaussian distribution. This makes any future sampling from noised distribution efficient and easy.
  • Noising with a Gaussian kernel provides another benefit – there is no need to go step-by-step through the noising process to achieve any intermediate latent state. We can sample any latent state directly thanks to reparametrization$$
    q\left(\mathbf{x}_{t} \mid \mathbf{x}_{0}\right)=\mathcal{N}\left(\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0},\left(1-\bar{\alpha}_{t}\right) \mathbf{I}\right) = \sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}} \cdot \epsilon,
    $$where \(\alpha_{t} := 1-\beta_{t}\), \(\bar{\alpha}_{t} := \prod_{k=0}^{t}\alpha_{k}\) and \(\epsilon \sim \mathcal{N}(0, \mathbf{I})\). Here \(\epsilon\) represents Gaussian noise – this formulation will be essential for training.

Reverse image diffusion

We have a nicely defined forward process. One might ask – so what? Why can’t we just define a reverse process \(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)\) and trace back from the noise to the image? First of all, that would fail conceptually, as we want to have a neural network that learns how to deal with a problem – we shouldn’t provide it with a clear solution. And second of all, we cannot quite do that, as it would require marginalization over the entire data distribution. To get back to the starting distribution \(q(\bf{x}_{0})\) from the noised sample we would have to marginalize over all of the ways we could arise at \(\mathbf{x}_{0}\) from the noise, including all of the latent states. That means calculating \(\int q(\mathbf{x}_{0:T})d\mathbf{x}_{1:T}\), which is intractable. So, if we cannot calculate it, surely we can… approximate it!

The core idea is to develop a reliable solution – in the form of a learnable network – that successfully approximates the reverse diffusion process. The first way to achieve that is by estimating the mean and covariance for denoising steps

$$
p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)=\mathcal{N}(\mu_{\theta}(\mathbf{x}_{t}, t), \Sigma_{\theta}(\mathbf{x}_{t}, t) ).
$$

In a practical sense, \(\mu_{\theta}(\mathbf{x}_{t}, t)\) can be estimated via the neural network and \(\Sigma_{\theta}(\mathbf{x}_{t}, t)\) can be fixed to a certain constant related to the noising schedule, such as \(\beta_{t}\mathbf{I}\).

The recent rise of diffusion-based models - Forward and reverse diffusion processes

Forward and reverse diffusion processes, source: [14]

Estimating \(\mu_{\theta}(\mathbf{x}_{t}, t)\) this way is possible, but Ho et al. [3] came up with a different way of training – a neural network \(\epsilon_{\theta}(\mathbf{x}_{t}, t)\) can be trained to predict the noise \(\epsilon\) from the earlier formulation of \(q\left(\mathbf{x}_{t} \mid \mathbf{x}_{0}\right)\).

As in Ho et al. [3], the training process consists of the following steps:

  1. Sample image \(\mathbf{x}_{0}\sim q(\bf{x}_{0})\),
  2. Choose a certain step in the diffusion process \(t \sim U(\{1,2,…,T\})\),
  3. Apply the noising \(\epsilon \sim \mathcal{N}(0,\mathbf{I})\),
  4. Try to estimate the noise \(\epsilon_{\theta}(\mathbf{x}_{t}, t)= \epsilon_{\theta}(\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}} \cdot \epsilon, t)\),
  5. Learn the network by gradient descent on loss \(\nabla_{\theta} \|\epsilon – \epsilon_{\theta}(\mathbf{x}_{t}, t)\|^{2}\).

In general, loss can be nicely presented as

$$
L_{\text{diffusion}}=\mathbb{E}_{t, \mathbf{x}_{0}, \epsilon}\left[\left\|\epsilon-\epsilon_{\theta}\left(\mathbf{x}_{t}, t\right)\right\|^{2}\right],
$$

where \(t, \mathbf{x}_0\) and \(\epsilon\) are described as in the steps above.

All of the formulations, reparametrizations and derivations are a bit math-extensive, but there are already some great resources available for anyone that wants to have a deeper understanding of the subject. Most notably, Lillian Weng [13], Angus Turner [14] and Ayan Das [15] went through some deep derivations while maintaining an understandable tone – I highly recommend checking these posts.

Guiding the diffusion

The above part itself explains how we can perceive the diffusion model as generative. Once the model \(\epsilon_{\theta}(\mathbf{x}_{t}, t)\) is trained, we can use it to run the noise \(\mathbf{x}_{t}\) back to \(\mathbf{x}_{0}\). Given that it is straightforward to sample the noise from isotropic Gaussian distribution, we can obtain limitless image variations. We can also guide the image generation by feeding additional information to the network during the training process. Assuming that the images are labeled, the information about class \(y\) can be fed into a class-conditional diffusion model \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid y)\).
and one way of introducing the guidance in the training process is to train a separate model, which acts as a classifier of noisy images. At each step of denoising, the classifier checks whether the image is denoised in the right direction and contributes its own gradient of loss function into the overall loss of diffusion model.

Ho & Salimans [5] proposed an idea on how to feed the class information into the model without the need to train an additional classifier. During the training the model \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid y)\) is sometimes (with fixed probability) ne n shown the actual class \(y\). Instead, the class label is replaced with the null label \(\emptyset\). So it learns to perform diffusion with and without the guidance. For inference, the model performs two predictions, once given the class label \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid y)\) and once not \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid \emptyset)\). The final prediction of the model is moved away from \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid \emptyset)\) and towards \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid y)\) by scaling with guidance scale \(s \geq 1\).

$$
\hat{\epsilon}_{\theta}\left(\mathbf{x}_{t}, t \mid y\right)=\epsilon_{\theta}\left(\mathbf{x}_{t}, t \mid \emptyset\right)+s \cdot\left(\epsilon_{\theta}\left(\mathbf{x}_{t}, t \mid y\right)-\epsilon_{\theta}\left(\mathbf{x}_{t}, t \mid \emptyset\right)\right)
$$

This kind of classifier-free guidance uses only the main model’s comprehension – an additional classifier is not needed – which yields better results according to Nichol et al. [6].

Text-guided diffusion with GLIDE

Even though the paper describing GLIDE [6] architecture received the least publicity out of all the publications discussed in this post, it arguably presents the most novel and interesting ideas. It combines all of the concepts presented in the previous chapter nicely. We already know how diffusion models work and that we can use them to generate images. The two questions we would now like to answer are:

  • How can we use the textual information to guide the diffusion model?
  • How can we make sure that the quality of the model is good enough?

Architecture choice

Architecture can be boiled down to three main components:

  1. A UNet-based model responsible for the visual part of the diffusion learning,
  2. A transformer-based model responsible for creating text embedding from a snippet of text,
  3. An upsampling diffusion model is used for enhancing output image resolution.

The first two work together in order to create a text-guided image output, while the last one is used to enlarge the image while preserving the quality.

The core of the model is the well-known UNet architecture, used for the diffusion in Dhariwal & Nichol [8]. The model, just like in its early versions, stacks residual layers with downsampling and upsampling convolutions. It also consists of attention layers which are crucial for simultaneous text processing. The model proposed by the authors has around 2.3 billion parameters and was trained on the same dataset as DALL·E.

The text used for guidance is encoded in tokens and fed into the Transformer model. The model used in GLIDE had roughly 1.2 billion parameters and was built from 24 residual blocks of width 2048. The output of the transformer has two purposes:

  • the final embedding token is used as class embedding \(y\) in \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid y)\),
  • the final layer of token embeddings is added to every attention layer of the model.

It is clear that a great deal of focus was put into making sure that the model receives enough text-related context in order to generate accurate images. The model is conditioned on the text snippet embedding, the encoded text is concatenated with the attention context and during training, the classifier-free guidance is used.

As for the final component, the authors used the diffusion model to go from a low-resolution to a high-resolution image using an ImageNet upsampler.

The recent rise of diffusion-based models - GLIDE interpretation of ‘a corgi in a field’

GLIDE interpretation of ‘a corgi in a field’, source: [6]

GLIDE incorporates a few notable achievements developed in recent years and sheds new light on the concept of text-guided image generation. Given that the DALL·E model was based on different structures, it is fair to say that the publication of GLIDE represents the dawn of the diffusion-based text-to-image generation era.

The next version – DALL·E 2

The OpenAI team doesn’t seem to get much rest, as in April they took the Internet by storm with DALL·E 2 [7]. It takes elements from both predecessors: it relies heavily on CLIP [9] but a large part of the solution revolves around GLIDE [6] architecture. DALL·E 2 has two main underlying components called the prior and the decoder, which are able to produce image output when stacked together. The entire mechanism was named unCLIP, which may already spoil the mystery of what exactly is going on under the hood.

The recent rise of diffusion-based models - Visualization of DALL-E 2 two-stage mechanism

Visualization of DALL·E 2 two-stage mechanism. Source: [7]

The prior

The first stage is meant to convert the caption – a text snippet such as a “corgi playing a flame-throwing trumpet” – into text embedding. We obtain it using a frozen CLIP model.

After text embedding comes the fun part – we now want to obtain an image embedding, similar to the one which is obtained via the CLIP model. We want it to encapsulate all important information from the text embedding, as it will be used for image generation through diffusion. Well, isn’t that exactly what CLIP is for? If we want to find a respective image embedding for our input phrase, we can just look at what is close to our text embedding in the CLIP encoded space. One of the authors of DALL·E 2 [Aditya Ramesh, 2022] posted a nice explanation of why that solution fails and why the prior is needed – “An infinite number of images could be consistent with a given caption, so the outputs of the two encoders will not perfectly coincide.
Hence, a separate prior model is needed to “translate” the text embedding into an image embedding that could plausibly match it”.

On top of that, the authors empirically checked the importance of the prior in the network. Passing both the image embedding produced by the prior and the text vastly outperforms generation using only the caption or caption with CLIP text embedding.

The recent rise of diffusion-based models - Samples generated conditioned on- caption, text embedding and image embedding

Samples generated conditioned on: caption, text embedding, and image embedding. Source: https://arxiv.org/pdf/2204.06125.pdf

The authors tested two model classes for the prior: the autoregressive model and the diffusion model. This post will cover only the diffusion prior, as it was deemed better performing than autoregressive, especially from a computational point of view. For the training of prior, a decoder-only Transformer model was chosen. It was trained by using a sequence of several inputs:

  • encoded text,
  • CLIP text embedding,
  • embedding for the diffusion timestep,
  • noised image embedding,

with the goal of outputting an unnoised image embedding \(z_{i}\). As opposed to the way of training proposed by Ho et al. [7] covered in previous sections, predicting the unnoised image embedding directly instead of predicting the noise was a better fit. So, remembering the previous formula for diffusion loss in a guided model

$$
L_{\text{diffusion}}=\mathbb{E}_{t, \mathbf{x}_{0}, \epsilon}\left[\left\|\epsilon-\epsilon_{\theta}\left(\mathbf{x}_{t}, t\mid y\right)\right\|^{2}\right],
$$

we can present the prior diffusion loss as

$$
L_{\text{prior:diffusion}}=\mathbb{E}_{t}\left[\left\|z_{i}-f_{\theta}\left({z}_{i}^{t}, t \mid y\right)\right\|^{2}\right],
$$

where \(f_{\theta}\) stands for the prior model, \({z}_{i}^{t}\) is the noised image embedding, \(t\) is the timestamp and \(y\) is the caption used for guidance.

The decoder

We covered the prior part of the unCLIP, which was meant to produce a model that is able to encapsulate all of the important information from the text into a CLIP-like image embedding. Now we want to use that image embedding to generate an actual visual output. This is when the name unCLIP unfolds itself – we are walking back from the image embedding to the image, the reverse of what happens when the CLIP image encoder is trained.

As the saying goes: “After one diffusion model it is time for another diffusion model!”. And this one we already know – it is GLIDE, although slightly modified. Only slightly, since the single major change is adding the additional CLIP image embedding (produced by the prior) to the vanilla GLIDE text encoder. After all, this is exactly what the prior was trained for – to provide information for the decoder. Guidance is used just as in regular GLIDE. To improve it, CLIP embeddings are set to \(\emptyset\) in 10% of cases and text captions \(y\) in 50% of cases.

Another thing that did not change is the idea of upsampling after the image generation. The output is tossed into additional diffusion-based models. This time two upsampling models are used (instead of one in the original GLIDE), one taking the image from 64×64 to 256×256 and the other further enhancing resolution up to 1024×1024.

Imagen that we can do it better

The Google Brain team decided not to be late to the party, as less than two months after the publication of DALL·E 2 they presented the fruits of their own labor – Imagen (Saharia et al. [7].

The recent rise of diffusion-based models - Overview of Imagen architecture

Overview of Imagen architecture. Source: [7]

Imagen architecture seems to be oddly simple in its structure. A pretrained textual model is used to create the embeddings that are diffused into an image. Next, the resolution is increased via super-resolution diffusion models – the steps we already know from DALL·E 2. A lot of novelties are scattered in different bits of the architecture – a few in the model itself and several in the training process. Together, they offer a slight upgrade when compared to other solutions. Given the large portion of knowledge already served, we can explain this model via differences with previously described models:

Use a pretrained transformer instead of training it from scratch. This is viewed as the core improvement compared to OpenAI’s work. For everything regarding text embeddings, the GLIDE authors used a new, specifically trained transformer model.
The Imagen authors used a pretrained, frozen T5-XXL model [4]. The idea is that this model has vastly more context regarding language processing than a model trained only on the image captions, and so is able to produce more valuable embeddings without the need to additionally fine-tune it.

Make the underlying neural network more efficient. An upgraded version of the neural network called Efficient U-net was used as the backbone of super-resolution diffusion models. It is said to be more memory-efficient and simpler than the previous version, and it converges faster1 as well. The changes were introduced mainly in residual blocks and via additional scaling of the values inside the network. For anyone who enjoys digging deep into the details – the changes are well documented in Saharia et al. [7].

Use conditioning augmentation to enhance image fidelity. Since the solution can be viewed as a sequence of diffusion models, there is an argument to be made about enhancements in the areas where the models are linked. Ho et al. [10] presented a solution called conditioning augmentation. In simple terms, it is equivalent to applying various data augmentation techniques, such as a Gaussian blur, to a low-resolution image before it is fed into the super-resolution models.

There are a few other resources deemed crucial to a low FID score and high image fidelity (such as dynamic thresholding) – these are explained in detail in the source paper [7]. The core of the approach is already covered in previous chapters.

The recent rise of diffusion-based models - Some of Imagen generations with captions

Some Imagen generations with captions. Source: [7]

Is it the best yet?

As of writing this text, Google’s Imagen is considered to be state-of-the-art as far as text-to-image generation is concerned. But why exactly is that? How can we evaluate the models and compare them to each other?

The authors of Imagen opted for two means of evaluation. One is considered to be the current standard for text-to-image modeling, namely establishing a Fréchet inception distance score on a COCO validation dataset. The authors report (unsurprisingly) that Imagen shows a state-of-the-art performance, its zero-shot FID outperforming all other models, even those specifically trained on COCO.

The recent rise of diffusion-based models - Comparison of several models

Comparison of several models. Source: https://arxiv.org/pdf/2205.11487.pdf

A far more intriguing means of evaluation is a brand new proposal from the authors called DrawBench – a comprehensive and challenging set of prompts that support the evaluation and comparison of text-to-image models (source). It consists of 200 prompts divided into 11 categories, collected from e.g. DALL·E or Reddit. A list of the prompts with categories can be found in [17]. The evaluation was performed by 275 unbiased (sic!) raters, 25 for each category.
Each rater was shown two non-cherry picked and random sets of images generated by two different models (e.g. Imagen and DALL·E 2) and had to respond to two questions:

  1. Which set of images is of higher quality?
  2. Which set of images better represents the text caption?

These two questions are meant to address the two most important characteristics of a good text-to-image model: the quality of the images produced (fidelity) and how well it reflects the input text prompt (alignment). Each rater had three choices – to claim that one of the models performs better, or to call it a tie. Once again, there can be only one winner. Interestingly, the GLIDE model seems to perform slightly better than DALL·E 2, at least based on this curated dataset.

The recent rise of diffusion-based models - Imagen vs other models

Source: [7]

As expected, a large portion of the publication is devoted to the comparison between the images produced by Imagen and GLIDE/DALL·E – more can be found in Appendix E of [7].

The fun is far from over

As usual, with new architecture gaining recognition there is a surge of interesting publications and solutions emerging from the void. The pace of developments makes it nearly impossible to track every interesting publication. There are also a lot of interesting characteristics of the models to discover other than raw generative power, such as image inpainting, style transfer, and image editing.

Apart from the understandable excitement over a new era of generative models, there are some shortcomings embedded into the diffusion process structure, such as slow sampling speed compared to previous models [16].

The recent rise of diffusion-based models - Models comparison

Models comparison. Source: [16]

For anyone who likes to go deep into the minutiae of implementation, I highly recommend going through Phil Wang’s (@lucidrains on github) repositories [20], which is a collaborative effort from many people to recreate the unpublished models in PyTorch.

For anyone who would like to admire some more examples of DALL·E 2’s generative power, I recommend checking the newly created subreddit with DALL·E 2 creations in [18]. It is moderated by people with OpenAI’s Lab access – feel free to join the waitlist [19] and have the opportunity to play with models yourself.

References

  1. Language Models are Few-Shot Learners Tom B. Brown et al. 2020
  2. Deep Unsupervised Learning using Nonequilibrium Thermodynamics Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli. 2015
  3. Denoising Diffusion Probabilistic Models Jonathan Ho, Ajay Jain, Pieter Abbeel. 2020
  4. How Much Knowledge Can You Pack Into the Parameters of a Language Model? Adam Roberts, Colin Raffel, Noam Shazeer. 2020
  5. Classifier-Free Diffusion Guidance Jonathan Ho, Tim Salimans. 2021
  6. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models Alex Nichol et al. 2021
  7. Zero-Shot Text-to-Image Generation Aditya Ramesh et al. 2021
  8. Diffusion Models Beat GANs on Image Synthesis Prafulla Dhariwal, Alex Nichol. 2021
  9. Learning Transferable Visual Models From Natural Language Supervision Alec Radford et al. 2021
  10. Cascaded Diffusion Models for High Fidelity Image Generation Jonathan Ho et al. 2021
  11. Hierarchical Text-Conditional Image Generation with CLIP Latents Aditya Ramesh et al. 2022
  12. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding Chitwan Saharia et al. 2022
  13. What are Diffusion Models? Lilian Weng. 2021
  14. Diffusion Models as a kind of VAE Angus Turner. 2021
  15. An introduction to Diffusion Probabilistic Models Ayan Das. 2021
  16. Improving Diffusion Models as an Alternative To GANs, Part 1 Arash Vahdat, Karsten Kreis. 2022
  17. DrawBench prompts Google Brain team. 2022
  18. DALL·E 2 subreddit Reddit. 2022
  19. OpenAI’s waitilist OpenAI team. 2022
  20. Phil Wang’s repositories Phil Wang. 2022
https://deepsense.ai/wp-content/uploads/2022/07/The-recent-rise-of-diffusion-based-models.jpeg 337 1140 Maciej Domagała https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Maciej Domagała2022-09-05 08:00:592022-09-06 10:37:42The recent rise of diffusion-based models
Data Science with Graphs - using knowledge graphs on the data before it reaches the ML phase

Data Science with Graphs – using knowledge graphs on the data before it reaches the ML phase

June 11, 2022/in Machine learning /by Grzegorz Rybak

Graph usage in AI recently became quite evident with an increased number of research papers and some impressive examples among the industry . This article aims to answer the question: Are there ways to improve a project’s delivery by using graphs even before reaching GraphML?

Introduction

“We can do a lot more with data to make systems that appear intelligent before we reach for the ML pipeline” – Dr. Jim Webber, Chief Scientist at Neo4j [“Graphs for AI and ML” conference talk]

Graph usage in AI recently became quite evident with an increased number of research papers [figure 1] and some impressive examples among the industry of using Graph Neural Network-based architectures – like the AlphaFold 2 – turning the community’s attention to GraphML.

But perhaps the focus on this area solely overshadows the wider look at the applications of graphs in data science and some of the other advantages they can bring in general?

This article aims to answer the question: Are there ways to improve a project’s delivery by using graphs even before reaching GraphML?

Spoiler: yes – it can happen.

Figure 1 - increased popularity of graph neural networks as indicated by the keyword usage in ICLR'21 paper submissions
Figure 1: increased popularity of graph neural networks as indicated by the keyword usage in ICLR’21 paper submissions. [Source: State of AI report 2021]

Knowledge graphs & graph databases

Starting with the basics, graphs are an abstract data type existing in various forms and shapes of implementation. Here is a couple of examples of graph entities for context:

  • Graph algorithms (notably, the famous Dijkstra for shortest path-finding or PageRank, the original algorithm for Google’s search engine)
  • Data Ontologies
  • Tim Berner-Lee’s Semantic Web concept.
  • GraphQL (graph-resembling API design)
  • 3D mesh structures

For this article’s needs, I will focus on one particular type of a graph application: graph databases.

Graph databases are an implementation of graphs enhanced with efficient persistence of the data, a robust graph query language to ask for data and its relationships, and (depending on the DB vendor) an in-DB interactive graph data visualisation for easier data exploration and analysis.

Graph DBs introduce the terminology of “nodes” for data subjects/objects and “edges” for relationships between the data.

They can also extend the graph data structure itself with new features – in particular, Labelled Property Graph databases allow for giving labels to different data nodes (which can work as a types or even “tags” system), and add properties on both nodes and relationships.

Whilst one can model their database so that the relationships can have any meaning (or even no meaning at all and their purpose is to only connect the data, for example in a specific geometric pattern), a particularly interesting aspect of graph databases for data science purposes is that they support forming knowledge graphs – a graph data representation in which connections between the objects have a defined, semantic meaning allowing for enhanced reasoning while traversing through the graph.

Figure 2 - example of a knowledge graph
Figure 2: example of a knowledge graph. [Source: Zhou, Zhixuan & Huankang, Guan & Bhat, Meghana & Hsu, Justin. (2019). Fake News Detection via NLP is Vulnerable to Adversarial Attacks.]

Knowledge graphs require a “knowledge base” – a resource (or many resources combined) defining the semantics of the relationships between the objects in the graph. In a strict sense (commonly accepted as the “proper” way by the knowledge graphs community), such a knowledge base should be a formal description of each possible relationship in the domain – for example, an automotive industry-focused knowledge graph could be built on top of the schema.org’s taxonomy for a “vehicle” concept.

However, I will relax this strict knowledge base definition and consider any kind of a data model describing entities and relations between them as knowledge base too, as this will simplify introducing the forthcoming concepts while still being factual.In other words, for as long as your graph data reflect the meanings you defined through any kind of ontology, taxonomy, or a data model created specifically for your project, it will be considered a knowledge graph in this article.

Figure 3 - A simplified view on the differences between a Graph Database & a Knowledge Graph
Figure 3: A simplified view on the differences between a Graph Database & a Knowledge Graph [Sources: data model, taxonomy, ontology]. *Data models are considered as a knowledge base for the purpose of this article.

There are two main graph DB technologies on the market and both, in principle, support creating knowledge graphs:

  1. RDFs (Resource Description Framework)
  2. LPGs (Labelled Property Graphs)

Whilst directly comparing the two technologies is out of the scope of this article, here is a useful article showing the main differences between them.

Having defined the common understanding of the knowledge graph concept in the context of the graph databases, let’s now proceed with exploring how they can be leveraged.

Graphs on the data science project – what’s more there apart from graphML?

To answer this question, I will make use of a great presentation by Dr. Victor Lee (VP of ML at Tigergraph – one of the LPG DB vendor companies) at the latest edition of the Connected Data World conference:

“Graph Algorithms & Graph Machine Learning: Making Sense of Today’s Choices”

During the talk, Dr. Lee breaks down a typical AI project into five main stages:

  1. Data Acquisition
  2. Data Cleansing
  3. Feature Extraction/Selection
  4. Model Training
  5. >Model Deployment

The first three stages form a great basis for the list of benefits that I would like to expand on based on personal findings from past graph projects – they are as follows:

  1. Intuitive data modeling.
  2. Improved data exploration & data discovery.
  3. Faster data model iterations & enhanced flexibility when changing the data model.
  4. Enhanced feature engineering/selection abilities, specifically:
    1. Easier querying of inter-connected (or indirectly connected) data than in the tabular data – improved features selection.
    2. Data-enriching graph transformations (above all: data paths creation, computed properties, in-graph data restructuring).
  5. Graph data science algorithms (depending on the DB vendor), including:
    1. Centrality & Community detection, Link prediction, graph similarity, and other useful algorithms adding extra information dimensions to your data.

Let’s now analyse these benefits one by one in context of the highlighted AI project stages from Dr. Lee’s presentation.

Graph benefits at different project stages

Figure 4 - AI project stages breakdown – in this article, I will focus on the first 3 stages prior to the actual ML phase
Figure 4: AI project stages breakdown – in this article, I will focus on the first 3 stages prior to the actual ML phase. [Source: “Dr. Victor Lee – Graph Algorithms & Graph Machine Learning: Making Sense of Today’s Choices” – Connected Data World 2021 conference]

Stage 1: Data Acquisition

Figure 5 - high-level visualisation of constructing data in a graph database format – in this example, mapping Twitter data
Figure 5: high-level visualisation of constructing data in a graph database format – in this example, mapping Twitter data.

Benefit 1: Intuitive data modeling

Graphs are simple in data modeling design (in other words, they are “whiteboard friendly”). They are basically mind maps, which make it easy to intuitively plan out the data stream and design the graph’s data model as there is no need to worry about any primary/foreign keys or structure rules that your data needs to follow.

Additionally, mind maps help explain the data modeling ideas to the business stakeholders of the project because they are usually more comprehensible than e.g. SQL relations mappings.

Extra benefit: Leverage open data standards via data ontologies (RDF DBs-specific feature)

In what can be described as perhaps the closest thing to transfer learning in the database world, RDF databases support building knowledge graphs on top of open-sourced data ontologies, i.e. publicly available knowledge graph data models of specific or general domains. This, in turn, can make the process of modeing your data model much faster. As described by one of the RDF DB vendors, Ontotext, in one of their articles:

“>Now you may be thinking that’s all well and good but creating a realistic map of all of these relationships sounds like a herculean task to begin with. You wouldn’t be wrong, except that you don’t have to build out a knowledge graph from scratch.”

“Graphs on the Ground Part I: The Power of Knowledge Graphs within the Financial Industry” article, section: “Taking Advantage of External Knowledge” [Source]

For example, if you need to map the financial data in your project, you could use the entirety or fragments of the Financial Industry Business Ontology as part of your own data model.

Figure 6 - excerpt of the Financial Industry Business Ontology
Figure 6: excerpt of the Financial Industry Business Ontology (FIBO) [Source].

Note, some LPG DB vendors also provide ways to leverage external ontologies in LPGs via database extensions, for example Neo4j with their neosemantics plugin (example: FIBO in a neo4j LPG DB.

Stage 2: Data cleansing

Figure 7 - An overview of the data refactoring workflow
Figure 7: An overview of the data refactoring workflow.

Benefit 2: Improved data exploration & data discovery

Visualise, explore, and interact with your data more easily

One of the key benefits of bringing data into a graph format is that many graph databases provide interactive visualisation of your data. This means the developer/data scientist can often get more insights and “see” the data better than when it’s in a tabular format. This also means it’s easier to find discrepancies and bad data patterns in your data and fix them at this stage, rather than performing a post-mortem analysis of a badly performing ML model.

Figure 8 - In-DB graph visualisation for a given graph data query in Stardog, an RDF-based graph database
Figure 8: In-DB graph visualisation for a given graph data query in Stardog, an RDF-based graph database (in other words: features selection). [Source]

More efficient human-driven entity resolution

Oftentimes, data sources used on the project have a varying degree of quality with regards to duplicates. In situations where there are multiple data sources in your project – especially when the data is mutually inclusive, the problem can increase significantly –  bringing data together results in conflicts or hidden duplicate entities lurking all over your data.

Figure 9 - inter-connected data
Figure 9: inter-connected data

The below dataset from “what’s cooking” series of articles, creating a food metrics knowledge graph from a collection of recipes https://medium.com/neo4j/whats-cooking-part-5-dealing-with-duplicates-a6cdf525842a shows a glimpse of this on real-life data (and that’s from one data source only):

Figure 10 - “What’s cooking ?” neo4j blog posts series. The article also introduces automated approaches to entity-resolution via graph algorithms
Figure 10: “What’s cooking ?” neo4j blog posts series. The article also introduces automated approaches to entity-resolution via graph algorithms [Source]

Therefore, being able to visualise the data allows for easier spot-checking of such phenomena compared to standard, tabular data. Additionally, in applicable scenarios, it allows more efficient cooperation with your data analysts or domain experts to identify and remove such cases without them knowing the technical details of your database (or query language).

Benefit 3: Easier data model iterations

Flexibility and ease of iterations

Graph databases are “schema-less” and they don’t require defining any structure/constraints/rules in advance when you start building/expanding your graph’s structure (a.k.a. “data model”).

Additionally, practically any problem can be shaped into a graph format, because graphs are conceptually easy -> they’re just a bunch of vertices and edges between them.

These two characteristics make graph DBs highly flexible and agile. And because they’re unconstrained by any design rules, they are highly iteration-friendly – changing a graph’s data model is much less hassle than e.g. de-normalising an SQL schema.

All this unlocks an extra degree of freedom on the project – you can focus on discovering your data “as it unfolds” rather than set up your data’s schema early (perhaps when you don’t really know the  data yet) and then fight through any changes, or worse, stick to a schema that won’t let you answer project-critical questions.

[https://youtu.be/GekQqFZm7mA?t=640 the 10:40-13:14 fragment of this talk tells more about this.]

Stage 3: Feature Extraction/Selection

Figure 11 - Example of a non-trivial query
Figure 11: Example of a non-trivial query: Find all users mentioned by related tweets to which user X commented on

Benefit 4: Enhanced feature engineering/selection abilities via the graph query language

Ease of querying indirectly linked data

One of the main advantages of graphs is how “cheap” it is to traverse through them, compared to costly joins in relational databases.

The figure above presents how to perform a rather non-trivial ask for all people mentioned by any tweet related to the tweet our “main user” commented on. Such data requests require multiple “hops” through our data, yet, it was roughly 4 lines-long.

Here’s an example of how a similarly inter-connected data query can, in extreme cases, compare to data in a relational database:

Figure 12 - Comparing SQL vs Graph queries when asking deeply-connected data
Figure 12: Comparing SQL vs Graph queries when asking deeply-connected data. Credits.

If your project often relies on questions like: “what are the next Nth connections to element X and how do they change if I change X?” Or if you have to query paths between your data on a daily basis – storing data in a graph can be a big advantage due to how much time you will save when querying that data – both in terms of developing and maintaining the queries, as well as from the DB performance perspective.

Note: this section was used with Cypher – a graph query language implemented across a couple of LPG DBs including Neo4j, RedisGraph & Memgraph – as an example to compare against SQL. An RDF graph DB query (using the SPARQL query language) would look a bit different; here’s a comparison.

Data-enriching graph transformations

Convenient graph query languages capable of data transformations bring another advantage to the table – you can use them to perform calculations on your graph data and persist them.

Notable examples of enhancing your data:

  • Identify and create new paths /data relationships between given data points.
  • Generate computed properties based on close/distant relationships
  • Restructure/re-shape your data directly in-graph.

This very benefit of graph databases is leveraged by trase.finance – a  graph DB-based platform for tracking down, monitoring, and calculating the biggest contributors to deforestation in the Brazilian and Indonesian tropical forests regions.

As shown in the figure below, trase.finance uses data-enriching graph transformations (specifically, computed properties generation) to dynamically propagate the deforestation risk and deforestation volume parameters across the company’s legal hierarchy and shareholding chains.

Figure 13 - Trase.finance platform using graphs to dynamically propagate calculations on interconnected data
Figure 13: Trase.finance platform using graphs to dynamically propagate calculations on interconnected data (figure taken from the methodology page)

Benefit 5: Graph data science

Finally, as the last piece of the superpowers coming with data stored in graph databases, let’s briefly mention the various useful graph algorithms, often directly built-into the graph DBs. As this is the area where what is “just” data science and what is graphML is the most blurry, I will give only a high-level overview of it here so that it can be explored more along with other ML solutions in a future article dedicated to GraphML.

This aspect tends to differ the most between RDFs & LPGs and, within them, their various DB vendors and implementations, but the main concept relies on the same – leverage how the data is structured to perform an algorithm on it and come up with meaningful analytics.

A common scenario for graph DBs, especially seen across Labelled Property Graphs’ DB vendors, is to contain a separate “add-on” library that implements a set of algorithms “out of the box”.

One example of that is the TigerGraph database with its Graph Data Science Library:

Sample group of graph algorithms implemented in Tigergraph’s Graph Data Science Library
Sample group of graph algorithms implemented in Tigergraph’s Graph Data Science Library

Another example is Neo4j DB with its GDS library, here’s a helpful infographic of their sample graph algorithms suite:

Graph Algorithms & Functions in Neo4j
Source. Note: this is a historical infographic, there are many more algorithms in the current version, you can inspect it here.

Turning to the RDF world, an extra strength of theirs is that apart from potential built-in graph algorithms RDF DB vendors, thanks to the integration with the ontologies at the core level, implement a so-called reasoning engine which allows the DB engine to draw inferences from the connections.

For example, imagine having a knowledge graph for an automotive industry business use case. Due to the reasoning, you could query the data for anything that’s a vehicle and the results would give us exactly that, even if the graph didn’t have any kind of “is type of vehicle” connection for our nodes. This is because, provided the data is correctly built on ontologies and thus the connections have true semantics, the DB engine can infer that e.g. any motorcycles are also a vehicle because the motorcycle entity is defined via an ontology (such as this one: https://schema.org/Motorcycle). Therefore, you could consider this feature of RDF DBs as a form of machine learning since, as long as you keep the coherent semantics of connections between the data in the data model, the DB engine has an “understanding” of what are the objects in the database.

Figure 14 - Simple example of a basic reasoning
Figure 14: Simple example of a basic reasoning – DB engine will infer to give you both “The Beatles” and “John Lennon” results when asked about artists since either “Band” or “SoloArtist” have a semantic meaning [Source]

In terms of graph algorithms library examples similar to the highlighted LPG vendors, it appears less prominently than in the LPGs – for example, Ontotext, one of the leading RDF DB vendors provide a handful of graph analytics features via their plugins. The most similar example of a DB vendor offering a comparable library of out-of-the-box graph algorithms is Stardog with their Graph Algorithms module.

While the showcased graph analytics can be considered largely representing unsupervised ML, there are examples of Supervised ML in both graph DB types as well.

Stardog DB expands the idea of reasoning and allows creating classification, regression, and similarity models via using the semantically-defined objects in the graph as labels. More details can be found in their documentation.

For LPGs, Neo4j’s GDS library recently began supporting supervised ML algorithms too – as of writing this article, these are: logistic regression and random forest methods, along with two built-in ML pipelines: node classification and link prediction, which rely on the methods mentioned.

Finally, some DB vendors also provide interfaces for users to make their own custom graph algorithms and models if needed. For example, Memgraph DB exposes a Python-expandable query module object along with docs providing explanations on how to extend it [source]

It is likely, for certain use cases – e.g. community-detection or for certain constrained path-finding business requirements –  that the out-of-the-box graph algorithms could be sufficient enough for any AI needs on a project and become the sole reason to try using graph databases.

Conclusions and what’s next

The above mentioned five benefits of using graphs in a data science project answer the original question posed in the introduction section – it is possible to improve the project’s delivery by using graphs without even reaching the graphML stage.

Putting the data in your project into a knowledge graph structure through one of the graph databases can bring several advantages to the data workflow and in certain cases, due to the appearing graph data science features, reduce the need for implementing the commonly perceived graphML approach altogether.

The next article in the series will look into the ways of integrating the actual ML pipelines with graphs (with or without prior usage of graph databases) as well as review potential data challenges coming with using them in the project. Stay tuned for that!

Further learning resources 

  • Introduction to Graph Theory: A Computer Science Perspective – Useful introduction (or a refresher) to the topic of graphs as an abstract data type.
  • 6 – Graph Data Science 1 6  What’s New – great presentation of both the general AI in graphs overview from a graph DB vendor’s perspective as well as an overview of the features in Neo4j’s graph data science library (note there has been a 2.0 version of the library recently released, more information here.
  • Dr. Victor Lee – Graph Algorithms & Graph Machine Learning: Making Sense of Today’s Choices | CDW21 – an overview of the AI landscape in graphs.
  • https://levelup.gitconnected.com/knowledge-graph-app-in-15min-c76b94bb53b3 – an interesting, graph DB-agnostic example of how to quickly prototype a knowledge graph app to test graph’s usefulness on the project without committing to either of the two main graph DB technologies.
https://deepsense.ai/wp-content/uploads/2022/06/Data-Science-with-Graphs-using-knowledge-graphs-on-the-data-before-it-reaches-the-ML-phase.jpeg 337 1140 Grzegorz Rybak https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Grzegorz Rybak2022-06-11 10:19:332022-08-16 19:46:16Data Science with Graphs – using knowledge graphs on the data before it reaches the ML phase
Logo detection in sports sponsorship

Logo detection in sports sponsorship

August 3, 2021/in Machine learning /by Michał Tadeusiak and Krzysztof Dziedzic

Consumers love brands that bring them closer to sporting events. This has compelled the largest brands to jump headlong into sports sponsorship. While the benefits of sponsorship are undeniable, measuring ROI precisely remains a challenge.

With machine learning-powered tools, brands can evaluate a campaign’s ROI by analyzing video coverage of sponsored events. Thanks to computer vision solutions, it is possible to precisely determine the time and place a brand was positioned during a sporting event. Image analysis also allows companies to see the branding activities of competitors and compare their brands’ visibility. Having precise information on brand positioning also enables advertising equivalent value to be calculated, and the most impactful events to be determined and the activities of competitors to be monitored. Such analyses would be extremely time consuming and far less accurate if performed manually. Automated analysis based on advanced machine learning algorithms allows brands to quickly provide new valuable insights and boost the effectiveness of marketing campaigns. To address these needs, deepsense.ai developed an automated tool for logo detection and visibility analysis that provides both raw detection and a rich set of statistics.

Solution overview

deepsense.ai’s solution is based on a combination of two approaches – supervised object detection and one-shot learning.

Supervised object detection approach

In this approach, to train models on a labelled data set, one of the many well-tested architectures can be used. They include fully convolutional YOLO / SSD or the R-CNN family (regional convolutional neural networks). Since video streaming is essential to logo detection during sports broadcasts, a fully convolutional model is the best choice. In this case, the model does not have to process each proposal region separately and has a constant inference time, independent of the number of objects detected. This enables the model to run in real time.

The advantages of this approach include:

  • Simplicity of operation, well-developed use cases (many open, tested and ready-to-use implementations).

But there are also disadvantages:

  • It’s impossible to quickly add a new version of a logo without obtaining a large amount of training data;
  • The system is very sensitive to changes in the appearance of the logo, and updating a model trained as described above would require a large amount of new data.

Therefore, in order to increase the efficiency of the system, the approaches were combined (supervised object detection + one-shot learning).

The One-shot learning approach

This approach effectively solves the problem of dynamic logos and allows us to add new logos to the database without the need to collect large amounts of data. All we need is to have a reference set of template vectors for each supported logo and model detecting logo region proposals without performing classification on them. The model is trained by a process known as triplet loss.

During the training process, each mini-batch consists of a three element tuple:

  1. Example of company A logo (anchor),
  2. Photo with regions with the company’s logo (positive),
  3. Image with regions with a different brand’s logo (negative).

For architecture, we use the fully convolutional YOLOv3 model, which will both embed the template set of logos into a certain relatively low-dimensional vector space and detect objects in photos (but without assigning them specific classes).

During training, the “template logo” (anchor) is encoded by the same model that is used for object detection. The one difference is that on the map of features extracted for anchor, we calculate average pooling to obtain a single feature vector-anchor vector.

An approximate diagram of this process is presented in the figure below.
Logo detection in sports sponsorship 1

The optimized target function in this case is triplet margin loss – a differentiable function that yields small values if the vector representing the positive region is close to the anchor pattern vector (the logos are similar) and the negative region vector is far away from it (they are not similar).

After training, the model will process the available template logos to create a template vector database for each of the classes supported. In the inference after the logo region is detected, we extract the vector representing this region and compare it with the patterns. The label of the most similar is selected as the class of the given detection.

Updating a model trained in this way requires only that new elements be added to the reference vector database, or the “old” logo be replaced with “new” one, without having to train the model for optimization issues.

Parallelization of the stream

To speed up the system’s performance, we parallelized the stream without processing it “frame by frame”. As the streaming data flowed in gradually over time, we opted not to use “batch” inference with one instance of the model. In this context, it is also important to synchronize the processes in order to return the processed stream elements in chronological order.

  1. We initialize n workers (parameter, natural number). Each of them is simply a pair of YOLOv3 detection networks (one trained with a supervised method and the other with one-shot).
  2. We create a FIFO queue to which we throw the incoming data from the stream and from which workers collect frames for processing.
  3. To ensure the chronology, the processed frames are thrown by workers into the heap,
  4. A separate, looped process checks that the heap is not empty. If it isn’t, then the element with the smallest id is taken from it; if the value is 1 greater than the id of the last processed frame, then we update the value of this variable and return the processed frame, alternatively the frame and id are thrown back onto the heap.

The diagram below presents an approximate scheme of the system.
Logo detection in sports sponsorship 2

Enabling live processing, this approach significantly improves the performance.

Logo detection analytics

Automated logo detection analytics helps advertisers evaluate the results of sponsorships by providing a series of statistics, charts, rankings and visualizations that can be assembled into a concise report. The statistics can be calculated globally and per brand. Some features include brand exposure size in time, heatmaps of a logo’s position on the screen and bar charts to allow you to easily compare various statistics across the brands. Last but not least, we have a module for creating highlights – visualizations of the bounding boxes detected by the model. This module serves a double purpose: in addition to making the analysis easy to track, such visualizations are also a source of valuable information for data scientists tweaking the model.

https://deepsense.ai/wp-content/uploads/2021/08/Logo-detection-in-sports-sponsorship.jpeg 337 1140 Michał Tadeusiak https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Michał Tadeusiak2021-08-03 11:05:582021-08-10 13:58:49Logo detection in sports sponsorship
Paramount factors in successful machine learning projects

Paramount factors in successful machine learning projects. Part 1/2.

May 21, 2022/in Machine learning /by Robert Bogucki

Much has been said about the effective running of machine learning projects. However, the topic keeps coming up. Data Scientists spend a lot of time discussing modeling methods, while – in my opinion – the overarching goal of running machine learning projects in companies fades into the background. It is vital to remember that the purpose of ML projects is not modeling itself, but achieving defined business goals.

While for data scientists modeling is often the most exciting part of the job, the other steps of an ML project should not be neglected, as doing so may imperil the valuable business results we set out to achieve in the first place. Properly executing the entire process of building and implementing machine learning models is essential.

PROCESS. As you make your bed, so you must lie upon it.

The entire ML project process can be described as a five-point checklist.

  1. FRAMING – the main goal here is to determine the essence of a business problem and phrase it in Data Science lingo.
  2. DATA – fuels the whole solution, so we need to painstakingly examine and understand it.
  3. MODELING – building the model is the core activity and often is viewed as the most exciting part.
  4. PRESENTATION & CONTINUATION – for our efforts to be truly fully appreciated, both the results and the solution must be described in a way that it is understandable and useful for business stakeholders.
  5. PRODUCTION & MAINTENANCE – because data science projects require so much experimentation, how they are to be brought to the production stage may end up being neglected. Is the code up to snuff? Do my timelines need to be adjusted? Will the solution we produce ultimately solve the problem stakeholders need solved? Make sure that you think about this at some point.
FRAMING. All that glitters is not gold.

Often, before we start a project, it seems that everyone has the same understanding of the basic concepts and purpose. It is worth double-checking to ensure they do. A project deepsense.ai did for a client in the banking sector may serve as a good example of why. Our job was to predict churn, which seemed straightforward enough at first glance. The problem was that everyone’s underlying interpretation of “churn” was slightly different. Only a series of detailed questions allowed us to agree on what exactly we define as “churn,” taking into account the time horizon, specific customer activities and eventual net profit from this customer. The more thoroughly we analyze and define a business problem, the more precisely we will be able to transfer it into metrics.

These are the questions that are worth answering during this phase:

How will the solution be used?

Determine at the outset in what context the results of our modeling will be used. Will it be input for business decisions, support for process automation or just some improvements within the system.

How should performance be measured?

Discuss the KPIs and establish how project success will be measured. The importance of this point shouldn’t be underestimated.

How will the solution be tested and validated?

Consider how your validation pipeline should look like during the development and what additional testing should be done during and after it. Remember that business stakeholders must be able to confirm that the solution works well.

Will additional requirements or limitations arise?

Last but not least, analyze technical capabilities and possible limitations related to technical constraints, data extraction or tolerable latency.

DATA. A bad workman blames his tools.

The garbage in / garbage out principle is well known. I would encourage you to look at it not only from the perspective of data, but also of the model development.

Key aspects at this stage include:

Understand what data can be available, and request it early

Due to the complexity of business processes, we almost always encounter difficulties in obtaining properly prepared data. Take this into account and plan more time to request and validate data – possibly even before we officially kick off the project.

Understand how the data was extracted and preprocessed

Surprises are rare here, but when they occur, they tend to be big ones. The person responsible for data extraction may make certain decisions or errors that can significantly distort the results – either by accident, miscommunication or simply out of a lack of knowledge of modelling practices. It is always a good idea to understand the data extraction process and later thoroughly review the data with the business and data owners to confirm that you are all on the same page.

Explore the data

First of all, double-check that you have all the data you requested. Then perform the critical step of pre-modelling Exploratory Data Analysis. It should help you understand the data, the problem itself, generate insights, discover patterns, spot anomalies and test hypotheses. You can approach the data exploration as if it were up to you, rather than the model, to generate predictions. Don’t forget the simple stuff at this stage: ask basic questions, compute statistics, check features and labels distribution.

Confront your findings with the data owners

With all the findings from the EDA, conclusions, insights, and new hypotheses, talk to the data owners and business stakeholders. Confronting all of these early and with people that know the data and problem (hopefully better and with more intuition than we have) will provide a better foundation for modelling.

MODEL. It is never late to learn.

I hope I won’t disappoint you here, but given the importance of modelling and people’s eagerness to do it, I’m going to cover it in detail in the second part. As for now, just for the sake of order, let’s just cover a few key aspects.

Try a number of different approaches

Modeling is a process of constant experimentation and it is always worth trying a number of different approaches to models, features and hyperparameters. The key thing here is to do this in the proper order – starting from basic benchmarks and standard or off-the-shelf approaches before rolling up your sleeves and unleashing your creativity.

Understand errors, important variables, predictions, …

Understanding those will give us ideas for improvement as well as allow us to catch potential issues or anomalies in time.

Confront your findings and predictions with business stakeholders

Whatever your findings, make sure they are either expected or you work on explaining them. Another good educational exercise is asking business stakeholders what they expect the data to reveal and cross-checking this with what you discover.

Do at least a couple of iterations

Don’t forget the experimental nature of modeling. Developing a good solution usually takes time and requires multiple iterations. Don’t be afraid to revisit earlier steps if necessary – especially after discovering something new about the problem/data or stalling.

PRESENTATION & CONTINUATION. All’s well that ends well.

How and why does your solution achieve your business objective?

A well-thought-out presentation of the project’s results may determine its success or failure. That is why we need to pay particular attention to emphasizing the business value of the project and the impact of the modeling on specific business processes. Otherwise business stakeholders may not understand the solution we deliver or its actual value. Doing the work is one thing but solving the problem and convincing others that we have done so is quite another.

What steps are necessary towards full-deployment/productionization?

More often than not the PoC doesn’t fully reflect the project’s business value. Therefore, at this stage, ask yourself “what’s next?” in order to determine what further developments are needed to achieve the desired results.

Optional: Prepare for the handover

If you are aiming for a full handover, be sure that both sides are on the same page – no one likes surprises here.

PRODUCTION & MAINTENANCE. Don’t count your chickens before they hatch.

Make sure your code is production-ready

Because of the experimental nature of data science, code quality and general software engineering principles may take a back seat to modelling. Going “live” is the last call to account for this.

Monitor, measure, and retrain only if necessary

Make sure that you monitor both inputs (feature space) and outputs (your predictions and actual labels if possible). Detecting any data shifts late will surely make many people unhappy. As for retraining, understand how often it has to be done and figure out the right degree of automation.

THE DEEPSENSE.AI TAKEAWAY

Many businesses have already learned the value of putting machine learning to use. The role of Data Science or ML teams is growing and advanced data analysis is becoming a key factor to support strategic imperatives. It is therefore crucial that data scientists keep in mind the main goal of the project – solving or improving a specific business problem (as opposed to just playing with ML). The modelling itself may well be the key or core ML activity, yet if unaccompanied by all the other steps it won’t achieve the ultimate goal. Hopefully, the above checklist will improve your collaboration with business stakeholders and bring greater success to your projects.

https://deepsense.ai/wp-content/uploads/2021/05/Paramount-factors-in-successful-machine-learning-projects.jpeg 337 1140 Robert Bogucki https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Robert Bogucki2022-05-21 16:26:312022-12-19 15:35:49Paramount factors in successful machine learning projects. Part 1/2.
Machine Learning for Applications in Retail and Manufacturing

Machine learning for applications in retail and manufacturing

February 25, 2021/in Machine learning /by deepsense.ai

Machine learning is still perceived as an innovative approach in business. The technological progress and the use of Big Data in business make ML-based solutions increasingly important. As Forbes magazine indicates, 76% of enterprises today prioritize artificial intelligence and machine learning over other IT initiatives [1].

The concept of machine learning is derived from advanced data analysis of pattern and dependency recognition. ML assumes that when models analyze new data, they can adapt independently and use this new knowledge to develop by learning from previous experiences. Machine learning models can enhance nearly every aspect of a business, from marketing and sales to maintenance.

Predictive maintenance

Machine learning enables predictive monitoring, with algorithms anticipating equipment breakdowns before they occur and scheduling timely maintenance. With the work it did on predictive maintenance in medical devices, deepsense.ai reduced one client’s downtime by 15%.

But it isn’t just in straightforward failure prediction where machine learning supports maintenance. In another recent application, our team delivered a system that automates industrial documentation digitization, effectively reducing workflow time by up to 90%. We developed a model that recognizes and adds descriptions for all symbols used in the installation documentation. The schematics, including the technical descriptions of all components, are fully digitalized. The model reduces the work to a 30-minute review by a specialist. It also handles the most tedious tasks, thus reducing the effort required of human specialists and the number of errors they make in performing them.

An international manufacturer of medical devices was looking for a solution that would reduce device downtime. Our experts built a predictive maintenance model that pores over historical data, searching for anomalies and signs of a breakdown before one occurs. The model reduced breakdown-related downtime by more than 15%. Such a  solution can be applied in machine-reliant industries, where breakdowns bring operations to a halt and hamper overall company performance.

Quality control

Machine learning is also being adopted for product inspection and quality control. ML-based computer vision algorithms can learn from a set of samples to distinguish the “good” from the flawed. In particular, semi-supervised anomaly detection algorithms require only “good” samples in their training set, making a library of possible defects unnecessary. Alternatively, a solution can be developed that compares samples to typical cases of defects.

One of our clients asked us to tackle two of their visualization problems on a food production line – to detect sauce smears on the product’s inner packaging, and to identify correct positioning of the product’s topping. The system deepsense.ai delivered was able to identify 99% of faulty products with topping defects and while raising the alarm with a 99% accuracy rate for sauce smears. The model significantly reduced the need for manual quality control, hence lowering costs.

ML-based computer vision solutions can also be an essential component in the monitoring of hazardous areas in factories, tracking whether every worker is following safety requirements (including wearing helmets, glasses, vests, earmuffs  etc.) and, if not, sending an instant alert to the supervisor with a detailed description of the event that has occurred.

Demand forecasting

In the field of predictive analytics demand forecasting can be used to predict consumer demand. Such forecasting is done by analyzing statistical data and looking for patterns and correlations. With machine learning taking the practice to a higher level, modern demand forecasting techniques go far beyond simple historical data analysis.

More recent techniques combine intuition with historical data. Modern merchants can dig into their data in a search for trends and patterns. At the pinnacle of these techniques sit demand forecasting machine learning models, including gradient boosting and neural networks, which are currently the most popular types and outperform classic statistics-based methods. Historical data from transactions form the basis of more recent demand forecasting techniques. These are data that sellers collect and store for fiscal and legal reasons. Because they are also searchable, these data are the easiest to use.

This modern approach is extremely effective. One of our clients from the retail industry was losing millions of euros a year due to out-of-stocks. There was a daily cap on how many new items its warehouse could receive. Our team built a demand forecasting model for products that were new to market. It enables the company to use the cap more efficiently by ordering more hot products and fewer of those that are less in demand. We used Gradient Boosting, Random Forest and Neural Networks to build the model, and the trifecta reduced out-of-stocks by 30%.

Marketing optimization

Companies can maximize ROI on their marketing activities by implementing machine learning into their customer analysis. Sophisticated data analysis helps identify customers with the highest ROI on ads to make the most of marketing campaigns. It also optimizes channel mix with advanced attribution models.

deepsense.ai designed a model for a leading mobile advertising platform that predicts the click-through rate of internet advertisements. The model analyzes historical data on site user behavior to spot patterns and uncover anomalies. It enabled clients to identify an abnormal pattern among users, which turned out to be bots engaging in fraudulent clicking. The solution effectively identified internet bots that click ads, significantly boosting CTR predictions – up to 90% of bots were spotted and CTR predictions were improved by up to 35% over existing heuristics.

Summary

Analyzing large amounts of data has become a crucial part of the retail and manufacturing business landscape. It has traditionally been done by experts, based on know-how honed through experience. With the power of machine learning, however, it is now possible to combine the astonishing scale of big data with the precision and intelligence of a machine-learning model. While the business community must remain aware of the multiple pitfalls it will face when employing machine learning, that it endows business processes with awesome power and flexibility is now beyond question.

[1] https://www.forbes.com/sites/louiscolumbus/2021/01/17/76-of-enterprises-prioritize-ai–machine-learning-in-2021-it-budgets/?sh=6d24c5e3618a

https://deepsense.ai/wp-content/uploads/2021/02/Machine-Learning-for-Applications-in-Retail-and-Manufacturing.jpeg 337 1140 deepsense.ai https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg deepsense.ai2021-02-25 08:54:592021-05-11 10:55:53Machine learning for applications in retail and manufacturing
AI solutions that are designated to bring value in banking

AI solutions are boosting value in banking

August 10, 2020/in Machine learning /by Oleh Plakhtiy and Dawid Nguyen

According to research done by Business Insider Intelligence, banks will make some $450 billion by 2023 by applying Artificial Intelligence. No wonder, then, that AI is playing an increasingly important role in financial institutions’ roadmap for the coming years – 75% of banks with over $100 billion in assets already have AI strategies in place.

Banks don’t just create AI strategies, but are increasingly using AI and Machine Learning in their day-to-day business. We often work with them on ideation workshops, PoC and solution implementation. As an example, we recently had the opportunity to work with Santander Consumer Bank, running workshops and researching how to use ML to boost the sustainability of loan portfolios. We were able to significantly reduce risk while maintaining the same acceptance rate for extending loans.

Apart from credit risk modelling, there is already an impressive range of use cases for AI in banking, covering everything from customer service to back-office operations. Here is a list of the most common AI solutions in the banking sector: AI solutions in banking sector

Customer service automation

  • Chatbots– applying chatbots to automate customer service increases customer satisfaction. In fact, most simple issues can be solved entirely without human interference. Behind the scenes, automation significantly reduces customer service workloads.
  • Biometric identification enables explicit or unnoticed identity verification within remote channels. This can include voice identity verification in call centers or typing manner verification in online banking.

Customer insights:

  • Customer 360 view – applying deep learning to customer analytics makes it easier to combine insights from various data sources (e.g. transactions, online banking logs, call center interactions). This helps us better understand a bank’s customers and build personalized recommendations and Intelligent customer assistants, making the business more responsive and efficient.
  • Churn prediction – Thanks to accurate AI algorithms, churn probability predictions improve customer retention. This is important as customers often churn without obvious warning signs. Thus, it is difficult to run precisely targeted anti-churn campaigns. On the other hand, retention activities can be expensive, sometimes much more so than the value a potential customer may bring.
  • Customer life-time value is often used to understand how valuable a particular relation is and to optimize other activities – for example, by integrating Customer Lifetime Value with a probability-of-churn function to focus retention activities on the most valuable clients.

Boosting Sales

  • New client acquisition – Deep learning is particularly suited to improving remarketing. As with the customer 360 view, it promotes the use of all possible information about a prospective customer. This includes cookies and how the individual has interacted with a website – from time spent to what they hovered over and how far into the site they went.  Understanding customer behaviour enables a bank to focus marketing activities on potential customers and show them personalized ads, translating into even 2,5x uplift from advertising activities.
  • X-sell – ML techniques can be used to improve the selection of customers targeted for outbound CRM campaigns. They combine the benefits from both the Customer 360 view and advanced probability of purchase predictions. This allows a bank to choose the right customer and the right product to cross-sell. As an example, ML has been shown to improve credit card x-sell by 12,5%.

Credit risk management

  • Loan application assessment – Machine Learning can analyze unstructured data (e.g. transaction descriptions) more thoroughly than other techniques and find non-obvious dependencies. ML techniques can also be combined with traditional scoring models to get even better results.
  • Fraud detection – ML enables nearly fully automated fraud detection, adapting to individual patterns and changing behaviors. It can be applied in areas where a high volume of events needs to be analyzed in real time, e.g. in card payments. AI can find complex correlations, so even the wildest purchase will make sense to AI. Those it can’t wrap its algorithms around will lead to the detection of fraud.
  • Debt collection strategies – AI algorithms can generate a customized communication strategy for each customer. It will adjust the contact channel, recommending script for CC, or propose a schedule.
  • Continuous portfolio evaluation – detecting SME clients, whose risk of default has risen.  This enables banks to react rapidly and start the recovery process before other creditors do.

Back-office optimization

  • Workflow documentation – classifying incoming emails to go to the appropriate department (sales, complaints, support) and customer segmentation (individual, SME, Corporate) reduces the manual work involved with organizing customer service departments.
  • Process automation – including for cash operations, trade finance, credit application processing, accounting processes.

Summary

A wide range of ML and AI applications is increasingly being used to solve real business problems in banking. As AI becomes more popular, those applications will become the market standard.

 

This article was prepared in cooperation with Santander Consumer Bank

Santander Consumer Bank - Logo

https://deepsense.ai/wp-content/uploads/2020/07/AI-solutions-that-are-designated-to-bring-value-in-banking.jpg 337 1140 Oleh Plakhtiy https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Oleh Plakhtiy2020-08-10 12:05:352021-01-05 16:12:01AI solutions are boosting value in banking
Using Machine Learning in credit risk modelling to reduce risk costs

Using machine learning in credit risk modelling

May 5, 2021/in Data science, Machine learning /by deepsense.ai

Cost of risk is one of the biggest components in banks’ cost structure. Thus, even a slight improvement in credit risk modelling can translate in huge savings. That’s why machine learning is often implemented in this area.

We would like to share with you some insights from one of our projects, where we applied machine learning to increase credit scoring performance.To illustrate our insights we selected a random pool of 10 000 applications.

How the regular process works

Loan applications are usually assessed through a credit score model, which is most often based on a logistic regression (LR). It is trained on historical data, such as credit history. The model assesses the importance of every attribute provided and translates them into a prediction.

The main limitation of such a model is that it can take into account only linear dependencies between input variables and the predicted variable. On the other hand, it is this very property that makes logistic regression so interpretable.  LR is in widespread used in credit risk modelling.

Credit scoring from a logistic regression model
Credit scoring from a logistic regression model

What machine learning brings to the table

Machine learning enables the utilization of more advanced modeling techniques, such as decision trees and neural networks. This introduces non-linearities to the model and allows to detect more complex dependencies between the attributes. We decided to use an XGBoost model fed with features selected with the use of a method called permutation importance.

Credit scoring from tree-based model
Credit scoring from tree-based model

However, ML models are usually so sophisticated that they are hard to interpret. Since a lack of interpretability would be a serious issue in such a highly regulated field as credit risk assessment, we opted to combine XGBoost and logistic regression.

Combining the models

We used both scoring engines – logistic regression and the ML based one – to assess all of the loan applications

With a clear correlation between the two assessment approaches, a high score in one model would likely mean a high score in the other.

Loan applications assessed by 2 models
Loan applications assessed by 2 models

In the original approach, logistic regression was used to assess applications. The acceptance level was set around 60% and the risk resulted at 1%

Initial credit application split (acceptance to portfolio risk)
Initial credit application split (acceptance to portfolio risk)

If we decrease the threshold by a couple of points, the acceptance level hits 70% while the risk jumps to 1,5%

Credit applications' split after lowering the threshold
Credit applications’ split after lowering the threshold

We next applied a threshold for an ML model, allowing us to get an acceptance percentage to the original level (60%) while bringing the risk down to 0,75% that is by 25% lower than the risk level resulting from only traditional approach.

Credit applications' split after applying Machine Learning
Credit applications’ split after applying Machine Learning

Summary

Machine learning is often seen as difficult to apply in banking due to the sheer amount of regulation the industry faces. The facts don’t necessarily back this up. ML is successfully used in numerous, heavily regulated industries. The example above is just one more example of how. Thanks to this innovative approach it is possible to increase the sustainability of the loans sector and make loans even more affordable to bank customers. There’s nothing artificial about that kind of intelligence.

https://deepsense.ai/wp-content/uploads/2020/07/Using-Machine-Learning-in-credit-risk-modelling-to-reduce-risk-costs.jpg 337 1140 deepsense.ai https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg deepsense.ai2021-05-05 09:31:322021-05-13 13:36:22Using machine learning in credit risk modelling
3D meets AI - an unexplored world of new business opportunities

3D meets AI – an unexplored world of new business opportunities

May 22, 2020/in Data science, Deep learning, Machine learning /by Krzysztof Palczewski, Jarosław Kochanowicz and Michał Tadeusiak

AI has become a powerful force in computer vision and it has unleashed tangible business opportunities for 2D visual data such as images and videos. Applying AI can bring tremendous results in a number of fields. To learn more about this exciting area, read our overview of 2D computer vision algorithms and applications.

Despite its popularity, there is nothing inherent to 2D imagery that makes it uniquely suitable for AI application. In fact, artificial intelligence systems can analyze various forms of information, including volumetric data. In spite of the increasing number of companies already using 3D data gathered by lidar or 3D cameras, AI applications aren’t the mainstream in their industries.

In this post, we describe how to leverage 3D data across multiple industries with the use of AI. Later in the article we’ll have a closer look at the nuts and bolts of the technology and we’ll aslo show what it takes to apply AI to 3D data. At the end of the post, you’ll also find an interactive demo to play with.

In the 3D world, there is no Swiss Army Knife

3D data is what we call volumetric information. The most common types include:

  • 2.5D data, including information on depth or the distance to visible objects, but no volumetric information of what’s hidden behind them. Lidar data is an example.
  • 3D data, with full volumetric information. Examples include MRI scans or objects rendered with computer graphics.
  • 4D data, where volumetric information is captured as a sequence, and the outcome is a recording where one can go back and forth in time to see the changes occurring in the volume. We refer to this as 3D + time, which we can treat as the 4th dimension. Such representation enables us to visualize and model dynamic 3D processes, which is especially useful in medical applications such as respiratory or cardiac monitoring.

There are also multiple data representations. These include a compound of 2D images along the normal axis, sparse Point Cloud representation and voxelized representation. Such data could have additional channels, like reflectance in every point of a lidar’s view.

Depending on the business need, there can be different objectives for using AI: object detection and classification, semantic segmentation, instance segmentation and movement parameterization, to name a few. Moreover, every setup has its own characteristics and limitations that should be addressed with a dedicated approach (or, in the case of artificial neural networks, with a sophisticated and thoroughly designed architecture). These are the main reasons our clients come to us, and to take advantage of our experience in the field. We are responsible for delivering the AI part of specific projects, even though the majority of their competencies are built in-house.

Let us have a closer look at a few examples

1. Autonomous driving

  • Task: 3D object detection and classification,
  • Data: 2.5 Point clouds captured with a lidar: sparse data, big distances between points

Autonomous driving data are very sparse because:

  • the distances between objects in outdoor environments are significant
  • In the majority of cases lidar rays from the front and rear of the car don’t return to lidar, since there are no objects to reflect them.
  • The resolution of objects gets worse the further they are from the laser scanner. Due to the angular expansion of the beam it’s impossible to determine the precise shape of objects that are far away.

For autonomous driving, we needed a system that can take advantage of data sparsity to infer 3D bounding boxes around objects. One such network is the part-aware and aggregation neural network i.e. Part-A2 net (https://arxiv.org/abs/1907.03670). This is a two-stage network that uses the high separability of objects, which functions as segmentation information.

In the first stage, the network estimates the position of foreground points of objects inside bounding boxes generated by an anchor-based or anchor-free scheme. Then, in the second stage, the network aggregates local information for box refinement and class estimation. The network output is shown below, with the colors of points in bounding boxes showing their relative location as perceived by the Part-A2 net.

3D meets AI - Autonomous drivingSource of image: From Points to Parts: 3D Object Detection from Point Cloud with Part-aware and Part-aggregation Network

2. Indoor scene mapping

  • Task: Object instance segmentation
  • Data: Point clouds, sparse data, relatively small distances between points

A different setup is called for in mapping indoor environments, such as we do with instance segmentation of objects in office space or shops (see this dataset for better intuition: S3DIS dataset). Here we employ a relatively high-density representation of a point cloud and BoNet architecture.

In this case the space is divided into a 1- x 1- x 1-meter cubic grid. In each cube, a few thousand points are sampled for further processing. In an autonomous driving scenario, such a grid division would make little sense given the sheer number of cubes produced, many of which are empty and only a few of which contain any relevant information.

The network produces semantic segmentation masks as well as bounding boxes. The inference is a two-stage process. The first produces a global feature vector to predict a fixed number of bounding boxes. It also tallies scores to indicate whether some of the predicted classes are inside those boxes. The point-level and global features derived in the first stage are then used to predict a point-level binary mask with the class assignment. The pictures below show a typical scene with the segmentation masks.

3D meets AI - Indoor scene mapping
An example from the S3DIS dataset. From left: input image, semantic segmentation labels, instance segmentation labels

3. Medical diagnosis

  • Task: 3D Semantic segmentation
  • Data: Stacked 2D images, dense data, small distance between images

This is a highly controlled setup, where all 2D images are carefully and densely stacked together. Such a representation can be treated as a natural extension of a 2D setup. In such cases, modifying existing 2D approaches will deliver satisfactory results.

An example of a modified 2D approach is the 3D U-Net (https://arxiv.org/abs/1606.06650), where all 2D operations for a classical U-Net are replaced by their 3D counterparts. If you want to know more about AI in medicine, check out how it can be used to help with COVID-19 diagnosis and other challenges.

3D meets AI - Medical diagnosis
Source: Head CT scan

4. A 3D-enhanced 2D approach

There is also another case, where luckily, it can be relatively straightforward to apply expertise and technology developed for 2D cases in 3D applications. One such scenario is where there are 2D labels available, but the data and the inference products are in 3D. Another is when 3D information can play a supportive role.

In such a case, a depth map produced by 3D cameras can be treated as an additional image channel beyond regular RGB colors. Such additional information increases the sensitivity of neural networks to edge detection and thus yield better object boundaries.

3D meets AI - A 3D-enhanced 2D approach
Source: Azure Kinect DK depth camera

Examples of the projects we have delivered in such a setup include:

  • Defect detection based on 2D and 3D images.

We developed an AI system for a tire manufacturer to detect diverse types of defects. 3D data played a crucial role as it allowed for ultra-precise detection of submillimeter-size bubbles and scratches.

  • Object detection in a factory

We designed a system to detect and segment industrial assets in a chemical facility that had been thoroughly scanned with high resolution laser scanners. Combining 2D and 3D information allowed us to digitize the topology of the installation and its pipe system.

3D data needs a mix of competencies

At deepsense.ai, we have a team of data scientists and software engineers handling the algorithmic, visualization, and integration capabilities. Our teams are set up to flexibly adapt to specific business cases and provide tailor-made AI solutions. The solutions they produce are an alternative to pre-made, off-the-shelf products, which often prove too rigid and constrained; they fail once user expectations deviate from the assumptions of their designers.

Processing and visualizing data in near real time with appropriate user experience is no piece of cake. Doing so requires a tough balancing act, including

combining specific business needs, technical limitations resulting from huge data loads and the need to support multiple platforms.

It is always easier to discuss based on an example. Next section shows what it takes to develop an object detection system for autonomous vehicles with outputs accessible from a web browser. The goal is to predict bounding boxes of 3 different classes: car, pedestrian and cyclist, 360 degrees around the car. Such a project can be divided into 4 interconnected components: data processing, algorithms, visualizations and deployment.

Data preprocessing

In our example, we use the KITTI and A2D2 datasets, two common datasets for autonomous driving, and ones our R&D hub rely on heavily. In both datasets, we use data from spinning lidars for inference and cameras for visualization purposes.

Lidars and cameras work independently, capturing data at different rates. To obtain a full picture, all data have to be mapped to a common coordinate system and adjusted for time. This is no easy task. As lidars are constantly spinning, each point is captured at a different time, while simultaneously the position and rotation of the car in relation to world coordinates is changing. Meanwhile, the precise location and angle of the car is not known perfectly due to limitations of geolocation systems such as GPS. These difficulties make it extremely difficult to precisely and stably determine the absolute positions of objects around you (SLAM can be used to tackle some of the problems).

Fortunately, absolute positioning of objects around the vehicle is not always required.

Algorithms

There are a vast number of approaches when it comes to 3D data. However, factors such as the length to and between objects and high sparsity will play an essential role in which algorithm we ultimately settle on. As in the first example above, we used Part-A2 net.

Deployment

We have relied on a complete, in-house solution for visualization, data handling, and UI. We have used expertise in the Unity engine to develop a cross-platform, graphically rich and fully flexible solution. In terms of a platform, we opted for maximum availability, which can be satisfied by a popular web browser like Chrome or Mozilla and WebGL as Unity’s compilation platform.

Visualization/UI

WebGL, while very comfortable for the user, disables drive access and advanced GPU features, limits available RAM to 2GB and processing to a single thread. Additionally, while standalone solutions in Unity may rely on existing libraries for point cloud visualization, making it possible to visualize hundreds of millions of points (thanks to advanced GPU features), this is not the case in WebGL.

Therefore, we have developed an in-house visualization solution enabling real-time, in-browser visualization of up to 70 mln points. Give it a try!

Such visualization could be tailored to the company’s specific needs. In a recent project, we took a different approach: we used AR glasses in visualizing a factory in all its complexity. This enabled our client to reach next level user experience and see the factory in a whole new light.

Summary

We hope that this post has shed some light on how AI can be used with 3D data. If you have a particular 3D use case in mind or you are just curious about the potential for AI solutions in your field, please reach out to us. We’ll be happy to share our experience and discuss potential ways we can help you apply the power of artificial intelligence in your business. Please drop us an email at contact@deepsense.ai.

https://deepsense.ai/wp-content/uploads/2020/05/3D-meets-AI-an-unexplored-world-of-new-business-opportunities.jpg 337 1140 Krzysztof Palczewski https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Krzysztof Palczewski2020-05-22 17:40:362021-05-11 10:55:223D meets AI – an unexplored world of new business opportunities
Page 1 of 7123›»

Start your search here

NEWSLETTER SUBSCRIPTION

    You can modify your privacy settings and unsubscribe from our lists at any time (see our privacy policy).

    This site is protected by reCAPTCHA and the Google privacy policy and terms of service apply.

    THE NEWEST AI MONTHLY DIGEST

    • AI Monthly Digest 20 - TL;DRAI Monthly Digest 20 – TL;DRMay 12, 2020

    CATEGORIES

    • Elasticsearch
    • Computer vision
    • Artificial Intelligence
    • AIOps
    • Big data & Spark
    • Data science
    • Deep learning
    • Machine learning
    • Neptune
    • Reinforcement learning
    • Seahorse
    • Job offer
    • Popular posts
    • AI Monthly Digest
    • Press release

    POPULAR POSTS

    • AI trends for 2021AI trends for 2021January 7, 2021
    • A comprehensive guide to demand forecastingA comprehensive guide to demand forecastingMay 28, 2019
    • What is reinforcement learning? The complete guideWhat is reinforcement learning? deepsense.ai’s complete guideJuly 5, 2018

    Would you like
    to learn more?

    Contact us!
    • deepsense.ai logo white
    • Services
    • Customized AI software
    • Team augmentation
    • AI advisory
    • Knowledge base
    • Blog
    • R&D hub
    • deepsense.ai
    • Careers
    • Summer internship
    • Our story
    • Management
    • Advisory board
    • Press center
    • Support
    • Terms of service
    • Privacy policy
    • Code of ethics
    • Contact us
    • Join our community
    • facebook logo linkedin logo twitter logo
    • © deepsense.ai 2014-
    Scroll to top

    This site uses cookies. By continuing to browse the site, you are agreeing to our use of cookies.

    OKLearn more

    Cookie and Privacy Settings



    How we use cookies

    We may request cookies to be set on your device. We use cookies to let us know when you visit our websites, how you interact with us, to enrich your user experience, and to customize your relationship with our website.

    Click on the different category headings to find out more. You can also change some of your preferences. Note that blocking some types of cookies may impact your experience on our websites and the services we are able to offer.

    Essential Website Cookies

    These cookies are strictly necessary to provide you with services available through our website and to use some of its features.

    Because these cookies are strictly necessary to deliver the website, refuseing them will have impact how our site functions. You always can block or delete cookies by changing your browser settings and force blocking all cookies on this website. But this will always prompt you to accept/refuse cookies when revisiting our site.

    We fully respect if you want to refuse cookies but to avoid asking you again and again kindly allow us to store a cookie for that. You are free to opt out any time or opt in for other cookies to get a better experience. If you refuse cookies we will remove all set cookies in our domain.

    We provide you with a list of stored cookies on your computer in our domain so you can check what we stored. Due to security reasons we are not able to show or modify cookies from other domains. You can check these in your browser security settings.

    Other external services

    We also use different external services like Google Webfonts, Google Maps, and external Video providers. Since these providers may collect personal data like your IP address we allow you to block them here. Please be aware that this might heavily reduce the functionality and appearance of our site. Changes will take effect once you reload the page.

    Google Webfont Settings:

    Google Map Settings:

    Google reCaptcha Settings:

    Vimeo and Youtube video embeds:

    Privacy Policy

    You can read about our cookies and privacy settings in detail on our Privacy Policy Page.

    Accept settingsHide notification only
    Cookies To make this site work properly, we sometimes place small data files called cookies on your device. Most big websites do this too.
    Accept
    Change Settings
    Cookie Box Settings
    Cookie Box Settings

    Privacy settings

    Decide which cookies you want to allow. You can change these settings at any time. However, this can result in some functions no longer being available. For information on deleting the cookies, please consult your browser’s help function. Learn more about the cookies we use.

    With the slider, you can enable or disable different types of cookies:

    • Block all
    • Essentials
    • Functionality
    • Analytics
    • Advertising

    This website will:

    This website won't:

    • Essential: Remember your cookie permission setting
    • Essential: Allow session cookies
    • Essential: Gather information you input into a contact forms, newsletter and other forms across all pages
    • Essential: Keep track of what you input in a shopping cart
    • Essential: Authenticate that you are logged into your user account
    • Essential: Remember language version you selected
    • Functionality: Remember social media settings
    • Functionality: Remember selected region and country
    • Analytics: Keep track of your visited pages and interaction taken
    • Analytics: Keep track about your location and region based on your IP number
    • Analytics: Keep track of the time spent on each page
    • Analytics: Increase the data quality of the statistics functions
    • Advertising: Tailor information and advertising to your interests based on e.g. the content you have visited before. (Currently we do not use targeting or targeting cookies.
    • Advertising: Gather personally identifiable information such as name and location
    • Remember your login details
    • Essential: Remember your cookie permission setting
    • Essential: Allow session cookies
    • Essential: Gather information you input into a contact forms, newsletter and other forms across all pages
    • Essential: Keep track of what you input in a shopping cart
    • Essential: Authenticate that you are logged into your user account
    • Essential: Remember language version you selected
    • Functionality: Remember social media settings
    • Functionality: Remember selected region and country
    • Analytics: Keep track of your visited pages and interaction taken
    • Analytics: Keep track about your location and region based on your IP number
    • Analytics: Keep track of the time spent on each page
    • Analytics: Increase the data quality of the statistics functions
    • Advertising: Tailor information and advertising to your interests based on e.g. the content you have visited before. (Currently we do not use targeting or targeting cookies.
    • Advertising: Gather personally identifiable information such as name and location
    Save & Close