Blog posts Archives - deepsense.ai

Optimizing Computational Resources for Machine Learning and Data Science Projects: A Practical Approach

June 5, 2024/in Data science, Machine learning /by Łukasz Gębala

Every computation requires computing resources. Sure, sometimes a regular calculator, a piece of paper, and a pencil are sufficient. However, in machine learning, powerful computing resources are necessary:

The model needs to be fed with a massive amount of data.
Appropriate calculations must be performed for each data point to process it into a pattern.
Some parameters must be adjusted to teach the model the correct mappings, necessitating further recalculations and computational resources.

Your teammates also need to train models. Ultimately, the amount of computational resources is always insufficient. Nevertheless, there are ways to

Reduce this deficit.
Increase the utilization of available resources.
Gain more freedom in developing your research or building a startup

This article may provide useful insights if you have encountered a similar situation.

Challenges of Computational Resource Allocation

At deepsense.ai, we specialize in addressing machine learning and data science challenges with custom solutions tailored to our client’s specific needs. With clients spanning various industries, we encounter diverse problems that demand adaptability and versatility. Our in-house computational resources are utilized for developing and testing these solutions. While cloud computing is trendy, it may not always be practical due to cost, availability of on-demand GPUs, or data confidentiality concerns.

When working on client projects, we often encounter a situation where multiple Machine Learning Engineers or Data Scientists need to share limited computational resources for their training sessions. Allocating these resources efficiently is challenging, especially during the solution design stage when we need to experiment with various parameters and models. Establishing a fixed schedule for resource usage is not feasible as we can’t always estimate the time required for computations.

Existing Solutions and Their Implications

It has become apparent that this issue is not unique to us, and that others have also encountered and addressed it. Solutions exist that may not fully resolve the problem, but can certainly mitigate it significantly. This challenge prompted the development of supercomputers and computing clusters, and we can draw upon their expertise. Furthermore, we can apply this knowledge to cloud resources, provided the need arises and opportunities arise.

Adopting SLURM for Efficient Resource Management

We have established our cluster using SLURM (formerly known as Simple Linux Utility for Resource Management). Our decision is based on several factors. First, SLURM supports all required resources, including CPU, RAM, and GPU. Second, it is compatible with the Linux operating system, Python, and other AI tools and models we commonly use. Third, SLURM is a stable and widely used solution. It is estimated that approximately 60% of the supercomputers on the Top500 list run on SLURM, and some of our staff have encountered it during their academic work.

How SLURM Enhances Resource Efficiency

This way, we have a tool that lets us “request” the resources needed to carry out a planned task. If the required resources are not available, the task will be queued and launched as soon as the resources become available. The workload manager will handle this without requiring the special involvement of an engineer at this stage. The engineer who submitted the task will receive email notifications when the task starts and ends. Since computing resources operate 24/7, this queuing method allows for more efficient use of resources, as tasks can be carried out outside of regular working hours. Additionally, with different types of resources available, one can “request” specific resource models or just a “type” of resource, which enhances the flexibility of this solution.

User Interaction with the SLURM Cluster

From the user’s perspective, we communicate with the cluster through a login node. This node is where tasks are prepared and configured before queuing and running. The controller, which knows the states of all compute nodes, then allocates the appropriate resources by assigning them to subsequent tasks in the queue.

Simplyfied schema of SLURM components. Icons source: draw.io

Optimizing SLURM Implementation for Machine Learning Workloads

When setting up our SLURM implementation, we carefully considered the available computing hardware, storage options, and data access performance and security requirements. The key conclusions we reached during this phase, and upon revisiting it after implementation, were as follows:

Network: The cluster should have a fast internal network connection of at least 10Gbit. AI model training often involves large datasets that must reach the compute node(s) where the job will be run. Even if the node can quickly handle the task, it won’t be efficient if downloading data for computations significantly slows it down. Ideally, such a network should be redundant to avoid a single point of failure that could bring the entire cluster out of service.
Data storage: Data should be easily accessible to each compute node, fast to read and write (since computation results need to be saved), and secure. A possible node failure should not affect the data on the node. NAS servers and distributed network file systems like Ceph can fulfill these requirements.
Operating system unification. All compute nodes should run under the control of the same operating system version. The available system and programming libraries should also be identical on each node. We cannot allow a situation where the programmed task code cannot run because a node lacks a required library. We base our systems on Ubuntu, which allows us to create code on Ubuntu desktops and then run it on an Ubuntu cluster. This consistency makes it easier and faster for us to design and develop solutions.
Computational libraries. The development of external libraries and models progresses every day. We need to work with different versions of libraries due to varying requirements. Given the number and variety of projects our engineers work on, each Machine Learning Engineer requires considerable disk space. Adding up the needs of all engineers only increases the complexity. We implemented a general library store using a network file system and LMOD to address this. This allows us to maintain only one copy of a given library (in a given version), available at any time on any node. Engineers can enable the libraries they need, disable the ones they don’t, or experiment with different library versions. This approach reduces the need to clean up environments of excess libraries.
Temporary data. According to the architecture of such a cluster, users can only submit jobs through the access node and cannot select a specific compute node on which the job will run. Therefore, all necessary elements, such as code or temporary data, must be available on each compute node. Users upload the required files through the access node, and the compute nodes have immediate access to them. Distributed network file systems are ideal for this purpose. Examples include Ceph and Lustre. For data downloaded from NFS, we use FS-Cache, which works very well for frequently used files.

Example of Cluster Configuration

Let’s use the example of our test cluster, consisting of one access node, two compute nodes without GPUs, and two compute nodes with four GPUs each. If we develop the solution well, we can scale it quite freely (by adding more nodes) and implement it (or help with implementation) on the client’s infrastructure or in the cloud.

The backbone is a 2 x 10Gbps Ethernet network based on redundant switches, enabling speeds up to 20Gbps. This technology is reasonable in our case as it does not expose us to additional costs and offers good performance.

Data Management and Efficiency

For data storage, we will use a NAS array and a network FS (file system) built with the help of Ceph. At this particular moment, “data” should be understood as everything needed for work that must be saved on disk and available on each compute node: input data for calculations (mostly customer data), libraries, models, repositories, code, temporary data, etc. This setup will also allow us to react flexibly and distribute data properly depending on the quantity and type of projects we are working on. Additionally, Ceph allows us to manage engineers’ temporary data, such as models reasonably. If one engineer needs to use model X, this model is downloaded and saved in the temporary cache. If another engineer wants to use model X, he does not have to download it again because this model is already available in the cache. Over time, this workflow improvement will save disk space and enhance efficiency.

*A small off-topic here – we omit security issues such as data encryption, communication encryption, and permission management of who can see what, etc. These are beyond the scope of this article. However, they are considered, implemented, and practiced following industry standards.*

Efficient Library Management with LMOD

The next question was what we could do with the excess libraries needed by our engineers, downloaded and saved here and there. Perhaps some part (as it turned out – a significant one) could be shared by installing them once, in one place, and making them available “in one go” to all nodes. Drawing on the experience of others, we used LMOD, an environment modules system. It allows you to make both libraries and programs available as loadable modules. This, in turn, gives the ability to quickly and easily switch between different libraries or programs and between their versions. At the same time, we save space and time because one copy of the software is installed and made available for general use. From our point of view, it is important to switch easily between different versions of Python or CUDA and test different versions of AI-related libraries such as PyTorch, Numpy, TensorFlow, and many others.

Practical Configuration of a SLURM Cluster

Access Node and Cluster Controller

Our access node will also be the cluster controller, running two main cluster services: slurmctld and slurmdbd. If you have the option, separating these services and the access node into three independent machines is worth separating. This solution is more sensible in large environments (actual supercomputers, university labs, etc.). However, we are not building a supercomputer; we are effectively utilizing available hardware resources. Moreover, we have checked our needs and capabilities, and tests have shown that this solution fully meets our needs.

Node Authentication with Munge

SLURM uses Munge to authenticate communication between nodes. This service should be installed and launched with the same (secret) key on all nodes to communicate with each other. It’s important to have Munge working before starting SLURM.

Configuring slurmdbd Service

The slurmdbd service saves information about executed tasks to the database. This is useful for statistical and accounting purposes – we can verify how many resources a given project or type of task consumed, how many tasks individual users launched, etc. Its configuration is simple and comes down to specifying database connection parameters (MySQL or MariaDB). Example configuration files are well documented. The most important options to configure are:

AuthType=auth/munge
AuthInfo=/run/munge/munge.socket.2

StorageType=accounting_storage/mysql
DbdHost=db-host.example.com
StorageLoc=slurm_DB
StoragePass=ThePasswordThatShouldBeProtected
StorageUser=slurm
StorageHost=slurm-ctlr.example.com

So, the previously mentioned authentication using Munge and the access data for the database.

Configuring slurmctld Service

The main controller is the slurmctld service. It is responsible for queueing tasks, allocating resources, and monitoring the states of nodes. Its configuration can be divided into two parts: the configuration of the load and resource manager and the actual resources of the cluster. The manager configuration is very rich in options, and it is worth studying the documentation to set the appropriate options for yourself. Let’s define our cluster:

AuthType=auth/munge
ClusterName=slurm-lab
SlurmctldHost=slurm-ctrl.example.com
GresTypes=gpu

As we already know, we use Munge for authentication. We name our cluster, specify the controller’s host, and define the GPU type resource (Gres means Generic RESource) as a resource that the controller is about to manage.

Scheduling and Resource Selection

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory

Backfill is the default plugin for managing the schedule in SLURM. It results in better utilization of the entire cluster, e.g., lower-priority tasks will be run if they do not interfere with the (predicted) execution of higher-priority tasks. The select/cons_tres plugin (in conjunction with the OverSubscribe parameter set for partitions) determines the ability to share unused resources (as opposed to the select/linear plugin, which operates at the level of entire nodes). Finally, the CR_CPU_Memory parameter sets the “logical” processor and memory as the “units” that we operate on when allocating resources (as opposed to, for example, CR_Core or CR_Socket). We can also use the CR_CPU parameter – then RAM will not be tracked when allocating resources to tasks.

Many other parameters allow us to adjust the cluster’s operation to our needs. For example, we can specify programs that will be run before and/or after the actual computational tasks, parameters for accounting statistics (duration, allocated resources, projects, etc.), or power-saving management. The flexibility of configuration is high.

Defining Cluster Resources

It’s time to define our resources. In this part, we define nodes – they can differ in parameters such as the availability or not of GPUs, the type of GPUs, but also simply the amount of CPUs and RAM. Therefore, it is necessary to describe well what they have so that the controller knows how it can allocate tasks depending on the requested resources, e.g.:

NodeName=lab-gpu-01 RealMemory=122880 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Feature="rtx4080" Gres=gpu:4

The lab-gpu-01 node has 120GB of RAM (the amount of memory we want to allocate for executing tasks assigned in the cluster), 2 processors with 12 cores each, and each core can execute two threads, along with 4 GPUs. By specifying the “Feature” parameter, you can introduce an additional level for the requested resources. For example, if there are different GPUs or CPUs in the nodes, you can request the execution of a task on a specific type of desired resource. If we don’t specify anything, the task will be launched on the first available resource (in the requested quantity). Additionally, in the gres.conf file, we define these generic resources (in our case, GPUs) more precisely, for example:

NodeName=lab-gpu-01 Name=gpu File=/dev/nvidia[0-3]

In this way, SLURM will know that the devices /dev/nvidia[0-3] are responsible for GPU-type resources, allowing it to allocate them (and block other tasks from using them).

Organizing Nodes into Partitions

Finally, we organize nodes into partitions, which are logical groups of nodes. A node can belong to more than one partition, but this should be carefully considered and probably tested in practice. The minimum is to create one partition and put all nodes into it, and the controller will handle organizing tasks according to the requested resources. However, we can (in our example) divide partitions based on resources. For example, one partition has nodes with GPUs, and another has nodes without GPUs. Example:

PartitionName=ml-cpu Nodes=lab-cpu-[01,02] Default=YES MaxTime=5-00:00:00 DefaultTime=04:00:00 State=UP AllowGroups=ml-users
PartitionName=ml-gpu Nodes=lab-gpu-[01,02] MaxTime=5-00:00:00 DefaultTime=04:00:00 State=UP AllowGroups=ml-users

We have two partitions with two nodes each: one partition only with CPU nodes (and this is the default partition) and the other with GPU nodes. We also specify the maximum job duration for 5 days, the default for 4 hours, while AllowGroups tells us that only users in the ml-users group can use these partitions.

Please note that most of the above features/parameters are highly individual. Available resources are an obvious determinant, but the specifics (needs) of users and, finally, the type of projects (i.e., tasks being run) will significantly influence the final configuration.

Verifying Cluster Configuration

We can check the status of the configured cluster by executing the sinfo command on the controller, and the result should look as follows:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
ml-cpu*      up 5-00:00:00      2   idle lab-cpu[01-02]
ml-gpu       up 5-00:00:00      2   idle lab-gpu[01-02]

Practical Experiences with Using SLURM

Accessing the Cluster

Time to start calculating our models. We gain access to the cluster via SSH through the login node, which in this case is the controller. It is on this node that we configure, prepare, and later run computational tasks. Therefore, we can upload the necessary input data for computations or code, or use SSH to download the code from a repository and prepare a virtual environment. Whatever we need. We should remember that because we are using network file systems shared between nodes, all our preparations will be immediately available on every compute node.

Tasks can be run in two modes:

Interactive Mode: Use the srun or salloc commands. Remember that such tasks are run “here and now” (depending on resource availability). If our SSH session is interrupted, the running task will also be interrupted, and resources will be released.
Batch Mode: Using the sbatch command along with the sbatch script. Such a task is put into the task schedule and will run (from our point of view) in the background. After starting it, you can immediately disconnect from the access node and, for example, wait for an email notification about its completion.

Interactive Task Example

Interactive tasks are useful for verifying if our computation preparations are working correctly, for example, if we have included all the necessary libraries, if the correct paths to data have been provided, and finally, if the code itself is invoked correctly. For example:

srun -p ml-gpu \
     -N 1 \
     --ntasks-per-node=1 \
     --mem-per-cpu=10GB \
     --gres=gpu:1 \
     --constraint="rtx4080" \
     --time=2:00:00 \
     -A LAB-test \
     --pty /bin/bash -l

We are launching a task on the (-p) ml-gpu partition, allocating one node (-N), one CPU core (—-ntasks-per-node), 10GB of memory (—-mem-per-cpu), and one GPU card (—-gres) of type rtx4080 (—-constraint), for a duration of 2 hours (—time) within the LAB-test project (-A). Finally, the task itself is to run a bash shell there.

This way, we get access to a compute node with the requested resources allocated. Then we load the necessary LMOD modules, e.g.:

module load apps/python/3.10
module load apps/cuda/11.7
module load libs/python/torch/1.13.1-cuda-11.7-python-3.10
module load libs/python/neptune-client/1.9.1-python-3.10

Thanks to LMOD, we immediately get access to the required Python version, CUDA libraries, and PyTorch, as well as the appropriate Neptune library, which allows us to save computation results for easy and efficient analysis later on. Assuming we have several available versions of such programs or libraries, switching between them is easy, making it easier to develop our code. Finally, we run the code itself, e.g.:

source ~/venv/bin/activate
python ~/code/program.py --data=/path/to/source/data

If necessary, we make appropriate corrections – add libraries, fix the code, and upload additional data. When everything is working correctly, we end the interactive session with:

logout

Returning to the access node, and according to the above (tested) experience, we prepare an sbatch script with the following content:

#!/bin/bash -l
## Project name
#SBATCH -A LAB-test
## Task name
#SBATCH -J Experiment-01
## E-mail notifications about job progress
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=username@example.com
## Number of nodes
#SBATCH -N 2
## Tasks per node (by default it is amount of CPU cores per node)
#SBATCH --ntasks-per-node=20
## RAM amount per computing core
#SBATCH --mem-per-cpu=5GB
## Maximum task duration (format D-HH:MM:SS)
#SBATCH --time=1-00:00:00
## Partition select (by default ml-cpu, according to SLURM config)
#SBATCH -p ml-gpu
## Amount of GPU per node
#SBATCH --gres=gpu:4
## Save outpu logs to file
#SBATCH --output="/home/username/slurm-logs/experiment-01.%j-%N.out.log"
## Save error logs to file
#SBATCH --error="/home/username/slurm-logs/experiment-01.%j-%N.err.log"
## We may request specific feature - rtx4080 in this case
#SBATCH --constraint="rtx42080"
## Now we repeat steps checked during interactive task
## which is: load necessary modules
module load apps/python/3.10
module load apps/cuda/11.7
module load libs/python/torch/1.13.1-cuda-11.7-python-3.10
module load libs/python/neptune-client/1.9.1-python-3.10
## activate v-env and launch job
source ~/venv/bin/activate
python ~/code/program.py --data=/path/to/source/data

Please note that in the sbatch script, all comments start with ##. On the other hand, #SBATCH indicates parameters passed to SLURM – these must be understandable and acceptable to it.

Running the Batch Job

This time, we want to run the task on 2 nodes from the ml-gpu partition. On each of them, we allocate 20 cores with 5GB of memory for each core (i.e., 100GB per node) and 4 GPUs. We additionally specify that these should be rtx4080 cards. The task should not take longer than 1 day.

The above parameters are an example. In practice, some tasks take several days to compute. But there are also (rarely) situations where we can use most (or even all) of our resources for one task—then, in turn, the task’s duration is significantly shortened. Of course, information on how to select parameters and time estimates comes with experience, but we already have some, and after all, we want to use our resources more efficiently.

Finally, we specify in which case (BEGIN, END, FAIL) and to which address notifications about the job status should be sent.

From now on, we can run as many such tasks as we need. If the requested resources are insufficient, the tasks will be queued. From there, they will be launched as resources become available. So, after a whole day of preparing code and calculations, we can safely queue them and go for a well-deserved rest, returning the next day to find the calculations in progress or even the results ready.

Conclusion

We wanted to effectively use our computing resources. We have a variety of hardware, both old and new. However, we can use each resource to the limits of its capabilities. Ultimately, this translates into our calm and efficient work and customer solutions. SLURM itself is great, flexible, and has enormous possibilities. Combined with network FS, LMOD, and thoughtfully designed resource partitioning, we have a tool that allows us not to (im)patiently wait for computing resources. Of course, these are always in short supply, but a cluster working 24/7 makes access to them easier for us.

Let us know if you need professional consultation, help with your SLURM configuration, or have questions about the above article!

Links:

AI Copilot’s Impact on Productivity in Revolutionizing Ada Language Development

July 15, 2024/in Data science, Machine learning /by deepsense.ai

How can we boost Ada software developers’ productivity? We teamed up with AdaCore to create a proof-of-concept copilot solution for the Ada programming language. This is the story of how we approached the challenge.

Project Overview

The Copilot for Ada programming language project aimed to research and develop a proof-of-concept code completion tool and evaluate its performance on the Ada code generation task. Its idea is to boost Ada software developers’ productivity by providing intelligent code completions and suggestions, improving the pace of task automation, and saving significant amounts of time on repetitive and boilerplate code.

Services provided

AI Research
AI Implementation

Key objectives

The solution aimed to prepare the groundwork for significantly boosting the effectiveness of Ada developers in the future by developing intelligent code completion and suggestions solution

Key outcomes

Demo application allowing to test code completion LLM-based solution on any Ada language samples,
Fine-tuned checkpoints of the selected coding LLMs, such as StarCoder and CodeGen, and their comparison with the baseline pre-trained models,
Recommendations on further steps for fine-tuning and enhancing Ada-specific code completion models.

Accelerating AI Integration with a Proof of Concept

M. Anthony Aiello, Head of Product & Innovation at AdaCore

“deepsense.ai quickly delivered a Proof of Concept for a code completion tool, using a state-of-the-art technological stack, including the newest available LLMs and libraries. They also led an excellent LLM discovery workshop that jump-started AdaCore’s integration of LLM solutions into our business processes and products. Their technical knowledge and commitment to delivering tailored, top-notch services were evident throughout our collaboration. Partnering with deepsense.ai has helped us accelerate our understanding of AI, implement AI solutions, and gain a strategic edge in today’s competitive landscape.”

Client background

AdaCore specializes in software development tools and services, primarily focused on Ada language programming for high-integrity systems to meet rigorous requirements for reliability, safety, security, and maintainability. With headquarters in New York and Paris, AdaCore provides its expertise to leading global defense, healthcare, automotive, aerospace, and railway enterprises.

Challenges and solutions

This project encountered several challenges while developing the copilot solution for the Ada programming language.

Challenges faced

LLMs Evaluation
Evaluating LLMs is difficult because no performance metrics can be automatically calculated and nicely correlated with the model’s performance in the program synthesis task without manually creating a set of programming challenges with unit tests. To circumvent this, we used text comparison metrics like BLEU and chrF measures to compare the ground truth code to the model’s generation.
Training Objective
Standard autoregressive training would not suffice because we need information from before and after the cursor to complete the code completion task correctly. As a result, we needed to understand how LLMs are trained to perform fill-in-the-middle tasks (FIM).

Resources
Despite the availability of memory-efficient training methods, memory requirements continue to be challenging when fine-tuning large models, particularly those with extended context. We had to rely on small batch sizes, which meant slow iteration speed (one training run could take several days to fine-tune on the largest dataset for more than one epoch).
The fast pace of the developments in the field of coding LLMs
During the project, the new versions of CodeGen and StarCoder models were released. Because our training pipeline was designed with modularity in mind, we could seamlessly incorporate these models.

Solution approach

Based on our experience in delivering AI solutions, we decided to take the following steps:

Formulation of the problem in Data Science terms
Challenge decomposition into more miniature epics
Setting development order and priorities
Choice of metrics and methods for evaluating the performance of the code generation model
Weekly meetings with AdaCore to present progress and gather feedback

Development process

The project’s goal in Data Science terms was to fine-tune the LLM, which is already pre-trained on code, for Ada code synthesis. The overall strategy included three crucial components as follows:

Training dataset preparation
We took an existing and cleaned corpus of GitHub repositories with permissive licenses called The Stack. After keeping only files with the correct Ada code and some additional preprocessing, we transformed it into a form that allows us to train the models on the FIM task.
Evaluation
Due to the lack of better evaluation methods, we utilized the chrF metric, which measures the similarity of predicted text and the ground truth, fully aware of some shortcomings. We evaluated the checkpoints on the held-out Ada corpus and the dataset of short Ada programming challenges found on the AdaCore website.
Fine-tuning Runs
We run and compared the performance of pre-trained and fine-tuned StarCoder and CodeGen models in different configurations, manipulating the number of model’s parameters, precision, context lengths, and memory-efficient fine-tuning methods, like LoRA and QLoRA

Data

We utilized three data sources throughout the project:

The Stack is a collection of code repositories for 358 programming languages available under permissive licenses. We retained only file extensions associated with Ada scripts (.ads, .adb, .ada), validated their correctness with the libadalang module, and filtered out files lacking Ada keywords in their contents. In the end, 30,528 files remained, representing 2.4% of the original dataset.
Ada Course Labs, which consisted of short Ada exercises with descriptions. We utilized this dataset as a supplementary test set.
Ada code from GitHub, the code that was not in The Stack dataset to improve the model’s performance with a more extensive training corpus and generate a test set of Ada files unseen by pre-trained models during their training phase

Key contributors

Since it was a scoped project with clear deliverables defined, deepsense.ai, a team of experienced consultants, managers, and developers, took ownership of the whole development process:

Project Manager
Technical Leader
Data Scientists
Principal ML Engineer (Technical Consultant)
ML Engineers

In addition to the team from deepsense.ai, the client-side team was actively engaged throughout the project, ensuring collaboration and alignment with the project goals.

Outcomes and benefits

The fine-tuning improved the model’s performance on the Ada code synthesis tasks compared to the pre-trained version. Considering the AdaCore company feedback, the project is a significant step forward. The generations from the delivered fine-tuned model outperformed the ground truth and GitHub Copilot even though Copilot uses a much larger model with a longer context and a more complex prompting method.

Lessons learned

The field of coding LLMs is constantly evolving, with more and more capable models being released at a fast pace. Even the foundation models pre-trained on publicly available code repositories have a decent level of understanding of how to program in the Ada language. The model’s capabilities can be further improved by fine-tuning an additional Ada-specific corpus of code repositories.
For coding models, context length is more important than the model’s size – even a 1B model with a context of 8k tokens can outperform a 15.5B model with only 2k tokens context.
While useful as a proxy measure for the model’s output quality, existing metrics and benchmarks for coding LLMs are unreliable for differentiating models. The end-user subjective evaluation is still required to determine the model’s usefulness.

Summary

The co-pilot for the Ada programming language project aimed to enhance Ada developer productivity with an intelligent code completion tool. deepsense.ai overcame challenges in LLM evaluation, training, and data resource constraints through problem decomposition and choosing the suitable evaluation scheme. The project marked significant progress for AdaCore in leveraging ML efforts to improve Ada developers’ productivity.

Implementing Small Language Models (SLMs) with RAG on Embedded Devices Leading to Cost Reduction, Data Privacy, and Offline Use

April 25, 2024/in Generative AI /by Kamil Czerski

In today’s rapidly evolving generative AI world, keeping pace requires more than embracing cutting-edge technology. At deepsense.ai, we don’t merely follow trends; we aspire to establish new solutions. Our latest achievement combines Advanced Retrieval-Augmented Generation (RAG) with Small Language Models (SLMs), aiming to enhance the capabilities of embedded devices beyond traditional cloud solutions. Yet, it’s not solely about the technology – it’s about the business opportunities it presents: cost reduction, improved data privacy, and seamless offline functionality.

What are Small Language Models?

Inherently, Small Language Models (SLMs) are smaller counterparts of Large Language Models. They have fewer parameters and are more lightweight and faster in inference time. We can consider models with more than 7 billion parameters as LLMs (the largest could have even more than 1 trillion parameters), demanding resource-heavy training and inference. The definition of a Small Language Model may vary among authors, but we consider models lightweight enough to run on edge devices, typically with 3 billion parameters or less. Please note that this division is conventional and does not provide full depth.

SLMs are compact versions of Language Models, and they excel in two main areas:

SLMs are suitable for Edge Devices, offering businesses benefits such as cost reduction, offline usage, or enhanced data privacy.
For research groups, SLMs facilitate speeding up R&D progress, swiftly testing new ideas, benchmarking at scale, and iterating relatively fast. Even retraining SLMs (even from scratch) is feasible for small groups with access to home-grade GPUs.

This article focuses on the first area: applying SLMs to Edge Devices for practical purposes.

Benefits of SLMs on Edge Devices

In this section, we present three compelling reasons why companies may find Small Language Model (SLM) applications preferable to their cloud-heavy Large Language Model (LLM) counterparts:

Cost Reduction

The expense of cloud inference for Large Language Models can be prohibitive. Transitioning LLM-based solutions directly to edge devices eliminates the need for cloud inference, resulting in significant cost savings at scale. This cost reduction may be a primary incentive for companies already employing cloud-based LLM inference on mobile phones or edge devices. Additionally, for specific applications, the quality offered by smaller models may already meet requirements. Moreover, companies seeking to cut LLM costs can benefit from shifting inference to local PC hardware.

We encourage readers interested in cost reduction topics to read our article on Reducing the cost of LLMs with quantization and efficient fine-tuning.

Offline Functionality

Deploying SLMs directly on edge devices eliminates the requirement for internet access, making SLM-based solutions suitable for scenarios where internet connectivity is limited. For instance, consider a drone application leveraging a Small Vision Language Model; it must operate seamlessly even in environments lacking internet connectivity. Another example can be a smartphone application as an RAG pipeline, utilizing the company’s documents and providing a question-answer mechanism. This application utilizes SLM and can reduce the costs of hosting larger LLM in a cloud.

Data Privacy

Sometimes, apprehensions arise regarding cloud services due to data protection regulations. All processing occurs locally by running on the Edge, offering the opportunity to adopt Language Model-based solutions while adhering to stringent data protection protocols.

Developing a Complete RAG Pipeline with SLMs on a Mobile Phone

To gain hands-on experience with Small Language Models, we decided to investigate an internal project where we explore SMLs and their usage.
The main goal of this internal project was to develop a complete Retrieval-Augmented Generation (RAG) pipeline, encompassing the embedding model, retrieval of relevant document chunks, and the question-answering model, ready for deployment on resource-constrained Android devices. The primary objective was to explore the capabilities of Small Language Models (SLMs) in terms of overall response quality and generation speed on mobile hardware.

Also, we publish code related to this project; check it at: https://github.com/deepsense-ai/edge-slm

What did we do?

We constructed a prototype pipeline for RAG using the llama.cpp framework and successfully deployed it on Android devices.
We experimented with SLMs, including Phi-2, Gemma, and TinyLlama, with parameter counts ranging from 1B to 3B.
Using the Ragas library, we evaluated their question-answering quality by combining human assessment with automated LLM-based metrics.
We gauged the impact of different quantization levels and prompt engineering on response quality.
We assessed the pipeline’s latency and memory consumption, gaining insights into the current possibilities for deploying language models on the edge.
In addition, we conducted experiments on the pipeline’s retrieval component, which involved embedding model selection and hyperparameter optimization.
Parallel to developing the main pipeline, we explored other frameworks such as ExecuTorch and MLC and alternatives to Transformer, like selective state space models (Mamba).

Demo of the RAG pipeline Phi-2 Q8 model with thenlper/gte-large embeddings model running on Samsung S24 Ultra.

What does the RAG pipeline look like?

It is a technique for injecting specific knowledge (consider your company documents and text data) into a system where users ask questions, and the Language Model answers those questions, incorporating knowledge from the mentioned documents. In other words, it is a zero-shot prompt technique for the Language Model, requiring no fine-tuning or training of the model. The main flow of the designed RAG pipeline is depicted in the diagram below, and this is precisely what we have fully implemented on the smartphone as our demo project.

For the offline component, documents are chunked, and embeddings are calculated. For the company’s records, this process was once an offline operation. When a mobile application is initiated, embeddings (indexed pointers to document chunks) are stored on the device in RAM, and documents are stored on the smartphone’s hard drive. Subsequently, when a user poses a question (user query), context is retrieved from this vector index. With appropriate prompt engineering, the Small Language Model takes user questions, retrieves contexts, and generates responses.

Offline processing:

The production-ready system can also include an offline component. The “Knowledge base” needs to be distributed to the edge devices, implying that the distribution may involve precalculated embeddings.

The document chunking step is conducted offline using Python scripts. This approach allows for the utilization of existing libraries and tools, such as LangChain.
Embedding vectors are computed offline to reduce loading time. This is achieved by developing simple applications using the developed library, ensuring consistent runtime implementation for the embeddings.

Online on edge part:

Knowledge base loading must occur during application startup, and indexed chunks must be loaded at runtime. The current solution stores all document chunks directly in memory. If the desired knowledge base was too large to fit within a reasonable amount of RAM, enhancing the solution by implementing mechanisms to store the actual documents outside the application would be necessary.
The context retrieval component takes the user query and the knowledge base. It extracts the K nearest elements retrieved from the knowledge base based on the cosine similarity score between the user query and document chunks.
The response generation step is the final stage in the pipeline. For this project’s scope, we did not implement the chatting functionality. It may be added in subsequent steps. This component generates the SLM model response based on the specific model prompt template, retrieved contexts, and the user query. The output from the LLM interface is capable of token streaming.

Tech Stack

Below, we provide a quick overview of the project, divided into research and inference sites. For the tech stack used in inference, we chose llama.cpp as an inference engine for SLMs, bert.cpp as a framework for embedding a model (now fully integrated into llama.cpp), Faiss as a library for realizing k-nn search for embeddings from user queries and embeddings from document chunks, OpenBLAS as Faiss requires a BLAS implementation, and Conan to manage C++ dependencies in the Android environment. On the research side, we evaluated models using both human and automated metrics (Ragas) and benchmarked the application.

Methods and Tools

Let’s start with the inference engine for the Small Language Model. We have tested and evaluated four LLM frameworks:

llama.cpp – This framework emerged as the best choice for runtime in mobile environments. It boasts a large community and is actively developed. It is compatible with mobile devices. In addition to supporting numerous Transformer models, there is a pending PR for incorporating Mamba models, which we also tested. While the community is rapidly growing, the PRs and codebase can be disorganized. Mobile optimizations are not the highest priority.
MLC LLM—Built on Apache TVM, our tests revealed this project’s limited applicability. It exhibited slower performance with GPU computations compared to llama.cpp. Additionally, it offers a narrower range of supported models.
Mamba.c – A runtime for Mamba written in C. It only supports CPU instructions on Android. A basic CUDA implementation is available, but it functions solely on PCs.A runtime for Mamba written in C. It only supports CPU instructions on Android. A basic CUDA implementation is available, but it functions solely on PCs.
ExecuTorch – This is a new framework designed for mobile and edge devices. While still in the development phase, it shows promise. However, support for Llama models is currently limited and buggy (we have opened an issue that needs to be solved). It may become a rising star due to its better organization, but it has yet to be production-ready.

Additionally, we researched and briefly examined language models support in Gemma and NCNN.

Google recently released its gemma.cpp, another development worth noting. While we haven’t tested it as an inference engine, it could interest those looking to utilize Gemma models. This framework appears to exclusively support Gemma models.
NCNN, on the other hand, is a popular ML framework for Android devices. However, it has limited support for small language models, with only the 7B llama model reported to run, and needs quantizations lower than int8.

Embedding model runtimes:

Bert.cpp – This repository utilizes the GGML runtime to execute the embedding models.

RAG:

FAISS – We developed the library for Android devices. FAISS is responsible for indexing the embedding vectors and enabling efficient search based on cosine similarity.

C++ libraries built:

OpenBLAS—FAISS requires a library that implements the BLAS and LAPACK interfaces. We incorporated OpenBLAS into a Conan recipe (the official recipe lacks Android support). However, an issue arises as OpenBLAS requires a Fortran compiler that is no longer supported with the Android NDK.

C++ Package management:

Conan – Responsible for managing all project dependencies. It integrates third-party repositories into our solution and enhances the manageability of Android and x86 builds.

Challenges with Implementing SLM with RAG on a Mobile Device

The key challenges faced during this project were:

1. Memory Limitations

The models’ size is crucial in their applicability on mobile hardware, which typically has 4-12GB of RAM. As a Small Language Model, we consider models ranging from 1B-7B parameters. To match the mobile’s memory constraint, we need to utilize quantization techniques like int8 (Q8) or lower (e.g., 4-bit or 5-bit representations). It is also important to mention that we can’t use all the memory; depending on the OS, we need to reserve 2-3.5GB for Android and other application components. Operations such as sparse kernel multiplication for pruned models have yet to be widespread in mobile frameworks like llama.cpp. Still, this field is progressing rapidly and will soon allow for the execution of larger models and/or faster inference.

2. Platform Independence

The Android platform has its own set of requirements for building applications. All the necessary components of the developed solution were designed so that the codebase should only be rebuilt or require minor tweaks specific to the target platform. Consequently, we implemented a library with a terminal application that can be deployed on Android devices and a regular x86 computer. Keeping the core functionality as a native-built shared object will allow the library to be used in regular Android apps (written in Kotlin) and Flutter (with Dart). Both require only wrappers on the public interface for the core RAG library.

3. Not Mature Enough Inference Engines

SLM inference Engines are evolving rapidly and still need to mature. Currently, llama.cpp is the best choice and supports more models than any other framework, but as a rapidly growing repository, it is somewhat disorganized. Additionally, it does not target Android and mobile performance optimization as the primary goal. Android GPU support via CLBLast is not producing correct results and is slower than the CPU. There is currently no support for pruning and sparse kernel operations. On the other hand, ExecuTorch seems to be well-organized and offers Qualcomm’s kernels for massive inference speed-up, but it is not yet mature enough to run Language Models. We expect the situation to change dramatically in the upcoming months.

4. Missing Features in Runtime Technologies and LLM Libraries for C++

The products are designed to deploy language model-based applications and systems, mainly targeting cloud-native deployments and the Python environment. In some cases, features ready to use in Python (i.e., more advanced retrieval techniques like hierarchical search) must be implemented from scratch in C++.

5. Android Constraints – a Single Process

Android deployment imposes constraints, as the entire application must be contained within a single process. This results in a much narrower choice of technologies like vector databases, not to mention that there is no clean way to build and include Fortran dependencies.

Performed evaluations

Here, we would like to discuss key findings from performance benchmarking.

Retrieval

The retrieval part was evaluated on a sample dataset containing a few PDF documents on the public Internet. The metric measured was mAP (mean average precision), which assesses how much relevant information was retrieved correctly.

RAG – evaluation datasets:

Source documents found in public resources containing the standard operating procedures for areas:
- Construction workplace safety
- Warehouse procedures
- Grocery store worker instructions
- COVID-19 guidelines
Queries and expected vital points to be retrieved were created manually.

As the best performing models, we chose the gte-base family. Depending on the memory available on the device, our recommendation is as follows:

- gte-base/fp32
  - mAP 0.65 at 3 chunks and 600 tokens
  - ~0.5GB

- gte-large/fp32
  - mAP 0.65 at 2 chunks and 200 tokens
  - ~1.5GB

The bigger model is better because it is sufficient to return the top-2 chunks to achieve an mAP of 0.65 on our dataset, with an average of 200 tokens for the SLM to parse in the next step. However, we need to allocate 1.5GB of RAM for this bigger model. The smaller model is 3x more lightweight at only 0.5GB, but it requires the top 3 chunks to achieve an mAP of 0.65, resulting in 600 tokens for the SML to process in the next step. This means that the SML will have more work to do (as it needs to process 600 tokens compared to 200 tokens) and potentially more challenging work (summarizing/reasoning with more non-relevant contexts). In other words, using a better embedding model can reduce the SLM input size. The smaller input prompt for the SLM model will be reflected in a shorter time without any output from the LLM (the prompt decoding step in llama.cpp).

The plots show top-k (Number of retrieved chunks) on the X-axis and mAP of the retrieval on the Y-axis. The number in the plot, close to the line, is how many tokens the entire retrieved-context is built from. Each plot shows a different model under a few configurations, where c_xxx means context size and o_XX means overlap between contexts when calculating embedding. For each query, a certain number of ground truth contexts are expected to be retrieved. For each query, the precision is calculated by correctly_retrieved_chunks_num / all_gt_chunks.

Embedder speed

The retrieval is fast, as user queries are often short.

The base model is 3x faster and occupies 3x less memory. But if memory is not a constraint, we suggest going with the larger model as it achieves better mAP and needs fewer chunks and tokens to achieve comparable mAP, making the next step, SML inference, way faster and easier as a task

Retrieval Indexing performance.

For retrieval, we used the Faiss library. Even though indexing is needed only offline in our case, we also show here on a benchmark that it is fast enough to consider it on a device. Searching time is blazingly fast for CPU index search. Also, a benchmark of the RAM needed for embedding shows that you can push thousands of pages without worrying about device memory. Num vectors in several embeddings, each corresponding to one text chunk; in our case, one page was like 6-10 chunks.

SLMs Benchmark

Here, we were evaluating models in the 1-3B range. It is possible to push 7B models with lower quantization levels, but they are slower and require memory that is available only on high-end smartphones.

Speaking about the memory, here are the RAM results needed per model and quantization level.

The lower quantization means some weight is stored with less precision. It is a severe reduction with needed RAM.

Plots show the generation speed for SLM on two mobile devices:

Galaxy S24 Ultra: 12 GB Mem, Snapdragon 8 Gen 3
S20FE: 6 GB Mem, Snapdragon 865

and 3 quantization levels (the lower-end device has 6GB memory and could not run some Q8 models).

The generation speed of 5-10 tokens per second might be considered fast enough for a good user experience. More important here is eval prompt time (how fast models read contexts and user queries), as contexts might be really long. Even though eval time is usually similar or 2x faster than generation time for tok/sec, time to first token (input lag user needs to wait for the first token to be generated) influences experience negatively.

Here, we can see that in the worst-case scenario of 1000 tokens (query + contexts), users must wait even 50 seconds after prompting the system.

SLM model response quality evaluation

But how well did the SLM fabricate the answer assuming retrieved contexts (not always correct) and user query as input? Two approaches were used: Ragas (an automated tool for RAG evaluation with an LLM-as-a-judge approach based on OpenAI models) and human-based manual evaluation. For Ragas, three metrics were calculated:

Answer Correctness – the accuracy of the generated answer when compared to the ground truth,
Faithfulness – consistency of the answer given the context,
Answer Relevance—how pertinent the generated answer is to the given prompt. A lower score is assigned to incomplete or redundant answers.

For Human manual evaluation, two metrics were calculated:

Averaged Score – scores in float range 0-1 mixing how correct the answer is and how good context is utilized, eventually punishing models for excessive off-topic content,
Averaged Hard_EQ_1 Score—as above, but thresholded to accept answers as correct only for those who scored a full 1.0. If not, a 0.0 score is assigned.

Three models in the range 1B-3B were evaluated:

phi2 – 2.78B Microsoft SLM released under MIT,
gemma – instruction tuned 2.51B Google SLM with gemma-terms-of-use license,
tinyLLama_v1.0 – 1.1B LLama model under apache-2.0.

Dataset:

Firefighters dataset—a tiny, handcrafted dataset using online-available PDFs to measure SLM quality given the pre-defined contexts. It consists of 24 human-crafted questions and answers and context grabbed from firefighters’ documents.
Phi-2/TinyLlama/Gemma evaluated for Ragas and Humans Score.

In the first stage, we decided to look for a quickly handcrafted prompt with a proper template. Using an adequate template proved crucial—without it, models tend not to halt quickly, talk about unrelated things, and overall response quality is worse.

phi2

example prompt:

Instruct: Generate answer to the Question, using provided context.\nContext: {space_separated_content}\nQuestion: {user_question}\\nOutput:

At this stage, the impression was positive and, as expected, showed the supremacy of less aggressive quantization. The model did overall good for the positive notes, showing a sign of correct reasoning for not the most straightforward question. Sometimes, despite missing contexts, the model was able to answer correctly. It might be attributed to the generalized skill of improvising and the model relying on its embedded knowledge from the pre-training. On the negative side, the model produced overwhelmingly long answers at this stage. Sometimes they were correct and relevant, but short, concise responses would be preferred.

The big issue with phi2 at this stage (for this templated prompt) was, even after correctly responding, producing additional questions and answers (starting with another Question: token) or changing the topic and continuing the monolog. Here, we observed that some of these (change topic) behaviors start from repetitive patterns (like {correct_answer} Output: unrelated content). One idea would be to apply post-processing and cut off these patterns; however, some ways the model goes to the next topic are creative and impossible to catch by a simple pattern matching. Also, a brute-force post-processing like that could harm some correct responses. The other idea would be to work with prompt engineering to better instruct the model for shorter answers and improve quality, which we have tried in the following steps.

Eval with Ragas

As a next step, we have constructed a small dataset for our needs (firefighters dataset) to check Humans vs automated Ragas eval.

One question was whether we can rely on Ragas’ Evaluation and how it relates to our human scores. This tiny test cannot be treated as a conclusive answer, but using Ragas might help automate evaluation tests. It would be too much to say that Ragas correlates well with human answers (in general), but at least for this small dataset, that looks good. Nevertheless, human checking on top of that (for smaller datasets) should be standard practice. It is worth noting that by keeping the same seed (seed_id for generating the answer with LLM), we can see an increase in the results for less aggressive quantizations on Ragas metrics.

Here are the impressions:

human evaluation scores and Ragas correlate (with few exceptions) on this small dataset,
a good practice would be to run Ragas on large datasets and then validate with human evaluation,
scrolling through answers when human evaluation reveals model weaknesses; some of them can be addressed with prompt engineering,
it’s also worth noting that Ragas can now help with synthetic RAG dataset generation; we didn’t test this functionality, but readers might find it interesting.

Better Prompt Engineering

In the next step, we checked the influence of prompt engineering on the answer.
As a general note, we achieved better results by:

Instructing models what are contexts; instead of space-separated contexts, listing contexts separately and providing the id, like that
context_id: {context}\n.
directing the model to produce short but concise answers, relying on the contexts,
and prompt engineering to counter the model’s flaws independently per each model.

We learned that each model has its problems and needs separate prompt engineering. Here, we show some issues with the Gemma model and how we can address them by modifying the original prompt.

Prompt engineering (without prompt template) after applying the above:

Generate a concise answer to the User Question using the provided Contexts. Read carefully, as the answer is always in one or more contexts. Be aware that not all contexts must be relevant. Contexts:\n{join_contexts}\nUser Question: {prompt}

where join_contexts are newline contexts with hardcoded context_id:
context_0: some text here
context_1: some other text here
…
context_n:…

This makes it easier for models to cite directly by context number.

Injecting these into the prompt helped (and in some cases wholly removed) the issues when manually scrolling through the answers. With the above examples and some techniques like ‘use step-by-step reasoning / read carefully,’ we were able to improve from naive prompts.

Conclusions

The most important takeaways from implementing SLMs with RAG on mobile devices:

While it is indeed possible to run SLM on edge devices and have satisfactory results for applications such as RAG, both in terms of speed and quality, some important caveats need to be mentioned:
- Heavy memory constraints. Recent phones with 12-16GB+ RAM can run models in the range 1-3B and even those in the range 7B (here, quantizations Q5, Q4, or lower). For devices with 6GB RAM, we could run Q5-quantized Gemma and smaller models. 4GB is challenging, but models like 1.1B TinyLLAMA can still be run there.
- Speed Constraints. Newer smartphones are admirably faster. Runtime becomes slow for the longest contexts; for the models with 20 eval-prompt-token/sec, it means that if the context has 1000 tokens, it will take 50 seconds before models start responding (worst case scenario). Check more details in the report.
- Early insights are that for models like Gemma or Phi2, quality can be satisfactory for RAG purposes.
- The rapid growth of supported models and better inference/memory-efficient models are expected shortly.
The llama.cpp is growing in popularity. It might also be used for cloud model serving. It had a rapid adoption of the Gemma model (2 days).
Executorch is an uprising star. Qualcomm kernels have very impressive runtimes for exemplary ML models like image segmentation. However, as of today, LLM support is poor, and the model is not yet production-ready.
The vector database search could be better (compared to Python/cloud services). There is no hybrid search feature and no sparse search, and there is no easy, out-of-the-box way to bring those to the inference/cpp site.
The latency for the SLM to start generating a response may be quite long. This delay time is related to the input model prompt length. It’s essential to get as much output from the retrieval part as possible. However, it’s getting faster with recent, more powerful phones and sophisticated Android-targeted inference engine speed-ups.
Evaluating SLMs requires more work with bigger datasets than our tiny handcrafted firefighter dataset. Automating the generation like that would be helpful, and it seems the Ragas library has introduced such a feature, which might be an interesting approach for continuing the project.
Given the limited lifespan of these projects, not all findings, including better prompt engineering techniques from the eval site, were migrated to the inference site.

With engineering effort put into inference engines, a growing community for SLMs, and tons of research put into it, we expect that in the coming months (sometimes even weeks), the situation will change drastically. Not only will models become more powerful for this 1-3B Range, but the inference site will improve in terms of speed and potential memory consumption.

Ongoing Research

Let’s briefly mention ongoing research efforts that aim to break the current limits of SLMs (or LLMs in general).

Better hardware utilization by dedicated kernel ops (e.g., Executorch + Qualcomm) [7]
Here, it’s purely an engineering effort, but it is worth mentioning, as proper GPU kernels with hardware-aware optimizations can significantly boost Android runtimes.
`1-bit` LLMs [1]
The idea is to go beyond classic quantization and train a model from scratch that keeps weights on merely 1.5 bit, optimizing some multiplications onto addition. The authors showed they could pair with fp16 models, and if this holds for other SLMs we have tested, someone would need to retrain models and add support, e.g., in llama.cpp, but memory and inference speed benefits can be huge. This technique would allow us to bring bigger models to the mobile, which is inaccessible for now due to RAM constraints.
A mixture of Expert (MoE) like in Mixtral of Experts [2] (coarse grain sparsity)
The idea of coarse grain sparsity is where, during inference, only some path(s) are active, and not every weight/layer needs to be calculated. This family of techniques does not lower needed memory but speed inference (as only a sub-part of the model needs to be active during inference). There is also a positive inductive bias to make model parts sparse (like paths of layers) with separate modules (a bunch of layers) specialized, resulting in more powerful models that train faster. In the future, this technique could be combined with sharding – loading some parts of the models from HDD to RAM – but this time minimizing the delays.
Mamba[3], MoE Mamba[4]
Here, there is a trend to mitigate quadratic attention mechanisms (the more significant the context window, the more computationally heavy it becomes) in favor of ‘linear’ attention with RNN/LSTM. Authors claim that Mamba models became significantly stronger for a given capacity than their transformer counterparts. llama.cpp has a PR that brings Mamba support—it’s not yet finished.
MoE Mamba is a combination of both techniques, a mixture of experts and Mamba. Here, the authors also show that both techniques are complementary, increasing model quality.
Sparse kernels+pruning (fine grain sparsity) [5][6]
It is about sparsity at the micro-level, where the model is pruned, and some weight gets locally removed. llama.cpp currently lacks support for sparse kernel operation and sparse weight storage. With minimal quality loss, there is an opportunity here to save more RAM and maybe even speed up inference. This technique, when mature, can also bring bigger models to mobile environments.
Draft + Verify [8]
Last but not least, a technique for speeding the inference by having a lightweight (weak) drafter and (strong) verifier NN on top of the drafter. Drafter proposes many tokens at once, and verifiers can verify them in parallel – then, as many tokens are accepted, the first rejected token from the verifier appears in chronological order. Language model training is parallelized (tokens+masking trained in parallel), but the inference is calculated token-by-token. The big deal is that verifiers can work in parallel, verifying many tokens in contrast to sequential, step-by-step token fabrication like in the current models we have tested. This technique might be brought to mobile someday, again increasing inference speed.
Symbolic Knowledge Distillation [14]
Rather than training SLMs from scratch, distilling skills and abilities like reasoning from the influential teacher (big LLM) via proxy NN critic can result in more powerful models. This is another paradigm shift for removing unrelated knowledge from small models by distilling essential skills and abilities and relying on a RAG-like pipeline for knowledge retrieval, which can result in lighter (and more powerful) models.

We have mentioned just a few techniques, but the research area in this domain is much larger, including dynamic neural network structures, a distillation of reasoning capabilities, and memory-augmented neural networks.

I was a lead engineer on the project and the author of this article. However, the entire team contributed to its success, so I would like to give a big shoutout to Marcin Ochman (Senior Engineer), Paweł Kaczmarczyk (Senior Engineer), and Artur Zygadło (Project Manager) for making it possible.

Tech Stack:
[7]https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html
[9] https://github.com/ggerganov/llama.cpp
[10] https://github.com/skeskinen/bert.cpp
[11] https://github.com/facebookresearch/faiss
[12] https://conan.io/
[13] https://docs.ragas.io/en/stable/

Active Research:
[1] https://arxiv.org/pdf/2402.17764.pdf
[2] https://arxiv.org/abs/2401.04088
[3] https://arxiv.org/abs/2312.00752
[4] https://arxiv.org/pdf/2401.04081.pdf
[5] https://www.youtube.com/watch?v=0PAiQ1jTN5k
[6] https://huggingface.co/neuralmagic
[8] https://arxiv.org/html/2401.07851v2
[14] https://www.youtube.com/watch?v=H_IfCbpS6G0
[15] https://arxiv.org/abs/2404.01744

From LLMs to RAG. Elevating Chatbot Performance. What is the Retrieval-Augmented Generation System and How to Implement It Correctly?

March 28, 2024/in Generative AI /by Patryk Kowalski

Chances are you’ve already heard about RAG – Retrieval-Augmented Generation. This technology has taken the industry by storm, and for good reason. The emergence of RAG systems is a natural consequence of the popularity of Large Language Models. They make it easier than ever before to create a chatbot – one deeply entrenched in the domain of your company data. It can provide a natural language interface for all the company information that a user would normally have to dig through heaps of internal documents to get.

This saves so much time! Let’s just consider the possibilities:

A factory worker could ask what an error code means and how to proceed with it, instead of hopelessly skimming through bulky instruction manuals.
An office worker could check on any policy without pestering HR.
A retail worker could see whether specific promotions stack together.

And the list goes on.

Why can’t we just use GPT though? Is this ‘RAG’ necessary? Well, there are issues with using LLMs directly in such cases:

Hallucinations – while LLMs are great at creating plausible sentences, they may not always be factually correct.
Lack of confidence – LLM by itself won’t be able to confidently declare how it knows what it says, or how the user can confirm it.
Domain adaptation – Large Language Models are large. Training them in the specifics of what you want them to know is not a task that comes easily or cheaply!
Domain drift – Let’s say you managed to train a GPT-like model to know everything about your particular use case. What if the underlying data have changed? Do we have to do everything over again?

There are a lot of risks involved in creating a chatbot using LLMs – thankfully, RAG is here to support us.

This article focuses on the retrieval component of retrieval-augmented generation – making sure the correct context is fetched from the company documents and passed onto the answer generation stage. It is based on our hands-on experience building multiple commercial RAG systems. We have read a ton of papers, and learned what works well on actual client data and what doesn’t – and we’ve compiled it all for you here in this article!

What is RAG? Retrieval-Augmented Generation explained

I assume I have managed to get your attention by now. You know you can use RAG to anchor a generative model in your company data. Who wouldn’t want a seemingly flawless solution like that? You’re probably still a tad suspicious though. It sounds too good to be true, and you’re not sure how it works. Let’s take care of that!

RAG workflow

Figure 1. RAG workflow, source: https://towardsdatascience.com/retrieval-augmented-generation-rag-from-theory-to-langchain-implementation-4e9bd5f6a4f2

A typical RAG workflow will look like this:

The user asks a question.
The question is converted to a numerical representation for convenient processing.
Pieces of company knowledge similar to the question asked – either semantically, or in terms of keywords – are picked up.
The relevant text gets packed into the LLM context.
The LLM is fed the relevant context and user question, and uses it to come up with an accurate answer.
An exact source and citation are provided for the user, so the truthfulness of the answer can be verified.

After the workflow finishes, the user is equipped with an exact answer to the question and a relevant passage from the internal documents, validating this information.

What are the benefits of the Retrieval-Augmented Generation?

There are multiple benefits of using Retrieval-Augmented Generation compared to alternative methods of creating chatbots anchored in a specific domain. Amongst the most important ones, we can highlight the following:

No training necessary

Before RAG, trying to teach an LLM domain-specific information required fine-tuning. While the Performance Efficient Fine Tuning branch of Machine Learning is growing strongly, training still requires:

know-how,
computational resources,
a lot of data.

Except for some very specific use cases, it’s best avoided altogether.

The RAG system does not require any training of the base Generative Model.

Fewer hallucinations

Even assuming someone has managed to fine-tune an LLM correctly, unfortunately, it is still prone to hallucinations. The model can use knowledge built-in during the pretraining to formulate an answer, or it can come up with a plausible-sounding false explanation when lacking data.

The RAG system handles hallucinations by providing the model with the exact context it needs to provide a truthful context. The model can be further instructed not to rely on any built-in knowledge if it’s not present in the retrieved context, thus reducing the probability of hallucinations. It’s not possible with a fine-tuned model, as built-in knowledge is all it has.

Dynamic knowledge base

With the RAG system, you can change the knowledge base whenever you feel like it. No repeated training is necessary, nor are any additional steps for that matter. All you need to do to make new knowledge available to the model, or deprecate some previous documentation, is to swap the documents uploaded to the vector database.

Citations

The RAG system structure makes it possible to return sources and citations for the information returned by the model. It allows the user to validate any answer received, and check the wider context in the linked company documents.

Building retrieval for your RAG system

Now that we know how retrieval-augmented generation is supposed to work, and what it’s good at, let’s see how to build one.

Note that everything we talk about here has been field-tested – this is knowledge based on actual commercial projects delivered by deepsense.ai!

Embeddings & Vector databases

The first thing you need to do when building an RAG system is to convert your documents to their vector representations and store them somewhere.

Embeddings

Embedding encapsulates the meaning of a sentence inside a numeric vector. It allows for further operations, like a similarity search. To create embedding vectors, we use sentence-embedding models. There are multiple models available, and a good place to start selection is a leaderboard. When making your choice, be sure to check the following model parameters:

Does it support the language you’re interested in?
Is the context size suitable for your needs? How large of a chunk do you need to embed at once?
Will you need to threshold retrieval results based on similarity scores? Some models are not suitable for thresholding, because of the resulting tendency of embeddings to always score highly on similarity.

Text splitters

A whole document will rarely fit inside an embedding model. The text needs to be split into digestible chunks, no larger than the embedding model’s maximum supported context. A text splitter will help separate the text into smaller, hopefully semantically homogenous chunks.

Splitting by character

Figure 2. Character Text Splitter, source: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb

The simplest text splitter is the character-based one. You need to define a separator – typically an empty string will work just fine, and the chunk is populated letter by letter until it reaches the maximum chunk size. You can opt for chunk overlap, where there are a couple of common letters between adjacent chunks. It’s typically a good idea to do so, because otherwise you risk splitting a sentence in an awkward place, losing the semantic meaning in either chunk.

This solution leaves much to be desired – a perfect chunk encapsulates a singular idea to make retrieval easier and the content leaner. We should find a way to maximize the probability of getting correct splits. That’s where recursive text splitters enter the stage.

Splitting recursively

Figure 3. Recursive Text Splitter, source: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb

Recursive text splitters let you define a number of separators in a specific order. This way, you are able to express an intended hierarchy. For example in the langchain implementation (available here) the default separators are [“\n\n”, “\n”, ” “, “”] – splitting by paragraphs, newlines, spaces, and with no alternatives left, on an empty string. Prioritizing splitting into paragraphs makes it less likely to break a coherent thought into separate chunks.

If the documents you’re working with have some specific structure like markdown or html – even better! You can use the knowledge about the format to come up with a better hierarchy of separators – for example, using the html tags.

Splitting semantically

Figure 4. Semantic Text Splitter, source: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb

A rather new and exciting idea is to try to split the text into chunks based on its meaning rather than particular predefined separators. This can be achieved by embedding the text sentence by sentence, and measuring the distance between consecutive embeddings. Wherever there is a peak of distance, it’s like the topic has changed, and it’s a natural place for a break. Keep in mind, this makes text splitting dependent on the embedding model used.

Structure

You can keep the resulting chunks in a flat structure, one next to the other, but a much better way is to create a hierarchy. We’ll get to why and how in a further section of this article, but for now, let’s keep in mind it’s useful to perform nested text splitting – choosing parent chunks with big chunk sizes first, and then splitting each one of them further into child chunks.

Tunable parameters

After this stage, you will get a number of parameters you want to sweep through to find the configuration that works best for your use case. These will be:

chunk size,
chunk overlap,
parent chunk size.

Vector databases

Now that we have our embeddings, we need to store them somewhere. Fortunately there is no need to build this storage from the ground up, as there are many refined implementations of vector databases specializing in storing, indexing, serving and performing searches on vectors. Some are even open-source!

Figure 5. Vector Database Workflow, source: https://www.pinecone.io/learn/vector-database/

Select your vector database

There are many options when it comes to selecting a vector database. Some of the characteristics you may want to pay attention to when making your decision are as follows:

License – is it open source? Do you need it to be?
Supported search – you will typically want some kind of sparse, dense and hybrid search. We’ll get to what that means in a minute.
Managed – are you going to host the vector database yourself, or is a managed instance more up your alley?
Framework integrations – do you care about any specific framework integrations? If your whole app is in langchain or llamaindex, you will probably require your vector database to play nicely.
Indexing – depending on what scale of data you’re working with, you will want a different kind of index. An index is what allows you to perform an efficient vector search.

To make an informed choice you can use a comparison tool, like this one:

Figure 6. Vector DB Comparison, source: https://vdbs.superlinked.com/

There are rumors of openAI and anthropic using qdrant internally – maybe there’s a slight edge there?

Configuration based on the example of Weaviate

Let’s see how to set up a vector database based on the example of Weaviate, which is one of the most popular providers.

Figure 7. Weaviate Vector DB Configurator, source: https://weaviate.io/developers/weaviate/installation

You can use a configurator to put together a docker-compose or a kubernetes-helm file that gets you started. You need to click through some options there, and if you are a data scientist, you will mostly care about the following:

Vectorizer

A vectorizer is the model that will be used to embed your data and the user queries. You can have it self-hosted. Weaviate offers a large selection of pre-built images ready to roll out. If you are into a niche model that hasn’t made its way there yet, you can build your own image, and it will work as long as it’s compatible with Hugging Face’s AutoModel. In any other case, you can’t go wrong with an external API, e.g., from openAI and their sturdy text-embedding-ada-002.

Reranker

A reranking model helps you keep the retrieval results sorted correctly – with the most important one at the top. It can be important for the quality of the generated answer, and the accuracy of the returned citation, so you may want to pay for the extra processing time and GPU usage to have it handy. The good news is that the reranking cross-encoder model will only run on the prefiltered set of retrieved chunks.

Indexes

Indexing makes it easier to calculate the semantic similarity between high-dimensional vectors, like the ones we get from text embeddings. Check out this page for a simple explanation of how different indexes work. Weaviate offers only two of them:

Flat indexing

The right choice when perfect accuracy is required and speed is not a consideration. If the dataset we are searching is small, flat indexing may also be a good choice as the search speed can still be reasonable.

HNSW

Figure 8. Hierarchical Navigable Small Worlds, source: modified https://arxiv.org/pdf/1603.09320.pdf

The HNSW index is a more complex index that is slower to build, but it scales well to large datasets as queries have a logarithmic time complexity.

It uses a multi-layered graph approach to indexing data. The lowest layer contains all the vectors as they are – but on each higher layer, they are increasingly grouped together. The user query starts its journey at the top layer, traversing its way to the bottom, getting closer to the most similar chunk with each step.

Search methods

In a vector database, you will typically encounter the following search methods:

Full text

Used for metadata filtering. You can use some additional metadata with your chunks, such as tags, that can be used to perform initial filtering. In this case, a full-text match is necessary.

Keyword (sparse vector) search

Word frequency-based search, good for catching keywords. Let’s imagine your use case involves a number of technical user manuals. Those tend to be semantically similar, with the crucial difference of describing different equipment. In such a case, the name of the part of interest – a keyword – needs to be a significant part of the search. Weaviate uses BM25 to perform this kind of search.

Vector search

The bread and butter of similarity search methods. You can further decide on a distance metric, but the default cosine similarity tends to get the job done.

Hybrid search

Figure 9. Weaviate Hybrid Search, source: https://weaviate.io/blog/hybrid-search-fusion-algorithms

Why choose one, if you can have it all? A hybrid search combines dense and sparse vector search methods and combines the results. You can use the alpha parameter to regulate dense/sparse search importance. There are caveats though! When using a hybrid search, the confidence score is calculated as a combination of the dense/sparse confidence scores, making it harder to interpret or threshold.

Get only relevant results

At this point we should be able to retrieve chunks of information relevant to the user query – but how many of them? We need to set a limit. Not enough chunks can make us miss important information, while too many of them can bloat the LLM context and confuse it.

Weaviate proposes an autocut feature – “Autocut aims to approximate where a user would cut the results intuitively after observing N jumps in the distance from the query.” It can get a bit tricky if you’re using the hybrid search method with it – the rescaled confidence score can trip autocut up. In this case, make sure you’re using RelativeScoreFusion, and not RankedFusion. It is supposed to work better, because it often results in natural clusters that autocut can detect.

Reranking

The first result is the best result. Or it should be. For a limited result set you can afford a more computationally expensive approach, so it’s the reranker’s time to shine. The cross-encoder model ingests a pair of user queries and one of the retrieved chunks to compute a relevance score between them. This number will be used to reorder the chunks and make sure the most relevant ones will be served first. An added benefit is that this score tends to be suitable for thresholding – for example, if you want to be able to decide that a similarity smaller than X should cut the RAG workflow short, and skip the answer generation step.

Tunable parameters

After this stage, the group of tunable parameters should welcome new contenders:

hybrid search alpha parameter,
hybrid search score fusion method,
chunk limit,
autocut limit,
switching reranker on or off.

Measuring results

We have said a great deal about tuning parameters – but for tuning we need a metric to optimize. For retrieval, Mean Average Precision is a great candidate to optimize, because it only scores highly if the relevant documents are first on the list.

This is a go-to metric for us in commercial projects and it has proven very reliable.

Let’s go through how it works together:

Figure 10. Mean Average Precision calculation, source: own study, inspired by https://www.educative.io/answers/what-is-the-mean-average-precision-in-information-retrieval

The user query is fed into the search algorithm and results in 5 documents being retrieved.
Only three-fifths of the documents are relevant.
For each of the documents, we calculate the precision. Precision@k will be the ratio of relevant to missed documents for the kth result.
For each of the documents. we multiply its relevance and its precision.
We summate the resulting numbers, and divide it by the total number of relevant documents that it’s possible to retrieve – thus arriving at the Average Precision.
We repeat the process for more queries, and average the results to get the Mean Average Precision.

As you can see, the calculations aren’t very complex, but you need to prepare a dataset beforehand for it to make sense. In particular, you need to be aware how many relevant documents there are for any given query.

You also need to decide when a retrieved chunk is considered relevant. It can be deemed relevant if it comes from a relevant document – but if your documents are large, you may need a more precise method. You can also define a set of ground-truth sentences you think should be retrieved for a given query. It introduces a new sort of trickery, because depending on how you split your text into chunks, a given ground-truth sentence can be split into multiple chunks, or one chunk can contain more than one ground-truth sentence! In such a case, feel free to modify the Mean Average Precision definition, for example, allowing the relevance metric to exceed the [0,1] range. It won’t technically be the Mean Average Precision anymore, but it will do great for parameter optimization and comparison of different retrieval setups.

Tuning retrieval parameters

We have gathered quite a few parameters to optimize. Let’s recall what they were:

embedding model type,
chunk size,
chunk overlap,
parent chunk size,
hybrid search alpha parameter,
hybrid search score fusion method,
chunk limit,
autocut limit,
reranker type.

To sweep through all of those effectively, we advise you to use Hydra & Neptune.ai working in tandem.

You can use hydra to set up ranges for the parameter sweep conveniently. Use the –multirun option to run your tests efficiently. A simple bash script may look like this:

If you are running your embedding model locally, do not forget to set the hydra.launcher.batch_size when sweeping through embedding parameters, like the chunk size, to make sure you fit in your GPU’s VRAM.

When running the experiments in parallel, setting the Neptune.ai synchronization to offline and uploading the experiments manually after they’re done can help you save the connection pool from unnecessary abuse. You can do it as follows:

Get accurate citations

LLM wouldn’t make stuff up, would it? Hopefully not, but as Ronald Reagan used to say – trust, but verify.

We want to provide the user with the means to validate whether the generated answer is factually correct. A direct citation from the source document gets the job done – but how to get it?

Use retrieval results directly

The simplest way is to just use the passage retrieved with the highest confidence score. It won’t cover scenarios with more than one relevant passage, and sometimes it will be plain wrong, if LLM used another chunk for its response… but it still tends to work in the majority of cases, and its simplicity and lack of added complexity cannot be valued highly enough.

OpenAI Function-calling

If you are able to work with the openAI API, you can make use of their function calling feature. Just define a function that returns the answer in the form of a JSON dictionary, with keys for the answer and the citation respectively. You can index the citations to allow the model to just pick a number, making it as easy as can be. You can also opt to make the model repeat the citation verbatim, but you will quickly find out that LLMs enjoy modifying the text they repeat a bit here and there, making a direct match impossible.

You can achieve a very similar result using Direct Prompting – just include a few-shot example in the model prompt, where you show that the answer always needs to be coupled with a citation index.

How well is it going to work? The answer can vary wildly depending on the actual LLM you’re using, and the quality of data. LLMs on the smaller end of the spectrum, like LLAMA2 7b, cannot be trusted to select a correct citation. GPT-4 will be correct most of the time given a clean enough dataset. It’s not going to go great if your data are messy, though. Suppose the documents you use contain leftover watermarks, random numbers, or OCR artifacts. In that case, the model will have difficulty determining where one citation ends and the other starts, and which number is the citation index.

Don’t forget to check out langchain implementation of those methods.

Learn from our experience – here’s what boosted the performance of our RAG systems

There are A LOT of tricks meant to improve your retrieval performance. This field is growing explosively and produces an unmanageable amount of ideas. Not all of them are all that useful though, and some just aren’t worth the time. When working on commercial projects, we sifted through the internet tips and academic papers to test them all – and a few of the methods tested have proven quite extraordinary.

Multiquery – Reciprocal Rank Fusion

Semantic search tends not to be very stable. RAG users will ask the same question in a multitude of different ways, and we would like them to always get the same results. That’s what multiquerying is for!

Figure 11. Reciprocal Rank Fusion, source: https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1

Each user query generates a number of alternatively phrased queries with the same meaning. It will work best if you use an LLM to request rephrasing, but if you are reluctant to introduce another LLM call, or are very wary of the response time, there are always simpler solutions. For example, you can shuffle the letters a bit to produce alternative queries by introducing typos.

This way you will get a set of different results instead of one, so they need to be combined back into a simple list. A Reciprocal Rank Fusion algorithm can help with that. It makes sure to put the following information at the top:

The most frequently retrieved documents
Documents rated with the highest confidence.

This is a great method to increase retrieval stability and robustness to imperfect queries.

Hierarchical chunking

Do you want your chunks big or small?

Small chunks, obviously, because…	Big chunks, obviously, because…
Small chunks make more meaningful embeddings	Big chunks decrease the risk of dropping information
Small chunks keep LLM context from bloating	Big chunks retain context
Small chunks decrease response time

Instead of deciding, let’s try to take the best of both worlds – hierarchical chunking!

Figure 12. Parent Chunking, source: https://medium.com/ai-insights-cobet/rag-and-parent-document-retrievers-making-sense-of-complex-contexts-with-code-5bd5c3474a8a

First run your text splitter with a big chunk size – bigger than you can fit into an embedding model. Then, for a second round, split each parent chunk further down into several child chunks – those will be vectorized.

You will perform a similarity search on the child chunks – but you are always free to swap a child chunk for its parent before building the LLM context! A smart way to go about this is defining another tunable parameter in the range [0,1], and determining when to perform the child -> parent swap. For example, setting it to 0.5 would mean you need to retrieve at least half of all child chunks to trigger the swap for a parent.

What’s left? Well, everything else, of course!

If you followed this journey with me, I’m sure at this point you have a wonderful retrieval setup. Does it mean we’re done? Well… almost. To have a full-scale RAG system running, you need to include the answer generation stage too. You will need to come up with a few prompts for the LLM and connect to your favorite model. That’s beyond the scope of today’s adventure though – let’s take a breather and get back to this with a fresh mind!

Summary

RAG systems are great for building chatbots anchored in domain data. They are cheap to build, require no training, and solve a lot of problems inherent to generative models. RAGs validate their answers by providing citations, have a decreased probability of returning hallucinations, and are easy to adapt to a new domain, which makes them a go-to solution for multiple use cases.

Building a solid retrieval mechanism is a cornerstone of any RAG system. Feeding the generative model with accurate and concise context enables it to provide great and informative answers. There is a lot of literature regarding building RAG, and filtering through all the tips and manuals can be time-consuming. We have already checked what works and what doesn’t – as part of successful commercial projects – so make sure to take advantage of a head start and use our tips:

Be mindful when selecting the components: the vectorizer, reranker and the vector database.
Create a benchmarking dataset – not necessarily a huge one – and tune all the retrieval parameters specifically for your use case.
Do not forget to use multiquerying and hierarchical chunking – they give you a lot of ‘bang for your buck’.

With retrieval built this way, you are on a sure path toward a perfect RAG system.

Cost-Effective LLMs: Leveraging Generative AI with Limited Hardware

Reducing the cost of LLMs with quantization and efficient fine-tuning: how can businesses benefit from Generative AI with limited hardware?

February 29, 2024/in Generative AI /by Alicja Kotyla and Artur Zygadlo

More than a year has passed since the release of ChatGPT, which led hundreds of millions of people to not only talk about AI, but actively use it on a daily basis. The wide adoption of ChatGPT and other large language models (LLMs) among individuals made companies of all sizes and across all sectors of industry wonder how they could benefit from this upward-trending technology. One of the main challenges with turning LLMs into business value is the high cost of the expensive hardware required to run the models. Fortunately, recent developments in the field have allowed companies to significantly lower these expenses.

This article will be a summary of the most recent trends around LLMs, focusing on LLM democratization – making generative AI easily accessible to everyone. The two main topics we will dive into are quantized inference and parameter-efficient fine-tuning. Apart from explaining these concepts and stressing their importance, we will share our experience from their practical use in commercial LLM projects which we have recently delivered to our clients.

If you want to know more about the history of large language models, their different “flavors” and example applications, feel free to check out our report or watch our deeptalk.

LLMs for everyone – the recent trend of AI democratization

ChatGPT is an example (not the first of its kind but undoubtedly the most famous one) of a large language model. LLMs are machine learning models based on deep neural networks, capable of generating text by autoregressively predicting the next word (or the next token, to be more precise). Their applications range from answering questions based on provided documents or knowledge bases (so-called retrieval-augmented generation, or RAG for short), to text summarization, content creation, coding assistants and more.

The most powerful models like those from OpenAI (ChatGPT, GPT-4), Google (Gemini Ultra) and several open-source alternatives (Falcon, Llama 2 or Mixtral, to name a few) are astonishingly performant, even superior to humans in many tasks. Their power can be unleashed thanks to the fact that they are large, i.e. they consist of dozens or hundreds of billions of so-called parameters – numbers that describe how to convert the input data into an output text.

Between 2018 and 2021, based on the visibly improving capabilities of increasingly larger models, the direction set in the field by the world’s biggest research labs was to push the model size to the extreme. However, with the large size comes many challenges, with the high cost of hardware (GPUs with enough memory) required not only for training the model but even serving it for inference being one of the main limitations and factors making business decision makers hesitant to build products or try to improve existing processes with the help of LLMs.

Figure 1. Recent trends in LLMs, source: own study

Fortunately, around early 2022, LLM researchers started to understand that for the LLMs to be practically applicable, scaling the models further up was not the way to go. Instead, they turned their attention to training so-called compute-optimal models, which paved the way for all the much smaller (with “only” a few billion parameters), yet still powerful, language models developed both by big tech companies and the open-source community in 2023.

In parallel to the models, motivated by the need to reduce costs and broaden access to the technology, new algorithms and ideas around quantized inference and parameter-efficient fine-tuning have been developed. We will delve into the details in the following sections.

LLM quantization – reducing the memory required for inference

The larger the model, the more GPU memory it needs to load its parameters into. Typically, billions of parameters require gigabytes of RAM. But how many exactly? To answer this question, we will first describe the various ways in which numbers (and hence the model parameters as well) can be represented.

Numeric data types

Two common floating-point formats used in deep learning applications include the single-precision floating-point format (known as FP32) and the half-precision floating-point format (FP16). FP32 utilizes 32 bits for representing a floating-point number, distributing 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand (see Figure 2), whereas FP16 allocates 16 bits, dividing them into 1 bit for the sign, 5 bits for the exponent, and 10 bits for the significand. It is worth mentioning one more data type, designed specifically with deep neural networks in mind – bfloat16 (brain floating point). Numbers in this format are also represented with 16 bits, but composed differently: 1 sign bit, 8 exponent bits (same as FP32), and 7 bits for the significand. In this way, despite bfloat16’s lower numerical precision, its dynamic range is equivalent to that of FP32.

Figure 2. Illustration of various numeric data types, source: https://www.microsoft.com/en-us/research/blog/a-microsoft-custom-data-type-for-efficient-inference

As a consequence of how these representations work, to load an LLM with 1 billion parameters, we would need 4 GB of GPU memory in the case of FP32 (32 bits, so 4 bytes per parameter), and 2 GB (2 bytes per parameter) for FP16 or bfloat16. It should now be easy to calculate how much is needed to serve the largest Llama 2 model with 70 billion parameters!

Although loading the model in half-precision saves a lot of memory (only half the amount is required compared to FP32), and it has become the de facto standard for neural network inference (as little or no quality degradation is usually observed compared to FP32), this still might not be enough in the case of LLMs. As these multi-billion-parameter models require powerful (and therefore expensive) GPUs like the NVIDIA A100 with 80GB RAM to run on, further memory savings are often required.

For this purpose, various quantization techniques that aim to shrink the model size even further while still preserving the model’s performance (sometimes even with a positive impact on response latency) have been developed. Recently, several algorithms compressing model parameters to data types such as INT8 (1 bit for the sign, 7 bits for the significand) and INT4 (1 bit for the sign, 3 bits for the significand) have gained the attention of the LLM community, among which they are widely used. We will discuss the most popular ones in the further sections.

Additionally, if you want to know more about the basics of quantization of neural networks, feel free to watch our deeptalk which introduces the topic.

Benefits and possible pitfalls of quantization

Quantization involves representing weights and activations with low-precision data types, such as INT8 or INT4, instead of the typical floating-point numbers, resulting in reduced computational and memory costs of LLM inference.

Regarding inference, it is worth mentioning that LLMs can be utilized in two ways. One possible approach is to set up a dedicated server with adequate GPUs (either on premises or within a virtual private cloud) and deploy the model there. In the cloud environment, one needs to pay a few dollars per hour to keep the utilized machine running. Another approach is to leverage one of the LLM-as-a-Service APIs like OpenAI API or Anyscale Endpoints and send requests (prompts) on demand. In this case, the hardware and infrastructure is fully handled by the API provider, and related expenses are covered by the money paid by the API users for each request, typically described in terms of a fraction of a dollar per every million tokens sent (e.g., in Anyscale Endpoints, the prices range from $0.15 for Llama 2 and Mistral models with 7 billion parameters to $1.00 for Llama 2 with 70 billion parameters for each million tokens).

Table 1 summarizes the costs of using selected open-source LLMs based on examples of Google Cloud Platform pricing. In the case of the deployment of an LLM in one’s own cloud, a suitable machine with sufficient memory needs to be chosen from the available configurations. Different cloud instances offered by Google (GCP), AWS, or Microsoft (Azure) have their own pricing, with the presence of high-end GPUs like NVIDIA A100 with up to 80GB VRAM having the biggest impact on the price. With INT8 quantization, the required memory is around 2 times lower, and INT4 quantization leads to 4 times lower memory requirements, allowing users to utilize the quantized model on a 3-4 times cheaper machine.

Model name	Number of parameters	Quantization	Estimated GPU memory required for inference	Hourly cost of running a cloud instance with sufficient memory (example pricing of GCP instances)	Hardware required for inference (example configurations of NVIDIA GPUs available in GCP)
StarCoder	15.5B	FP16	36 GB	$2.87	1x A100 (40 GB)
		INT8	18 GB	$0.70	2x T4 (16 GB)
		INT4	10 GB	$0.35	1x T4 (16 GB)
Llama 2	70B	FP16	150 GB	$7.70	2x A100 (80 GB)
		INT8	80 GB	$3.85	1x A100 (80 GB)
		INT4	40 GB	$2.87	1x A100 (40 GB)

Table 1. Example cost of LLMs self-hosted in the cloud, based on Google Cloud Platform pricing (as of February 2024)

With these cost savings being the biggest advantage of quantization, one might wonder what the downsides are, e.g., with respect to performance or speed. While the evaluation of LLMs and other generative models in terms of output quality remains a challenging and widely researched topic, subjective manual analysis of sample outputs reveals that in many cases the modern quantization algorithms do not seem to introduce any visible degradation. Quantization should not be too aggressive though, as at some point an unacceptable level of degradation can be observed – going below 3 bits per parameter is not recommended.

When it comes to response latency, it should not increase after quantization. In fact, it is sometimes even possible to observe inference speed-ups compared to FP16, especially when using specialized low-level kernels for integer matrix calculations. However, this largely depends on the specific usage patterns and computing environment (batch sizes, prompt lengths, utilized hardware etc.), and requires verification in practice.

Selected quantization algorithms

The topic of model quantization is currently widely researched, with advancements being developed at a fast pace. In this section, we will introduce and explain the technical details of selected quantization algorithms and share our practical experience with some of them.

Naive approach

One of the classic quantization algorithms is range-based linear quantization, an approach in which floating-point values are quantized by multiplying them with a scale factor derived from the actual range of the tensor’s values. Within this algorithm, we can distinguish two modes: symmetric and asymmetric.

In asymmetric mode, we align the smallest and largest values from the floating-point range with those of the integer range using a zero-point and a scale factor.

Figure 3. Asymmetric mode, source: https://intellabs.github.io/distiller/algo_quantization.html

Conversely, in symmetric mode, we pick the maximum absolute value between the smallest and largest values of the float range without a zero-point, resulting in both the float range and the quantized range being symmetric around zero.

Figure 4. Symmetric mode, source: https://intellabs.github.io/distiller/algo_quantization.html

One of the issues with the approaches presented above, due to the reliance on identifying maximum values, is their sensitivity to outliers which can hugely impact the quantization results.

LLM.int8()

In 2022, Dettmers et al. introduced LLM.int8(), a quantization method addressing the problem of outlier values, frequently present in LLMs’ internal calculations. It uses vector-wise quantization, prioritizing precision for outliers in FP16 format while processing the vast majority of values in INT8 format. With outliers typically making up only around 0.1% of all values, this approach cuts LLM’s memory usage by nearly half (compared to FP16).

LLM.int8() operates in three main stages during matrix multiplication:

It identifies columns in the input hidden states that contain outlier features using a specific threshold.
It conducts the matrix multiplication, processing outliers in FP16 and non-outliers in INT8 using vector-wise quantization.
After dequantizing the non-outlier results from INT8 to FP16, it combines them with the outlier results to produce the complete result in FP16.

Figure 5. Illustration of LLM.int8(), source: https://huggingface.co/blog/hf-bitsandbytes-integration

LLM.int8() was definitely a great development, allowing users to run and experiment with otherwise unavailable models; however, due to its nature (on-the-fly dequantization to FP16 for outlier values), its usage can lead to inference slowdown compared to serving purely 16-bit models.

Our experience with LLM.int8()

The inference slowdown of LLM.int8() mentioned above was confirmed by our experience in a commercial project around LLMs for code completion. According to our experiment results (with 7 billion and 15 billion models from the CodeGen and StarCoder series), the response latency may increase up to 1.5 times. While making it possible to run an inference of StarCoder 15.5B on a single 24GB GPU or two 16GB GPUs (and spend around 4 times less compared to FP16, as shown in Table 1), the speed of code generation was unacceptably low for the model to be usable in the form of a coding assistant. Back then, we had to stick to a smaller model and serve it in FP16. Fortunately, faster quantization algorithms (described below) were proposed not long after.

GPTQ

Introduced by Frantar et al., 2023, the GPTQ (Generative Pre-trained Transformer model Quantization) algorithm is a post-training quantization method, i.e., the weights of an already trained model are converted to lower precision without necessitating any retraining. GPTQ quantizes the model layer by layer, by iteratively going through matrix columns and finding compressed versions of their elements (one for each row) that will yield a minimum mean squared error on a pre-defined calibration dataset. The approach builds and improves on the Optimal Brain Quanization (OBQ) method (Frantar et al., 2022) for solving the layer-wise quantization problem defined above.

GPTQ is currently one of the two most popular quantization techniques (along with AWQ described below). GPTQ-quantized (typically 8-bit and 4-bit) versions of all major newly released LLMs are introduced soon after, and ready-to-use for both further research and commercial applications.

Figure 6. Illustration of GPTQ, source: https://mlabonne.github.io/blog/posts/4_bit_Quantization_with_GPTQ.html

A recently published extension of GPTQ is ExLlamaV2 (EXL2, for short). EXL2, like GPTQ, uses the same optimization method and supports 2, 3, 4, 5, 6 and 8-bit quantization. This format enables you to blend different quantization levels within a model to reach an average bitrate of 2 to 8 bits per weight. Additionally, it makes it possible to apply various quantization levels to each linear layer, resembling sparse quantization, where more crucial weights are quantized with more bits. In this way, it allows Llama2 70B to run on a single 24 GB GPU with a 2048-token context, and 13B models can be used at 2.65 bits within 8 GB of VRAM.

Our experience with GPTQ

Currently, one of the most popular applications of LLMs is retrieval-augmented generation (RAG). In RAG, rather than relying solely on the model’s parameters, the user’s input is first leveraged to extract data from an external source of knowledge, and then both the user query and relevant information are integrated into an LLM prompt, enhancing response generation and mitigating the hallucination issue (to a certain extent). In one of our recent projects concerning RAG in the context of a copilot application for frontline workers, which we developed for a global company from the retail sector, we tested the 4-bit GPTQ version of the powerful open-source model Mixtral 8x7B. With this quantization, we managed to run the model on a single NVIDIA A100 card with 40GB vRAM. For comparison, to run the half-precision model you would need about 90GB GPU vRAM, which exceeds the capacity of the largest available A100 GPU (80GB). As presented in Table 1 (with the example of Llama 2), such memory savings lead to cutting the costs of inference by almost two-thirds. Moreover, as we manually verified the quality of the responses, we observed no difference between the outputs of the quantized model and those of Mixtral served in FP16. In our experiments, 4-bit GPTQ and FP16 models were more or less on par in terms of the speed of text generation.

If you want to learn more about our experience with building RAG systems, consider watching our deeptalk.

AWQ

Another recently popular post-training quantization technique is Activation-aware Weight Quantization, or AWQ (Li et al., 2023), based on the observation that among the LLM’s weights (parameters), not all are equally important for the model’s performance. By identifying a small fraction (0.1%-1%) of so-called salient weights and scaling them up, AWQ effectively reduces their relative quantization error. To pinpoint these salient weight channels, the algorithm analyzes the activation distribution instead of the weight distribution.

Going into more detail, AWQ consists of three main stages:

profiling activations – run a sample of data through the LLM and record activations, then analyze to identify salient weights.
optimal scaling – scale up salient weights to minimize quantization error.
quantization – apply optimal scaling and quantize all weights to INT8/4/2.

The AWQ paper highlights a 3.2-3.3x average speedup compared to Huggingface’s FP16 implementation across various LLMs, but these findings should be treated with caution, as they only report the results for a single, short input prompt. As already stated above, the observed speedup will depend on many factors related to hardware, prompt length and usage patterns. Memory savings, and therefore also cost reduction, in the case of AWQ are similar to those of GPTQ, leading to as much as 3-4 times less money spent on keeping the inference server running.

GGUF

Typically, LLMs are implemented in Python (using its deep learning libraries like PyTorch or Hugging Face transformers), which is not the optimal choice for maximizing inference speed. An interesting project called llama.cpp was started by Georgi Gerganov soon after Meta released their first Llama models in March 2023, with the goal of reimplementing these newly proposed LLMs in C++. Due to its lightweight, portable nature and support for a wide variety of hardware, the project became very popular and matured quickly, currently allowing users to choose from multiple LLMs other than Llama (all implemented in C++).

Apart from implementing the models, the authors of llama.cpp came up with their own quantization algorithm, called k-quants and often referred to as GGUF (after the format in which llama.cpp models are served). This algorithm is less sophisticated than GPTQ or AWQ, but useful in practice, and much superior in terms of inference speed in CPU-only scenarios. There is a wide selection of quantized representations (2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit), with the possibility to mix different levels of quantization within a single model.

Figure 7. Phi-2 as an example of a “small language model”, source: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models

Our experience with GGUF

We utilized llama.cpp and GGUF-quantized models in our recent PoC project regarding LLMs, or rather SLMs (small language models) deployed on edge devices like mobile phones with Android and only 4 or 6 GB of RAM. We were able to set up the entire RAG pipeline, including the vector index, an embedding model and the generative language model. We found out that 8-bit versions of models like Phi-2 or Gemma were performing reasonably well for retrieval-augmented generation (while for more extreme 4-bit and 5-bit quantizations we saw a significant drop in performance). As of today, running language models on devices with such limited memory is a challenging topic which still needs to be explored further. Nevertheless, there is no doubt that thanks to advancements like GGUF, the recently announced ExecuTorch and various hardware-level optimizations, running a personalized language model on a mobile phone does not seem impossible anymore, with the field of Edge AI continuing to flourish. Deploying language models on edge devices allows one to mitigate the issue with the high cost of hardware and cloud infrastructure required for inference.

Efficient fine-tuning – enabling LLM training on limited hardware

While foundation LLMs of various sizes are already capable of solving many business use cases out-of-the-box (with some time spent on proper prompt engineering, but without the need for any further training), in certain situations there might still be a need to fine-tune them on domain-specific data to reach the satisfactory level of output quality. Compared to the already expensive LLM inference, training a model (either from scratch or fine-tuning an existing one) requires even more powerful hardware, as it is not only the model that needs to fit into memory, but additional space for storing gradients and optimizer states as well. In the case of multi-billion-parameter LLMs, the standard approach to fine-tuning the models by updating all (or some major part) of their parameters is nowadays a procedure only “GPU-rich” companies can afford. The exact calculations are quite complex, with a rule of thumb that the required memory (in gigabytes) is around 12 times the number of model parameters (in billions). An example setup suitable for fine-tuning Llama 2 with 70 billion parameters in the cloud is a cluster of two nodes with 8 A100 GPUs with 80GB VRAM each, resulting in an hourly cost of $61.60 (in the case of GCP). With dozens of hours required to fine-tune an LLM, the total cost of a single model training experiment reaches hundreds or thousands of dollars.

To mitigate this issue, so-called parameter-efficient fine-tuning (PEFT) techniques have been developed, making it possible to effectively tailor pre-trained language models for different downstream tasks, without the requirement to fine-tune every single parameter of the model. Instead, PEFT focuses on fine-tuning only a limited set of additional model parameters, which significantly cuts down on the computational and storage costs linked with fine-tuning LLMs. With parameter-efficient fine-tuning, it is possible to train Llama 2 with 70 billion parameters on a single 80GB GPU, which is 16 times cheaper than with the standard approach.

Quantized inference and parameter-efficient fine-tuning are only selected examples of techniques developed to optimize required memory, decrease response latency, or reduce the time and cost of training LLMs. If you want to know more about the topic, feel free to check out our previous blog post.

Figure 8. Overview of PEFT techniques, source: https://arxiv.org/pdf/2303.15647.pdf

Various PEFT techniques have been developed and can be divided into categories such as additive, selective, and reparametrization-based methods. Additive methods can be further split into two groups: adapter-like and soft prompt-based methods. In the following sections, we will briefly discuss representative examples of algorithms from these groups. Many of them are implemented as part of the PEFT library, part of the Hugging Face ecosystem, which makes them pretty straightforward to use in practice.

Low-Rank Adaptation (LoRA)

Reparameterization-based methods aim to discover low-dimensional representations of weight matrices. A prominent example of such a method available in PEFT is Low-Rank Adaptation (LoRA, Hu et al., 2021), which is currently the go-to approach for efficient fine-tuning, with new ideas still being built on top of it.

Figure 9. Illustration of LoRA, source: https://huggingface.co/docs/peft/main/en/conceptual_guides/lora

Rather than modifying the parts of the original pre-trained model during fine-tuning, LoRA introduces new weight matrices (marked in orange in the picture above), which are paired with existing parameters (blue); only these newly added weights are then updated during the training process. As LoRA is based on matrix rank decomposition, the new matrices have significantly fewer parameters in total when compared to the original model. The exact numbers vary depending on the rank hyperparameter setting, but the models resulting from LoRA fine-tuning can often be of a quality almost on par with those obtained via full fine-tuning (on much more powerful hardware, hence often even impossible to run), while modifying, e.g., less than 1% of the original parameter count. Moreover, LoRA is an algorithm designed with production environments in mind, as multiple sets of LoRA weights (so-called adapters) can be trained for different use cases, and then easily switched between, while sharing the same underlying pre-trained weights. Additionally, since the original pre-trained weights remain unchanged, the risk of the model forgetting what it had learned before is reduced.

QLoRA

One important extension of the LoRA method available in PEFT is called QLoRA (Dettmers et al., 2023), which is a combination of LoRA and model quantization. It extends LoRA to enhance efficiency by representing the pre-trained weights with low-precision data types, enabling even more extreme reduction of memory required for fine-tuning. In QLoRA, the frozen parameters are stored as FP4 (4-bit floating-point representation newly introduced in the QLoRA paper), but dequantized on demand as training-related calculations and gradient updates are still performed in FP16.

Figure 10. Comparison of full fine-tuning, LoRA and QLoRA, source: https://arxiv.org/pdf/2305.14314.pdf

Our experience with QLoRA

In the abovementioned recent project regarding coding LLMs (language models generating code rather than natural language), our goal was to fine-tune the foundation model to be able to write code in a previously unfamiliar programming language. We only had 48GB of vRAM available at our disposal, but as we leveraged QLoRA, we were able to experiment with fine-tuning both larger models (up to 15 billion parameters with 2048 input tokens) and increasing the context size for smaller models (up to 8192 tokens for a 1-billion-parameter model), with the latter turning out to be the game-changer in the context of coding LLMs. While it was hard to capture the improvement with standard token-level metrics like BLEU or ROUGE, which could only serve as our proxy measurement of quality, the final assessment of the generated code samples conducted by the client (expert in this particular, highly specialized programming language) revealed that the fine-tuned model was superior to the original one. Due to the hardware and budget limitations, we were not able to perform full fine-tuning of analogous models to have a fair comparison, but our experience proves that parameter-efficient fine-tuning techniques, especially when combined with quantization, enable the development of use-case and client-specific LLMs, even with constraints.

Other parameter-efficient fine-tuning methods

Another interesting method found in PEFT is Adaptive Low-Rank Adaptation (AdaLoRA, Zhang et al., 2023). This method builds upon the principles introduced in LoRA but utilizes a distinct form of low-rank matrix decomposition. AdaLoRA optimizes fine-tuning by allocating more trainable parameters to matrices and layers of the model found to be more important.

While LoRA and its variants are currently the most widely used among all the PEFT methods, other noteworthy alternatives exist, as mentioned above.

For example, the goal of additive methods is to enhance a model by introducing a new set of parameters or network layers. During fine-tuning, only these new parameters’ weights are updated. Two such methods are available in PEFT: the Adapters method (Houlsby et al., 2019), which involves introducing small fully-connected networks after the Transformer sub-layers, and the (IA)3 method (Liu et al., 2022), which relies on augmenting the Transformer block with additional parameters.

Another group of methods is based on the concept of prompting. Prompting directs how a language model behaves by changing the input text with a prompt, typically made up of a task description and relevant examples. There are two types of prompts: hard prompts and soft prompts. Hard prompts are manually crafted text prompts using discrete input tokens. Unlike hard prompts, soft prompts cannot be directly viewed and edited in text form. They consist of an embedding, essentially a sequence of numbers, that draws knowledge from the larger model. Soft prompts can be tuned for the input layer only (P-Tuning, Liu et al., 2021 and Prompt Tuning, Lester et al., 2021) or for all layers (Prefix Tuning, Li and Liang, 2021).

Summary

In this article, we discussed the recent trend in the field of generative AI – the democratization of LLMs, which is about making these text-processing models accessible to everyone by reducing the hardware (and the associated cost) requirements. Recent advances such as new model quantization and parameter-efficient fine-tuning techniques not only make it possible to run or fine-tune one’s own LLM on their personal computer or edge device for fun or small-scale projects. Much more importantly, this also means that LLMs are becoming cheaper for businesses who want to incorporate AI into their existing products or processes, or unlock completely new ideas and functionalities previously impossible to implement.

At deepsense.ai, we have hands-on experience with applying the methods described above in commercial projects such as the development of a coding assistant for a highly specialized programming language, or an LLM-powered mobile application for field workers in the retail sector. Our experiments confirm the effectiveness and importance of these methods in turning the recent advancements in generative AI into business value.

If the article has got you interested in potential use cases of LLMs at your company, we invite you to explore our LLM Discovery Workshops offering, where you can learn more from our experts and find out how your business can benefit from cutting-edge technology.

Achieving accurate image segmentation with limited data: strategies and techniques

February 12, 2024/in Artificial Intelligence /by Sebastian Chwilczyński

Harnessing the power of deep learning for image segmentation is revolutionizing numerous industries, but often encounters a significant obstacle – the limited availability of training data. Collecting a large, diverse, and accurately annotated dataset consisting of pairs of images and corresponding segmentation masks can be time-consuming, expensive, and challenging due to privacy concerns.

Fortunately, in 2023 we underwent a minor revolution in the task of image segmentation. It all began with the Segment Anything Model (SAM) from Meta AI, followed by rapid advancements in zero- and few-shot image segmentation. These approaches aim to provide accurate segmentation without access to extensive datasets, reducing the costs and time of implementation.

In this blog post, we will explore techniques and strategies that leverage the latest advancements in the field to address the challenges of image segmentation with limited training data.

Fundamental concepts

Before delving into the methods, it is essential to refresh our knowledge regarding the concepts that will be useful for our discussions.

Image segmentation

Image segmentation involves partitioning images into multiple segments or objects. This task has applications in various fields such as medical analysis, autonomous driving, and augmented reality. Typically, we can classify segmentation tasks into four categories:

Semantic Segmentation aims to associate a label with every pixel in an image.
Instance Segmentation involves segmenting every instance found by the object detector. However, we are not interested in segmenting uncountable entities like the sky or grass.
Panoptic Segmentation integrates both instance and semantic segmentation, providing a holistic understanding of an image by assigning each pixel a semantic label and an instance identifier.

Figure 1. Different segmentation types. Source: own study.

Supervised learning

Supervised learning is a widely used approach in machine learning, where algorithms are trained using a large number of input examples paired with their corresponding expected outputs. In the case of image segmentation, this involves providing raw images as input and the corresponding segmentation masks as the expected output. The algorithm learns from these examples and aims to find a function that can accurately transform the input images into the expected segmentation masks. Over the years, various successful deep learning architectures have been developed for this task, such as U-Net or SegFormer.

In numerous scenarios, communities have successfully collected large amounts of data covering the full distribution of expected inputs, leading to impressive results in image segmentation. However, in many real-world use cases, gathering a significant amount of data is infeasible, resulting in the suboptimal performance of supervised learning.

Zero-shot learning

Zero-shot learning aims to solve classification, image segmentation, and other tasks for classes that were not observed during training. Instead, we rely on descriptions of these classes or tasks. This technique is based on knowledge transfer, which is already contained in the concepts learnt during training. For example, if our training dataset includes horses and the concept of stripes, a zero-shot learning system should have the capability to recognize zebras.

Recently, the most popular approaches have utilized natural language as a proxy for describing new classes, exemplified by the CLIP model. The key idea behind CLIP is to leverage pre-training on a large amount of easily accessible Internet data consisting of images and their corresponding descriptions. The objective is to create embedding space for both text and images, where the embedding of the textual description of an image is close to the described image embedding. This is achieved through the use of contrastive learning.

Coming back to the zebra example, we could describe the concept of the zebra as follows: “A horse-like animal with black and white stripes”. The embedding of this description should be close to the images of zebras, allowing for classification based on the similarity. With common objects we can be even more direct and provide a prompt: “a photo of a zebra”.

Figure 2. Illustration of the CLIP training process. Source: own study.

Few-shot learning

In the previous example, the model relied on abstract descriptions rather than labeled examples. Now, let’s explore few-shot learning, which involves having a small set of labeled images known as a support set. This support set is presented to the neural network, and the expectation is for the network to correctly classify unseen examples of the newly introduced concept.

In this scenario, when training a network with a large dataset, the objective is not simply to learn classification but rather to learn how to associate similarities and differences between objects. Few-shot learning is a bit simpler than zero-shot learning, but it still poses significant challenges. In the context of image segmentation, few-shot learning may be represented as follows:

Figure 3. Illustration of a few-shot segmentation process. Source: https://arxiv.org/pdf/2203.15712.pdf.

Segment Anything Model (SAM)

Inspired by the success of prompting techniques utilized in the field of natural language processing, researchers from Meta AI proposed the Segment Anything Model (SAM), which aims to perform image segmentation based on segmentation prompts. These prompts can take various forms, such as a point, bounding box, initial binary mask, or even text, indicating what specific area of the image to segment.

To achieve this, SAM utilizes two encoders: one for encoding the image and another for encoding the prompt. The embeddings from both encoders are then connected and passed through the segmentation decoder to generate the segmented output. This modular approach allows the exchange or fine-tuning of all three models separately. Additionally, SAM brings computational efficiency through the reuse of embeddings. Once the embedding for a given image is calculated, it can be reused for every inference on that image. Only new prompt embeddings need to be computed, resulting in a faster overall process.

Figure 4a - Illustration of Segment Anything Model architecture and examples of point and box inference — Figure 4. Illustration of Segment Anything Model (SAM) architecture (left) and examples of point and box inference (right). Source: https://segment-anything.com/.

Figure 4b - Illustration of Segment Anything Model architecture and examples of point and box inference — Figure 4. Illustration of Segment Anything Model (SAM) architecture (left) and examples of point and box inference (right). Source: https://segment-anything.com/.

Methods

Lang SAM and Grounded Segment Anything

After the introduction of SAM, several projects, such as LangSAM and Grounded SAM, have emerged with the goal of improving text-based prompts. The pipeline for this approach involves several steps. Initially, an image-text pair is processed through Grounding Dino, a zero-shot object detector that detects objects within the image and provides a set of bounding boxes. These bounding boxes, along with the corresponding image, are then fed into SAM, which generates segmentation masks.

Note that we still have to provide textual description. The entire process can be further automated incorporating automatic image tagging using modules like RAM or Tag2Text. Using an image tagger, one can automatically obtain the image description, which is subsequently passed to Grounding Dino. Grounded SAM provides many other nice features, like inpainting or voice-based prompts, and supports various versions of SAM, such as EfficientSAM and EdgeSAM. Unfortunately, this approach has one drawback – it is multistage. If any of the previous stages fails, there is no way to recover.

When using Grounded SAM, there are several important aspects to consider:

Types of objects: Grounded SAM performs exceptionally well with common objects like umbrellas or cars. However, it may face challenges when attempting to segment specific types of objects, such as transistors on a circuit board. Additionally, be aware of the object detection process happening before segmentation, which makes the segmentation of patterns instead of objects very hard.
Prompt engineering: the provided prompt plays a crucial role, especially when dealing with compound nouns. By using “car lamp” as a prompt, we are very likely to detect cars instead of car lamps. Using “headlight” as a prompt may be better. Moreover, remember to separate classes with a dot, which is treated as a sentence separator. For example, the prompt “egg. banana. apple. orange.” should be used instead of “egg, banana, apple, orange.”

Figure 5a - Comparison of model outputs for different prompts — Figure 5. Comparison of model outputs for different prompts. On the left “car lamp” is used as a prompt, and on the right “headlight” is used. Source: own study.

Figure 5b - Comparison of model outputs for different prompts — Figure 5. Comparison of model outputs for different prompts. On the left “car lamp” is used as a prompt, and on the right “headlight” is used. Source: own study.

SEEM

While previous methods attempted to incorporate the original SAM into larger pipelines, numerous approaches have also emerged with the aim of improving the raw SAM. Examples include SAM-HQ, SemanticSAM, and SAM-Adapter, each targeting different aspects of the method.

One method that particularly caught our attention was SEEM (Segment Everything Everywhere All at Once), which extends SAM by introducing more types of prompts such as scribbles, audio, and images. SEEM also enhances text prompt handling and provides additional semantic labels. As in SAM, we can mix these prompts freely.

From our perspective, the most exciting capability of SEEM is its segmentation based on an exemplary image. This feature enables one-shot inference, where a reference image with the desired object’s mask is provided only once. Subsequently, any number of images containing the desired object can be processed without the need for supervision or additional prompting.

Figure 6a - Demonstration of SEEM inference — Figure 6. Demonstration of SEEM inference. From the left: our object of interest, accompanied by a roughly sketched mask. Next, a query image to be segmented. Finally, the resulting segmentation, along with additional classification information. Source: https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.

Figure 6b - Demonstration of SEEM inference — Figure 6. Demonstration of SEEM inference. From the left: our object of interest, accompanied by a roughly sketched mask. Next, a query image to be segmented. Finally, the resulting segmentation, along with additional classification information. Source: https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.

SegGPT

Many successful approaches from NLP are now being translated into computer vision. For instance, the analogy of the masked token prediction task used to train BERT is known as masked image modeling in computer vision. Similarly, the capability of solving multiple tasks through next-token prediction in NLP can be transferred to CV by using image inpainting, the goal of which is to reconstruct missing regions in the image. Of course, don’t forget about the concept of few-shot inference, where one provides a few examples and then asks the model to solve a new example based on the context. It sounds like ChatGPT for images, and it is actually named SegGPT.

While training such a model is a complex topic on its own, using it for inference is relatively straightforward. First, we provide at least one example of how the task should be solved, consisting of an input image and its corresponding expected image mask. Then, we prompt the model with an input image and an empty image, and the model is expected to inpaint a mask on the empty image based on the previously shown examples.

Figure 7. Comparison of few-shot inference between NLP and CV. Source: own study.

Unfortunately, the amount of prompt we can provide to the SegGPT is limited, as we can simply run out of GPU memory. What if we were able to collect 50 images? Since we have already seen 3 inspirations from NLP, let’s go further and try to translate two more concepts.

The first concept is prompt engineering. In NLP, this refers to finding the most optimal text to feed the Large Language Model for enhanced performance. Analogously, in SegGPT, we can find the most effective image prompts from the training set. We can employ an approach similar to forward variable selection commonly used in classic Machine Learning. Starting with an empty prompt, we continuously expand it by adding images. We include those images that, when added to the prompt, result in the largest increase in performance accuracy on the validation set. We continue this process until there is no further improvement or until we reach the maximum capacity of our GPU memory.

The second approach is reminiscent of Retrieval Augmented Generation (RAG), where we aim to provide the best matches based on the actual query. In this case, we compute embeddings for all training images using any image encoder of choice. During the inference phase, we determine the most similar image from the training set as the prompt by comparing the input image embedding with precalculated embeddings.

The authors of SegGPT also propose a third strategy – learning the optimal prompt from the training data. This involves freezing the entire pretrained model and optimizing only the input context tensor.

PerSAM

Can we obtain good point prompts without human interaction? This is the question PerSAM tries to answer. Given a training instance (image, mask), it calculates the embedding of the masked region. During inference, it compares the embedding of the input with the one it has precalculated and looks for a region that is most similar to the one from the training image.

As this region is found, a point is placed here and the decoding stage of SAM is run. But this is not the end. After we obtain our first mask, we pass it as a prompt to SAM so that it can improve itself. In the third pass we calculate the bounding box from the returned mask and ask SAM to improve once again. If you are interested in this approach, two more studies try to improve upon it: Matcher and SamAug.

Figure 8. A diagram illustrating the operation of PerSAM. Source: own study.

ClipSeg

CLIP was created to combine knowledge of language concepts with semantic knowledge of images via embeddings. It soon became a vital part of models such as DALL·E 2 or Stable Diffusion. The fact that researchers tried to utilize it for image segmentation is nothing special; this is how ClipSeg emerged. This method can work in both zero- and one-shot scenarios, taking both image and/or text as a prompt. During inference, it calculates embeddings of the prompt utilizing a CLIP encoder and passes it to a transformer-based segmentation decoder. The biggest advantage of this method lies in its simplicity – it doesn’t parse prompts in a complicated way; it is end-to-end; and its backbone is a very well-known model.

Figure 9. The ClipSeg pipeline involves the initial embedding of both the prompt and query into a unified representation, which is then input to the decoder. Source: https://arxiv.org/pdf/2112.10003.pdf.

Comparison

To facilitate the comparison of the models, we have prepared a table highlighting the differences between the presented methods. Each method is compared against the base Segment Anything Model (SAM), with each row representing a separate method and the columns representing the different capabilities of the models:

Zero-shot: The model is capable of zero-shot inference given text, point, or bounding box inputs.
Few-shot: The model is capable of making inferences given a few examples of how the task should be solved.
Text: The model accepts text prompts.
Image: The model accepts image prompts.
More prompts: The model extends the set of prompts available in SAM.
Prompt engineering: The model allows prompt engineering to improve performance.
Instance segmentation: The model can perform the task of instance segmentation.
End-to-end: There is only one stage of processing, and input is transformed directly into a mask.
Modular: The model’s components can be easily exchanged without requiring retraining.

Figure 10. Comparison of all introduced methods from different viewpoints. Source: own study.

We also performed quantitative comparison using the task of umbrella segmentation, with the goal of determining the amount of training data required for the supervised model to achieve comparable accuracy to the zero-shot and few-shot learners. To evaluate the performance, we utilized the mean intersection over union (mIoU) metric, which is commonly used in image segmentation tasks. The mIoU metric quantifies the overlap between the model’s predictions and the ground truth labels. A value of 1 indicates perfect overlap, representing an excellent model, while a value of 0 indicates no overlap, indicating poor model performance.

Figure 11. Some examples from the dataset used for evaluation. Source: own study.

Figure 12. Experimental results for the umbrella dataset. Observe that we need thousands of instances to match the performance of zero-shot models. Source: own study.

Conclusions

The release of the Segment Anything Model has brought about a revolution in addressing data scarcity in image segmentation. Our experiment reveals that in certain cases, we can surpass the performance of a model trained on thousands of examples with absolutely no data. But what if objects and concepts to be segmented are not recognized by the zero-shot learners? In such scenarios, we can turn to the few-shot models with a little more effort. Exciting, isn’t it? So the next time you face an image segmentation problem, before spending weeks on data collection, spend a day exploring the techniques presented in this blog post, and hopefully the problem will be solved.

6 AI predictions for 2024 from 6 deepsense.ai experts

December 20, 2023/in Artificial Intelligence /by deepsense.ai

In the world of AI, change is the only constant. The field is evolving at an unprecedented pace, making it extremely challenging for companies and decision-makers to stay ahead of the curve and to keep up with the technical advancements being released day after day. That’s why the ability to adapt and predict future trends for businesses is no longer an option, but a necessity.

Recognizing this high-pressure scenario, six experts from deepsense.ai have taken on the challenge of forecasting the state of AI in 2024. Through rigorous analysis, they spotlight potential developments in six key AI domains, delivering indispensable insights to help savvy leaders prepare for the future.

So, are you ready to uncover what the future holds for AI in 2024? Let’s dive right in!

1. Edge AI – Michał Tadeusiak

In the dynamic realm of artificial intelligence (AI), edge devices are emerging as an enabling force, revolutionizing language communication, shaping the metaverse, and empowering industries. By bringing AI capabilities closer to where data is collected, edge devices enable real-time decision-making, enhanced privacy, and improved scalability.

The field of LLMs is witnessing significant advancements, while companies like Apple, Qualcomm, and Google, by providing platforms like MLX, Snapdragon, and Gemini Nano, respectively, are supporting their adoption closer to users. Meanwhile, the AI community, through initiatives like llama2.c, is diligently working to enable the use of LLMs on edge devices. This on-device capability enhances privacy and scalability, ensuring that sensitive user data remains secure and accessible locally. This effort aims to make intelligent assistants more ubiquitous, offering sophisticated and responsive user interactions on a wide range of devices.

The metaverse is rapidly evolving, enabled by edge AI applications and powerful platforms to support them. Edge AI-powered augmented reality (AR) applications are bridging the gap between the digital and physical worlds, bringing virtual elements into our real-world experiences. Headsets like Apple’s Vision Pro, anticipated for release in 2024, and Meta’s recently launched Quest 3 are spearheading this transformation. These headsets generate real-time overlays of digital information onto the real world, creating immersive and interactive experiences. With the recent developments in the area of 3D scene reconstruction, such as NeRFs and Gaussian Splatting, the future of augmented reality looks brighter than ever before.

While IoT (Internet of Things) and similar edge technologies have been around for some time, edge AI is introducing new possibilities and applications across various industries. In sectors like drones, robotics, and wearables, its influence is marked by the ability to run locally. Edge AI is particularly effective in predictive maintenance, utilizing sensor data for proactive upkeep, reducing repair costs and downtime. Retail is also experiencing a transformation with edge AI, where automated checkout systems and AR-enhanced in-store navigation are redefining the shopping experience. The automotive industry stands out in its adoption of edge AI, especially in developing autonomous vehicles. A prime example is Tesla’s Full Self-Driving (FSD) beta program, which leverages deep learning models on edge devices for advanced autonomous driving features. This innovation highlights edge AI’s capability in executing real-time, complex tasks, setting the stage for the widespread use of self-driving cars in the near future.

Edge AI is rapidly becoming a foundational technology across multiple sectors. Its transformative applications in language, the metaverse, and industries demonstrate its versatility and ability to enable significant advancements. 2024 will see the rise of edge AI applications which are poised to play an even more integral role in shaping our digital interactions, enhancing operational efficiencies, and empowering users with intelligent and personalized experiences. Michał Tadeusiak

Michał Tadeusiak, Director of AI

2. Large Language Models – Mateusz Wosiński

Unless you have been locked in a closet for about a year, you no doubt know that we are living in the era of Large Language Models (LLMs). Although the first such solutions were developed as far back as 2019 (e.g. GPT-2 from OpenAI), the release of the famous ChatGPT in November 2022 was arguably the biggest breakthrough. At that moment, the revolution began, and almost everybody got extremely hyped about this technology.

Next year is expected to be a year of transitioning LLM-based applications from research to production. Those who identify the opportunities first will benefit the most. Mateusz Wosiński

Mateusz Wosiński, Senior Machine Learning Engineer

Since the debut of our all-new favorite web tool, we have seen several models attempting to further improve its outstanding quality and mitigate the most crucial drawbacks – hallucinations, cost and data privacy. All the tech giants have joined the race, each taking a different approach:

Meta released an open-source family of models called LLaMA,
Google delivered PaLM and LaMDA models which are accessible via a ChatGPT-like assistant called Bard,
Lastly, Microsoft, or to be more exact OpenAI, which received a multi-year, multi-billion dollar investment from that giant of the industry, refined its previous models with a multi-modal GPT-4, capable of understanding not only text, but also images.

Apart from that, we have observed numerous examples from academia (an open-source Alpaca model, which is basically a fine-tuned LLaMA, from Stanford AI lab being the most notable one) and newly emerging companies (e.g., Anthropic, which released a new model Claude, designed to be “helpful, harmless and honest”).

But what does it all mean for business? An endless stream of possibilities! LLMs are such a revolutionary technology that the majority of companies have not yet figured out the possible use-cases. However, there certainly are plenty of them. At deepsense.ai, we have already delivered a couple of projects leveraging such solutions, including the Frontline Worker Assistant, which provides specific instructions based on internal knowledge bases, or Interactive Document Explorer, which strongly accelerates the process of understanding complex PDFs. What’s more, we collaborate closely with LangChain, the leading framework for creating LLM-powered applications. Our team was responsible for some of its most important features concerning data privacy and app security. As a result, deepsense.ai was awarded the prestigious title of official LangChain partner. And we are just getting started!

Next year will undoubtedly surprise us all with further impressive model advancements, but I expect it to be specifically a year of implementing LLMs into full-scale production. And those who miss the wave, may fall far behind.

Harness the potential of GPT and other LLMsduring a customized workshop

LEARN MORE

3. 3D scene reconstruction – Konrad Czarnota

3D scene reconstruction based on camera images is currently a focal point in the AI community. The ability to digitize real-world items and bring them to the virtual world has become reality.

The current trend sees a shift from special devices with dedicated hardware, such as LIDAR, to basically any smartphone capable of recording videos. Recent developments of NeRFs (Neural Radiance Fields) and Gaussian Splatting methods have streamlined the entire process. As a consequence, the required GPU memory and training times have significantly decreased, making these algorithms more accessible.

In the upcoming years, scenes created using either Gaussian Splatting or NeRFs are likely to achieve mainstream popularity. Their potential to seamlessly integrate real-world scenes into the virtual realm promises a thrilling business opportunity that’s hard to ignore. Konrad Czarnota

Konrad Czarnota, Senior Data Scientist

These advancements have led to the successful application of 3D scene reconstruction in numerous industries. Let’s take a closer look at a few of them. E-commerce businesses, for instance, can now generate 3D views of their products much faster and, importantly, at a much lower cost. Recent advancements have also brought a significant change to the special effects industry, enabling the creation of complex scene representations of real-world buildings, all based on drone-captured footage. Companies that manage large stadiums can now produce views from each individual seat automatically, a feature that dramatically enhances ticket sales. The entire entertainment industry is gearing up for the possible incoming personalization opportunities these advancements can offer for users wishing to import physical items into the virtual world.

Here at deepsense.ai, we’re at the forefront of innovation. We’ve been hands-on, experimenting with the latest breakthroughs in 3D scene reconstruction, even taking our own office as a testing ground, which has led to some interesting conclusions. In our exploration, we discovered that Gaussian Splatting offers superior visual effects but grapples with certain challenges in early stage development, particularly the lack of support for some tools. On the other hand, NeRFs have evolved considerably over the past few years, providing a stable and well-supported set of tools. However, they could occasionally produce more visible artifacts, such as mist-like or blurred areas.

I strongly believe that in 2024 Gaussian Splatting will quickly surpass NeRFs to become the most sought-after solution for novel view synthesis. Meanwhile, NeRFs themselves may hone their focus on highly specialized use-cases, such as few-shot scene generation derived from just a handful of images. As we move forward, we anticipate a steady increase in scenes created using these techniques. In the coming years, they’ll likely go mainstream. Their potential to seamlessly integrate real-world scenes into the virtual realm promises a thrilling opportunity that’s hard to ignore.

4. Diffusion models – Maciej Domagała

Without a doubt, diffusion-based models have taken the computer vision-related GenAI scene by storm. These models are stripped of the limitations one might experience with typical GAN-based and VAE-based applications. As of 2023, there are two strong observable trends propelling each other forward. OpenAI’s series of groundbreaking DALL·E models are often referred to as a trendsetter in terms of quality for text-to-image solutions. On the other hand, we can observe the virtually limitless application of diffusion models, thanks to the publication of the open-source powerhouse Stable Diffusion.

The global availability of the latter has resulted in a much faster development of task-specific architectures for, e.g., inpainting or video-from-image rendering. Transfer learning and domain adaptation are thriving thanks to sharing services such as HuggingFace or Civitai. This is a huge benefit for companies as it brings them several steps closer to incorporating many of the latest models directly into their workflows. The recent surge in the development of multi-modal methods is clearly visible in the domain adaptation field. New state-of-the-art structures, such as ControlNet, are utilizing numerous types of inputs (both text and image-related) to generate customized output.

As the quality of the models rises, we expect to see more automation happening in the near future to make these convoluted architectures even more accessible to businesses. This trend has already begun, as, for instance, the newest DALL·E 3 is supported natively by ChatGPT, which – putting aside the usage-related costs – makes it a viable option for organizations trying to maximize the impact of innovation on their businesses. Beyond the general accessibility provided by diffusion models, there is also the generation speed aspect. This year’s SDXL Turbo enables high-quality single-step generation which allows for near-instant image generation. All of these new ideas consistently make the field interesting and ever more exciting. At deepsense.ai, we like to keep up with the leading advancements and we are certain that 2024 will bring a lot of interesting research!

In 2024, we expect even greater abstraction of the current architectures, allowing users to enhance seamless finetuning of custom domain models. Open-source diffusion-based solutions (among others) will benefit from the global multi-modality trend, which will help to unleash their potential. Maciej Domagała

Maciej Domagała, Senior Machine Learning Engineer

5. LLMOps – Mateusz Hordyński

The rapid rise in popularity of Large Language Models (LLMs) has undeniably revolutionized our approach towards machine learning project development. These models have enabled us to blueprint and bring to life product ideas at unprecedented speed. Prompt engineering has pushed back the heavy lifting, such as data preparation or model training, to the later stages of development, allowing teams to focus earlier on innovative and creative aspects of their projects. Essentially, this shift has empowered teams that were previously unable to use AI in their products due to technical or resource limitations to do so. This has given rise to countless PoCs, demos and AI startups over the last year. However, we’re still in the early days of LLM adoption – getting them to do meaningful work is yet to happen. That is precisely what’s going to happen in the LLMOps community in 2024.

In 2024, a transition awaits LLMOps as we move from LLM-powered PoCs to production-grade systems, enabling companies to create reliable and profitable products. To achieve this, we need the emergence of more performant inference serving, observability, and security tools. Additionally, tools to distill LLM knowledge into smaller, more efficient specialist models are yet to be developed. Mateusz Hordyński

Mateusz Hordyński, Technical Leader

In the upcoming months, inference serving in LLMOps will see significant enhancements. The core trend in this area will focus on achieving scalable and efficient inference deployments while maintaining high model performance. Given that LLMs are very demanding and often require huge amounts of memory and storage, it is crucial to optimize usage of those resources. The popularity of more advanced quantization methods is expected to rise, aiming to reduce overall model sizes and, consequently, lower latency. There is potential for popular inference serving libraries to further optimize model performance by leveraging more sophisticated attention algorithms, tensor, and data parallelism methods. Another cost-saving method involves serving multiple fine-tuned specialist adapters to a single foundational model as a base, thereby minimizing overhead.

Observability and security will also be key aspects of LLMOps trends in 2024. Increased visibility of model behavior will become more critical to ensure the robustness and reliability of AI systems. The development of tools that provide transparency into how LLMs make decisions, monitoring model performance in real-time, and identifying any drifts or anomalies in model predictions will be emphasized more strongly. Furthermore, we will likely see a surge in the adoption of LLM-related security tools, and observable systems will enable us to analyze traffic against popular attack surfaces, such as prompt injections.

Lastly, a very interesting concept is to replace large generalist models with smaller, more specialized versions for specific tasks. Techniques like LLM distillation – training smaller models using LLMs – may significantly increase in popularity. This area also can greatly benefit from more advanced tooling – for supervising the learning process, gathering labels, and sourcing reasoning data from LLMs.

6. Coding Agents – Maks Operlejn

AI tools are revolutionizing the programming landscape, enhancing efficiency and quality in software development. According to a study, GitHub Copilot (a tool for autocomplete code) has sped up the work of software developers by as much as 55%. Programmers themselves also believe that code quality has improved as well.

In 2023,LLMs like GPT-4 have transformed how we interact with coding resources. It’s undeniable that GPT has at least partially replaced the good old Stack Overflow in the coding process, if not completely. But what if AI could do more than just help with code snippets? What if it could build entire code repositories with only objective specification? That’s where Coding Agents come into play:

The user specifies the goal (e.g., “create a system to manage the company’s inventory”), adds the required technical specification (such as “use the following technologies: […]”) and provides any necessary information (like “users should be able to create an account via activation email”).
AI-driven assistants will plan, prioritize, and generate full-scale code in line with specified goals and technical requirements. They can craft code, conduct on-the-fly testing, and refine with real-time ‘Reflection’ mechanisms, while still allowing human collaboration through feedback and manual code enhancements.

Despite their potential, Coding Agents face two main challenges:

Need for Current Data: While GPT-4 is excellent for code generation, its training data only goes up to 2021 – a sizable gap in the ever-evolving tech world. This leads to concerns about outdated code and incompatibilities.
Limited Prompt Length: A code repository could encompass hundreds of files. Conveying comprehensive context to ensure that AI-generated code integrates seamlessly with existing systems is a significant hurdle.

Both problems are addressed in various ways by users (for example by using RAG systems), but companies are also identifying and fixing models’ weaknesses. Not long ago, a new version of GPT-4 Turbo came out which increases the context size and includes data up to April 2023.

Beyond the realm of Coding Agents that operate using mainly LLMs, there is an additional frontier where graphical prototypes are being transformed into code. A key innovator in this field is Figma, which is already conducting trials on converting User Interface (UI) design into workable code via AI. This approach bridges the gap between designers and developers, thus promoting a more integrated and collaborative workflow.

Coding Agents are still at the early stages, with a lot of growth expected before they’re ready for widespread business use. In 2023, they might seem like fun toys to play with. But looking ahead to 2024, there’s a strong possibility that they will become more practical and reliable tools for developers. Let’s ponder a future where AI can take over a variety of roles – planning projects like a Product Owner (and putting them in Jira), breaking down tasks like an Analyst (describing Jira issues), and even managing code submissions and testing (via GitHub). While this concept might seem as though it belongs in the realm of science fiction and people are just experimenting with prototypes now, one trend is clear: software developers are going to rely more on AI for coding assistance, and will spend more of their time reviewing and approving the AI’s work.

AI tools can significantly accelerate developers’ work and have a direct effect on business success. The focus in programming is sometimes shifting more towards “reviewing” code rather than “writing” it from scratch. While fully autonomous Coding Agents are still the wave of the future, keeping up to date with current AI-based tool advancements is essential to avoid missing out on key opportunities. Maks Operlejn

Maks Operlejn, Machine Learning Engineer

It’s widely recognized now that AI-assisted tools are essential for modern developers. Those who don’t adopt these tools risk falling behind, as their productivity may dwindle. The reality is, with the aid of AI, businesses can significantly accelerate their software development process. Here at deepsense.ai, keeping up to date with AI technology is in our DNA, and these tools are integral to our daily operations. This article delves into existing agent coding solutions, and excitingly, a follow-up article showcases our venture into developing a proprietary agent – we invite you to explore our findings and innovations!

Summary

As we navigate this rapidly unfolding technological revolution, the challenge lies not only in understanding and keeping up with the rapid advancements, but also in strategically harnessing these developments to drive innovation, growth, and competitiveness. As AI continues to break boundaries and redefine possibilities, it’s an exciting time to be part of this transformative journey.

The future of AI may seem like venturing into the unknown, but with expert insights, we can better prepare for what’s to come in order to avoid being left behind by competitors. So let’s embrace the forthcoming changes and boldly step into the innovative, AI-driven world of 2024!

Data Generation Methods: ControlNet, GLIGEN & Stable Diffusion Inpainting - deepsense.ai

Data generation with diffusion models. Part 3: Generating custom data in the blink of an eye

December 5, 2023/in Generative AI /by Natalia Czerep, Marianna Parzych and Piotr Banasiński

It’s time to wrap up our work on data generation using diffusion models. Previously we laid the foundation for this by introducing the concept and providing a quick overview of promising methods. Then, in the second part, we focused on obtaining images along with semantic segmentation maps. In this blog post, we would like to touch on the topic of methods which allow supplementary inputs.

Generating images based on additional conditions

Using diffusion models can lead to creative, imaginary outputs. However, to use this data in real-world business projects, additional information is also required, whether it is the composition of the objects, the size, or even the overall appearance. For this reason, it could be helpful to incorporate another input besides the prompt and feed it to the network, so that it provides guidance on what we expect. At deepsense.ai we experimented with the most popular methods for guided image generation, and we are excited to share our thoughts.

Stable Diffusion Inpainting

Let’s start with a method that originates from the Stable Diffusion paper [1]. The Inpainting pipeline allows you to edit specific parts of an image by providing a mask and a text prompt. This allows the user to erase and replace parts of the picture. The inpainting functionality of Stable Diffusion relies on a modified UNet architecture. This specialized network incorporates five additional input channels: four dedicated to the encoded masked image and one specifically for the mask itself.

Fig. 1 Images generated using a Stable Diffusion Inpainting model with the prompt “a blue robot on a bench”

ControlNet

Sound familiar? Yes, we have mentioned the ControlNet architecture [2] before in our first blog post of this ‘Data generation with diffusion models’ series – you can check it out here [3]. But we are coming back to this topic with more experience and carefully considered conclusions.

ControlNet modifies diffusion models by adding a component ready to be trained with additional inputs. One copy of the encoder is frozen. It carries the wisdom of the original network, gained from studying billions of images. This ensures the preservation of the network’s incredible capabilities. There is no need to train this part of the network. The trainable copy of the encoder learns conditional control so that it is possible to direct outputs with segmentation masks, key points, edges, etc. The outputs from trainable and frozen neural network blocks are connected with a custom layer called “zero convolution” and passed to the remaining parts of the stable diffusion model.

Figure 2 - ControlNet architecture overview

Fig. 2 ControlNet architecture overview. Image from [2].

GLIGEN

A similar concept of freezing the original weights of pre-trained diffusion models can be seen in the implementation of GLIGEN [4] (Grounded-Language-to-Image Generation). Apart from creating a copy of the entire encoder as in ControlNet, it introduces new layers called Gated Self-Attention (GSA), which are responsible for processing grounding input, within the encoder. A key distinction between GLIGEN and ControlNet is the way they deal with conditions and visual features. GLIGEN operates by processing a concatenation of inputs using the Transformer layer. ControlNet adopts the approach of concatenating the condition and visual features. This design choice positions GLIGEN as a more versatile choice. Both architectures exhibit impressive performance based on conditions like edge maps and segmentation maps. However, GLIGEN goes further by demonstrating great results based on conditions such as bounding boxes and reference images.

Fig. 3 Gated Self-Attention layer from the GLIGEN architecture. Image from [4].

Training the models

Pre-trained models are undoubtedly impressive, but we often find that they don’t meet the exact needs of business projects. Their inputs are often hyperrealistic, imaginary, and do not fit custom, specialized datasets. One can combat this with prompt engineering, but such a solution is poorly scalable. In this case, a re-training process is necessary.

Figure 4a - Images generated with a pre-trained stable diffusion inpainting model — *Fig. 4 Images generated with a pre-trained stable diffusion inpainting model. Base images are part of the Cityscapes dataset [5]. Inpainted cars stand out from the background style.*

Figure 4b - Images generated with a pre-trained stable diffusion inpainting model — *Fig. 4 Images generated with a pre-trained stable diffusion inpainting model. Base images are part of the Cityscapes dataset [5]. Inpainted cars stand out from the background style.*

However, training large diffusion models is not an easy task. First of all, a large dataset is needed. Stable-Diffusion-Inpainting uses LAION which contains 5 billion images. ControlNet, on the other hand, was trained on the ADE20K dataset with over 25,000 images for segmentation and a custom dataset for edges as control images: 3 million edge-image-caption pairs from the internet. In business cases, often only a few thousand images are at their disposal. When it comes to hardware and computational power, GLIGEN was trained with 16 V100 GPUs for 100,000 iterations, and ControlNet for segmentation – with 200 GPU-hours on Nvidia A100 80G.

From a business perspective, it can be impractical to experimentally train such large models due to deadlines and other limitations. Therefore, we have adopted suitable training strategies to overcome these issues, which can lead to satisfactory results for the chosen project.

LoRA to rule them all

In typical text-to-image techniques, where images are generated solely based on text prompts, several methods have been developed to effectively train new concepts. These fine-tuning methods enable swift and efficient learning of new styles or objects, even when the available data is extremely limited. One of these methods is the Low-Rank Adaptation method, commonly referred to as LoRA. At deepsense.ai, we have extensively explored and evaluated this method, uncovering its remarkable flexibility and versatility. You will soon witness its capabilities firsthand. But first, let’s delve into what exactly LoRA is.

Custom models with LoRA

LoRA was initially introduced in the LoRA: Low-Rank Adaptation of Large Language Models paper [5] as a fine-tuning method for Large Language Models like GPT-3. In the paper, it is demonstrated that by fine-tuning only a small portion of the parameters, the fine-tuned model outperformed the pre-trained ones. This technique has since been applied to Stable Diffusion, where it has exhibited remarkable effectiveness in fine-tuning diffusion models as well.

Figure 5 - Comparison of a cross-attention schematic view without and with additional LoRA layers

Fig. 5 Comparison of a cross-attention schematic view without and with additional LoRA layers

LoRA achieves this by incorporating additional linear layers into the cross-attention layer. Fig. 5 provides a schematic representation of how it works. In the Stable Diffusion model, the cross-attention layers play a vital role in image generation as they facilitate the interaction between the processed textual prompt and the generated image. During fine-tuning, only the additional layers, which use significantly fewer weights compared to the entire model, are trained. As a result, the training process is quick, and powerful computing resources are not required.

However, the most remarkable aspect of LoRA is its adaptability to pre-trained weights. Despite the existence of various methods built upon Stable Diffusion – ControlNet, GLIGEN, etc. – and each functioning differently, the cross-attention layers connecting a textual prompt with an image maintain a consistent structure and remain a crucial element. As a result, the weights trained using LoRA on one model can be seamlessly transferred to another model, even if these models represent different methods. This interoperability showcases the versatility of LoRA and its ability to facilitate knowledge transfer across diverse model architectures.

Fig. 6 Our proposed process for customizing models using LoRA

Now let’s look at how it works in practice.

Results for Cityscapes

We chose the “Cityscapes” dataset [5] to present the results of the Stable Diffusion model with the LoRA layers.

“Cityscapes” consist of images and corresponding segmentation masks captured in 50 different cities in Germany. The images can be easily recognized by a specific color scheme, a slight degree of blurriness, and the presence of the Mercedes hood. We assumed that specific characteristics of the Cityscapes dataset would be a good example by means of which to present the capabilities of the LoRA method. With just under 3,000 training images, the dataset may appear small for training deep-learning models. Nevertheless, it remains widely utilized for evaluating semantic segmentation techniques.

Figure 7a - Original images from Cityscapes dataset — *Fig. 7 Original images from Cityscapes dataset [5].*

Figure 7b - Original images from Cityscapes dataset — *Fig. 7 Original images from Cityscapes dataset [5].*

Stable Diffusion + LoRA

The Stable Diffusion model, trained on the Cityscapes dataset using the LoRA method, demonstrates great fidelity in generating images that mimic the training dataset. Most notably, the training process takes only several dozen minutes on a single GPU card!

Figure 8a - Images generated by Stable Diffusion fine-tuned with LoRA on the Cityscapes dataset — *Fig. 8 Images generated by Stable Diffusion fine-tuned with LoRA on the Cityscapes dataset*

Figure 8b - Images generated by Stable Diffusion fine-tuned with LoRA on the Cityscapes dataset — *Fig. 8 Images generated by Stable Diffusion fine-tuned with LoRA on the Cityscapes dataset*

Stable Diffusion Inpainting + LoRA

Now we are ready to customize Stable Diffusion models with additional conditions. Let’s begin with inpainting with Stable Diffusion. The model is great for removing objects from scenes. It can also handle the insertion of completely new objects. However, one limitation is that the inpainted objects may not always align seamlessly with the overall style of the image. This disparity is particularly noticeable in the distinctive images from Cityscapes. Fortunately, incorporating LoRA weights trained on Cityscapes into the generation process significantly improves the integration of inpainted objects, resulting in a much smoother blend.

Figure 9a - Results for the Stable Diffusion Inpainting method — *Fig. 9 Results for the Stable Diffusion Inpainting method. From left to right: background image; binary mask which determines object position; results without LoRA weights; results with LoRA weights.*

Figure 9b - Results for the Stable Diffusion Inpainting method — *Fig. 9 Results for the Stable Diffusion Inpainting method. From left to right: background image; binary mask which determines object position; results without LoRA weights; results with LoRA weights.*

ControlNet + LoRA

The inpainting technique relies on basic binary masks, which can limit its functionality to some extent. However, more advanced techniques such as ControlNet and GLIGEN offer enhanced flexibility, enabling the usage of various additional conditionals. Let’s explore how ControlNet excels in reconstructing Cityscapes data based on segmentation masks. The model processes the segmentation mask and creates a result based on the information therein.

Figure 10a - Results for the Stable Diffusion Inpainting method — *Fig. 10 Results for the Stable Diffusion Inpainting method. From left to right: background image; binary mask which determines object position; results without LoRA weights; results with LoRA weights.*

Figure 10b - Results for the Stable Diffusion Inpainting method — *Fig. 10 Results for the Stable Diffusion Inpainting method. From left to right: background image; binary mask which determines object position; results without LoRA weights; results with LoRA weights.*

GLIGEN + LoRA

A similar effect can be obtained for the GLIGEN method. Here, let’s focus on conditioning with bounding boxes. Each bounding box has a text prompt associated with it that determines what should be inside. Although GLIGEN works in a completely different way than ControlNet or Stable-Diffusion-Inpainting methods, LoRA also matches it and works really well.

Figure 11a - Results for the GLIGEN method without and with LoRA weights — Fig. 11 Results for the GLIGEN method without (top row) and with (bottom row) LoRA weights. The green and red bounding boxes indicate people and cars.

Figure 11b - Results for the GLIGEN method without and with LoRA weights — Fig. 11 Results for the GLIGEN method without (top row) and with (bottom row) LoRA weights. The green and red bounding boxes indicate people and cars.

Summary

In this article, we have introduced various approaches to generating data using additional conditioning. Techniques like Stable Diffusion Inpainting, ControlNet, and GLIGEN offer impressive capabilities that can greatly assist in data generation for business projects. However, retraining these methods typically requires time and a relatively large amount of data.

In this blog post, we have presented a method that addresses this challenge by taking advantage of fine-tuning with the Low-Rank Optimization technique. This approach allows for the efficient and seamless adaptation of these methods to specific use cases. By adopting this method, it becomes feasible to enrich business project data with data generated by diffusion models, opening up new possibilities for enhancing data diversity and quality.

References

“High-Resolution Image Synthesis with Latent Diffusion Models”, Rombach, R., Blattmann, A., Lorenz, D., et al. 2021
”Adding Conditional Control to Text-to-Image Diffusion Models” Lvmin Zhang et al., 2023
“Data generation with diffusion models – part 1”
“GLIGEN: Open-Set Grounded Text-to-Image Generation” Yuheng Li et al., 2023
“LoRA: Low-Rank Adaptation of Large Language Models”, Hu, E. J., Shen, Y., Wallis, P., et al., 2021
Cityscapes dataset

Llama 2. A significant milestone in the world of AI

November 30, 2023/in Artificial Intelligence /by Dawid Stachowiak

With the development of language models showing no signs of letting up, Meta AI has decided to make their contribution to the AI world with the introduction of the second iteration of their groundbreaking open-source language model Llama 2. It definitely marks a significant step in the field of natural language processing (and artificial intelligence as a whole), further democratizing the power of LLMs and improving the quality of many LLM-based applications.

In this blog post, we will focus on the widely-discussed Llama 2 model. We will go through the technicalities, safety issues, the tools built around it, and the possibilities of using the model. We’ll also examine how the model compares to others available, and we’ll take a quick look at Meta’s recent partnership with Microsoft.

What is Llama 2, and why is there so much buzz around it?

Chances are, if artificial intelligence sparks your interest, you’ve probably heard about the excitement surrounding Llama 2. Llama stands for Large Language Model Meta AI, which is an autoregressive language model that relies on a transformer architecture (similar to many of the recently developed alternatives). While the first iteration of Llama (presented in late February 2023) was generously made available for non-commercial use, the second version, Llama 2, takes a leap forward, by not only being open to the public but also offering itself for commercial usage. Essentially, this means that both individual developers and enterprises can now use the Llama 2 model to develop countless commercial applications. This may herald even more new products based on language models, and thus the even faster development of AI in this field!

The Llama 2 license permits any commercial use of the model with one small exception – if you had a user count of over 700 million per month at the time of the model’s launch, obligatory permission must be sought from Meta. This license exception was implemented due to Meta AI’s desire to prevent their current competitors from utilizing the model. Anyone else can make unlimited use of it, and even if applications based on it reach that kind of scale in the future, it will still be license-compliant.

Llama 2 is available to the public in a variety of sizes and flavors. The smallest model has 7 billion parameters, followed by a 13 billion parameter model and a staggering 70 billion parameter one, so you get a great trade-off between accuracy and the speed/cost of your system. Interestingly, there is also a model with 34 billion parameters, but according to Meta’s research paper, it has not been released to the public due to a lack of time to ensure the self-imposed safety threshold. Each of the sizes mentioned is quite significant; if you want to learn more about operating such models, it is worth checking out our blog post about LLMOps.

In addition, a few versions of the model are available for different applications. Beyond the foundational version of the model, fine-tuned versions for chat and programming assistance are also available. According to Meta, the model is trained on 40% more data than the previous version which equals around 2 trillion tokens in total and over 1 million human annotations (more precisely, binary comparisons of the model outputs) used in the RLHF (Reinforcement Learning from Human Feedback) process. This means, after classic self-supervised training, the model was fine-tuned on human labels that indicated more helpful responses. The above-mentioned improvements to the model allow for building multiple business applications such as specialized chatbots, knowledge and information retrieval search engines, code or text autocompletion, automatic content creators and many more.

Read our post about implementing LLMs in business operations to learn more about possible use cases in detail.

Safety development

In the second iteration attention was paid to the aforementioned issue of safety associated with the use of the model. In this case, every effort was made during pretraining to ensure that all data used was fully legal and did not come from users who did not give their consent. In addition, Meta tackled the thorny issue of making sure that the model was free of biases toward religion, gender, nationality, race, and sexual orientation. Then, during fine-tuning using the RLHF method, the authors attempted to eliminate three types of behavior (illicit and criminal activities, hateful and harmful activities, and unqualified advice) from the model.

A more detailed description of the safety measures can be found in Meta’s research paper about the model, where the relevant chapter is 12 pages long! It is worth mentioning that in this case safety improvements can prove to be a double-edged sword, because, for example, the model can wrongly interpret a question as harmful or hurtful in some situations.

Figure 1. An example of a question that Llama 2 has difficulty answering due to the aforementioned safety-oriented training. Source: https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI

A rapidly growing ecosystem

There have already been several tools built around Llama to make it easier for developers to use them. A noteworthy one, for example, is the open-source Llama 2-Accessory library that targets pre-training, fine-tuning, and deployment. This library allows for the easy evaluation of fine-tuned models with popular benchmarks, and can optimize its performance for speed and size, or easily encapsulate it in an API. Building such tools not only facilitates the work of developers but also greatly accelerates the creation of new products based on Llama 2.

Another toolkit library to check out when planning your work around the second version of Llama may be llama-recipes from Meta itself, which provides ready-made scripts for fine-tuning in various hardware configurations. Besides, it is worth remembering that it is possible to use Llama 2 with LangChain and in the Huggingface ecosystem, which gives us a truly infinite number of applications.

Best of all, some chatbot-based service products on the market like Perplexity Labs or Poe are already using Llama 2. This trend indicates the likelihood of an increasing number of similar products adopting this technology in the future.

Figure 2. Possibilities of the Llama2-Accessory library. Source: Llama 2-Accessory: An Open-source Toolkit for LLM Development

Benchmarking the performance

While proprietary models like OpenAI’s GPT series are still superior, it’s worth noting that Llama 2 showcases tremendous potential. In fact, in view of certain benchmarks, it even surpasses the effectiveness of GPT-3.5 which is the basis for the free version of ChatGPT. Of course, it should be remembered that benchmarks are not the only valid indicators determining the effectiveness of the model on all levels. Nevertheless, when designing a custom solution, it is a great idea to check out the performance of those leading models in your downstream task.

Meta & Microsoft

In a stride toward the future, Meta and Microsoft have embarked upon a groundbreaking partnership. United by a shared vision, the two tech giants have pledged their support to the Llama 2 family of language models. Llama 2 is available in the Azure AI model catalog, which means that you can easily build applications on top of the model or just play with it using cloud-native tools. It is also optimized to run locally on Windows. But if you don’t use Azure, don’t worry – AWS, GCP or even many of the smaller cloud service providers already offer Llama 2 as well, e.g., Anyscale Endpoints.

Summary

In an era of artificial intelligence and language models, the introduction of Llama 2 represents an undeniable landmark – a testament to the constant quest to push the boundaries of what is possible. As Llama 2 develops its capabilities, one thing becomes clear: We are at a point in time where natural language and technology are intertwined in unprecedented ways, opening the door to previously unexplored innovations. All of this is now available not only for public use but also for commercial use, which is sure to accelerate the development of this field even further, and result in many great products which are applicable to our lives and work.

For those who want a deeper understanding of the nuances contained in the Llama 2 architecture, a treasure trove of insights awaits in Meta’s comprehensive research paper. Dive into the depths of innovation here: Meta’s Llama 2 Paper. If you yearn to embark on a journey of first-hand interaction, you can take the opportunity to download Llama 2 for exploration here: Download Llama 2.

Evaluation Derangement Syndrome (EDS) in the GPU-poor’s GenAI. Part 1: the case for Evaluation-Driven Development

November 14, 2023/in Generative AI /by Jarosław Kochanowicz

GenAI, understood as a class of models capable of generating human-like, high dimensional outputs like text, image or sound, is experiencing great success and explosive growth [1, 2, 3]. However, this has also quietly given rise to a critical problem that permeates applied GenAI in its entirety – what we call Evaluation Derangement Syndrome (EDS). In short, EDS is the problem of the widespread lack of a rational approach to and methodology for the objective, automated and quantitative evaluation of performance in terms of generative model finetuning and prompt engineering for specific downstream GenAI tasks related to practical business applications.

In this post, we analyze EDS from a practical, applied, business perspective, drawing from our rich experience in GenAI development for business, both in image generation (with diffusion models and GANs) and LLMs in code generation, retrieval systems, voice assistants, etc. We’ll explore the underlying causes (both technical and business), and examine the consequences it may have for the GenAI community and beyond.We analyze its intricate relationship with GPU inequality [4] and address how the ‘GPU-rich’ (a handful of firms with thousands of the strongest GPUs, as well as resources like data, engineers, and labelers) approach the problem of GenAI evaluation, in contrast to the harsh realities of the ‘GPU-poor’ (everyone else, really). We also discuss the fundamental insufficiency of the ‘pseudo-evaluation’ approaches used by the GPU-poor and sketch a potentially more rational path forward for them: Evaluation-Driven Development (EDD). The subsequent post will take a deep dive into the nitty-gritty practicalities of this approach, drawing from our extensive experiences with Diffusion Models and LLMs in the ‘GPU-poor’ landscape. Enjoy!

GenAI evaluation in the realm of the GPU-poor

For fundamental technical reasons, GenAI does not naturally lend itself to any obvious and reliable analogues of quality monitoring tools (like F1 score, accuracy, precision, etc.) that all data scientists live and breathe when practicing traditional ML. Then there are business pressures to deliver at an extreme tempo, typical of the heated ‘hype economy’ driven by the fear of taking over the target niche in the AI revolution, contributing to the ‘produce fast, test later (i.e., never)’ approach.

Additionally, almost all GenAI contain several evaluation dimensions lying on a spectrum from soft (subjective) to hard (objective). To give specific examples from our own GenAI projects and practice, the evaluation of:

a chat/assistant includes subdimensions like helpfulness, friendliness, or even political correctness (subjective), vs factual correctness that can be measured in a ‘hard’ quantitative manner for some of the questions/answers (objective);
a retrieval system may include retrieval coverage correctness (‘hard’ in many cases) and conversation/summary style (soft), similar to the assistant;
code generation can be broken down into (hard) code correctness/test passing and (soft) code clarity;
diffusion-based face generation may consist of (soft) image attractiveness, similarity to the desired target, but also (hard) domain adherence, i.e., % of generated images actually containing a face.

This mix of soft and hard evaluation criteria poses technical difficulties (analyzed in this section), and the resulting simple truth is this: almost all GPU-poor researchers working on GenAI applications today work without any rational, objective, quantitative, repetitive framework to evaluate their work or to inform choices – their own or that of the business decision-maker – that depend on them. When dealing with our day-to-day dilemmas we all depend on subjective, arbitrary gut feelings built in short, selective inspections of the systems we train. These are (at best) accompanied by weak numeric pseudo-evaluation (‘broken evaluation methods that put more emphasis on style rather than accuracy or usefulness’ [5]), that do almost nothing to evaluate our specific business capacities (I’m looking at you, leaderboard rankings [6], BertScores [7], BLEUs [8], etc.). The former are used more as a rationalization than an actual trustworthy indicator of performance and a primary driver of our research.

This is truly astonishing when considering the rigorous objective evaluation practices ingrained in traditional ML (Image 1). More shockingly, everybody in the field seems to accept this and move along: if we move fast (as in the production of more and more GenAI), we are fine with not checking the direction (as in objective QA, and in comparison with competition or alternative approaches).

Image 1. Daily cycles as practiced by the GPU-poor. The differences in the evaluation standards are stark and consequential.

This is not an academic or theoretical issue, but a weighty real-life problem with business, technical, and human consequences. Again, deepsense.ai’s wide experience allows us to share a specific story that many GenAI researchers can relate to. One of our valued customers asked us to develop a code-generating solution for a somewhat niche language (think GitHub Copilot’s [9] competition for this language). Even though the team we established consisted of elite LLM experts, the task proved very challenging. Due to the limited time allocated for creating the evaluation pipeline, it relied solely on the BLEU-based evaluations. The main effort went directly into generative model creation itself. This, combined with the release of the new, possibly superior base models over the duration of the project, led to serious internal evaluation issues.

To cut a long story short – our team was highly competent and worked very hard, but had no reliable way of determining if any improvement had taken place. Considering our BLEU-based evaluations, we thought this may not be the case! Luckily, the client’s own internal evaluation at the end of the project was very positive. Apparently, according to the client, our model was visibly superior to both the foundational models and GitHub Copilot. Good enough for us, and another job well done! But, to be fair, we have no idea how objective and extensive this ‘client-based evaluation’ was, or whether this will work next time. Being professionals, we prefer to make our own luck instead of being part of EDS. Today’s GenAI development is full of similar stories, rarely with a happy ending like ours.

In summary, EDS is a serious, practical issue affecting all areas of GenAI, most notably LLMs [10, 11], and image generation (GANs [12], Diffusion Models [13]). Its reach will only grow in the future, together with GenAI use cases. Generative AI can create jokes [14], stories [15], poems [16] and beautiful images [17] for a continuously growing number of applications, but our ability to evaluate GenAI is constantly falling even further behind. Given the scale of the pandemic and the technological, economic, and political impacts of GenAI, EDS truly is a critical issue. Let’s now investigate the underlying causes of this situation.

EDS – business causes

EDS emerges in the GPU-poor domain due to a complex interplay of factors—some technical and others soft, encompassing the economic, psychological, business-strategic, and organizational dimensions. Before delving into the technical reasons behind this situation, let’s take a look at the broader realities of the GenAI business-economic ecosystem contributing to EDS amongst the GPU-poor.

The first EDS-inducing condition is that, since at least 2022, the GenAI business ecosystem has operated permanently in ‘hyped-economy’ mode [18, 19]. In the GPU-poor realm, this means the red-hot fervor of startups [20] competing in a race to quickly find their niche in the GenAI revolution. Or at least respond to the marketing pressure to be a part of GenAI… A pervasive sense of ‘it’s now or never’ exerts enormous pressure on businesses. CEOs of even modest startups aspire to innovate within GenAI, sometimes despite limited technical understanding and unreliable intuitions concerning the likely future GenAI developments and their risks to early adopters [21]. The unrelenting drive to produce GenAI en masse is fueled by sky-high VC investments [22, 23] partly streaming from a ‘pay-to-participate’ strategy to keep a horse in the big GenAI race. A strategy that is at times questionable, considering that the chance to join the GPU-rich is likely far-fetched at this point [24].

The second EDS-relevant characteristic of the GenAI ecosystem is the astonishing tempo of the general-purpose model/innovation of releases by the GPU-rich (proprietary or open-source). This creates significant potential to undermine or eliminate early GenAI adopters – their business strategies are potentially threatened each time we are exposed to a release that redefines the GenAI landscape and boundaries of what is possible. This disruptive tendency manifests every few months and shows no sign of slowing down, with the recent releases of Llama 2 [25] and Mistral [26] (the great hopes of open source NLP [27, 28]) and two proprietary game-changers seemingly just around the corner: Gemini [29] and GPT-5 [30]. As a result, the GPU-poor research and do business on ‘moving sands’. As a side-effect of any release by the GPU-rich, their research projects may likely be outdated before completion. Indeed, rapid and unpredictable innovations of the GPU-rich will repetitively wipe out niches targeted by countless GPU-poor early adapters, before any real chance of a return on investment.

It is not merely that “it’s totally hopeless to compete with us on training foundation models you shouldn’t try, and it’s your job to, like, try anyway” [31], as Sam Altman accurately summed up the situation of the GPU-poor vs GPU-rich. It’s also safe to assume that once the battle for supremacy for the ‘foundation models’ is more advanced, the same fate will befall utilities and applications of these models, where many of the GPU-poor hope to find a place for themselves. Think retrieval systems, agents, specialized finetunes, and other services built on and around the foundational models. Whenever serious consequences (in USD) are involved, the GPU-rich will (in time) provide these ‘auxiliary’ services out of the box.

Hence, the GPU-poor’s fear of this ‘side effect’ of business eradication is contributing to EDS by creating additional pressure to ship fast and skip evaluation, which is not perceived as a critical business goal.

Astonishing pace of business changes leads to Evaluation Derangement in GPU-poor

Image 2. The astonishing tempo of changes in the business landscape contributes to Evaluation Derangement Syndrome in the GPU-poor. [source: The Missing Link in Generative AI | Fiddler AI Blog ].

The third ‘soft’ cause is that fighting EDS among the GPU-poor may not be a primary interest of any key players involved. The main contributors to the GenAI revolution (the GPU-rich) have a proper way of evaluating GenAI (see RLHF section), but no incentives to develop or share alternatives for those who cannot afford it. In fact, from the perspective of GenAI’s future, they may see the current practices of the GPU-poor as inconsequential and irrelevant, and can leave them to their dubious practices. Sure, GPU-poor startups built around GPU-rich models may turn out to be successful from an economic standpoint (and deepsense.ai is here to make sure of it!), but they are not likely to shape the GenAI landscape in any major way over the course of a decade – this role belongs to the GPU-rich. Other parties are also fine with EDS; ML training providers are happy to train any model, regardless of its performance, as some developers and consultants are to create them regardless of the actual capacity to validate quality. And so the EDS machine may continue. We at deepsense.ai are not happy with this approach and strive to provide solutions that actually ARE better.

Technical causes: Why does EDS plague GenAI?

Technically speaking, why exactly would EDS haunt GenAI applications? We figured out the Traditional ML evaluation quite well. Why would this know-how not translate easily to GenAI? In this section, we’ll delve into the ‘hard’/technical factors behind EDS in Generative AI.

Let’s start by refreshing the (very straightforward) conventional approach to ML development and evaluation. Let’s break it down into three steps:

gather human-generated ground truths (GTs),
gather model-generated outputs,
compare them directly to produce meaningful metrics like accuracy, F1 score, and mean squared error.

There are several reasons why this approach breaks down for GenAI, regardless if it is related to the generation of text, images, or other content.

1. Inadequacy of the ‘ground truth’ concept in GenAI

Firstly, almost by definition, generative tasks often include a practically infinite variety of ‘optimal’ solutions, making any metrics which rely on direct comparisons with GT questionable. Within Natural Language Processing (NLP), ‘pseudo-evaluation’ approaches that we call ‘Superficial Utility Comparison Kriterion’ (SUCK) methods, like BLEU [32], METEOR [33], ROUGE [34], or BLEURT [35], attempt to salvage the situation. SUCK methods usually compare model outputs to GTs in specific embedding spaces, de facto under the assumption that output quality correlates with similarity to some GT within that space.

While this assumption might hold to some extent in quasi-generative tasks like summarization [link], it is fundamentally flawed. SUCK’s weaknesses are well recognized in the literature [link]. In our opinion, a fundamentally different approach is needed – we simply need to embrace the fact that ‘ground-truth’ is simply not a viable concept for most truly generative tasks. When the task is love poem generation, what is THE right answer that all others should imitate?

2. The innate subjectivity in GenAI evaluation

Secondly, assessing human-level creativity requires subjective and elusive criteria. Such evaluation is best expressed by poorly-defined concepts like aesthetics, novelty, style, creativity, helpfulness. Formal quality formulas have limited promise in this context. Furthermore, intersubjectivity is an inherent feature of evaluation, not a flaw to eliminate [36]. In many cases, generative models should optimize multiple objective quality indicators (e.g., the correctness of the generated code or retrieval ratio correctness etc.) and the subjective ones (e.g., code cleanliness, or proper conversational tone). All of them may non-trivially determine the final output quality, and more often than not conflict in practice [37, 38], and how to use them all in the practical evaluation is not necessarily obvious. Data collection, training, and evaluation procedures need to explicitly account for, quantify, and control subjectivity in the subdimensions of GenAI evaluation and their relation to the final quality of the output.

3. Extreme use-case specificity of the GenAI evaluation criteria

Thirdly, the interpretation of values relevant to the problem is highly application-specific and changes dramatically between use cases, and even users. Despite using identical terms, domains or use-cases dramatically redefine their interpretation. Context and goals determine standards of output beauty/aesthetics, proper tone, relevance, social acceptability or most other subjective metrics one can imagine. This makes capturing them with large, cross-domain, ‘once and for all’ datasets a challenge.

4. Diversity / mode collapse monitoring problems

The fourth reason is the susceptibility to mode collapse, where models produce outputs with limited diversity. While this can impact the utility of the model, it is also difficult to quantify and measure as part of model evaluation. Attempts to approximate the human sense of diversity, like FiD or the CLIP score, have serious limitations [39, 40]. On one hand, the definition of sample proximity may be non-trivial and domain-specific. On the other hand, diversity may be hard for humans to evaluate, as it is a feature of a large dataset – evaluation of multiple objects is a natural human weakness.

5. Potential evaluation dataset leaks in behind-closed-doors training

The fifth and final challenge in GenAI evaluation is the potential data leakage of any existing evaluation datasets one intends to utilize. Almost any model trained by the GPU-poor is a refinement of a general model (like Llama or Stable Diffusion) generously thrown open by the GPU-rich. These models have been trained on just about any data available, including all open datasets you wish to use to evaluate them or their fine-tuned iterations. The closed-door nature of many of these training procedures and their often undisclosed training datasets complicate the issue further. When we consider this in conjunction with the third challenge, which relates to the high degree of application specificity in given metrics, it becomes evident that depending on vast, openly accessible, one-size-fits-all datasets may no longer be a viable approach. It seems that the GPU-poor may have to rely on methodology allowing for easy evaluation with internally created, very small datasets capturing the case-specificity of their application.

Okay, so there are technical reasons why GenAI evaluation is hard, and why traditional ML approaches fail. But why would those technical obstacles hit the GPU-poor more? Why and how are the GPU-rich capable of dealing with them? Why can’t the GPU-poor copy what they do on a small scale? These are all great questions that we will address next.

How do the rich kids do it? The shiny new RLHF you (probably) can’t afford

Time for full disclosure: the problem above is only partially new. A lot of thought went into the proper approach to somewhat similar challenges long before the GenAI revolution. The field of Reinforcement Learning (RL) has much to say about the ‘healthy’ approach to a situation where the evaluation metric (‘reward’ in the RL lingo) is hard to formalize, complex, etc.

Successful RL conceptual frameworks like actor-critic [41] or world-model [42] have been used in physical or virtual environments to attack problems with ill-defined notions of ‘correct action’. The moment one sees GenAI as an agent, its outputs as actions, and human evaluation as a reward function, it all falls very nicely into the RL framework – hence Reinforcement Learning from Human Feedback (RLHF). One RL-borrowed component that allowed the GPU-rich to ‘put the harness’ on the GenAI they produced was the notion of a critic, preference, or reward model. Understanding this concept is vital to our conversation. The preference model is trained on human evaluations, to approximate human, well… preferences (Image 3, left). Once successfully trained, it can provide automated, pseudo-human feedback to train the generator (Image 3, right) concerning the human-level metrics (friendliness, political correctness, etc.). It is relevant to note that the reward model is a traditional (i.e., non-generative) model – it is not GenAI, so we know how to evaluate it directly. OpenAI used this approach to make ChatGPT by ensuring that GPT3/4 will not set off on racist rants, etc., by finetuning it to score highly on a reward model trained for OpenAI-preferred ideological biases. However, the same principle works for any GenAI (for example, reward models trained on human aesthetics [43] enable the creation of image generators like Midjourney [44]).

OpenAI's diagram: Reward model trained on feedback (left); used for generative model training (right)

Image 3. OpenAI’s famous diagram. The reward model is trained on feedback to capture the human-level labeler’s preferences (left). The reward model is then used to automatically provide the training signal (providing human-like preferences) in millions of training steps of the actual generative model (right). [source: https://openai.com/blog/chatgpt]

It should now be clear that RLHF makes the GPU-rich immune to EDS. Preference models can provide an automated, repetitive, quantitative evaluation for human-level metrics like beauty or friendliness. There is a ‘but’, however. Namely, the way GPU-rich use RLHF is categorically beyond the reach of GPU-poor. We cannot ‘just copy what they do on a smaller scale’ – we are too poor to do even that. The amount of GPU power, labeling, and the (often overlooked) scale of technical engineering difficulty needed to make full RLHF work is unknown and (probably rightfully) normally assumed prohibitive for today’s GPU-poor.

At this point, many highly technical GPU-poor experts, who intuitively feel much of what we have described, just shrug and say, ‘that’s why we cannot afford proper GenAI evaluation’. One reason for this may be a misleading intuition about what is hard in (HF)RL. The thing is, the difficulty considered here is not just training any evaluation model, but rather a reward model strong enough to be used in full RLHF, i.e., the provider of a training signal directly from the generative model (right column). In this setup, the reward model becomes vulnerable, as the generative model may ‘beat’ it, exposing any weakness (hidden misalignments with human preferences) it may have. Dealing with this has much in common with adversarial training stability challenges in GANs [45] or general issues with ML adversarial safety [46]. It is hard. In essence, the reward model and the entire training methodology must be very strong to stop the generator from ‘hacking’ the preferences, or it will find a way to get a high reward in undesired ways (by generating weird or noise-like images/sentences that should not get high rewards, but do). This takes a vast amount of data, GPUs, and expert engineering.

Here comes the trivial observation behind the Evaluation Driven Development (EDD) that the GPU-poor can adapt. Evaluation models may be easier by orders of magnitude to obtain than reward models. Our experiments indicate that evaluation models may be very cheap and easy to create for the GPU-poor’s own use cases, when taken out of the RLHF context. Unlike the GPU-rich shaping their base models, you do not need your reward model to provide a training signal for the model you fine-tune (or prompt engineer). If you do it right, and the generator cannot fool the evaluator through the training signal, as few as 100-200 samples may be enough to create human-like evaluator models. And that is all you need to get out of the EDS!

Okay. So we have the general idea behind the EDD. But that’s not really a framework yet, is it? And what does ‘if you do it right’ mean exactly? That’s an entirely different story – one we will tell in the second blog post of this series.