deepsense.ai
  • Careers
    • Job offers
    • Summer internship
  • Clients’ stories
  • Services
    • AI software
    • Team augmentation
    • AI discovery workshops
    • GPT and other LLMs fast track workshop
    • Generative AI
    • Train your team
  • Industries
    • Retail
    • Manufacturing
    • Financial & Insurance
    • IT operations
    • TMT & Other
    • Medical & Beauty
  • Knowledge base
    • deeptalks
    • Blog
    • R&D hub
  • About us
    • Our story
    • Management
    • Advisory board
    • Press center
  • Contact
  • Menu Menu
How we integrated GPT with PDF documents

How we developed a GPT‑based solution for extracting knowledge from documents

May 26, 2023/in Generative AI /by Piotr Gródek

Practical business use cases for GPT

Recent breakthroughs in AI have showcased the vast potential of convenient natural language interfaces and taken the web by storm. Many companies across various verticals have started looking for specific business use cases to implement this technology. As our motto is “There is no better way to show our capabilities than to build solutions”, we have developed various technical showcase implementations to inspire and present potential solutions. In this blogpost we will discuss our latest GPT-based solution addressing the challenge of extracting knowledge from a set of PDF documents.

Meet Niffler
Niffler mascot

We have given our project the codename ”Niffler”. The name was inspired by a magical beast from the Harry Potter universe which is attracted to shiny things. We used AI to create its mascot image (see the result above!). Niffler’s task is to digest user-provided PDF (or text) documents and provide a chat interface along with highlights of the relevant document. Development of the application enabled us to put our experience into practice.

Time is money – who has time to search each document?

At any point in time, each company generates a huge number of documents – from legal, finance and administrative documentation to knowledge databases pertaining to internal processes. Employees join or leave the company, projects finish up, others start, and the pile of documents continues to grow. At some point, keeping track of what was done and where to find the information is impossible, and many hours are wasted. If the company’s core business is not about document organization, it represents a cost sink which is hard to even measure.

In some cases, even if the right document is found, it is often not enough – the document can be too long to read properly, or the need to supplement it with other ones to gain relevant insight arises. Hiring a dedicated staff member to search for information can be a possible solution, but wouldn’t it be great if an application could read all the documents, find and match relevant information and then provide a concise answer with all the required references? If we combine this idea with another AI model for speech-to-text, any team at a company could access technology similar to that owned by the superhero Tony Stark and efficiently work with the company or external data, which will of course provide a competitive advantage.

Overall solution overview

The business challenge for Niffler was described in bullet point form:

  • we have a collection of documents – docs, PDFs, txt
  • we look for an answer to a question which can be answered by any of the documents
  • we want to get the answer as fast as possible
  • we want to use a natural language interface
  • ensuring the privacy and security of internal know-how is a priority for us

Our core technology is independent of the source of the GPT model, as we don’t want to be vendor locked, but rather flexible for every potential need. We decided to use OpenAI as an external component as it suited our needs best.

In order for Niffler to start working and supporting us in our daily work related to document analysis in accordance with the above-mentioned assumptions, we had to consider various crucial aspects, which we discuss below.

Operation costs for a GPT-based application

As with all projects, the costs depend on the choice of the model and where it is used. For example, in the case of OpenAI API payment depends on the number of tokens required by an input and output (neural networks require input paragraphs and sentences split into small processable units called tokens), but on the other hand Azure charges by the inference time, counting how many minutes your requests take.

A great way to see what a token is would be to visit the OpenAI tokenizer page which graphically displays it for your text.

One important detail is that the number of tokens includes not only direct user input but also hints we need to pass to the network – such hints provide additional guides, context or examples which are necessary for better results. Such techniques are called zero-shot and few-shot learning and provide a way to better align outcomes with expected results for concrete tasks without the associated costs of training a specialized model. There is also a hard limit on the maximum number of tokens acceptable to the network; the bigger the network, the more it can take as an input.

You may wonder about a specific example of why there is a need to have additional context provided to the network, as of course it increases operational costs. Please note that the model does not have a memory of a conversation (it cannot remember anything that either a user or the model itself wrote a second ago!) and to help it remember, it is necessary to inject the chat history or just a summary of it. The presence of additional, specially formatted prompts can enforce consistency and quality – it is a technique to prevent answering outside the desired bounds and to ensure it acts appropriately, even for a malicious user.

Moderation of AI

It is an unfortunate fact that large language models can generate outputs that are untruthful, toxic or simply unhelpful, and special care is required to address that issue. Providers of services like OpenAI and Azure provide some black-box moderation – but that’s not enough. To address such concerns, we came up with and implemented several techniques – one of which is to add an additional AI layer to moderate output. More details about our design are described in the reliability and security section later in the article.

Development prototype

We started with a proof of concept prototype – its main goal was to get feedback and iterate fast on the idea. We used Streamlit, a library for quickly building graphical interfaces for data science projects – it does not allow a high level of customization, but it allows you to quickly present visual results which greatly simplifies communication, especially with less technically-oriented people. Additionally, a major plus point is that it is easy for a data scientist to use without the need to involve frontend and backend developers.

The video below shows a set of prototype features of our AI system:

  • answer a question about a set of documents and find the relevant one
  • ask more in-depth questions about the content of the document found
  • extend with recent and online knowledge – crawl the web with the Bing search engine.

The prototype has more features than our polished demo application and it is a teaser of what we can do.

https://deepsense.ai/wp-content/uploads/2023/05/Video-1.mp4

Video 1. Prototype flow: We started by searching a set of documents and then asking for more details focused on those found. We also integrated a Bing search allowing the model to dynamically fetch data from the internet as requested.

The prototype allowed us to experiment with many different ideas before settling on a set of features to focus on. It also improved communication with project stakeholders, but even more importantly each team member showcased their work post daily with other team members which greatly improved internal collaboration and made it more fun.

Unstructured data and search

At the time of writing, out-of-the-box GPT-like models are unable to process big chunks of data or work with standard documents like PDF or Word documents.

Figure 1 - Example of a PDF legal document we used for our tests from this Kaggle dataset Indian supreme court judgment

Figure 1. Example of a PDF legal document we used for our tests from this Kaggle dataset Indian supreme court judgment.

To solve this challenge, we created a dedicated preprocessing step which digests native PDF formats – parsing PDF in general is not an easy task and OCR might not always work so well, but for the purposes of our prototype it is sufficient.

The resulting canonical form is then passed to an intelligence processing block – AI reads chunks of text, creating a summary and tags to make efficient searches possible – which calculates so-called embedding vectors. They encode semantic information in a very efficient manner. Such vectors are then stored in the vector database along with additional metadata.

Figure 2 - Each uploaded document has 3 tags useful for searching, clustering and prompt tuning

Figure 2. Each uploaded document has 3 tags useful for searching, clustering and prompt tuning.

Figure 3A - Two distinct PDF documents with summaries generated by AI

Figure 3B - Two distinct PDF documents with summaries generated by AI

Figure 3. Two distinct PDF documents with summaries generated by AI. The shorter form is a great way to quickly learn about the content, as well as to align the system to focus only on questions and answers related to the document context.

This is a one-time cost to include each document in a database and it does take some time – however, the database can then be extended easily on the fly to include new documents, which can be done in the background without stopping the system from functioning.
This approach provides additional control which can be useful when it comes to improving or extending the performance of the system.

Great User Experience

Software should be pleasant to use. To achieve this, we decided to build our frontend as a ChatGPT-inspired interface – familiarity with chat interfaces makes it very easy and natural to use.

We have prepared two main views – a standard chat and document preview, together with a left sidebar which contains highlights, or questions with answers, serving as links which allow a user to revisit previous and current selection in the source document.

Screenshots, of course, are not enough to present interaction, so we decided to record a set of short clips to capture the user experience. One major strength of our application is that a user can not only quickly revisit answers, but also jump with just one click to the relevant source information and validate whether AI has done a good job with the answer provided.

https://deepsense.ai/wp-content/uploads/2023/05/Video-2.mp4

Video 2. Question about a court case and inspecting the full document to show that only a small, relevant portion is highlighted.

https://deepsense.ai/wp-content/uploads/2023/05/Video-3.mp4

Video 3. Medical leaflet – one question asked.

https://deepsense.ai/wp-content/uploads/2023/05/Video-4.mp4

Video 4. Medical leaflet again, with more questions and a showcase of the highlights.

Figure 6 - Example chat. Please note that our application is language-agnostic as the underlying GPT model

Figure 4. Example chat. Please note that our application is language-agnostic as the underlying GPT model.

Reliability and security concerns

GPT models can return different answers when asked the same question multiple times and there is no formal guarantee that they won’t make mistakes, answer on topic, or even offend a user. Indeed, it is a challenge many have experienced; it is mentioned in the Bloomberg article, the official limitations of GPT-4 (the most powerful model), and it is easy to find more articles on the topic. A mitigating solution that seems to work quite reliably for our use case is the additional 3 steps for standard flow that we describe in more detail below.

Our first line of defense is the aforementioned built-in moderation from OpenAI. We also researched prompts to ensure that AI can only provide answers on selected, narrow topics and data. The input prompt to the model is built only with the context connected to the question given the automatically generated document summary and 3 tags for a given document. Dynamic prompt engineering yields better results than a generic prompt and is a great alternative to manually hand-crafted prompts.

Figure 7 - Simplified flow for a single document interactive question and the answers we have implemented

Figure 5. Simplified flow for a single document interactive question and the answers we have implemented.

The third line is actually our secret sauce – we use another AI to moderate output.

We tried several attacks by injecting text prompts known to alter model behavior (asking it to act as someone else and other different kinds of persuasions, DAN etc. which are mentioned by people on Twitter) or try to get it to answer something unrelated or on the topic but possibly harmful. We have failed to break it so far. On the other hand, it also sometimes leads to it not answering questions if they are not really on the topic. Depending on the use case, we can tune it to be more or less restrictive as all additional checks are opt-in. We also found out that even if the model refuses to provide an answer, the highlights mentioned in the next paragraph might be returned correctly.

To complete the user experience, we also quoted and showed what the source of information provided by Niffler is. This feature is a major selling point of our approach for any user as it addresses 2 aspects – verification of AI model output by the user and efficient information search. We especially placed a great deal of emphasis on presenting minimal, raw text in a visual way in the source document to enable the user to save as much time required to read the original content as possible. A concise AI answer is an added extra, not a replacement for your data source. At the moment truthful, fact-based answers and links to source information are still an unsolved problem under active research. Addressing this issue is very challenging, but we have already seen some promising results and gained a number of key insights during the development phase.

Figure 8 - Example of an answer with the citation selected

Figure 6. Example of an answer with the citation selected – in the PDF file we highlight the exact sources of information used by GPT, allowing a user to focus only on short, concise and important information.

Deployment as an internal tool

We have built a useful and interesting application – we hosted it internally on our servers and made it available for deepsense.ai employees to use. Our highlight module is one of the key strong points and people found it very useful. External use cases for the current Niffler version we witnessed included information retrieval from research papers and device manuals. Additionally, we created a knowledge base which we have shared internally (for now) with our colleagues to propagate everything we have learned.

Charged with even more superpowers

As we thrive on excellence, there are still more things to do.

Static knowledge is not enough for the rapid changes that are happening in the world of AI. That is why we added the possibility of integrating with external sources of data such as SQL databases or any APIs to Niffler. For example, if we would like to do an analysis of our competitors, we could search for different types of data along with recent business analyses, stock prices and so on.

Moreover, we created a prototype mode of an AI agent who can search and scour the Internet on its own.

On top of that, one of our team members set up an integrated Whisper service (a speech-to-text model API provided by OpenAI) – why does someone need to type on a keyboard if a superhero can just say things? With real-time transcription and text-to-speech synthesis, we make it even more enjoyable to use. Imagine being able to search for your Q3 financial statistics and receive them directly to your ear during a conversation with stakeholders! You could eliminate the need for someone to look it up and prepare a report.

Such things are entering the realm of possibility, and likely only companies which understand and can use such potential will dominate the market.

Get in touch!

If you are interested in your own solution, feel free to let us know! We can help you to build a competitive advantage by adding the features of GPT and other LLMs to your products and services.

https://deepsense.ai/wp-content/uploads/2023/05/How-we-integrated-GPT-with-PDF-documents.jpeg 337 1140 Piotr Gródek https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Piotr Gródek2023-05-26 21:58:112023-05-29 16:02:51How we developed a GPT‑based solution for extracting knowledge from documents
Diffusion models in practice. Part 2: How good is your model?

Diffusion models in practice. Part 2: How good is your model?

May 8, 2023/in Generative AI /by Jarosław Kochanowicz, Maciej Domagała, Dawid Stachowiak and Dawid Żywczak

This is the second post in our series “Diffusion models in practice”. In this article, we start our journey into the practical aspects of diffusion modeling, which we found even more exciting. First, we would like to address a fundamental question that arises when one begins to venture into the realm of generative models: Where to start?

Introduction

This is the second post in our series “Diffusion models in practice”. In the previous one, we established a strong theoretical background for the rest of the series. We talked about diffusion in deep learning, models that utilize it to generate images, and several ways of fine-tuning it to customize your generative model. We also explained the building blocks of Stable Diffusion and highlighted why its release last year was such a groundbreaking achievement. If you haven’t read it before, we strongly recommend you start there [1]!

In this post, we start our journey into the practical aspects of diffusion modeling, which we found even more exciting. First, we would like to address a fundamental question that arises when one begins to venture into the realm of generative models: Where to start?

So many options…

Both the rapid ongoing development and the lack of general know-how in the scope of diffusion models and surrounding techniques results in many people getting confused even before they begin.

Image 1 - Number of papers on diffusion in recent years

Caption: Number of papers on diffusion in recent years. Source: [2]

In the previous post, we explained the importance of Stable Diffusion [3]. The Hugging Face hub [4] already contains hundreds of models with unique adaptations. There are also numerous techniques used to fine-tune the model (which we covered in the last post) – each yielding satisfactory results. On top of that, the models can be used for a variety of different tasks revolving around image generation, such as inpainting, outpainting, image mixing, using smart starting noises, and many more.

Being spoiled for choice is usually a good thing, but keeping up is quite problematic – before you fine-tune your favorite new model, there is another state-of-the-art solution waiting to be explored. Important choices not only refer to the abstract topic of architecture. After selecting your weapon of choice, another rabbit hole opens up: parametrization. Each of the models and methods comes with dozens of parameters for both training and generation, with an exponential number of combinations. Both inherent variance in output quality and the lack of fixed criteria of what constitutes good results make the search for perfect parameters truly challenging.

All of that can leave even the toughest deep-learning practitioner confused. But don’t worry! In this and future posts, we would like to explain our empirical and data-based approach to enable rational assessment and choices. These helped us navigate the complications and we hope that it will be beneficial for others as well. Let’s dive into it!

How can we estimate fine-tuning quality? 

Let’s assume that we managed to choose our favorite model. More than that, we adapted it to our needs via fine-tuning. That’s great news! But how can we assess the quality of our adjustments? The visual inspection always comes first, but when done ad-hoc it’s by no means informative nor reproducible. We would like to have a solution that is as automatic as possible, as well as being reliable.

Several natural metrics immediately come to mind, one of which makes it possible to establish how similar the output of our model is to the object that we embedded in the model domain during fine-tuning. Another can inform the user about the aesthetic value of the image. So how can we actually measure those? We decided to use the human-like approach i.e. look at the input/output images and estimate the performance of the model using subjective criteria. The metrics described below were proposed and validated based on one specific type of object: faces.

Personalized face generation process

Caption: Personalized face generation process. Source: Authors

However, with small adjustments, the approach we describe below can be applied to any domain. Most importantly, the general methodology for the validation can be used for any generative setup.

Similarity assessment

Our goal is to gather insights into how well the model understands the images that were provided during the training, or in our case, fine-tuning. The similarity of object’s characteristics between real picture and the output is vital for a high-quality model. To make sure the model has acquired knowledge about the characteristics of new objects, there is a need for a validation setup that provides information about how similar the object is to the generated image when compared to the dataset used for fine-tuning.

Usually, we would like the model to generate our object in different setups, styles, and scenarios. A different textual input prompt means different colors and textures in the images. What we truly care about is how well our model conveys the characteristics of the object to an image.

To measure that, we use a two-stage solution – first, we crop out the object from pictures, as this is the only part we want to measure the similarity for. Next, we embed the images using an Inception-based [5] neural network. Let’s talk about those models and the validation scheme we used to make sure they are the right fit for our approach.

Image 3 - Similarity assessment flow

Caption: Similarity assessment flow. Source: Authors

Face cropping

We opted for the MTCNN [6] architecture which is specifically trained for the task of face cropping, but a similar architecture can be applied to any other type of object. This solution is based on several Convolutional Neural Networks that work in a cascade fashion to locate the face with some landmarks in an image.

The first network is called a Proposal Network – it parses the image and selects several bounding boxes that surround an object of interest: a face, in our case. It is the fastest of all three networks since its main job is to perform basic filtering and produce a number of candidate boxes. In the second step, the candidates are fed into a Refine Network, which further reduces the number of false candidates and refines the bounding box locations. The third and last network, the Output Network, performs the final adjustments and additionally provides information about facial landmarks. In this model, the location of 5 landmarks is predicted – the left and right eyes, the left and right corners of the mouth, and the nose.

Image 4 - MTCNN architecture

Caption: MTCNN architecture. Source: [7]


Architecture links several tasks of different natures. One is a face classification, where a classical cross-entropy loss is used. The others are bounding box regression and facial landmark localization, where updated weights are calculated via Euclidean loss functions. Each network’s prediction finishes with a non-maximum suppression mechanism to merge lots of boxes into the most likely candidates.

Needless to say, other tools can be applied with similar effects here, including architectures that are not designed to work with facial images.

Face embedding

Having extracted the faces, we opted for InceptionResnetV1 [8] as an encoder to allow for reliable comparison. This model can boil the input down to a numerical representation, which allows it to represent the abstract images in a pleasant, vector form. It was originally trained on the VGGFace2 [9] dataset containing 3.31 million images across over 9000 identities. In the case of other objects, other versions of the Inception-based network could be used.

Image 5 - Sample images of Ruby Lin and Roy Jones Jr. from the VGGFace2 dataset

Caption: Sample images of Ruby Lin and Roy Jones Jr. from the VGGFace2 dataset. Source: [9]


To assess how similar the faces in two images are to each other, we simply run the generated images through the model together with the input image and we receive vector embeddings. This allows us to run any type of statistical analysis on the vectors – we opted for cosine similarity. The score is a mean aggregation of every generated image versus input similarity, which allows us to achieve a statistically relevant outcome.

Image 6 - Examples of generated images with different similarity scores

Caption: Examples of generated images with different similarity scores. Source: Authors

How does it work? InceptionResnetV1, trained beforehand on a significant number of different faces, can extract their features and encode their representations in a way that the similar faces are represented through vectors that lie close together in vector space.

Aesthetic assessment

Aesthetic measurement is a second important metric that helps to establish how powerful a model is. Image aesthetics is a pretty abstract concept, which is difficult to grasp and define even for a human. Strictly defining it with mathematical formulas could prove impossible; hence, we decided to model it using data. For that purpose, we used a publicly available dataset with subjective aesthetic assessments gathered from people. To best represent the images, we decided on the CLIP [10] model from OpenAI – the same architecture that is used in the Stable Diffusion [3] pipeline. CLIP works as a well-defined mapping between images and text – we covered it in a previous post in this series [1].

Image 7 - Aesthetic assessment flow

Caption: Aesthetic assessment flow. Source: Authors

Having a way to represent images in numerical form, there was a need to automate the process of evaluating the aesthetics of a given image. To do this, we used an AVA dataset [11] containing more than 250,000 images, along with assessments of their aesthetics by various people. Our goal was to train a multi-layer neural network in a regression setup to teach it the abstract correlation between face characteristics and aesthetic value. The model would rate each image on a scale from 0 (poor aesthetics) to 10 (great aesthetics). We applied data balancing methods to the input dataset to influence the rating distribution, as the first model’s outputs were highly condensed in the middle of the scale – the model had problems estimating extreme values, e.g., 1-2 or 9-10.

Examples of generated images with low and high aesthetic scores

Caption: Examples of generated images with low and high aesthetic scores. Source: Authors

It is worth noting that the LAION dataset [12] on which Stable Diffusion was trained also has aesthetic evaluations. While it might have been easier to use that one, it could lead to unwanted data leakage, which is why we depended on an external dataset.

Metrics in action

Enough talk! That was a comprehensive description of the metrics – it is time to see how they work. Let’s ask ourselves one of the many valid questions we might have when thinking about model fine-tuning – how many input images do I need to use?

The graph above should help you find the answer to that question. It is fully interactive, so feel free to explore it. We fine-tuned the Stable Diffusion v1.5 model on pictures of several people, and tested a different number of input images to see how it affects the training of the model. You can trace how different metrics behave during the evaluation every 300 steps and how it affects the images produced.

Visual inspection allowed us to notice that the presented metrics work well, scoring comparably to humans. However, we would not feel comfortable without validating those setups, so we decided to make sure we can rely on them to provide us with accurate information. We strongly recommend you visit Appendix A to see exactly how it was done!

Summary

In this post, we expressed how confusing it might be to successfully navigate the convoluted area of diffusion models, in both a theoretical and practical sense. To make matters simpler, we introduced two metrics that come in very handy when a reliable assessment of the models is needed. We proved that these models are well-balanced and suitable for our needs with comprehensive validation. In the next post of this series, we will expand on this approach, with lots of experiments and more metrics to check them. Stay tuned!

Appendix A – Metrics validation

For validation purposes, we used images of 8 different people – 4 women and 4 men – to fine-tune 8 new v1.5 Stable Diffusion models. We used the same number of images to fine-tune the models for all of the subjects. After training, two evaluation datasets were created by generating 128 pictures with each model, using half of the images for similarity validation and the other half for aesthetic validation.

Five different people marked the aforementioned sets of pictures according to their subjective opinion. Each image received a label from 1 to 5 (Likert scale [13]) from each labeler, with one meaning very bad and five being very good. This type of labeling was done for both metrics separately. After that, we could calculate several statistics, such as each evaluator’s grade, which is essentially a mathematical formulation of how the given labeler assesses the images. For all of the proper definitions, please check out Appendix B of this post.

Every set was then divided into 5 cross-validation sets in a 4:1 ratio considering the labeler’s dimension – in each set, the labels from one labeler were included in the test set. Using our proposed evaluation as the function we want to optimize, we undertook model score mapping (bucket division) on a scale of 1 to 5 on the validation sets. We performed the bucketizing in a way that minimizes the distance between the model and human answers.

Image 10 - Spearman’s coefficient scores for labelers and models - similarity model evaluation on the left, and aesthetic model evaluation on the right

Caption: Spearman’s coefficient scores for labelers and models – similarity model evaluation on the left, and aesthetic model evaluation on the right. Source: Authors

Next, we tested the model score (similar to the test set labeler) and averaged it out over 5 test sets. We can observe that there is a highly positive correlation between human labels and the score predicted by the similarity model.

Image 11 - Comparison of human labeling and assessment model scores

Caption: Comparison of human labeling and assessment model scores. Source: Authors

In the table below we present the results of cross-validation of the model. After optimization, the models perform in a human-like fashion – exactly what we aimed for!

Image 12 - Table comparing results of different models

Caption: Table comparing results of different models. Source: Authors

Appendix B

Let’s denote \(E\) as the set of evaluators of a given set of images \(S\). We can formulate a discrete Likert scale evaluation as

$$
\begin{equation}
\forall e \in E \;\; \forall s \in S \quad e_d(s) \in L.
\end{equation}
$$

Each score is normalized so that the scores are within the \([0, 1]\) interval, which we can describe as a scoring function \(e\):

\begin{equation}
\forall e \in E \;\; \forall s \in S \quad e(s) \in [0,1].
\end{equation}

For each evaluator, we can define a mean evaluator score on the set of images \(S\) as

\begin{equation}
\forall e \in E \quad \bar{e}=\dfrac{\sum\limits_{s \in S} e(s)}{|S|}
\end{equation}

as well as the standard deviation of the evaluator score on the set of images S

\begin{equation}
\forall e \in E \quad \sigma(e)=\sqrt{\dfrac{\sum\limits_{s \in S}(e(s)-\bar{e})^{2}}{|S|}},
\end{equation}

where \(|S|\) denotes the cardinality of the set of images.

For a single image sample, we can define a sample score \(\bar{s}\) as
\begin{equation}
\forall s \in S \quad \bar{s}=\dfrac{\sum\limits_{e \in E} e(s)}{|E|}.
\end{equation}

Specifically for the single evaluator grading, we include a modification of the above score \(s\) established for each labeler. Let’s denote \(E_{e}\) as the set of evaluators without evaluator \(e(\forall e \in E \; E_{e} = E\setminus{e})\). We can define a sample score without the evaluator \(e\) as:
\begin{equation}
\forall e \in E \;\; \forall s \in S \quad \bar{s}_e=\dfrac{\sum\limits_{e \in E_e} e(s)}{\left|E_e\right|}.
\end{equation}
For a single evaluator \(e\) we can define his grade on the set of images \(S\), denoted as \(g(e)\), which is referred to as evaluator’s grade
\begin{equation}
\forall e \in E \quad g(e)=1-\frac{\sum\limits_{s \in S}\left|e(s)-\bar{s}_e\right|^k}{|S|},
\end{equation}

where \(k \in \mathbb{N}+\) is a grading parameter.

References

  1. https://deepsense.ai/diffusion-models-in-practice-part-1-the-tools-of-the-trade/
  2. https://vsehwag.github.io/blog/2023/2/all_papers_on_diffusion.html
  3. High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al. 2022
  4. https://huggingface.co/models?other=stable-diffusion
  5. Going deeper with convolutions, Szegedy et al. 2014
  6. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks, Zhang et al. 2016
  7. https://kpzhang93.github.io/MTCNN_face_detection_alignment/index.html
  8. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, Szegedy et al. 2016
  9. VGGFace2: A dataset for recognising faces across pose and age, Cao et al. 2018
  10. Learning Transferable Visual Models From Natural Language Supervision, Radford et al. 2021
  11. AVA: A Large-Scale Database for Aesthetic Visual Analysis, Naila Murray and Luca Marchesotti and Florent Perronnin, 2012
  12. https://laion.ai/blog/laion-5b/
  13. https://simplypsychology.org/likert-scale.html
https://deepsense.ai/wp-content/uploads/2023/04/Diffusion-models-in-practice.-Part-2-How-good-is-your-model.jpeg 337 1140 Jarosław Kochanowicz https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Jarosław Kochanowicz2023-05-08 07:35:572023-05-08 11:03:22Diffusion models in practice. Part 2: How good is your model?
How to train a large language model using limited hardware?

How to train a large language model using limited hardware?

April 17, 2023/in Generative AI /by Alicja Kotyla

Large language models (LLMs) are yielding remarkable results for many NLP tasks, but training them is challenging due to the demand for a lot of GPU memory and extended training time. This is compounded by the fact that the size of many models exceeds what a single GPU can store. For instance, to fine-tune BLOOM-176B, one would require almost 3 TB of GPU memory (approximately 72 80GB A100 GPUs). In addition to the model weights, the cost of storing intermediate computation outputs (optimizer states and gradients) is typically even higher. To address these challenges, various parallelism paradigms have been developed, along with memory-saving techniques to enable the effective training of LLMs. In this article, we will describe these methods.

Data parallelism

In data parallelism (DP), the entire dataset is divided into smaller subsets, and each subset is processed simultaneously on separate processing units. During the training process, each processing unit calculates the gradients for a subset of the data, and then these gradients are aggregated across all processing units to update the model’s parameters. This allows for the efficient processing of large amounts of data and can significantly reduce the time required for training deep learning models.

Figure 1: Illustration of data parallelism

Figure 1: Illustration of data parallelism. Source: [11]

While data parallelism can offer significant benefits in terms of reducing the time taken to train deep learning models, there are also several constraints associated with this technique. The key constraint of data parallelism is that each processing unit needs to store a copy of the entire model’s parameters and gradients, which can be a significant memory overhead. Naive DP cannot work well if the model size is larger than the memory of a single GPU node. Later in this article, we will elaborate on how to work with limited GPU memory when the model is too big to fit on one machine.

Pipeline Parallelism

Since deep neural networks typically have multiple layers stacked on top of each other, the naive approach to model parallelism involves dividing a large model into smaller parts, with a few consecutive layers grouped together and assigned to a separate device, with the output of one stage serving as the input to the next stage. For instance, if a 4-layer MLP is being parallelized across 4 devices, each device would handle a different layer. The output of the first layer would be fed to the second layer and so on, until the last layer’s output is produced as the MLP’s output.

Figure 2 - The naive model parallelism strategy is inefficient because the network works in a sequential manner and can only use one accelerator at a time

Figure 2: The naive model parallelism strategy is inefficient because the network works in a sequential manner and can only use one accelerator at a time. Source: [3]

However, naive model parallelism has some limitations. One major limitation is that this approach suffers from inefficiency due to idle time or “bubbles” (when machines have to wait for other machines to finish their stages in both the forward and backward passes – see the diagram below). Another limitation is that the communication overhead between devices can be high, particularly when dealing with large models or data sets which can slow down the training process.

To address the need for efficient pipeline parallelism, in 2019 researchers at Google introduced a new technique for parallelizing the training of deep neural networks across multiple GPUs – GPipe (Huang et al. 2019).

Unlike naive model parallelism, GPipe splits the layers in a way that maximizes parallelism while minimizing communication between GPUs. The key idea behind GPipe is to partition theincoming batch into smaller micro-batches, which are processed in a distributed manner on the available GPUs. The GPipe paper found that if there are at least four times as many microbatches as partitions, the bubble overhead is almost non-existent. Furthermore, the authors of the paper report that the Transformer model exhibits an almost linear speedup when the number of microbatches is strictly larger than the number of partitions.

Figure 3 - GPipe splits the input mini-batch into micro-batches, which can be processed by multiple accelerators at the same time

Figure 3: GPipe splits the input mini-batch into micro-batches, which can be processed by multiple accelerators at the same time. Source: [3]


However, when a single parameter, such as a large embedding table with a large vocabulary size, requires a significant amount of GPU memory, the methods described in this paragraph become inefficient since treating this large tensor as an atomic unit impedes the balance of the memory load.

Tensor Parallelism

In tensor parallelism, specific model weights, gradients and optimizer states are split across devices and each device is responsible for processing a different portion of the parameters.

In contrast to pipeline parallelism, which splits the model layer by layer, tensor parallelism splits individual weights. In this section we will describe the technique for parallelizing a Transformer model with tensor parallelism using an approach that was proposed in the Megatron-LM paper (Shoeybi et al. 2020).

A Transformer layer consists of a self-attention block followed by a two-layer perceptron. First, we will explain the MLP block.

Figure 4 - Illustration of tensor parallelism for the MLP block

Figure 4: Illustration of tensor parallelism for the MLP block. Source: [5]


The first part of the block is a GEMM (General Matrix Multiplication) operation followed by a GeLU. The GEMM operation is partitioned in such a way that the weight matrix \(A\) is split along its columns \(A=[A_1, A_2]\) and GeLU can be independently applied to the output of each partitioned GEMM:

$$
[Y_1, Y_2] = [GeLU(XA_1), GeLU(XA_2)].
$$

In this way, a synchronization point can be skipped. The second GEMM operation is performed such that the weight matrix \(B\) is split along its rows and input \(Y\) along its columns:

$$
Y = [Y_1, Y_2], B = [B_1, B_2]^T,
$$

resulting in \(Z = Dropout(YB) = Dropout(Y_1B_1 + Y_2B_2)\).

We will now move on to the explanation of the self-attention block.

Figure 5 - Illustration of tensor parallelism for the self-attention block

Figure 5: Illustration of tensor parallelism for the self-attention block. Source: [5]


It runs GEMM with query (\(W^Q\)), key (\(W^K\)), and value weights (\(W^V\)) according to the previously explained partitioning in parallel. Next, another GEMM is used to produce the attention head results:

$$
Attention(Q, K, V) = softmax \big( \frac{QK^T}{\sqrt{d_k}} \big) V.
$$

To measure the scalability of their implementation, the authors of the MegatronLM paper considered GPT-2 models with \(1.2, 2.5, 4.2\) and \(8.3\) billion parameters. They evaluated both tensor parallelism and a combination of tensor parallelism with 64D data parallelism which demonstrated up to 76% scaling efficiency using 512 GPUs. Mixed Precision Training and Activation Checkpointing techniques were also used – we elaborate more in the following paragraphs.

Figure 6 - Training efficiency for tensor parallelism and a combination of tensor parallelism with data parallelism as a function of the number of GPUs

Figure 6: Training efficiency for tensor parallelism and a combination of tensor parallelism with data parallelism as a function of the number of GPUs. Source: [5]

Sequence parallelism

Another method to parallelize computation across multiple devices is Sequence Parallelism (Li et al. 2021), a technique for training Transformer models on very long sequences by breaking them up into smaller chunks and processing each chunk in parallel across multiple GPUs.

Figure 7 - Sequence parallelism illustration

Figure 7: Sequence parallelism illustration. Input sequences are divided into smaller pieces and distributed to corresponding devices. Each device has the same trainable parameters but different sub-sequence input chunks. Source: [7]


The main challenge in this approach is computing attention scores across devices. To solve this problem, the authors came up with a new method called Ring Self-Attention (RSA), which makes it possible to compute attention scores in a distributed setting. There are two steps in RSA which we will briefly describe in this section.

We begin by establishing some notations which we adopt from the original paper. We assume that the embeddings on the n-th device correspond to the n-th chunk of the input sequence and are denoted as \(K^n\) (key), \(Q^n\) (query), and \(V^n \) (value). Additionaly,  we set the number of available GPUs to \(N\).

The goal of the first stage of RSA is to compute \(Attention(Q^n, K, V)\) which is the self-attention layer output on the n-th device. To achieve this, the key embeddings are shared among the devices and used to calculate attention scores \(QK^T\) in a circular manner. This requires \(N-1\) rounds of communication. As a result, all attention scores \(S^1, S^2, \dots, S^N\) are stored on the proper devices.

In the second stage of RSA, the self-attention layer outputs \(O^1, O^2, \dots, O^N\) are calculated.  For this purpose, all value embeddings are transmitted in a similar way as the key embeddings in the previous stage.

Mixture-of-Experts

The fundamental concept behind the Mixture-of-Experts method (MoE, Shazeer et al. 2017) is ensemble learning. To go into more detail, the MoE layer consists of a set of \(n\) feed-forward expert networks \(E_1, E_2, \dots, E_n\) (which can be distributed across GPUs) and the gating network \(G\) whose output is a sparse \(n\) -dimensional vector. The output \(y\) of the MoE layer for a given input \(x\) is

$$
y = \sum\limits^{n}_{i=1} G(x)_i E_i(x),
$$

where \(G(x)\) denotes the output of the gating network and \(E_i(x)\) – the output of the \(i\)-th expert network. It is easy to observe that wherever \(G(x)_i = 0\) there is no need to evaluate \(E_i\) on \(x\).

Figure 8 - Illustration of a mixture-of-experts (MoE) layer where the gating network activates only two of the experts

Figure 8: Illustration of a mixture-of-experts (MoE) layer where the gating network activates only two of the experts. Source: [8]


For the gating network, the authors introduced a mechanism called Noisy Top-k Gating that adds two components to the standard Softmax gating network – noise and sparsity. More precisely, before applying the softmax function, Gaussian noise is added to the input, then only the top k values are kept and the rest are set to \(– \infty\):

$$
G(x) = Softmax(KeepTopK(H(x), k))
$$
$$
H(x)_i = (x \cdot W_g)_i + \epsilon \cdot softplus((x \cdot W_{noise})_i), \epsilon \sim N(0, 1),
$$
$$
KeepTopK(v, k)_i = v_i \mbox{ if } v_i \mbox{ is in the top } k \mbox{ elements of } v, -\infty \mbox{ otherwise }.
$$

Activation Checkpointing

Suppose we partition a neural network into k partitions. In Activation Checkpointing (Chen et al. 2016), only the activations at the boundaries of each partition are saved and shared between workers during training. The intermediate activations of the neural network are recomputed on-the-fly during the backward pass of the training process rather than storing them in memory during the forward pass.

Mixed Precision Training

Two common floating-point formats used in Deep Learning applications are the single-precision floating-point format (FP32) and the half-precision floating-point format (FP16). The half-precision data type uses 16 bits to represent a floating-point number, with 1 bit for the sign, 5 bits for the exponent, and 10 bits for the significand. On the other hand, FP32 uses 32 bits, with 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand.

Figure 9 - FP16 and FP32 formats

Figure 9: FP16 and FP32 formats. Source: [16]


The main advantage of using FP16 over FP32 is that it requires only half as much memory,

which can be beneficial in applications where speed and reduced memory usage are more important than accuracy, such as in Deep Learning models that require a large number of calculations. However, FP16 is less precise than FP32, which means that it can result in rounding errors when performing calculations.

The concept of Mixed Precision Training (Narang & Micikevicius et al. 2018) bridges the gap between reducing memory usage during training and maintaining good accuracy.

Mixed Precision Training involves utilizing FP16 to store weights, activations, and gradients. However, to maintain accuracy similar to that of FP32 networks, an FP32 version of the weights (the master weights) is also kept and modified using the weight gradient during the optimizer step. In each iteration, a copy of the master weights in FP16 is utilized in both the forward and backward passes, which reduces storage and bandwidth requirements by half compared to FP32 training.

Figure 10 - Mixed precision training process

Figure 10: Mixed precision training process. Source: [9]


When using FP16, the range of representable values is smaller than when using FP32, which can cause the gradients to become very small and ultimately disappear. This can make it difficult to train a deep neural network effectively.

Figure 11 - The distribution of weight gradient exponents when training a speech recognition model with FP32 weights

Figure 11: The distribution of weight gradient exponents when training a speech recognition model with FP32 weights. Source: [9]


To better handle gradients with small magnitudes, loss scaling is used. In loss scaling, the loss function is multiplied by a scaling factor before computing the gradients during backpropagation. This scaling factor increases the magnitude of the gradients, thereby preventing them from becoming too small and underflowing. Finally, the gradients are divided by the same scaling factor to undo the scaling, and used to update the weights.

The authors of “Mixed Precision Training” also provide experimental results showing the effectiveness of the technique on image classification and language translation tasks. In both cases, Mixed Precision Training matched the FP32 results.

Zero Redundancy Optimizer

Optimizers use a lot of memory. For example, while using the Adam optimizer, we need to save four times the memory of model weights, as it stores momentums and variances which are as big as the gradients and model parameters (Weng 2021).

All parallelism techniques described in the previous sections store all the model parameters required for the entire training process, even though not all model states are needed during training. To address these drawbacks of training parallelism while retaining the benefits, Microsoft researchers developed a new memory optimization approach called Zero Redundancy Optimizer (ZeRO, Rajbhandari et al. 2019).

ZeRO aims to train very large models efficiently by eliminating redundant memory usage, resulting in better training speed. It eliminates memory redundancies in Data Parallel processes by dividing the model states across the devices instead of duplicating them.

ZeRO has three optimization stages:

  1. Optimizer Partitioning: The optimizer state is divided equally among available devices. Each GPU only stores and updates its assigned optimizer state and parameters during training.
  2. Gradient Partitioning: Only gradients responsible for updating corresponding parameters in the assigned partitions are sent to the GPU during backpropagation.
  3. Parameter Partitioning: Only the partition of a parameter needed for forward and backward propagation is stored in the GPU. Other required parameters are received from other GPUs.

Figure 12 - Three optimization stages of ZeRO compared with the data parallelism baseline

Figure 12: Three optimization stages of ZeRO compared with the data parallelism baseline. Source: [13]


According to the authors, memory reduction is directly proportional to the degree of data parallelism. For instance, partitioning across 8 GPUs will lead to an 8-fold reduction in memory usage.

FlashAttention

Making Transformers understand longer inputs is difficult because their multi-head attention layer needs a substantial amount of memory and time to process the input, and this requirement grows quadratically with the length of the sequence. When training Transformers on long sequences with parallelism techniques described in previous sections, the batch size can become extremely small. This is the scenario which the FlashAttention method (Dao et al. 2022) improves.

To optimize for long sequences for each attention head, FlashAttention splits the input \(Q, K, V\) into blocks and loads these blocks from GPU HBM (which is the main memory) into SRAM (which is its fast cache). Then, it computes attention with respect to that block and writes back the output to HBM.

Figure 13 - FlashAttention algorithm

Figure 13: FlashAttention algorithm. Source: [14]


In further research (Dao 2023), the author additionally parallelizes over the sequence length dimension. It is reported that while keeping the number of heads at 12 and head dimension at 128 and using an A100 40GB GPU, FlashAttention is between 2.2x and 2.7x faster for longer sequences (8k) compared to Pytorch and Megatron-LM attention implementations.

Figure 14 - Comparison of the time taken by the forward and backward passes of the attention layer as the sequence length increases while the batch size decreases

Figure 14: Comparison of the time taken by the forward and backward passes of the attention layer as the sequence length increases while the batch size decreases. Source: [15]


Also, in the case of end-to-end training, a significant speed-up was obtained. The usage of FlashAttention to train Transformers of up to 2.7B parameters on sequences of 8k in length made training 2.2 times faster compared to Megatron-LM.

Figure 15 - Training efficiency as a function of the model size

Figure 15: Training efficiency as a function of the model size. Source: [15]

Summary

In the article, various memory optimization techniques for training large language models were discussed. We explained different parallelism paradigms: Data Parallelism, Naive Model Parallelism, Pipeline Parallelism, Tensor Parallelism and Sequence Parallelism. In addition, some pros and cons of these approaches were presented. We then moved on to other memory optimization methods: Mixture-of-Experts, Mixed Precision Training and ZeRO (Zero Redundancy Optimizer). While explaining Mixed Precision Training, we also went through the Loss Scaling technique. Finally, we introduced FlashAttention – an algorithm dedicated to memory reduction for the attention layer. To sum up, we’ve presented several methods that it’s important to be familiar with when training large language models, as these methods can help improve the efficiency, scalability, and cost-effectiveness of the training, as well as optimizing resource utilization.

References:

  1. Petals: Collaborative Inference and Fine-tuning of Large Models, Alexander Borzunov et al. 2022
  2. How to train really large models on many GPUs?, Lilian Weng 2021
  3. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, Huang et al. 2019
  4. Tensor Parallelism, Amazon SageMaker Documentation 2023
  5. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, Shoeybi et al. 2020
  6. Training Deep Nets with Sublinear Memory Cost, Chen et al. 2016
  7. Sequence Parallelism: Long Sequence Training from System Perspective, Li et al. 2021
  8. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, Shazeer et al. 2017
  9. Mixed Precision Training, Narang & Micikevicius et al. 2018
  10. Train With Mixed Precision, NVIDIA Docs Hub 2023
  11. NeMo Megatron, NVIDIA NeMo 2022
  12. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Rajbhandari et al. 2019
  13. ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters, DeepSpeed Team, Rangan Majumder & Junhua Wang 2020
  14. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Dao et al. 2022
  15. FlashAttention: Fast Transformer training with long sequences, Dao 2023
  16. The bfloat16 numerical format, Cloud TPU Documentation 2023
https://deepsense.ai/wp-content/uploads/2023/04/How-to-train-a-large-language-model-using-limited-hardware.jpeg 337 1140 Alicja Kotyla https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Alicja Kotyla2023-04-17 08:00:402023-04-17 12:06:16How to train a large language model using limited hardware?
Data generation with diffusion models - part 1

Data generation with diffusion models – part 1

April 3, 2023/in Generative AI /by Natalia Czerep

It is widely known that computer vision models require large amounts of data to perform well. The reason for this is the complexity of these tasks, which usually involve the recognition of many different features, such as shapes, textures, and colors. Therefore, to train state-of-the-art, advanced models, it is necessary to use vast datasets like, for example, ImageNet, containing 14 million images.

Unfortunately, in many business cases we are left with a small amount of data. Small datasets may be due to the high cost of data collection, privacy concerns, or the limited availability of data. This causes various problems such as underrepresentation of the rarest classes, being prone to overfitting and the limitations of the machine learning algorithms or deep learning models that can be used.

When working with limited data, it can be difficult to train a model that is accurate and generalizes well to new examples. There are several approaches to overcoming the issue of insufficient data, one of which is supplementing the available dataset with new images, which is discussed in this article.

Diffusion models to the rescue

Diffusion models are a class of generative models that have become increasingly popular in recent years due to their ability to generate high-quality images. At a high level, diffusion models work by firstly adding a certain amount of random noise to the images from the training set. Then the reverse process happens, that is, during training the model learns to remove the noise to reconstruct the image. The advantage of this approach is that it allows the model to generate high-quality samples that are indistinguishable from real data, even with a small number of training examples. This is particularly useful in situations such as in medical imaging, where obtaining high-quality images is expensive and time-consuming. Popular diffusion models include Open AI’s Dall-E 2, Google’s Imagen, and Stability AI’s Stable Diffusion.

You can read more about the recent rise of diffusion-based models in our recent post.

Related Work

In recent years, there has been growing interest in the application of diffusion-based models in creating new images based on those which already exist. Many architectures modify the baseline to achieve the best quality of output. Below we discuss a few of them, to give you an overview of what can be accomplished.

Medfusion

One of the fields where there is a need to complement datasets is medical imaging. The use of real patient data is encumbered with privacy and ethical concerns. What is more, there is a lack of process standardization when it comes to sharing sensitive data, even between hospitals and other medical research facilities. To overcome the barrier of privacy issues and a lack of available medical data, Medfusion architecture was proposed by Muller-Franzes et al. in [1].

In the past, it was common to use generative adversarial models (GANs) to generate data based on existing training data. However, it has been proven that GANs suffer from unstable training behavior, among other things [2].

In [1] the authors present a novel approach based on Stable Diffusion [3]. The model consists of two parts: an autoencoder and a Denoising Diffusion Implicit Model (DDIM). The autoencoder compresses the image space into a latent space. During training, the latent space is decoded back to the image space. Then they use a pre-trained autoencoder to get the image to the latent space which is then diffused into Gaussian noise. A UNet model is used to denoise the latent space, and samples are generated with the DDIM. During their research they first investigated whether the autoencoder was sufficient to encode images into a compressed space and decode them back without losing medically relevant details. Then they studied whether the Stable Diffusion Model’s autoencoder, pre-trained on natural images, could be used for medical images without further training. The results showed that the Medfusion model was effective in compressing and generating images while retaining medically relevant details, and the Stable Diffusion Model’s pre-trained autoencoder could be used for medical images without the loss of any relevant details, and could outperform GANs in terms of the quality of output images. The study highlights the potential of the Medfusion model for medical image compression and generation, which could have significant implications for healthcare providers and researchers.

They explored three domains of medical data: ophthalmologic data (fundoscopic images), radiological data (chest x-rays) and histological data (whole slide images of stained tissue).

Figure 1. Qualitative image generation comparisons

Figure 1. Qualitative image generation comparisons. Image from [1].

ControlNet

While working with diffusion models, one can encounter the problem that the output changes even in terms of the parts of the picture that we would like to stay the same. In the case of generating new images to supplement the existing dataset, we would prefer, e.g., the semantic masks or the edges (possibly obtained by Canny or Hugh lines detector) that we already have to still apply to the newly created image.

Figure 2. ControlNet architecture overview

Figure 2. ControlNet architecture overview. Image from [4].


In the article “Adding Conditional Control to Text-to-Image Diffusion Models” by Lvmin Zhang and Maneesh Agrawala [4], ControlNet architecture is presented in more detail. ControlNet makes it possible to augment Stable Diffusion by making use of additional inputs like segmentation masks, keypoints or edges. The architecture makes two copies of the network – one locked and the other one trainable. The locked copy preserves the network capability learned from billions of images, while the trainable copy is trained on task-specific datasets to learn the conditional control. The trainable and locked neural network blocks are connected with a layer called “zero convolution”. Training using the zero convolution is efficient because it does not introduce any new noise to deep features. Consequently, it is just as fast as fine-tuning a diffusion model, as opposed to starting the training process from scratch when new layers are added.

The authors trained several ControlNets with various datasets of different conditions, such as Canny edges, Hough lines, user scribbles, human key points, segmentation maps, shape normals, and depths. The results showed that ControlNet was effective in controlling large image diffusion models to learn task-specific input conditions.

Figure 3. Control Stable Diffusion with Canny edge map

Figure 3. Control Stable Diffusion with Canny edge map. Image from [4].

Data augmentation with diffusion models – DA-Fusion

Standard data augmentation techniques are image transformations such as flips, rotations or changes in color. Unfortunately, they do not allow more sophisticated changes in the appearance of the object. Suppose we would like to have a model detect and recognize a plastic bottle in the wild, e.g., while creating a waste detector – it would be very helpful to be able to vary the appearance of the bottle in terms of the label or the color of the bottle, etc. Unfortunately this is not possible with classical data augmentation. The DA-Fusion [5] method, based on text-to-image diffusion models, was proposed to address this issue.

The authors utilize pre-trained diffusion models to generate high-quality augmentations for images, even those with visual concepts not previously known to the base model. In the text-encoder, new tokens were used to adapt the diffusion model to new domains. In order to do so, Textural Inversion [6] was applied – a technique for capturing novel concepts in the embedding space of a text encoder.

Figure 4. System architecture for DA-Fusion

Figure 4. System architecture for DA-Fusion. Image from [6].

Our interest

At deepsense.ai we strongly believe that data is crucial to the performance of machine learning models. Therefore, we are constantly searching for new approaches to make the most of the data that we have available. Recently, we have been exploring the use of generative models as well as novel diffusion models and variations thereof.

Stay tuned for our next post to find out more about our approach to applying these methods to synthesizing datasets from different domains (i.e. medical images, street view) for classification and segmentation tasks.

Bibliography

  • [1] “Diffusion probabilistic models beat GAN on medical 2D images” Gustav Müller-Franzes et al., 2022
  • [2] “What is going on with my GAN?” Fabiana Clemente, 2020 https://medium.com/towards-data-science/what-is-going-on-with-my-gan-13a00b88519e
  • [3] “High-Resolution Image Synthesis with Latent Diffusion Models” Robin Rombach et al., 2021
  • [4] ”Adding Conditional Control to Text-to-Image Diffusion Models” Lvmin Zhang et al., 2023
  • [5] “Effective Data Augmentation With Diffusion Models” Trabucco et al., 2023
  • [6] “An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion” Rinon Gal et al., 2022
https://deepsense.ai/wp-content/uploads/2023/04/Data-generation-with-diffusion-models-part-1.jpeg 337 1140 Natalia Czerep https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Natalia Czerep2023-04-03 08:00:192023-04-03 11:26:21Data generation with diffusion models – part 1
Diffusion models in practice. Part 1 - The tools of the trade

Diffusion models in practice. Part 1: The tools of the trade

March 29, 2023/in Generative AI /by Jarosław Kochanowicz, Maciej Domagała, Dawid Stachowiak and Krzysztof Dziedzic

The AI revolution continues, and there is no indication of it nearing the finish line. The last year has brought astonishing developments in two critical areas of generative modeling: large language models and diffusion models. To learn more about the former, check out other posts [1] by deepsense.ai. This series is devoted to sharing our practical know-how of diffusion models.

Application of the family of models using a mechanism called “diffusion” in various generative setups remains one of the hottest topics in machine learning. Diffusion-based models have proven their ability to yield results surpassing all other well-known counterparts used in this domain beforehand, such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs). Sohl-Dickstein et al.’s [2] publication in 2015 brought a breath of fresh air to the generative model scene. For a few years, the concept was gradually improved upon, and just last year there were numerous state-of-the-art publications in the domain of image-to-image, text-to-audio, and time series forecasting, just to name a few. The number of applications is growing by the day, although the text-to-image domain remains the most popular so far – we are focusing solely on it in this series as well. There are many practical questions one may have when trying to use these methods, such as:

  • Which tools bring the best results?
  • How can I reliably judge whether the results are satisfactory?
  • What parameter values should I use in image generation?
  • How many images should I use to train diffusion models optimally? How many training steps?

These and similar questions have both beginners and advanced practitioners struggling. While the Internet is full of math-heavy theories and opinionated claims, there are not many well-researched answers to these questions available. Over the next few posts, we intend to present a practical and empirical answer to them.

The series will contain the results of multiple experiments using various models and metrics. Based on these, we will present insights which are relevant to the specific practical dilemmas and challenges, and allow you to draw your own conclusions by facilitating a graphical, interactive exploration of these results.

First, however, this post will lay the groundwork for this with just enough theory to make the following ones understandable. We will introduce the relevant tools and concepts to be referenced in later posts, and we strongly recommend familiarizing yourself with them. More specifically, we will introduce Stable Diffusion [3] (one of the loudest models published last year) and several tools used to finetune it, including DreamBooth [4], LoRA [5], and Textual Inversion [6].

Stable Diffusion: the root of it all

While DALL·E [7] and DALL·E 2 [8] were responsible for drawing large-scale attention to generative image models, Stable Diffusion [3] was the model that unleashed a true revolution. Since its open-sourcing in August 2022, anyone could modify, expand, tweak, or simply use it on their GPU or in a collab [9] for free. Follow-up technologies appeared, including training methods like DreamBooth [4], allowing people to see themselves (or anyone else) as a character in their generative art. The landscape of generative models (and possibly of art itself [10]) changed forever.

What does ‘diffusion’ stand for in ‘diffusion models’ and ‘Stable Diffusion’?

While ‘Stable Diffusion’ sounds catchy, for someone starting their adventure in the generative realm, the term ‘diffusion’ may be confusing. Within the context of deep learning, diffusion refers to one of the processes used by these methods to generate images based on the training data. That is, admittedly, a bit vague. How exactly do diffusion models differ from GANs, VAEs, or other models used in image modeling?

There are numerous novelties proposed. Essentially, various generative architectures are composed of two processes that are complementary to each other. GANs architectures utilize a generator and discriminator, while VAEs use an encoder and decoder setup. For diffusion-based models, it is no different, as we can formulate the model with two processes – diffusion (noising) and denoising [11].

Diffusion – also referred to as the forward diffusion process – is meant to sequentially destroy the image by slowly blurring the picture so that its main characteristics are no longer observable. One of the relatively new ideas for this family of models is that the formulation of this forward process is fixed. That is a large difference when compared to, e.g., VAEs, where both main components of the architecture are fully trainable. In the case of diffusion, the encoder’s job is overtaken by a mathematical process – the neural network approach is redundant.

The main goal of the entire architecture is to teach the model to reverse that process; to create something meaningful – an image – from a complete noise. This denoising process is performed by a neural network, and arguably this is the place where most of the heavy lifting is done. During the training, the network is presented with a diffused image and is taught to predict the noise added to it. When the predicted noise is subtracted from the input, a person/object/scene starts to emerge in the picture. As usual, the network is trained on millions of images and the feedback about its performance is constantly provided via loss function and backpropagation. The whole process allows the network to gather information about the characteristics of the pictures from the same class. We don’t tell the network explicitly how to reverse the diffusion process, since the generative power comes from the interpolation of the knowledge that the model gains during the training. Truth is, we wouldn’t even be able to do that, since it would require the model to have virtually infinite capacity (more about this in detail in [12]).

Visualization of the forward and reverse diffusion processes.

Visualization of the forward and reverse diffusion processes. Source: Authors’ own elaboration

That’s a very brief summary of the basics of the construction of diffusion models. We already have a detailed blog post [12] that focuses on the mathematical aspects. We covered DALL·E 2 [8] and Imagen [13] in detail there – we strongly recommend checking it out, as it should shed a great deal of light on the intricacies of diffusion in deep learning. In this section, we will give a general overview of the key terms and components of Stable Diffusion.

The building blocks

Stable Diffusion architecture, source: [3]

Stable Diffusion architecture. Source: [3]

Even though the image above might indicate otherwise, compared to the intricate architecture of e.g. DALL·E 2 [8], Stable Diffusion [3] seems to be based on concepts which are a little easier to grasp. There are three main components of the solution:

  • Text Encoder, which is necessary for translation between the text and the latent generative space,
  • U-Net type Neural Network, which runs the diffusion process allowing for the generation of new information,
  • Image Autoencoder, which is able to compress the input image into its latent space representation and translate the latent output into the actual image.

Below we will explain each of these in a bit more detail.

Text Encoder

Diffusion models can work on their own, generating images based on the knowledge gained through the training process. Usually, we would like to be able to guide the generation process using text, so the model produces exactly what we want. This information is passed into the model in the form of a generation prompt. So how does the model understand the text?

When it comes to text-to-image generation, we can have the best of the generative and natural language processing worlds and successfully apply the solutions from one domain to the other. The area of Language Models has gained momentum rapidly in recent years. An increasing number of solutions allow for efficient embedding of the text in the abstract space, which later allows for efficient processing of the text in hand.

The idea of a text encoder is quite simple – we wish to take the textual input and represent it efficiently in a space that would allow the model to generate an image based on the prompt. That process is called text guidance – each time the diffusion model goes one step further in the image generation process, it is reminded of what the image was supposed to present. Currently, the choice of the model that transfers the text into a latent space seems paramount for accurate and high-quality generation. Many of the current state-of-the-art solutions utilize different architectures for text embedding. For instance, Imagen [13] architecture uses the T5-XXL [14], while Stable Diffusion uses the CLIP ViT-L/14 model [15] to get the job done.

CLIP model

CLIP model. Source: Authors’ own elaboration

CLIP [16] was trained with millions of (text, image) pairs and can accurately map the text to the image and vice versa. This makes it a very natural choice for the embedder, and it has already been proven to perform well, e.g., in DALL·E. The model itself is pre-trained and fixed in the Stable Diffusion, although there are ways to interfere with the embeddings – we will talk about those later in this post.

U-Net type Neural Network

The heart of the solution – this part of the architecture takes the embedded text and noised latent and tries to reverse the diffusion process in a way that would produce an image as similar to the input prompt as possible. For anyone that follows neural network evolution, the way this is done may not be a great surprise. The weapon of choice is U-Net architecture, which was already used for lots of diffusion architectures before Stable Diffusion.

Overview of U-Net architecture

Overview of U-Net architecture. Source: Authors’ own elaboration

The denoising process happens step-by-step – each ResNet block receives the information about the denoising step and the text embedding is processed by the attention mechanism. A neural network aims to predict the noise that was added to the image during a forward pass of diffusion – that’s why the information about the step is important. During inference, the network tries to produce an image from a randomly sampled latent. The authors opted for the already well-established solution incorporating classifier-free guidance [17] to make sure that the image output is as close to the input text as possible.

Image Autoencoder

The diffusion process itself is not very lightweight. Multiplying the height and width of the image, the number of color channels, and the number of noising steps in the process already leads to numerically heavy calculations. The authors of Stable Diffusion decided to address that by transferring the main part of the diffusion process into the latent space. Essentially the idea is to perform denoising not on the actual image in pixel dimensions, but rather to do it on a latent representation of the image, which is compressed to yield fewer dimensions. Historically there were already a few ways of performing this type of compression – the authors opted for an already well-known autoencoder approach.

Example of the Stable Diffusion inference process

Example of the Stable Diffusion inference process. Source: Authors’ own elaboration

During the training, the input image is processed by the encoder and represented as latent several times smaller than the original e.g. 3x512x512 image is represented as a 6x64x64 matrix. It means that, in the architecture, all models work with a compressed representation of inputs that retains only the most valuable information. After the reverse diffusion process happens in U-Net, the tensor with all the necessary information for image generation is translated into the actual picture by the decoder.

To sum up, the flow seems easy enough – we type the text, and the encoder translates the text into a concise embedded form. The main model reverses the diffusion process of random noise in a step-by-step fashion, at the same time conditionally guiding the generation using the text. The last step is to take the model output and generate an image – that is the image decoder’s job. Our goal was to explain the interior of this architecture as simply as possible, but it needs to be underlined that actual model usage and navigating through the different parameters and settings is far more difficult! Worry not, as we will be helping with that as well.

Newer is better? v2.x vs v1.x

November 2022 brought another iteration of the Stable Diffusion architecture – Stable Diffusion 2.0 [18]. Two weeks later, in December, Stability AI published the most recent stable version of the flag model to date – version 2.1 [19]. Just like its predecessor, it is available in the form of a demo [20]. There were a couple of major tweaks compared to the 1.5 version.

The new text encoder and dataset choice seem to be the largest of the changes. This time a newer generation OpenCLIP-ViT/H [21] model was trained to handle the task of text embedding. Just like its predecessor, CLIP ViT-L/14, its weights are open-sourced. The dataset used for the OpenCLIP training is open and includes several changes that drastically altered the way that model works with the prompts. An additional NSFW filter was included to filter out the images that could lead to misuse across the internet, mostly related to child abuse and pornography. Also, there is a noticeably smaller collection of celebrities and artistic images.

Comparison of generation for the prompt

Comparison of generation for the prompt: “a studio photograph of Robert Downey Jr., cinematic lighting, hyperdetailed, 8 k realistic, global illumination, radiant light, frostbite 3 engine, cryengine, trending on artstation, digital art” for SDv1.5 and SDv2.0. Source: [22]

The curation of the data seemed to be serving its purpose, but users reported severe performance degradations for some of the generation tasks, one of the most notable of which was the ability to generate people. This issue was partially fixed in the 2.1 version after loosening the NSFW filter; it was turned off in the last steps of the model’s training, which led to fewer false positives being removed from the dataset.

Additionally, negative prompting seems to be more important for accurate generation in the newer versions of Stable Diffusion. Negative prompts are appended to the regular generation prompt and should retain the information of what should not be seen on the generated picture. It was highly optional for the first version of the architecture, but it seems to be a must in the 2.0 and 2.1 versions in order to get high-quality results.

Example of the importance of negative prompting in the 2.1 version. The pictures on both the left and right were generated with the same prompt, but there was an additional negative prompt added on the right to reduce the greenery in the image. Source: [19]

Example of the importance of negative prompting in the 2.1 version. The pictures on both the left and right were generated with the same prompt, but there was an additional negative prompt added on the right to reduce the greenery in the image. Source: [19]

The experiments that will be presented in the upcoming posts of this series were performed on the 1.5 version of Stable Diffusion. That was a conscious choice; it is widely reported that before the change of the dataset and encoder, the model offered the most flexibility in terms of usage and experimentation which we found essential.

Training tools

Although generative models offer endless possibilities, their domain knowledge can be limited. Often, the generation process relies on interpolating between images seen during training. While general pre-trained models are versatile, specific use cases may require additional training of the model. Since the release of Stable Diffusion as open-source software, several techniques have emerged to extend the knowledge of pre-trained models, collectively known as fine-tuning.

Several methods have been developed to generate specific objects, styles, or compositions with minimal data requirements. These techniques have pushed the boundaries of generative art even further. In the following chapters, we describe some established fine-tuning methods, some of which we used in our experiments.

DreamBooth

This is one of the most popular methods, designed mainly to extend the model’s knowledge to include specific objects, but it is also possible to introduce styles (e.g. to incorporate a new artist). It works great for placing the face of a particular person in the model’s domain and more. It was also this method that we used for our internal Christmas card generator project – feel free to check it out [23].

Graph presenting the DreamBooth method

Graph presenting the DreamBooth method. Source: Authors’ own elaboration

When it comes to the technicalities, DreamBooth [4] is quite close to the traditional definition of fine-tuning.

In its basic form, it relies on freezing all weights of the architecture except for the U-Net model, which runs the denoising process. Additional training of the model is based on iteratively showing the network sample images of, e.g., the object of interest, along with a prompt containing a new, unique identifier to symbolize it, e.g., “A photo of a <identifier> dog”. The idea is to embed the knowledge about an object within the model weights and force it into the embedding layer. In the publication [4] the authors present a method for the rare-token selection to link with the identifier. The importance of choosing sparse tokens in this operation is not to be neglected – the model might lose some important information in the process.

Additionally, to combat overfitting and language drift, some regularization images with similar prompts are also shown during fine-tuning to ensure that the network does not forget the original meaning of the remaining tokens – this idea is called prior-preservation loss.

It is worth mentioning that DreamBooth implementation provided by Hugging Face [24] allows you to unfreeze the weight of the text encoder as well, which can further improve the results. On the other hand, it requires so much computing power that only a small percentage of users can launch these tools on their local machines. DreamBooth itself reigns supreme in terms of popularity among all methods listed in this post, but it comes with a price – it is the most computationally expensive. Even with the most basic setup, including numerous memory-efficient optimizations, this way of fine-tuning needs a GPU with 12GB of VRAM to perform training without major complications.

As of today, there are already hundreds of different concepts in the concepts library setup on Hugging Face – we strongly recommend checking them out [25].

LoRA

LoRA [5] stands for Low-Rank Adaptation. It is not a diffusion-specific concept, as it was published in 2021 and targeted the problem of fine-tuning large language models such as GPT-3 [26]. These models, with millions of parameters, require a lot of storage, and altering the whole architecture each time there can be computationally expensive. Generally speaking, models are defined by parameters, which are stored in matrices. As the dimensions of the model increase, these matrices can grow very rapidly, which makes them heavy.

Graph presenting the LoRA method

Graph presenting the LoRA method. Source: Authors’ own elaboration

In a nutshell, LoRA is a technique used to reduce the number of parameters that need to be tweaked to fine-tune the model. Instead of working with the whole matrices of parameters, these matrices are converted into a lower rank decomposition, which makes their size magnitudes smaller. On top of that, this technique does not need to be applied everywhere in the model. In the context of diffusion architectures – such as Stable Diffusion – the attention layers in the UNet network are specifically targeted, as they directly link the textual semantics with the generative ability of the model.

This approach yields several advantages:

  • Most of the pre-trained weights are frozen and unaffected by this operation, which should ensure that the generative power of the model remains the same after fine-tuning,
  • Sharing the tweaked model is much easier as the changes boil down to a lightweight diff-like file,
  • Fewer parameters to train = faster results; this method also works with just a couple of images required.

Textual Inversion

Another powerful yet simple technique that can be used for adding new concepts to the model is called textual inversion [6]. It works by directly interfering with the text embedding layer of the architecture. The idea is similar to that of DreamBooth – we wish to give the model some information about a certain concept, potentially a new object or style, while maintaining the existing information.

Inverting a visual idea of a specific cat into a <S∗> token. Source: [6]

Inverting a visual idea of a specific cat into a <S∗> token. Source: [6]

This procedure boils down to adding a new token to the vocabulary, for instance <custom_woman_token>, and initializing its embedding to be the same as for the already existing <woman> token. Next, by providing a set of images corresponding to that new token, the idea is to fine-tune the embedding rather than the model, which remains frozen. We want to find optimal embedding so that the characteristics of the object are well captured. The advantage of such an approach is that the results are lightweight and easy to share – embedding tensors have only a few kilobytes.

Graph presenting the textual inversion method

Graph presenting the textual inversion method. Source: Authors’ own elaboration

Hugging face offers a space [27] that showcases different new concepts that have been introduced into the embedding space of Stable Diffusion, such as <midjourney-style> or <nebula>.

A lot of room for exploration

The use of different training techniques and ways to customize models is an area which is developing much faster than the traditional architecture research related to diffusion models. Here we have covered just the three most popular ways to fine-tune the model, but developers are constantly testing new ways to interact with the model, such as Hypernetworks training [28]. On top of that, there are ways to combine several methods for an even more powerful effect. For instance, one could combine Textual Inversion with LoRA, first by tuning the new embedding and then tuning the diffusion mechanism using images related to the newly embedded object.

From the practical perspective, there is no one-size-fits-it-all method; as usual, each comes with a certain trade-off. DreamBooth seems to be yielding great results, but it is computationally and spatially expensive. Textual Inversion is highly lightweight but it is limited to the model’s idea of embeddings. LoRA sits in the middle, combining a bit of both – not as lightweight, but it does interact with the model’s weights directly.

Summary

In this article, our goal was to arm the reader with the knowledge necessary to understand where the diffusion models are right now in the context of text-to-image generation. We also wanted to share the overview of popular methods related to the training itself, as it is a vital part of today’s development.

“But which method should be used? And how?” – these are the questions that we want to address in the next posts from this series. We will be looking a lot closer at the practical side of things – there will be training, optimization, validation, and more – stay tuned!

References

  1. https://deepsense.ai/blog/
  2. Deep Unsupervised Learning using Nonequilibrium Thermodynamics, Sohl-Dickstein et al. 2015
  3. High-Resolution Image Synthesis with Latent Diffusion Models, Rombach et al. 2022
  4. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, Ruiz et al. 2022
  5. LoRA: Low-Rank Adaptation Of Large Language Models, Hu et. al. 2021
  6. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion, Gal et al. 2022
  7. Zero-Shot Text-to-Image Generation, Ramesh et al. 2021
  8. Hierarchical Text-Conditional Image Generation with CLIP Latents, Ramesh et al. 2022
  9. https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb
  10. https://hbr.org/2022/11/how-generative-ai-is-changing-creative-work
  11. Denoising Diffusion Probabilistic Models, Ho et al. 2020
  12. The recent rise of diffusion-based models, deepsense.ai, 2022
  13. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, Saharia et al. 2022
  14. How Much Knowledge Can You Pack Into the Parameters of a Language Model?, Roberts et al. 2020
  15. https://huggingface.co/sentence-transformers/clip-ViT-L-14
  16. Learning Transferable Visual Models From Natural Language Supervision, Radford et al. 2021
  17. Classifier-Free Diffusion Guidance, Ho et al. 2021
  18. https://stability.ai/blog/stable-diffusion-v2-release
  19. https://stability.ai/blog/stablediffusion2-1-release7-dec-2022
  20. https://huggingface.co/spaces/stabilityai/stable-diffusion
  21. https://github.com/mlfoundations/open_clip
  22. https://www.assemblyai.com/blog/stable-diffusion-1-vs-2-what-you-need-to-know/
  23. https://deepsense.ai/portfolio-item/christmas-card-generator
  24. https://github.com/huggingface/diffusers/tree/main/examples/dreambooth
  25. https://huggingface.co/sd-dreambooth-library
  26. Language Models are Few-Shot Learners, Brown et al. 2020
  27. https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer
  28. HyperNetworks, Ha et al. 2016
https://deepsense.ai/wp-content/uploads/2023/03/Diffusion-models-in-practice.-Part-1-The-tools-of-the-trade.jpeg 337 1140 Jarosław Kochanowicz https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Jarosław Kochanowicz2023-03-29 08:00:582023-03-29 10:26:33Diffusion models in practice. Part 1: The tools of the trade
Solution guide - The diverse landscape of large language models. From the original Transformer to GPT-4 and beyond

Report: The diverse landscape of large language models. From the original Transformer to GPT-4 and beyond

March 22, 2023/in Generative AI /by Artur Zygadlo
https://deepsense.ai/wp-content/uploads/2023/03/Solution-guide-The-diverse-landscape-of-large-language-models.-From-the-original-Transformer-to-GPT-4-and-beyond.jpeg 337 1140 Artur Zygadlo https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Artur Zygadlo2023-03-22 17:22:132023-04-24 11:48:06Report: The diverse landscape of large language models. From the original Transformer to GPT-4 and beyond
ChatGPT – what is the buzz all about?

ChatGPT – what is the buzz all about?

March 10, 2023/in Generative AI /by Eryk Mazuś and Maciej Domagała

Over the last few months, ChatGPT has generated a great deal of excitement. Some have gone as far as to suggest it is a giant step in developing AI that will overtake humanity in many important areas, both in business and social life. Others view it more as a distraction on the path towards achieving human-level intelligence. How did ChatGPT generate such hype? In this article, we’ll try to explain.

How did we get here?

Recent advances in natural language processing can be viewed as a progression toward more flexible and general systems. We can see various ideas flowing through the field of NLP development. A few years ago, around 2013-14, the main approach to NLP tasks was to use word embeddings, which are vectors that represent the meaning of words. This was the standard approach in the vast majority of language-related tasks, such as text classification, in which embeddings were first obtained either through training or by downloading pre-trained vectors from public sources, and then fed into a task-specific architecture. This approach necessitated the creation of a task-specific, labeled dataset on the one hand, and a task-specific architecture of the model itself on the other. Not only did this require a significant amount of effort, but the performance of such an approach was limited by the representational capabilities of the input embeddings. Word embeddings were unable to capture the meaning of words based on context (words surrounding them) or the semantics of the entire text.

Figure 1: NLP Timeline

Figure 1: NLP Timeline

Since 2015, researchers have been experimenting with the idea of semi-supervised pre-training of LSTM [1] and Transformer-based language models on large corpora of text, and then supervised fine-tuning them for specific tasks on a much smaller dataset. BERT [2] and GPT-1 [3] are two examples of such approaches. Such methods eliminated the need for task-specific models, resulting in architectures that outperformed existing solutions to many difficult NLP tasks. Even though the task-specific dataset and fine-tuning were still required, it is a significant improvement.

The scarcity of large enough datasets for some tasks, the effort required to create them, and the lack of generalization of fine-tuned models outside the training distribution prompted the development of a new, human-like paradigm in which all that is required is a short natural language description of the task that the model is asked to perform, with an optional, tiny number of demonstrations added to the instruction. GPT-2 [4], GPT-3 [5], and other generative language models described in the following section represent this paradigm.

GPT: applications and architecture

GPT is an abbreviation of Generative Pre-trained Transformer. It is generative in the sense that it can generate text-given input. Because it has already been trained on a large corpus of text, it is pre-trained. Finally, it is a neural network architecture that is based on the Transformer [6].

A GPT generates text in response to a text input, called a prompt. It is a simple but versatile framework, as many problems can be converted to text-to-text tasks. On the one hand, GPT can be asked to perform standard NLP tasks such as summarizing/classifying a text passage, answering questions about a given piece of text, or extracting named entities from it. On the other hand, due to its generative nature, GPT is an ideal tool for creative applications. It can create a story based on a brief premise, hold a conversation, or… write a blog post. Furthermore, if trained on a corpus of code, such a model could perform code generation, editing, and explanation tasks, such as generating Python docstrings, generating git commit messages, translating natural language to SQL queries, or even translating code from one programming language to another.

Modern language models, such as OpenAI’s GPT-3, Google’s LaMDA [7], and DeepMind’s Gopher [8], are essentially GPT implementations. They are much more powerful than the original GPT-1, mostly because of their size – for instance, the largest variant has 175 billion parameters – and because they were pre-trained on massive amounts of text; in the case of GPT-3, it was hundreds of billions of words.

Figure 2 - Number of parameters and the release date of Transformer-based models. GPT-like models are highlighted in red

Figure 2: Number of parameters and the release date of Transformer-based models. GPT-like models are highlighted in red. Source: [9]

The GPT and GPT-like models are actually autoregressive language models that predict the next word in a sequence. After predicting the next word, it is appended to the initial sequence and fed back into the model to predict the subsequent one. The procedure is repeated until the model outputs a stop token or reaches the user-specified maximum length of the output sequence.

From a technical standpoint, the model is a decoder-only variant of a Transformer model, consisting of a stack of Transformer blocks followed by a linear layer and softmax that predict the probability that each word in the model’s vocabulary is the next token in a sequence. Each transformer block is composed of a Multi-Head Casual Self Attention layer, a linear layer, layer normalizations, and residual connections. This architecture can be thought of as a “general-purpose differential computer” that is both efficient (transformers enable high parallelism of computation) and optimizable (via backpropagation)[10].

Figure 3 - Decoder architecture underpinning the GPT-like models

Figure 3: Decoder architecture underpinning the GPT-like models Source: [11]

ChatGPT

The research community recently took a few steps forward in the development of language models. GPT-family models are trained to complete the input text rather than follow the user’s instructions. To make the models generate more sensible outputs in response to user instructions, as well as to make them more truthful and less toxic, the authors opted for the inclusion of human feedback in the process of training the model. This technique, called Reinforcement Learning from Human Feedback (RLHF) is so interesting that we decided to devote a whole blog post to describing it in detail – feel free to read more about it here!

Figure 4 - Evolution from transformer architecture to ChatGPT

Figure 4: Evolution from transformer architecture to ChatGPT

The application of this technique has resulted in new iterations of the models, such as InstructGPT [12] and ChatGPT [13]. The latter was the subject of massive attention from the public, even outside of the AI world itself. ChatGPT created a stir in the media, mostly because of its availability and API that allows everyone to use it directly [14].

With just a couple of commands, ChatGPT can prove its ability to interact with a human by producing a well-tailored resume, playing a game of chess, or writing a part of compilable code. It also acts as an information distiller, providing a comprehensive yet concise summary of a given subject.

OpenAI recently enabled ChatGPT API access under the name gpt-3.5-turbo. It’s a GPT-3.5 model optimized for the chat that costs one-tenth the price of the best previously available model. More information on that can be found here.

Future perspectives

In spite of the fact that such developments are clearly ground-breaking, there seems to be a long way to go for it to become standard for general NLP purposes. Current studies prove that even though the model is impressive given its do-it-all ability, it is underperforming compared to existing state-of-the-art solutions for NLP tasks. For instance, in the recently published paper by J. Kocon et al., ChatGPT seems to be yielding worse results than the current best models in all of the 25 different NLP tasks that were tested in the publication [15]. Anyone who has used the model for a bit longer could notice its limitations, such as the fact that it lacks knowledge of recent events.

We are eager to observe further development in this area of AI. Ideas to make the model better and more versatile seem to be never-ending and the results are already looking very promising.

Bibliography

  1. Semi-supervised Sequence Learning, Andrew M. Dai, Quoc V. Le, 2015
  2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin et al., 2018
  3. Improving Language Understanding by Generative Pre-Training, Alec Radford et al., 2018
  4. Language Models are Unsupervised Multitask Learners, Alec Radford et al., 2019
  5. Language Models are Few-Shot Learners, Tom B. Brown et al., 2020
  6. Attention Is All You Need, Ashish Vaswani et al., 2017
  7. LaMDA blogpost, Eli Collins, Zoubin Ghahramani, 2021
  8. Scaling Language Models: Methods, Analysis & Insights from Training Gopher, Jack W. Rae, 2022
  9. Transformer Models: an Introduction and Catalog, Xavier Amatriain, 2023
  10. https://twitter.com/karpathy/status/1582807367988654081
  11. GPT in 60 Lines of NumPy, Jay Mody, 2023
  12. Training language models to follow instructions with human feedback, Long Ouyang et al., 2022
  13. https://openai.com/blog/chatgpt/
  14. https://chat.openai.com/chat
  15. ChatGPT: Jack of all trades, master of none, Jan Kocon et al., 2023
https://deepsense.ai/wp-content/uploads/2023/03/ChatGPT-–-what-is-the-buzz-all-about.jpeg 568 1920 Eryk Mazuś https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Eryk Mazuś2023-03-10 10:00:182023-03-10 15:31:11ChatGPT – what is the buzz all about?
How to leverage ChatGPT to boost marketing strategy

How to leverage ChatGPT to boost marketing strategy?

February 26, 2023/in Generative AI /by Ewa Szkudlarek

The revolution in marketing is happening before our very eyes. The latest developments in the area of generative models mark a milestone where artificial intelligence and human expertise have come together like never before, and the use of AI in marketing is no longer just a buzzword. With ChatGPT and other large language models, marketers will be able to harness the power of AI in an easy way.

Since its launch in November 2022, the potential of using ChatGPT in business has been widely discussed. Marketing seems like the perfect area to test this technology, as the number of use cases is a kind of low-hanging fruit. In this article, we will explore the most effective areas to leverage ChatGPT in marketing strategy that can quickly bring noticeable business value.

More human-like chatbots
The first association for most marketers is the use of ChatGPT to create chatbots that will more naturally interact with customers in real time. There are many possibilities, and here we are not only talking about answering simple questions, but about complex conversations, at the level of a virtual assistant. Certainly, such solutions will also help to examine the individual needs of customers and precisely match the offer to their requirements, which will ensure a more personalized experience.

Large language model technology offers the possibility of deep advanced interaction, which can support a competitive advantage. It will not only broaden the use of intelligent chatbots in customer service, but could also be a great opportunity for a completely new approach to, for example, the range of medical or educational services offered.

Hyper-personalized customer service
Over the years, marketing has become more and more data-driven, but solutions based on large language models allow marketers to fully maximize the potential of data. By analyzing customers’ needs, behavior, and interactions with the brand, the company has a chance to fully respond to their interests. This can help increase customer loyalty and drive revenue growth.

ChatGPT and other large language models also support customer service automation and improve response times – they enable the customer to receive a response and an offer at a chosen time. Customers get the information they need quickly and efficiently, without having to wait for a human agent.

Winning content creation
Marketers can use ChatGPT to generate engaging content ideas, blog post outlines and up-to-date insights that are relevant to their brand. This can help streamline the content creation process and ensure consistency in messaging and brand voice. By inputting a few keywords related to their industry or niche, ChatGPT can provide a list of topics that can be used to create blog posts, social media updates, and other forms of content. This can help to establish the brand as a thought leader in their field and provide content that is perfectly suited to specific marketing objectives and target audiences.

Almost real-time optimization
The American merchant John Wanamaker, a pioneer of advertising, used to say that “half the money I spend on advertising is wasted, the problem is I don’t know which half” 😉 A thorough analysis of advertising expenditure is the key to the effectiveness of marketing strategy, and since Wanamaker’s time, a lot has changed. The possibilities of large language models broaden the approach to optimizing advertising. ChatGPT can help revise marketing campaigns by analyzing performance data, reviewing customer sentiment, and providing insights on areas for improvement. This can notably increase conversion rates, reduce customer acquisition costs, and improve overall ROI.

Market research
In a rapidly changing business reality, converting ChatGPT capabilities into the most useful business use cases can determine a competitive advantage. By using language models, marketers can easily analyze large amounts of text data to extract valuable insights about customers, competitors, and market trends. This information can be used to develop marketing strategy and tactics in areas such as product positioning, messaging, and channel selection.

On an everyday basis, ChatGPT can also help market researchers to gain a deeper understanding of customer feedback and social media conversations, which can provide proof of the actual image of the brand in the eyes of customers.

Sounds promising, but where to start?
The dynamic development of artificial intelligence requires marketers with a deep understanding of new technologies in order to be able to capture new opportunities to win customers’ attention. This sounds like simply stating the obvious! However, in practice, it turns out that the lack of technological know-how excludes many marketers from using AI. In order not to lag behind, it is worth supporting the marketing expertise with the knowledge of a technology partner such as deepsense.ai.

At deepsense.ai, the overriding goal of cooperation with clients is not to deliver the technology solution itself, but above all to provide the client with real business value. That’s why most of our projects start with ideation sessions and discovery workshops, where we introduce our clients to the possibilities of ChatGPT and other large language models and jointly analyze the most attractive use cases. Then deepsense.ai’s teams of AI engineers will perform end-to-end deployment and customization of selected AI solutions and put them on the fast track to delivering value. Close cooperation with deepsense.ai allows the marketing departments to maximize the potential of state-of-the-art technologies and focus on the industry-related aspects of building a competitive advantage.

https://deepsense.ai/wp-content/uploads/2023/02/How-to-leverage-ChatGPT-to-boost-marketing-strategy.jpeg 337 1140 Ewa Szkudlarek https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Ewa Szkudlarek2023-02-26 21:41:222023-02-26 21:47:07How to leverage ChatGPT to boost marketing strategy?
How can we improve language models using reinforcement learning? ChatGPT case study

How can we improve language models using reinforcement learning? ChatGPT case study

February 20, 2023/in Generative AI /by Kinga Prusinkiewicz

ChatGPT is a cutting-edge natural language processing model released in November 2022 by OpenAI. It is a variant of the GPT-3 model, specifically designed for chatbot and conversational AI applications. On the rising tide of ChatGPT, there are plenty of amazing examples of the chatbot’s accomplishments, one of which is presented in Figure 1.

Figure 1: Example usage of ChatGPT to analyze worst-case time complexity of bubble sorting in the specified style

Figure 1: Example usage of ChatGPT to analyze worst-case time complexity of bubble sorting in the specified style. Source: https://twitter.com/goodside/status/1598129631609380864

Introduction of GPT models

Let’s start with a short introduction of what GPT models are (what the GPT model family is). This acronym is used when referring to a series (GPT, GPT-2 and GPT-3, with the next generations expected soon). It has been trained to process and generate human-like language, and it has achieved impressive results in various language tasks such as translation, summarization, and question answering. GPT has been trained on a massive dataset of text, and it uses this training data to learn patterns and relationships in language. This allows it to understand and generate language in a way that is similar to how humans do. GPT is a powerful tool for developers looking to create articles, poetry, stories, news, reports and dialogue. It can be fine-tuned for specific tasks or domains, allowing it to become even more effective at handling specific types of language tasks.

What makes ChatGPT different from classic GPT models is its incorporation of human feedback during training using reinforcement learning. In this post we will dive into the details of RLHF (Reinforcement Learning from Human Feedback) and how we can use it to fine-tune language models. It is worth noting that the idea was previously used by the OpenAI team in InstructGPT – a sibling model which was trained to follow an instruction in a prompt and provide a detailed response.

What is reinforcement learning?

Reinforcement learning is a machine learning area which aims to train models to make a sequence of decisions. The agent learns by interacting with the (usually complex) environment. Each action is chained with a reward (or penalty). The aim of the model is to learn which actions will maximize the total reward.

The typical reinforcement learning setup consists of a tuple of five elements:

  • States Space (\(S\)) – a set of possible states that an agent can visit.
  • Action Space (\(A\)) – a set of possible actions that an agent may take.
  • State Transition Probability (\(P\)) – describes the dynamics of the environment. It is also called the world model. For model-free reinforcement learning, it is not necessary to know the state transition probability.
  • Reward Function (\(R\)) – a reward (penalty) that an agent receives for a selected action made in a specific state.
  • Discount Factor (\(\gamma\)) – defines the present value of future rewards.

A reinforcement learning agent learns policy (\(\gamma\)), which defines the action that should be taken in the current state.

RL has a wide range of applications, including control systems and robotics. It is particularly useful for tasks that involve sequential decision-making or learning from experience, such as playing Go or Atari games.

Reinforcement learning from human feedback

The history of incorporating human feedback into reinforcement learning is very long. There have been plenty of ideas on how we can integrate human-based samples into the agent training process, for example by adjusting the algorithm itself or with reward shaping. We would like to focus a little bit more on the approach presented in “Deep Reinforcement Learning from Human Preferences” published by DeepMind in 2017.

A typical reinforcement learning training loop involves an agent who interacts with the environment and changes states. Each interaction is connected to a reward. The whole process is presented in Figure 2. The reward function has a huge impact on agent performance. If poorly designed, it results in poor agent performance as well. In the paper, the authors propose to learn the reward function from human feedback, while the agent is still training the same way as in the classical reinforcement learning task.

Figure 2: Classic reinforcement learning training loop. Source: own elaboration

Figure 2: Classic reinforcement learning training loop. Source: own elaboration.

The best way to train the agents is by going through the example provided by the authors in the video.

Figure 3: Experiment video screenshot. The human coordinator selected the left agent, as its behavior is more similar to a backflip, which was the goal

Figure 3: Experiment video screenshot. The human coordinator selected the left agent, as its behavior is more similar to a backflip, which was the goal. Source: https://www.youtube.com/watch?v=oC7Cw3fu3gU

The task was to teach the agent how to do a backflip. Two trajectories delivered by current policy were shown to a human, who decided which one did the better backflip (or at least made a better attempt at doing a backflip). Based on preference, the reward estimator is updated to grant the favored agent behavior with a higher reward. Then the agent is trained in the classical manner of reinforcement learning. Our training loop was enriched with one additional step. The new loop is presented in Figure 4.

Figure 4: Reinforcement learning from human feedback training loop. Source: own elaboration

Figure 4: Reinforcement learning from human feedback training loop. Source: own elaboration.

To sum up, the new process consists of three steps:

  1. Generating a set of trajectories \(\{\tau^{1}, …, \tau^{n}\}\), with learned policy. The parameters of the policy are learned via traditional reinforcement learning to maximize total reward. Policy can be learned using any suitable reinforcement learning algorithm.
  2. Selecting two segments \((\sigma^{1}, \sigma^{2})\) from the generated trajectories and letting the human compare them and rank which one did better. Human judgments are stored as a tuple \((\sigma^{1}, \sigma^{2}, \mu)\), where \(\mu\) is the distribution of which segment was preferred.
  3. Train the reward predictor using supervised learning techniques. To estimate the reward predictor, we should find a way to express the preferred strategy, which can be achieved via the Bradley-Terry model. The simplest example of how this model works is a situation in which we would like to rank football teams in a competition. As the number of matches played might not be even for all teams, we can introduce a model that compares the “strength” of teams to achieve the probability of one team beating another. We can introduce the same thing for trajectories:

$$
\widehat{P}[\sigma^{1} \succ \sigma^{2}] = \frac{\exp(\Sigma\widehat{r}(\sigma^{1}_{t}, a^{1}_{t}))}{\exp(\Sigma\widehat{r}(\sigma^{1}_{t}, a^{1}_{t})) + \exp(\Sigma\widehat{r}(\sigma^{2}_{t}, a^{2}_{t}))}
$$

Therefore we can write the loss function as:

$$
loss(\widehat{r}) = \Sigma_{(\sigma^{1}, \sigma^{2}, \mu)} \mu(1)\widehat{P}[\sigma^{1} \succ \sigma^2] + \mu(2)\widehat{P}[\sigma^{2} \succ \sigma^1]
$$

Now, as we are equipped with reinforcement learning from human feedback knowledge, we can take a deep dive into the ChatGPT example.

ChatGPT/Instruct GPT cases

ChatGPT and InstructGPT use reinforcement learning from human feedback in the model fine-tuning phase. We can split it into the three stages presented in Figure 5.

Figure 5: ChatGPT fine-tuning steps

Figure 5: ChatGPT fine-tuning steps. Source: https://openai.com/blog/chatgpt/

Step 1

The first step involves fine-tuning GPT-3.5 using data delivered by humans playing the role of assistant and user. The trainers had access to model-written suggestions to help with composing responses. These dialogues were mixed with the InstructGPT dataset, which contains prompts and instructions written by users of earlier versions of InstructGPT submitted through Playground. Regarding InstructGPT, the data collection step is limited to obtaining and using the InstructGPT dataset and fine-tuning the GPT-3 model. This step is summarized in Figure 6.

Figure 6: Language model pretraining

Figure 6: Language model pretraining. Source: https://huggingface.co/blog/rlhf

The next steps remain the same for both ChatGPT and InstructGPT.

Step 2

The second step is focused on the training reward model. The language model from the first step is used to prepare samples of responses that are compared and ranked by humans to express their preferences. According to the InstructGPT paper, a labeler receives between 4 and 9 responses to rank. It means, that there is \(\binom{K}{2}\) comparisons, where K is the number of responses to compare. Each set of comparisons is fed to a neural network to learn how to evaluate generated responses in terms of human preferences.

Figure 7: Reward model training

Figure 7: Reward model training. Source: https://huggingface.co/blog/rlhf

Step 3

The last step utilizes the prepared elements in one reinforcement learning task to fine-tune the language model. Let’s formulate the task to fit reinforcement learning language:

  • The agent is represented by a language model.
  • State space is the possible input token sequences.
  • The action space is all the tokens corresponding to the vocabulary of the language model.
  • The reward from the environment is delivered by the reward predictor trained in step 2.

The algorithm used in ChatGPT is PPO, which is short for Proximal Policy Optimization – a state-of-the-art technique in the Reinforcement Learning area. Kullbach-Leibler divergence is added to PPO loss between the initial model and current policy distributions to prevent one from moving substantially away from the initial model.

Figure 8: Fine-tuning with Reinforcement Learning

Figure 8: Fine-tuning with Reinforcement Learning. Source: https://huggingface.co/blog/rlhf

Summary

Using human feedback as the reward signal has several advantages. It allows the model to learn from real-world human preferences and expectations, making it more likely to generate responses that are natural and human-like. It also allows the model to learn more quickly and efficiently, since it can use the feedback it receives to fine-tune its output and avoid making the same mistakes in the future.

However, there are also some limitations to this approach. The feedback may be subjective and prone to bias, which could affect the model’s learning process. Additionally, it can be time-consuming and resource-intensive to collect and process large amounts of human feedback, especially if the model is generating a large number of responses.

Bibliography

  • “Deep reinforcement learning from human preferences” Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, https://arxiv.org/abs/1706.03741
  • https://openai.com/blog/chatgpt/
  • https://openai.com/blog/instruction-following/
  • https://huggingface.co/blog/rlhf
  • https://openai.com/research/learning-from-human-preferences
  • https://wandb.ai/ayush-thakur/RLHF/reports/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1–VmlldzoyODk5MTIx
https://deepsense.ai/wp-content/uploads/2023/02/How-can-we-improve-language-models-using-reinforcement-learning-ChatGPT-case-study.jpeg 337 1140 Kinga Prusinkiewicz https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Kinga Prusinkiewicz2023-02-20 07:00:452023-03-13 16:49:54How can we improve language models using reinforcement learning? ChatGPT case study
The recent rise of diffusion-based models

The recent rise of diffusion-based models

September 5, 2022/in Generative AI /by Maciej Domagała

Every fan of generative modeling has been living an absolute dream for the last year and a half (at least!). The past few months have brought several developments and papers on text-to-image generation, each one arguably better than the last. We have observed a social media surge of spectacular, purely AI-generated images, such as this golden retriever answering tough questions on the campaign trail or a brain riding a rocketship to the moon.

The recent rise of diffusion-based models - Introduction

Sources: https://openai.com/dall-e-2/ and https://imagen.research.google/

In this post, we will sum up the very recent history of solving the text-to-image generation problem and explain the latest developments regarding diffusion models, which are playing a huge role in the new, state-of-the-art architectures.

The recent rise of diffusion-based models - Short timeline of image generation and text-to-image solutions

A short timeline of image generation and text-to-image solutions.

It all starts with DALL·E

In 2020 the OpenAl team [1] published the GPT-3 model – a multimodal do-it-all huge language model, capable of machine translation, text generation, semantic analysis etc. The model swiftly became regarded as the state-of-the-art for language modeling solutions, and DALL·E [7] can be viewed as a natural expansion of the transformer capabilities into the computer vision domain.

Autoregressive approach

The authors proposed an elegant two-stage approach:

  • train a discrete VAE model to compress images into image tokens,
  • concatenate the encoded text snippet with the image tokens and train the autoregressive transformer to learn the joint distribution over text and images.

The final version was trained on 250 million text-image pairs obtained from the Internet.

CLIP

During inference, the model is able to output a whole batch of generated images. But how can we estimate which images are best? Simultaneously with the publication of DALL·E, the OpenAI team presented a solution for image and text linking called CLIP [9]. In a nutshell, CLIP offers a reliable way of pairing a text snippet with its image representation. Putting aside all of the technical aspects, the idea of training this type of model is fairly simple – take the text snippet and encode it, take an image and encode it. Do that for a lot of examples (400 million (image, text) pairs) and train the model in a contrastive fashion.

The recent rise of diffusion-based models - Visualisation of CLIP contrastive pre-training

Visualisation of CLIP contrastive pre-training, source: [9]

This kind of mapping allows us to estimate which of the generated images are the best match considering the text input. For anyone who would like to see the power of CLIP – feel free to check out my previous post on combining CLIP and evolutionary algorithms to generate images [deepsense.ai’s blogpost].

DALL·E attracted major attention from people both inside and outside the AI world; it gained lots of publicity and stirred a great deal of conversation. Even so, it only gets an honorable mention here, as the trends shifted quite quickly towards novel ideas.

All you need is diffusion

Sohl-Dickstein et al. [2] proposed a fresh idea on the subject of image generation – diffusion models.

The recent rise of diffusion-based models - Generative models

Generative models, source: [13]

The idea is inspired by non-equilibrium thermodynamics, although underneath it is packed with some interesting mathematical concepts. We can notice the already known concept of encoder-decoder structure here, but the underlying idea is a bit different than what we can observe in traditional variational autoencoders. To understand the basics of this model, we need to describe forward and reverse diffusion processes.

Forward image diffusion

This process can be described as gradually applying Gaussian noise to the image until it becomes entirely unrecognizable. This process is fixed in a stochastic sense – the noise application procedure can be formulated as the Markov chain of sequential diffusion steps. To untangle the difficult wording a little bit, we can neatly describe it with a few formulas. Assume that images have a certain starting distribution \(q\left(\bf{x}_{0}\right)\). We can sample just one image from this distribution – \(\bf{x}_{0}\). We want to perform a chain of diffusion steps \(\bf{x}_{0} \rightarrow \bf{x}_{1} \rightarrow … \rightarrow \bf{x}_{\it{T}}\), each step disintegrating the image more and more.

How exactly is the noise applied? It is formally defined by a noising schedule \(\{\beta_{t}\}^{T}_{t=1}\), where for every \(t = 1,…,T\) we have \(\beta_{t} \in (0,1)\). With such a schedule we can formally define the forward process as

$$
q\left(\mathbf{x}_{t} \mid \mathbf{x}_{t-1}\right)=\mathcal{N}\left(\sqrt{1-\beta_{t}} \mathbf{x}_{t-1}, \beta_{t} \mathbf{I}\right)
$$

There are just two more things worth mentioning:

  • As the number of noising steps increases \((T \to \infty)\), the final distribution \(q(\mathbf{x}_{T})\) approaches a very handy isotropic Gaussian distribution. This makes any future sampling from noised distribution efficient and easy.
  • Noising with a Gaussian kernel provides another benefit – there is no need to go step-by-step through the noising process to achieve any intermediate latent state. We can sample any latent state directly thanks to reparametrization$$
    q\left(\mathbf{x}_{t} \mid \mathbf{x}_{0}\right)=\mathcal{N}\left(\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0},\left(1-\bar{\alpha}_{t}\right) \mathbf{I}\right) = \sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}} \cdot \epsilon,
    $$where \(\alpha_{t} := 1-\beta_{t}\), \(\bar{\alpha}_{t} := \prod_{k=0}^{t}\alpha_{k}\) and \(\epsilon \sim \mathcal{N}(0, \mathbf{I})\). Here \(\epsilon\) represents Gaussian noise – this formulation will be essential for training.

Reverse image diffusion

We have a nicely defined forward process. One might ask – so what? Why can’t we just define a reverse process \(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)\) and trace back from the noise to the image? First of all, that would fail conceptually, as we want to have a neural network that learns how to deal with a problem – we shouldn’t provide it with a clear solution. And second of all, we cannot quite do that, as it would require marginalization over the entire data distribution. To get back to the starting distribution \(q(\bf{x}_{0})\) from the noised sample we would have to marginalize over all of the ways we could arise at \(\mathbf{x}_{0}\) from the noise, including all of the latent states. That means calculating \(\int q(\mathbf{x}_{0:T})d\mathbf{x}_{1:T}\), which is intractable. So, if we cannot calculate it, surely we can… approximate it!

The core idea is to develop a reliable solution – in the form of a learnable network – that successfully approximates the reverse diffusion process. The first way to achieve that is by estimating the mean and covariance for denoising steps

$$
p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)=\mathcal{N}(\mu_{\theta}(\mathbf{x}_{t}, t), \Sigma_{\theta}(\mathbf{x}_{t}, t) ).
$$

In a practical sense, \(\mu_{\theta}(\mathbf{x}_{t}, t)\) can be estimated via the neural network and \(\Sigma_{\theta}(\mathbf{x}_{t}, t)\) can be fixed to a certain constant related to the noising schedule, such as \(\beta_{t}\mathbf{I}\).

The recent rise of diffusion-based models - Forward and reverse diffusion processes

Forward and reverse diffusion processes, source: [14]

Estimating \(\mu_{\theta}(\mathbf{x}_{t}, t)\) this way is possible, but Ho et al. [3] came up with a different way of training – a neural network \(\epsilon_{\theta}(\mathbf{x}_{t}, t)\) can be trained to predict the noise \(\epsilon\) from the earlier formulation of \(q\left(\mathbf{x}_{t} \mid \mathbf{x}_{0}\right)\).

As in Ho et al. [3], the training process consists of the following steps:

  1. Sample image \(\mathbf{x}_{0}\sim q(\bf{x}_{0})\),
  2. Choose a certain step in the diffusion process \(t \sim U(\{1,2,…,T\})\),
  3. Apply the noising \(\epsilon \sim \mathcal{N}(0,\mathbf{I})\),
  4. Try to estimate the noise \(\epsilon_{\theta}(\mathbf{x}_{t}, t)= \epsilon_{\theta}(\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}} \cdot \epsilon, t)\),
  5. Learn the network by gradient descent on loss \(\nabla_{\theta} \|\epsilon – \epsilon_{\theta}(\mathbf{x}_{t}, t)\|^{2}\).

In general, loss can be nicely presented as

$$
L_{\text{diffusion}}=\mathbb{E}_{t, \mathbf{x}_{0}, \epsilon}\left[\left\|\epsilon-\epsilon_{\theta}\left(\mathbf{x}_{t}, t\right)\right\|^{2}\right],
$$

where \(t, \mathbf{x}_0\) and \(\epsilon\) are described as in the steps above.

All of the formulations, reparametrizations and derivations are a bit math-extensive, but there are already some great resources available for anyone that wants to have a deeper understanding of the subject. Most notably, Lillian Weng [13], Angus Turner [14] and Ayan Das [15] went through some deep derivations while maintaining an understandable tone – I highly recommend checking these posts.

Guiding the diffusion

The above part itself explains how we can perceive the diffusion model as generative. Once the model \(\epsilon_{\theta}(\mathbf{x}_{t}, t)\) is trained, we can use it to run the noise \(\mathbf{x}_{t}\) back to \(\mathbf{x}_{0}\). Given that it is straightforward to sample the noise from isotropic Gaussian distribution, we can obtain limitless image variations. We can also guide the image generation by feeding additional information to the network during the training process. Assuming that the images are labeled, the information about class \(y\) can be fed into a class-conditional diffusion model \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid y)\).
and one way of introducing the guidance in the training process is to train a separate model, which acts as a classifier of noisy images. At each step of denoising, the classifier checks whether the image is denoised in the right direction and contributes its own gradient of loss function into the overall loss of diffusion model.

Ho & Salimans [5] proposed an idea on how to feed the class information into the model without the need to train an additional classifier. During the training the model \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid y)\) is sometimes (with fixed probability) ne n shown the actual class \(y\). Instead, the class label is replaced with the null label \(\emptyset\). So it learns to perform diffusion with and without the guidance. For inference, the model performs two predictions, once given the class label \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid y)\) and once not \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid \emptyset)\). The final prediction of the model is moved away from \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid \emptyset)\) and towards \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid y)\) by scaling with guidance scale \(s \geq 1\).

$$
\hat{\epsilon}_{\theta}\left(\mathbf{x}_{t}, t \mid y\right)=\epsilon_{\theta}\left(\mathbf{x}_{t}, t \mid \emptyset\right)+s \cdot\left(\epsilon_{\theta}\left(\mathbf{x}_{t}, t \mid y\right)-\epsilon_{\theta}\left(\mathbf{x}_{t}, t \mid \emptyset\right)\right)
$$

This kind of classifier-free guidance uses only the main model’s comprehension – an additional classifier is not needed – which yields better results according to Nichol et al. [6].

Text-guided diffusion with GLIDE

Even though the paper describing GLIDE [6] architecture received the least publicity out of all the publications discussed in this post, it arguably presents the most novel and interesting ideas. It combines all of the concepts presented in the previous chapter nicely. We already know how diffusion models work and that we can use them to generate images. The two questions we would now like to answer are:

  • How can we use the textual information to guide the diffusion model?
  • How can we make sure that the quality of the model is good enough?

Architecture choice

Architecture can be boiled down to three main components:

  1. A UNet-based model responsible for the visual part of the diffusion learning,
  2. A transformer-based model responsible for creating text embedding from a snippet of text,
  3. An upsampling diffusion model is used for enhancing output image resolution.

The first two work together in order to create a text-guided image output, while the last one is used to enlarge the image while preserving the quality.

The core of the model is the well-known UNet architecture, used for the diffusion in Dhariwal & Nichol [8]. The model, just like in its early versions, stacks residual layers with downsampling and upsampling convolutions. It also consists of attention layers which are crucial for simultaneous text processing. The model proposed by the authors has around 2.3 billion parameters and was trained on the same dataset as DALL·E.

The text used for guidance is encoded in tokens and fed into the Transformer model. The model used in GLIDE had roughly 1.2 billion parameters and was built from 24 residual blocks of width 2048. The output of the transformer has two purposes:

  • the final embedding token is used as class embedding \(y\) in \(\epsilon_{\theta}(\mathbf{x}_{t}, t \mid y)\),
  • the final layer of token embeddings is added to every attention layer of the model.

It is clear that a great deal of focus was put into making sure that the model receives enough text-related context in order to generate accurate images. The model is conditioned on the text snippet embedding, the encoded text is concatenated with the attention context and during training, the classifier-free guidance is used.

As for the final component, the authors used the diffusion model to go from a low-resolution to a high-resolution image using an ImageNet upsampler.

The recent rise of diffusion-based models - GLIDE interpretation of ‘a corgi in a field’

GLIDE interpretation of ‘a corgi in a field’, source: [6]

GLIDE incorporates a few notable achievements developed in recent years and sheds new light on the concept of text-guided image generation. Given that the DALL·E model was based on different structures, it is fair to say that the publication of GLIDE represents the dawn of the diffusion-based text-to-image generation era.

The next version – DALL·E 2

The OpenAI team doesn’t seem to get much rest, as in April they took the Internet by storm with DALL·E 2 [7]. It takes elements from both predecessors: it relies heavily on CLIP [9] but a large part of the solution revolves around GLIDE [6] architecture. DALL·E 2 has two main underlying components called the prior and the decoder, which are able to produce image output when stacked together. The entire mechanism was named unCLIP, which may already spoil the mystery of what exactly is going on under the hood.

The recent rise of diffusion-based models - Visualization of DALL-E 2 two-stage mechanism

Visualization of DALL·E 2 two-stage mechanism. Source: [7]

The prior

The first stage is meant to convert the caption – a text snippet such as a “corgi playing a flame-throwing trumpet” – into text embedding. We obtain it using a frozen CLIP model.

After text embedding comes the fun part – we now want to obtain an image embedding, similar to the one which is obtained via the CLIP model. We want it to encapsulate all important information from the text embedding, as it will be used for image generation through diffusion. Well, isn’t that exactly what CLIP is for? If we want to find a respective image embedding for our input phrase, we can just look at what is close to our text embedding in the CLIP encoded space. One of the authors of DALL·E 2 [Aditya Ramesh, 2022] posted a nice explanation of why that solution fails and why the prior is needed – “An infinite number of images could be consistent with a given caption, so the outputs of the two encoders will not perfectly coincide.
Hence, a separate prior model is needed to “translate” the text embedding into an image embedding that could plausibly match it”.

On top of that, the authors empirically checked the importance of the prior in the network. Passing both the image embedding produced by the prior and the text vastly outperforms generation using only the caption or caption with CLIP text embedding.

The recent rise of diffusion-based models - Samples generated conditioned on- caption, text embedding and image embedding

Samples generated conditioned on: caption, text embedding, and image embedding. Source: https://arxiv.org/pdf/2204.06125.pdf

The authors tested two model classes for the prior: the autoregressive model and the diffusion model. This post will cover only the diffusion prior, as it was deemed better performing than autoregressive, especially from a computational point of view. For the training of prior, a decoder-only Transformer model was chosen. It was trained by using a sequence of several inputs:

  • encoded text,
  • CLIP text embedding,
  • embedding for the diffusion timestep,
  • noised image embedding,

with the goal of outputting an unnoised image embedding \(z_{i}\). As opposed to the way of training proposed by Ho et al. [7] covered in previous sections, predicting the unnoised image embedding directly instead of predicting the noise was a better fit. So, remembering the previous formula for diffusion loss in a guided model

$$
L_{\text{diffusion}}=\mathbb{E}_{t, \mathbf{x}_{0}, \epsilon}\left[\left\|\epsilon-\epsilon_{\theta}\left(\mathbf{x}_{t}, t\mid y\right)\right\|^{2}\right],
$$

we can present the prior diffusion loss as

$$
L_{\text{prior:diffusion}}=\mathbb{E}_{t}\left[\left\|z_{i}-f_{\theta}\left({z}_{i}^{t}, t \mid y\right)\right\|^{2}\right],
$$

where \(f_{\theta}\) stands for the prior model, \({z}_{i}^{t}\) is the noised image embedding, \(t\) is the timestamp and \(y\) is the caption used for guidance.

The decoder

We covered the prior part of the unCLIP, which was meant to produce a model that is able to encapsulate all of the important information from the text into a CLIP-like image embedding. Now we want to use that image embedding to generate an actual visual output. This is when the name unCLIP unfolds itself – we are walking back from the image embedding to the image, the reverse of what happens when the CLIP image encoder is trained.

As the saying goes: “After one diffusion model it is time for another diffusion model!”. And this one we already know – it is GLIDE, although slightly modified. Only slightly, since the single major change is adding the additional CLIP image embedding (produced by the prior) to the vanilla GLIDE text encoder. After all, this is exactly what the prior was trained for – to provide information for the decoder. Guidance is used just as in regular GLIDE. To improve it, CLIP embeddings are set to \(\emptyset\) in 10% of cases and text captions \(y\) in 50% of cases.

Another thing that did not change is the idea of upsampling after the image generation. The output is tossed into additional diffusion-based models. This time two upsampling models are used (instead of one in the original GLIDE), one taking the image from 64×64 to 256×256 and the other further enhancing resolution up to 1024×1024.

Imagen that we can do it better

The Google Brain team decided not to be late to the party, as less than two months after the publication of DALL·E 2 they presented the fruits of their own labor – Imagen (Saharia et al. [7].

The recent rise of diffusion-based models - Overview of Imagen architecture

Overview of Imagen architecture. Source: [7]

Imagen architecture seems to be oddly simple in its structure. A pretrained textual model is used to create the embeddings that are diffused into an image. Next, the resolution is increased via super-resolution diffusion models – the steps we already know from DALL·E 2. A lot of novelties are scattered in different bits of the architecture – a few in the model itself and several in the training process. Together, they offer a slight upgrade when compared to other solutions. Given the large portion of knowledge already served, we can explain this model via differences with previously described models:

Use a pretrained transformer instead of training it from scratch. This is viewed as the core improvement compared to OpenAI’s work. For everything regarding text embeddings, the GLIDE authors used a new, specifically trained transformer model.
The Imagen authors used a pretrained, frozen T5-XXL model [4]. The idea is that this model has vastly more context regarding language processing than a model trained only on the image captions, and so is able to produce more valuable embeddings without the need to additionally fine-tune it.

Make the underlying neural network more efficient. An upgraded version of the neural network called Efficient U-net was used as the backbone of super-resolution diffusion models. It is said to be more memory-efficient and simpler than the previous version, and it converges faster1 as well. The changes were introduced mainly in residual blocks and via additional scaling of the values inside the network. For anyone who enjoys digging deep into the details – the changes are well documented in Saharia et al. [7].

Use conditioning augmentation to enhance image fidelity. Since the solution can be viewed as a sequence of diffusion models, there is an argument to be made about enhancements in the areas where the models are linked. Ho et al. [10] presented a solution called conditioning augmentation. In simple terms, it is equivalent to applying various data augmentation techniques, such as a Gaussian blur, to a low-resolution image before it is fed into the super-resolution models.

There are a few other resources deemed crucial to a low FID score and high image fidelity (such as dynamic thresholding) – these are explained in detail in the source paper [7]. The core of the approach is already covered in previous chapters.

The recent rise of diffusion-based models - Some of Imagen generations with captions

Some Imagen generations with captions. Source: [7]

Is it the best yet?

As of writing this text, Google’s Imagen is considered to be state-of-the-art as far as text-to-image generation is concerned. But why exactly is that? How can we evaluate the models and compare them to each other?

The authors of Imagen opted for two means of evaluation. One is considered to be the current standard for text-to-image modeling, namely establishing a Fréchet inception distance score on a COCO validation dataset. The authors report (unsurprisingly) that Imagen shows a state-of-the-art performance, its zero-shot FID outperforming all other models, even those specifically trained on COCO.

The recent rise of diffusion-based models - Comparison of several models

Comparison of several models. Source: https://arxiv.org/pdf/2205.11487.pdf

A far more intriguing means of evaluation is a brand new proposal from the authors called DrawBench – a comprehensive and challenging set of prompts that support the evaluation and comparison of text-to-image models (source). It consists of 200 prompts divided into 11 categories, collected from e.g. DALL·E or Reddit. A list of the prompts with categories can be found in [17]. The evaluation was performed by 275 unbiased (sic!) raters, 25 for each category.
Each rater was shown two non-cherry picked and random sets of images generated by two different models (e.g. Imagen and DALL·E 2) and had to respond to two questions:

  1. Which set of images is of higher quality?
  2. Which set of images better represents the text caption?

These two questions are meant to address the two most important characteristics of a good text-to-image model: the quality of the images produced (fidelity) and how well it reflects the input text prompt (alignment). Each rater had three choices – to claim that one of the models performs better, or to call it a tie. Once again, there can be only one winner. Interestingly, the GLIDE model seems to perform slightly better than DALL·E 2, at least based on this curated dataset.

The recent rise of diffusion-based models - Imagen vs other models

Source: [7]

As expected, a large portion of the publication is devoted to the comparison between the images produced by Imagen and GLIDE/DALL·E – more can be found in Appendix E of [7].

The fun is far from over

As usual, with new architecture gaining recognition there is a surge of interesting publications and solutions emerging from the void. The pace of developments makes it nearly impossible to track every interesting publication. There are also a lot of interesting characteristics of the models to discover other than raw generative power, such as image inpainting, style transfer, and image editing.

Apart from the understandable excitement over a new era of generative models, there are some shortcomings embedded into the diffusion process structure, such as slow sampling speed compared to previous models [16].

The recent rise of diffusion-based models - Models comparison

Models comparison. Source: [16]

For anyone who likes to go deep into the minutiae of implementation, I highly recommend going through Phil Wang’s (@lucidrains on github) repositories [20], which is a collaborative effort from many people to recreate the unpublished models in PyTorch.

For anyone who would like to admire some more examples of DALL·E 2’s generative power, I recommend checking the newly created subreddit with DALL·E 2 creations in [18]. It is moderated by people with OpenAI’s Lab access – feel free to join the waitlist [19] and have the opportunity to play with models yourself.

References

  1. Language Models are Few-Shot Learners Tom B. Brown et al. 2020
  2. Deep Unsupervised Learning using Nonequilibrium Thermodynamics Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli. 2015
  3. Denoising Diffusion Probabilistic Models Jonathan Ho, Ajay Jain, Pieter Abbeel. 2020
  4. How Much Knowledge Can You Pack Into the Parameters of a Language Model? Adam Roberts, Colin Raffel, Noam Shazeer. 2020
  5. Classifier-Free Diffusion Guidance Jonathan Ho, Tim Salimans. 2021
  6. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models Alex Nichol et al. 2021
  7. Zero-Shot Text-to-Image Generation Aditya Ramesh et al. 2021
  8. Diffusion Models Beat GANs on Image Synthesis Prafulla Dhariwal, Alex Nichol. 2021
  9. Learning Transferable Visual Models From Natural Language Supervision Alec Radford et al. 2021
  10. Cascaded Diffusion Models for High Fidelity Image Generation Jonathan Ho et al. 2021
  11. Hierarchical Text-Conditional Image Generation with CLIP Latents Aditya Ramesh et al. 2022
  12. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding Chitwan Saharia et al. 2022
  13. What are Diffusion Models? Lilian Weng. 2021
  14. Diffusion Models as a kind of VAE Angus Turner. 2021
  15. An introduction to Diffusion Probabilistic Models Ayan Das. 2021
  16. Improving Diffusion Models as an Alternative To GANs, Part 1 Arash Vahdat, Karsten Kreis. 2022
  17. DrawBench prompts Google Brain team. 2022
  18. DALL·E 2 subreddit Reddit. 2022
  19. OpenAI’s waitilist OpenAI team. 2022
  20. Phil Wang’s repositories Phil Wang. 2022
https://deepsense.ai/wp-content/uploads/2022/07/The-recent-rise-of-diffusion-based-models.jpeg 337 1140 Maciej Domagała https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Maciej Domagała2022-09-05 08:00:592023-02-26 21:57:37The recent rise of diffusion-based models

Start your search here

Build your AI solution
with us!

Contact us!

NEWSLETTER SUBSCRIPTION


    You can modify your privacy settings and unsubscribe from our lists at any time (see our privacy policy).

    This site is protected by reCAPTCHA and the Google privacy policy and terms of service apply.

    CATEGORIES

    • Generative AI
    • Elasticsearch
    • Computer vision
    • Artificial Intelligence
    • AIOps
    • Big data & Spark
    • Data science
    • Deep learning
    • Machine learning
    • Neptune
    • Reinforcement learning
    • Seahorse
    • Job offer
    • Popular posts
    • AI Monthly Digest
    • Press release

    POPULAR POSTS

    • How we integrated GPT with PDF documentsHow we developed a GPT‑based solution for extracting knowledge from documentsMay 26, 2023
    • Diffusion models in practice. Part 2: How good is your model?Diffusion models in practice. Part 2: How good is your model?May 8, 2023
    • How to train a large language model using limited hardware?How to train a large language model using limited hardware?April 17, 2023

    Would you like
    to learn more?

    Contact us!
    • deepsense.ai logo white
    • Services
    • AI software
    • Team augmentation
    • AI discovery workshops
    • GPT and other LLMs fast track workshop
    • Generative AI
    • Train your team
    • Knowledge base
    • deeptalks
    • Blog
    • R&D hub
    • deepsense.ai
    • Careers
    • Summer internship
    • Our story
    • Management
    • Advisory board
    • Press center
    • Support
    • Terms of service
    • Privacy policy
    • Code of ethics
    • Contact us
    • Join our community
    • linkedin logo facebook logo twitter logo youtube logo medium logo
    • © deepsense.ai 2014-
    Scroll to top

    This site uses cookies. By continuing to browse the site, you are agreeing to our use of cookies.

    OKLearn more

    Cookie and Privacy Settings



    How we use cookies

    We may request cookies to be set on your device. We use cookies to let us know when you visit our websites, how you interact with us, to enrich your user experience, and to customize your relationship with our website.

    Click on the different category headings to find out more. You can also change some of your preferences. Note that blocking some types of cookies may impact your experience on our websites and the services we are able to offer.

    Essential Website Cookies

    These cookies are strictly necessary to provide you with services available through our website and to use some of its features.

    Because these cookies are strictly necessary to deliver the website, refuseing them will have impact how our site functions. You always can block or delete cookies by changing your browser settings and force blocking all cookies on this website. But this will always prompt you to accept/refuse cookies when revisiting our site.

    We fully respect if you want to refuse cookies but to avoid asking you again and again kindly allow us to store a cookie for that. You are free to opt out any time or opt in for other cookies to get a better experience. If you refuse cookies we will remove all set cookies in our domain.

    We provide you with a list of stored cookies on your computer in our domain so you can check what we stored. Due to security reasons we are not able to show or modify cookies from other domains. You can check these in your browser security settings.

    Other external services

    We also use different external services like Google Webfonts, Google Maps, and external Video providers. Since these providers may collect personal data like your IP address we allow you to block them here. Please be aware that this might heavily reduce the functionality and appearance of our site. Changes will take effect once you reload the page.

    Google Webfont Settings:

    Google Map Settings:

    Google reCaptcha Settings:

    Vimeo and Youtube video embeds:

    Privacy Policy

    You can read about our cookies and privacy settings in detail on our Privacy Policy Page.

    Accept settingsHide notification only