INDUSTRYTMT & Other
Companies continuously generate a vast array of documents, ranging from legal, financial, and administrative records to internal knowledge bases detailing essential processes. As the amount of documentation grows, tracking and finding essential information becomes ever more challenging. This often leads to countless hours being wasted searching for specific details. In organizations for which document organization is not the primary focus, this inefficiency becomes a substantial, yet difficult to measure, cost burden.The challenge
The goal of the Niffler project was to develop an innovative solution to enable users to interactively explore and extract information from documents or extensive collections of such documents. By providing a user-friendly interface, the system aimed to efficiently identify relevant sections of the documents in response to user queries and construct accurate, comprehensive answers based on the extracted parts of the document.The solution
To address this challenge, we leveraged the latest advancements in the field of large language models (LLMs). Our final solution employs OpenAI models: Davinci for tasks such as text summarization, text tagging, and text citation, and Ada for text embedding. Since using large language models by themselves can be tricky, we also use the LangChain library to put many API calls together, making the system work smoothly and efficiently. LangChain also facilitates the integration of different LLMs and provides the option to switch between them with relative ease. This allows our system to adapt to new models when necessary.
Finding relevant documents
To efficiently search for documents that might contain the information users are looking for, we first create an index of the documents. Using OpenAI's Davinci model, we summarize each document and then embed each summary with the OpenAI Ada model. This collection of embeddings, called a Vectorstore, can be stored efficiently by various vector databases, such as Pinecone, AtlasDB, and more. When a user asks a question, we find the relevant documents by first embedding the user's question with the Ada model and then performing a similarity search over the vector store. This process ensures that the most pertinent documents are identified and presented to the user.
Interactive document exploration
Once the relevant documents have been identified, we can begin an interactive exploration process through a chat-like interface. The original approach involves the following steps:
- The user asks a question.
- We combine the question with the chat history to create a condensed question that captures the context of the conversation using OpenAI's Davinci model.
- We ask OpenAI's Davinci model to answer the condensed question within the context of the entire document.
However, due to limitations in handling long documents, we need to adapt the process. We can work around this by dividing the document into segments, embedding each segment with the Ada model, and performing a similarity search over the segments. The refined process looks like this:
- The user asks a question.
- We create a condensed question using the question and chat history, capturing the conversation context with OpenAI's Davinci model.
- We embed the condensed question and perform a similarity search over the document segments, identifying the most relevant parts (context) related to the question.
- We ask OpenAI's Davinci model to answer the condensed question using the context identified in step 3.
This approach allows users to explore documents interactively and efficiently locate the information they need.
Video: Our application in action: user asks questions about refrigerator and receives answers based on the uploaded user manual PDF file.
Large language models have been known to "hallucinate", which, in the context of document exploration, means they can generate answers that disregard the context provided. To ensure users can quickly validate whether an answer is actually present in the document, we use OpenAI's Davinci model to find citations. By submitting the user's question and the Davinci model's answer, we can locate the relevant citations within the document. This approach allows users to swiftly assess if the answer provided is based on a specific part of the document. If discrepancies exist between the answer and citation, it may indicate that the model has hallucinated.
As previously mentioned, OpenAI's Davinci model has difficulty handling extremely long documents, so prompting it to summarize an entire document directly is not feasible. To overcome this limitation, we divide the document into smaller segments and run a summary prompt on each individual chunk of data. Following this, a separate prompt is executed to consolidate all the initial outputs, ultimately producing a coherent and comprehensive summary of the document.
Through informal experimentation, we discovered that Large Language Models produce better results when given prompts containing tags that describe the subject of the document. To achieve this, we once again utilize the OpenAI Davinci model to extract the relevant tags. By asking the Davinci model to identify the keywords from each document summary, we can fine-tune the prompts used throughout the system.The effect
The Niffler system is adept at processing numerous documents quickly, streamlining the process of finding important information. Furthermore, it guides users to the specific location within the document where the extracted answer originated from, promoting a better understanding of the context. This makes the Niffler system a useful tool for efficiently navigating large amounts of digital information. By incorporating the LangChain library, the Niffler system can be easily extended and adapted to the constantly evolving needs of its users.