Best Practices for Technical Manuals for Onyx-HybridSearch¶

In this Notebook, we will go through best practices for working with manuals containing images and tables. Working with non-multimodal LLM, especially those lacking image augmentation capabilities can be challenging. We will work through complex PDFs and simplify them using pymupdf and its RAG-optimised module pymupdf4llm

We start by importing these packages. Additionally, we import inbuilt regex module re.

In [28]:
import pymupdf4llm
import pymupdf
import re

from datetime import datetime as dt, timezone as tz

Taking an example file in a specific location, we can read this file into a valid pymupdf.Document object. This object is basically a container for pymupdf.Page object.

In [17]:
PDF_URL = "PDFs/bal_en_01-00.pdf" # set own URL here
pdf_doc = pymupdf.open(PDF_URL) 
type(pdf_doc) # expected type would be pymupdf.Document
Out[17]:
pymupdf.Document

Now we can cross-check if our document is read properly. We can count number of pages and verify content of pages. Our test file contains 1048 pages.

In [18]:
len(pdf_doc)
Out[18]:
1048

Now we cross check content of first page (i.e. Page with 0th index.)

In [20]:
print(pdf_doc[0].get_text())
en
Operator's manual
Tower crane
470 EC-B 16
59484
TC-OS Version 1.02
Tower system 24 HC 630 / 24 HC 420 / 500 HC
Undercarriage 24 HC 630 UC-0800
Foundation anchor 24 HC 630 FA
Foundation anchor 24 HC 630 FAr
This machine complies with the relevant guidelines and standards on the North American market.
www.liebherr.com

Hybrid Search¶

The content above is not optimised for hybrid search as the information is not quite structured in relevant query terms. Additional information about Foundation anchors etc should not be main focus of search and it is enough if they exist in Document content.

For hybrid search following parameters are absolute essential:

  • File name
  • Title
  • document issuance date

These factors are taken into account while calculating relevance score in the reranker and we will try to optimise these document metadata here.

Data Extraction¶

Extracting Title¶

For proper scoring, a title has to be as precise as possible and must contain as less abbreviations with proper natural language syntax. Optimal format is <what> for <whom>

In [7]:
def extract_manual_title(page: pymupdf.Page) -> str: 
    """Extract suitable `title` for a Liebherr Manual.
    :param page: 0th index page
    :returns: title of type `str`
    """  
    text = page.get_text()
    document_type_pattern = r"Operator's manual"
    machine_type_pattern = r"Tower crane"
    model_number_pattern = r"\b\d{3} [A-Z]+-\w+ \d{2}\b"
    serial_number_pattern = r"\b\d{5}\b"
    
    # Search for patterns in the text
    document_type_match = re.search(document_type_pattern, text)
    machine_type_match = re.search(machine_type_pattern, text)
    model_number_match = re.search(model_number_pattern, text)
    serial_number_match = re.search(serial_number_pattern, text)
    
    # Extract matched strings
    document_type = document_type_match.group(0) if document_type_match else ""
    machine_type = machine_type_match.group(0) if machine_type_match else ""
    model_number = model_number_match.group(0) if model_number_match else ""
    serial_number = serial_number_match.group(0) if serial_number_match else ""
    
    # Construct the title
    title_parts = [part for part in [document_type, machine_type, model_number, serial_number] if part]
    title = " for ".join(title_parts)
    
    return title
In [12]:
extract_manual_title(pdf_doc[1])
Out[12]:
"Operator's manual for Tower crane for 470 EC-B 16 for 59484"

And from the big pile of text, we have a proper title that can be used.

Extracting Issuance Date¶

These manuals have date of issuance which will be crucial if we have newer manuals in future. This date will be used to determan DOCUMENT_DECAY_TIME which has negative bias in reranking model i.e. older documents are penalized by set threshold in reranker and appear latter between two document if more newer documents are present for same title.

This timestamp needs to be in UTC TZ.

In [31]:
def extract_timestamp(page: pymupdf.Page) -> str: 
    """Extract suitable `timestamp` for a Liebherr Manual.
    :param page: 0th index page
    :returns: timestamp of type `str`
    """ 
    text = page.get_text()
    date_pattern = r"Issued:\s*(\d{4}-\d{2}-\d{2})"
    date_match = re.search(date_pattern, text)
    timestamp = date_match.group(1) if date_match else None
    return dt.strptime(timestamp, "%Y-%m-%d").replace(tzinfo=tz.utc)
In [32]:
extract_timestamp(pdf_doc[1])
Out[32]:
datetime.datetime(2022, 11, 28, 0, 0, tzinfo=datetime.timezone.utc)

This timestamp can be used against doc_updated_at in Onyx indexing pipelines.

Extracting Metadata¶

Additionally, we can use metadata tags to simplify queries for users. For example, this document starts with en which means the document language is english. We can pass this metadata as tag to Onyx so users can use this tag to filter only english documents.

In [43]:
def extract_metadata(page: pymupdf.Page) -> dict: 
    text = page.get_text()
    language_pattern = r"^(en|de|it)\b"  # Matches 'en', 'de', or 'it' at the start of the text, use more if you have more.
    tc_os_version_pattern = r"TC-OS Version (\d+\.\d{2})"
    language_match = re.search(language_pattern, text, re.MULTILINE)
    tc_os_version_match = re.search(tc_os_version_pattern, text)
    language = language_match.group(1) if language_match else None
    tc_os_version = tc_os_version_match.group(1) if tc_os_version_match else None
    return {
        "lang": language,
        "TC_OS_version": tc_os_version
    }
In [44]:
extract_metadata(pdf_doc[0])
Out[44]:
{'lang': 'en', 'TC_OS_version': '1.02'}

Now these 3 extra parameters should already drastically improve the retrieval performance and efficiency.

Extracting Tables and Images¶

Per page, you have possibility to invoke page.find_tables() and iterate over them to turn them into markdown using Table.to_markdown() but we will skip extracting images and tables manually and use complete end-to-end pipeline using pymupdf4llm

In [ ]:
markdown_content = pymupdf4llm.to_markdown(PDF_URL, image_path="IMGs/", write_images=True)
Processing PDFs/bal_en_01-00.pdf...
[===                                     ] (  87/1048)

This will save all the images in folder IMGs. Big advantage of having page content in Markdown is most of NNs are pre-trained with data labelled in this format i.e. NNs used in Chunker (NLTK lemmatziers, punkt) and Encoders.

Important¶

This content will have images in markdown syntax in format [my image](IMGs/image_1.png) and you will have to prompt Onyx RAG Model to convert this to proper URL according to how you plan to store images.