The quality of a RAG system is bounded by the quality of its document ingestion. Garbage in, garbage out — if your PDF extractor misses a critical table, no amount of retrieval sophistication will recover that information. This post covers the full document processing stack: PDF extraction, HTML parsing, OCR, table detection, and the Unstructured library that ties them together.
The Document Processing Challenge
Documents in the real world are messy. A single enterprise document corpus might contain digitally-born PDFs, scanned documents, HTML pages, Word files, PowerPoint decks, and spreadsheets. Each format presents unique extraction challenges:
PDFs: No standard text extraction API. Layout information is encoded as absolute positions, not semantic structure. Two-column layouts, headers/footers, and watermarks create noise.
Scanned documents: Contain no extractable text at all. Require OCR with layout-aware processing to reconstruct reading order.
HTML: Contains semantic structure (headings, lists, tables) but also navigation, ads, and boilerplate that must be filtered.
Tables: Span all formats and are the hardest to extract. Cell merges, nested tables, and borderless layouts confuse most extraction tools.
PDF Extraction
PDF extraction is the most common and most frustrating document processing task. The PDF spec stores text as positioned glyphs, not as paragraphs or sentences. Extractors must reconstruct reading order from absolute coordinates, handle multiple columns, and separate body text from headers, footers, and marginalia.
The main extraction libraries, ordered by sophistication:
PyPDF2 / pypdf
Pure Python, zero dependencies. Extracts raw text in rendering order. Fast but produces poor results on complex layouts — columns get interleaved, tables lose structure. Best for simple, single-column PDFs.
pdfplumber
Built on pdfminer, adds layout analysis and table extraction. Can detect table boundaries and extract cells. Handles two-column layouts better but still struggles with scanned PDFs and complex merges.
PyMuPDF (fitz)
C-based binding for MuPDF. Extremely fast text extraction with block-level layout detection. Excellent for high-throughput processing. Returns text blocks with bounding boxes for post-processing.
Docling / LayoutParser
Deep learning document analysis. Uses Detectron2 or similar models to detect document regions (title, text, table, figure). Highest quality but highest latency. Best for complex, varied documents.
import fitz # PyMuPDFimport pdfplumber
from pypdf import PdfReader
classPDFExtractor:
defextract_with_pypdf(self, pdf_path: str) -> list[str]:
# Basic extraction: fast but low quality on complex layouts
reader = PdfReader(pdf_path)
pages = []
for page in reader.pages:
text = page.extract_text()
if text.strip():
pages.append(text)
return pages
defextract_with_pymupdf(self, pdf_path: str) -> list[dict]:
# Block-level extraction with bounding boxes
doc = fitz.open(pdf_path)
results = []
for page_num, page inenumerate(doc):
blocks = page.get_text("dict")["blocks"]
page_blocks = []
for block in blocks:
if block["type"] == 0: # Text block
text = ""for line in block["lines"]:
for span in line["spans"]:
text += span["text"]
text += "\n"
page_blocks.append({
"text": text.strip(),
"bbox": block["bbox"],
"page": page_num
})
results.extend(page_blocks)
return results
defextract_with_pdfplumber(self, pdf_path: str) -> list[dict]:
# Layout-aware extraction with table detectionwith pdfplumber.open(pdf_path) as pdf:
results = []
for page in pdf.pages:
# Extract tables separately
tables = page.extract_tables()
# Extract text excluding table regions
text = page.extract_text()
results.append({"text": text, "tables": tables})
return results
HTML Processing
HTML documents present the opposite challenge from PDFs: they contain too much structure. Navigation bars, sidebars, footers, ads, and script tags must be stripped to isolate the main content. The goal is to extract semantic text while preserving heading hierarchy and list structure for chunking.
from bs4 import BeautifulSoup
import trafilatura
import re
classHTMLProcessor:
NOISE_TAGS = {"script", "style", "nav", "footer", "header", "aside"}
defextract_with_bs4(self, html: str) -> dict:
# Manual extraction with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Remove noise elementsfor tag in soup.find_all(self.NOISE_TAGS):
tag.decompose()
# Extract headings for structure
headings = []
for h in soup.find_all(["h1", "h2", "h3", "h4"]):
headings.append({"level": int(h.name[1]), "text": h.get_text().strip()})
# Extract main content
main = soup.find("main") or soup.find("article") or soup.body
text = main.get_text(separator="\n", strip=True) if main else""
text = re.sub(r"\n{3,}", "\n\n", text)
return {"text": text, "headings": headings}
defextract_with_trafilatura(self, html: str) -> str:
# Trafilatura: best general-purpose web content extractor
result = trafilatura.extract(
html,
include_tables=True,
include_links=False,
include_images=False,
output_format="txt",
)
return result or""
Recommendation: Use trafilatura as your default HTML extractor. It outperforms manual BeautifulSoup parsing on 90% of web pages and handles boilerplate removal, main content detection, and encoding issues automatically.
OCR Pipeline
Scanned documents and image-based PDFs require Optical Character Recognition (OCR). Modern OCR is a multi-step pipeline: image preprocessing, text detection (where are the text regions?), text recognition (what do they say?), and layout reconstruction (what order should they be read?).
import pytesseract
from PIL import Image, ImageFilter, ImageEnhance
import cv2
import numpy as np
classOCRPipeline:
defpreprocess_image(self, image_path: str) -> Image:
# Image preprocessing improves OCR accuracy significantly
img = Image.open(image_path)
# Convert to grayscale
img = img.convert("L")
# Increase contrast
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2.0)
# Denoise with median filter
img = img.filter(ImageFilter.MedianFilter(size=3))
# Binarize with adaptive threshold
img_array = np.array(img)
binary = cv2.adaptiveThreshold(
img_array, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2
)
return Image.fromarray(binary)
defextract_text(self, image_path: str) -> str:
# Basic Tesseract OCR
processed = self.preprocess_image(image_path)
text = pytesseract.image_to_string(processed, lang="eng")
return text.strip()
defextract_with_layout(self, image_path: str) -> list[dict]:
# Extract text with bounding box information
processed = self.preprocess_image(image_path)
data = pytesseract.image_to_data(
processed, output_type=pytesseract.Output.DICT
)
# Group words into lines using vertical proximity
blocks = []
current_block = {"text": "", "top": 0}
for i inrange(len(data["text"])):
ifint(data["conf"][i]) > 60: # Confidence threshold
word = data["text"][i].strip()
if word:
current_block["text"] += word + " "return blocks
OCR accuracy: Tesseract achieves ~95% character accuracy on clean printed text but drops to 70–80% on noisy scans. For production workloads, consider cloud OCR services (Google Document AI, AWS Textract, Azure Document Intelligence) that use deep learning models trained on millions of documents.
Table Parsing
Tables require specialized extraction because their meaning depends entirely on the relationship between rows, columns, and headers. A cell containing "42.5" is meaningless without knowing it represents "Q3 Revenue in millions." Effective table parsing must preserve this relational structure.
import camelot
import pdfplumber
import pandas as pd
classTableParser:
defextract_with_camelot(self, pdf_path: str, pages: str = "all"):
# Lattice mode: for tables with visible grid lines
lattice_tables = camelot.read_pdf(
pdf_path, pages=pages, flavor="lattice"
)
# Stream mode: for borderless tables
stream_tables = camelot.read_pdf(
pdf_path, pages=pages, flavor="stream",
edge_tol=50, # Edge detection tolerance
)
# Use whichever mode extracted more tables with higher accuracyifsum(t.accuracy for t in lattice_tables) > \
sum(t.accuracy for t in stream_tables):
return [(t.df, t.accuracy) for t in lattice_tables]
return [(t.df, t.accuracy) for t in stream_tables]
defserialize_for_rag(self, df: pd.DataFrame, strategy: str = "row") -> str:
# Convert DataFrame to RAG-friendly textif strategy == "row":
# Each row becomes a natural language sentence
headers = df.columns.tolist()
sentences = []
for _, row in df.iterrows():
pairs = [f"{h} is {v}"for h, v inzip(headers, row) if v]
sentences.append(", ".join(pairs) + ".")
return"\n".join(sentences)
elif strategy == "markdown":
# Markdown table format (preserves structure)return df.to_markdown(index=False)
else:
# JSON records (best for LLM consumption)return df.to_json(orient="records", indent=2)
defdetect_with_deep_learning(self, page_image):
# Microsoft Table Transformer for complex layoutsfrom transformers import (
TableTransformerForDetection,
AutoImageProcessor
)
processor = AutoImageProcessor.from_pretrained(
"microsoft/table-transformer-detection"
)
model = TableTransformerForDetection.from_pretrained(
"microsoft/table-transformer-detection"
)
inputs = processor(images=page_image, return_tensors="pt")
outputs = model(**inputs)
return self.post_process(outputs, page_image.size)
The Unstructured Library
The unstructured library is the Swiss Army knife of document processing. It provides a unified API for extracting content from 20+ file formats, with automatic format detection, layout analysis, and element classification. It wraps the best extraction tools for each format and adds intelligent post-processing.
Key capabilities of unstructured:
Auto-detection: Identifies file type and routes to the appropriate extractor (no manual format switching).
Element classification: Categorizes extracted content as Title, NarrativeText, ListItem, Table, Image, etc.
Layout analysis: Uses detectron2 or yolox for visual layout detection on PDFs and images.
Chunking: Built-in chunking strategies that respect element boundaries (never splits mid-table or mid-list).
Metadata enrichment: Attaches page number, element type, coordinates, and parent element to each chunk.
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.staging.base import elements_to_dicts
classUnstructuredPipeline:
defprocess_document(self, file_path: str) -> list[dict]:
# Auto-detect format and extract structured elements
elements = partition(
filename=file_path,
strategy="hi_res", # Use layout detection model
infer_table_structure=True, # Parse table HTML
include_metadata=True,
)
# Filter and classify elements
processed = []
for el in elements:
processed.append({
"type": type(el).__name__,
"text": str(el),
"metadata": el.metadata.to_dict(),
})
return processed
defchunk_elements(self, file_path: str) -> list[dict]:
# Extract and chunk in one pipeline
elements = partition(filename=file_path, strategy="hi_res")
# Chunk by title: groups elements under their heading
chunks = chunk_by_title(
elements,
max_characters=1500,
new_after_n_chars=1000,
combine_text_under_n_chars=200,
)
returnelements_to_dicts(chunks)
defbuild_rag_documents(self, file_paths: list[str]):
# Full pipeline: process, chunk, and prepare for vector store
all_chunks = []
for path in file_paths:
chunks = self.chunk_elements(path)
for chunk in chunks:
all_chunks.append({
"content": chunk["text"],
"metadata": {
"source": path,
"element_type": chunk["type"],
"page_number": chunk.get("metadata", {}).get("page_number"),
}
})
return all_chunks
Production tip: Use strategy="hi_res" for the initial document processing pass (highest quality but slower), and strategy="fast" for incremental updates where documents have simple layouts. The quality difference is significant on PDFs with complex layouts.
Dependency note: The hi_res strategy requires detectron2 or yolox, which need a GPU for reasonable performance. For CPU-only deployments, use strategy="auto" which falls back to rule-based extraction when deep learning models are unavailable.