Document Processing: PDF, HTML, Unstructured

MLOps Series RAG Systems

The quality of a RAG system is bounded by the quality of its document ingestion. Garbage in, garbage out — if your PDF extractor misses a critical table, no amount of retrieval sophistication will recover that information. This post covers the full document processing stack: PDF extraction, HTML parsing, OCR, table detection, and the Unstructured library that ties them together.

The Document Processing Challenge

Documents in the real world are messy. A single enterprise document corpus might contain digitally-born PDFs, scanned documents, HTML pages, Word files, PowerPoint decks, and spreadsheets. Each format presents unique extraction challenges:

PDFs: No standard text extraction API. Layout information is encoded as absolute positions, not semantic structure. Two-column layouts, headers/footers, and watermarks create noise.
Scanned documents: Contain no extractable text at all. Require OCR with layout-aware processing to reconstruct reading order.
HTML: Contains semantic structure (headings, lists, tables) but also navigation, ads, and boilerplate that must be filtered.
Tables: Span all formats and are the hardest to extract. Cell merges, nested tables, and borderless layouts confuse most extraction tools.

PDF Extraction

PDF extraction is the most common and most frustrating document processing task. The PDF spec stores text as positioned glyphs, not as paragraphs or sentences. Extractors must reconstruct reading order from absolute coordinates, handle multiple columns, and separate body text from headers, footers, and marginalia.

The main extraction libraries, ordered by sophistication:

PyPDF2 / pypdf

Pure Python, zero dependencies. Extracts raw text in rendering order. Fast but produces poor results on complex layouts — columns get interleaved, tables lose structure. Best for simple, single-column PDFs.

pdfplumber

Built on pdfminer, adds layout analysis and table extraction. Can detect table boundaries and extract cells. Handles two-column layouts better but still struggles with scanned PDFs and complex merges.

PyMuPDF (fitz)

C-based binding for MuPDF. Extremely fast text extraction with block-level layout detection. Excellent for high-throughput processing. Returns text blocks with bounding boxes for post-processing.

Docling / LayoutParser

Deep learning document analysis. Uses Detectron2 or similar models to detect document regions (title, text, table, figure). Highest quality but highest latency. Best for complex, varied documents.

import fitz # PyMuPDF import pdfplumber from pypdf import PdfReader class PDFExtractor: def extract_with_pypdf(self, pdf_path: str) -> list[str]: # Basic extraction: fast but low quality on complex layouts reader = PdfReader(pdf_path) pages = [] for page in reader.pages: text = page.extract_text() if text.strip(): pages.append(text) return pages def extract_with_pymupdf(self, pdf_path: str) -> list[dict]: # Block-level extraction with bounding boxes doc = fitz.open(pdf_path) results = [] for page_num, page in enumerate(doc): blocks = page.get_text("dict")["blocks"] page_blocks = [] for block in blocks: if block["type"] == 0: # Text block text = "" for line in block["lines"]: for span in line["spans"]: text += span["text"] text += "\n" page_blocks.append({ "text": text.strip(), "bbox": block["bbox"], "page": page_num }) results.extend(page_blocks) return results def extract_with_pdfplumber(self, pdf_path: str) -> list[dict]: # Layout-aware extraction with table detection with pdfplumber.open(pdf_path) as pdf: results = [] for page in pdf.pages: # Extract tables separately tables = page.extract_tables() # Extract text excluding table regions text = page.extract_text() results.append({"text": text, "tables": tables}) return results

HTML Processing

HTML documents present the opposite challenge from PDFs: they contain too much structure. Navigation bars, sidebars, footers, ads, and script tags must be stripped to isolate the main content. The goal is to extract semantic text while preserving heading hierarchy and list structure for chunking.

from bs4 import BeautifulSoup import trafilatura import re class HTMLProcessor: NOISE_TAGS = {"script", "style", "nav", "footer", "header", "aside"} def extract_with_bs4(self, html: str) -> dict: # Manual extraction with BeautifulSoup soup = BeautifulSoup(html, "html.parser") # Remove noise elements for tag in soup.find_all(self.NOISE_TAGS): tag.decompose() # Extract headings for structure headings = [] for h in soup.find_all(["h1", "h2", "h3", "h4"]): headings.append({"level": int(h.name[1]), "text": h.get_text().strip()}) # Extract main content main = soup.find("main") or soup.find("article") or soup.body text = main.get_text(separator="\n", strip=True) if main else "" text = re.sub(r"\n{3,}", "\n\n", text) return {"text": text, "headings": headings} def extract_with_trafilatura(self, html: str) -> str: # Trafilatura: best general-purpose web content extractor result = trafilatura.extract( html, include_tables=True, include_links=False, include_images=False, output_format="txt", ) return result or ""

Recommendation: Use trafilatura as your default HTML extractor. It outperforms manual BeautifulSoup parsing on 90% of web pages and handles boilerplate removal, main content detection, and encoding issues automatically.

OCR Pipeline

Scanned documents and image-based PDFs require Optical Character Recognition (OCR). Modern OCR is a multi-step pipeline: image preprocessing, text detection (where are the text regions?), text recognition (what do they say?), and layout reconstruction (what order should they be read?).

import pytesseract from PIL import Image, ImageFilter, ImageEnhance import cv2 import numpy as np class OCRPipeline: def preprocess_image(self, image_path: str) -> Image: # Image preprocessing improves OCR accuracy significantly img = Image.open(image_path) # Convert to grayscale img = img.convert("L") # Increase contrast enhancer = ImageEnhance.Contrast(img) img = enhancer.enhance(2.0) # Denoise with median filter img = img.filter(ImageFilter.MedianFilter(size=3)) # Binarize with adaptive threshold img_array = np.array(img) binary = cv2.adaptiveThreshold( img_array, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2 ) return Image.fromarray(binary) def extract_text(self, image_path: str) -> str: # Basic Tesseract OCR processed = self.preprocess_image(image_path) text = pytesseract.image_to_string(processed, lang="eng") return text.strip() def extract_with_layout(self, image_path: str) -> list[dict]: # Extract text with bounding box information processed = self.preprocess_image(image_path) data = pytesseract.image_to_data( processed, output_type=pytesseract.Output.DICT ) # Group words into lines using vertical proximity blocks = [] current_block = {"text": "", "top": 0} for i in range(len(data["text"])): if int(data["conf"][i]) > 60: # Confidence threshold word = data["text"][i].strip() if word: current_block["text"] += word + " " return blocks

OCR accuracy: Tesseract achieves ~95% character accuracy on clean printed text but drops to 70–80% on noisy scans. For production workloads, consider cloud OCR services (Google Document AI, AWS Textract, Azure Document Intelligence) that use deep learning models trained on millions of documents.

Table Parsing

Tables require specialized extraction because their meaning depends entirely on the relationship between rows, columns, and headers. A cell containing "42.5" is meaningless without knowing it represents "Q3 Revenue in millions." Effective table parsing must preserve this relational structure.

import camelot import pdfplumber import pandas as pd class TableParser: def extract_with_camelot(self, pdf_path: str, pages: str = "all"): # Lattice mode: for tables with visible grid lines lattice_tables = camelot.read_pdf( pdf_path, pages=pages, flavor="lattice" ) # Stream mode: for borderless tables stream_tables = camelot.read_pdf( pdf_path, pages=pages, flavor="stream", edge_tol=50, # Edge detection tolerance ) # Use whichever mode extracted more tables with higher accuracy if sum(t.accuracy for t in lattice_tables) > \ sum(t.accuracy for t in stream_tables): return [(t.df, t.accuracy) for t in lattice_tables] return [(t.df, t.accuracy) for t in stream_tables] def serialize_for_rag(self, df: pd.DataFrame, strategy: str = "row") -> str: # Convert DataFrame to RAG-friendly text if strategy == "row": # Each row becomes a natural language sentence headers = df.columns.tolist() sentences = [] for _, row in df.iterrows(): pairs = [f"{h} is {v}" for h, v in zip(headers, row) if v] sentences.append(", ".join(pairs) + ".") return "\n".join(sentences) elif strategy == "markdown": # Markdown table format (preserves structure) return df.to_markdown(index=False) else: # JSON records (best for LLM consumption) return df.to_json(orient="records", indent=2) def detect_with_deep_learning(self, page_image): # Microsoft Table Transformer for complex layouts from transformers import ( TableTransformerForDetection, AutoImageProcessor ) processor = AutoImageProcessor.from_pretrained( "microsoft/table-transformer-detection" ) model = TableTransformerForDetection.from_pretrained( "microsoft/table-transformer-detection" ) inputs = processor(images=page_image, return_tensors="pt") outputs = model(**inputs) return self.post_process(outputs, page_image.size)

The Unstructured Library

The unstructured library is the Swiss Army knife of document processing. It provides a unified API for extracting content from 20+ file formats, with automatic format detection, layout analysis, and element classification. It wraps the best extraction tools for each format and adds intelligent post-processing.

Key capabilities of unstructured:

Auto-detection: Identifies file type and routes to the appropriate extractor (no manual format switching).
Element classification: Categorizes extracted content as Title, NarrativeText, ListItem, Table, Image, etc.
Layout analysis: Uses detectron2 or yolox for visual layout detection on PDFs and images.
Chunking: Built-in chunking strategies that respect element boundaries (never splits mid-table or mid-list).
Metadata enrichment: Attaches page number, element type, coordinates, and parent element to each chunk.

from unstructured.partition.auto import partition from unstructured.chunking.title import chunk_by_title from unstructured.staging.base import elements_to_dicts class UnstructuredPipeline: def process_document(self, file_path: str) -> list[dict]: # Auto-detect format and extract structured elements elements = partition( filename=file_path, strategy="hi_res", # Use layout detection model infer_table_structure=True, # Parse table HTML include_metadata=True, ) # Filter and classify elements processed = [] for el in elements: processed.append({ "type": type(el).__name__, "text": str(el), "metadata": el.metadata.to_dict(), }) return processed def chunk_elements(self, file_path: str) -> list[dict]: # Extract and chunk in one pipeline elements = partition(filename=file_path, strategy="hi_res") # Chunk by title: groups elements under their heading chunks = chunk_by_title( elements, max_characters=1500, new_after_n_chars=1000, combine_text_under_n_chars=200, ) return elements_to_dicts(chunks) def build_rag_documents(self, file_paths: list[str]): # Full pipeline: process, chunk, and prepare for vector store all_chunks = [] for path in file_paths: chunks = self.chunk_elements(path) for chunk in chunks: all_chunks.append({ "content": chunk["text"], "metadata": { "source": path, "element_type": chunk["type"], "page_number": chunk.get("metadata", {}).get("page_number"), } }) return all_chunks

Production tip: Use strategy="hi_res" for the initial document processing pass (highest quality but slower), and strategy="fast" for incremental updates where documents have simple layouts. The quality difference is significant on PDFs with complex layouts.

Dependency note: The hi_res strategy requires detectron2 or yolox, which need a GPU for reasonable performance. For CPU-only deployments, use strategy="auto" which falls back to rule-based extraction when deep learning models are unavailable.