Rules vs. Large Language Models: A Hands‑On Comparison for B2B Document Extraction

By

Introduction

Extracting structured data from B2B documents—such as purchase orders, invoices, and delivery notes—has long been a challenge for automation teams. Traditional rule‑based systems rely on optical character recognition (OCR) and hand‑crafted patterns, while modern large language models (LLMs) promise flexibility and contextual understanding. In this article we compare two implementations of the same B2B document extractor: one built with pytesseract and a set of rules, and another using Ollama and LLaMA 3. The goal is to highlight the strengths and weaknesses of each approach in a realistic order‑processing scenario.

Rules vs. Large Language Models: A Hands‑On Comparison for B2B Document Extraction
Source: towardsdatascience.com

The B2B Order Scenario

Imagine a company that receives hundreds of PDF purchase orders daily. Each order contains fields such as buyer name, order number, item list with quantities, prices, and a total amount. The variation in layout, font, and language makes extraction error‑prone. A reliable extractor must handle these variations while maintaining high accuracy and low latency.

Rule‑Based Approach with Pytesseract

How It Works

The rule‑based system starts by converting PDF pages to images using pytesseract (Python’s wrapper for Tesseract OCR). After OCR, the system uses regular expressions and positional heuristics to locate and extract the required fields. For example, a rule might look for the pattern Order\s*#\s*:?\s*([A-Z0-9]+) to capture an order number.

Strengths

Weaknesses

LLM‑Based Approach with Ollama and LLaMA 3

How It Works

For the LLM approach, we use Ollama to run a local instance of LLaMA 3 (8B parameters). The PDF is first converted to plain text using a simple OCR pass or by extracting native text. Then a carefully crafted prompt instructs the model to parse the order and return JSON with the required fields. The prompt includes example outputs and an explanation of the desired schema.

Strengths

Weaknesses

Head‑to‑Head Comparison

Accuracy on a Test Set

We evaluated both systems on 50 real purchase orders from five different suppliers. The rule‑based system achieved 82% field‑level accuracy, mainly failing on tables with merged cells and on fields that appeared in unexpected positions. The LLM system achieved 94% accuracy, correctly extracting nearly all fields but occasionally misreading total amounts or skipping rare fields.

Rules vs. Large Language Models: A Hands‑On Comparison for B2B Document Extraction
Source: towardsdatascience.com

Speed and Cost

Processing time for the rule‑based system averaged 1.2 seconds per document, using only a single CPU core. The LLM system took an average of 8.7 seconds per document on an NVIDIA RTX 3060 GPU. For a batch of 1,000 documents, the rule system would finish in 20 minutes, while the LLM would require over two hours—though this can be parallelized with multiple GPUs.

Maintainability

When a sixth supplier with a completely different layout was added, the rule‑based system required two days of work to create and test new patterns. The LLM system needed only five prompt adjustments and re‑evaluation, which took two hours.

When to Choose Which

Conclusion

Neither approach is universally superior. The rule‑based extractor with pytesseract is fast, cheap, and predictable, but brittle. The LLM‑based system using Ollama and LLaMA 3 is flexible, accurate, and low‑maintenance, but slower and more expensive. For many B2B scenarios, a hybrid approach may be optimal: use rules for common templates and fall back to an LLM when confidence is low. Whichever path you choose, the key is to thoroughly test on your own data and monitor performance over time.

This article originally appeared on Towards Data Science.

Tags:

Related Articles

Recommended

Discover More

10 Critical Facts About the Drug-Resistant Salmonella Outbreak from Backyard PoultryAndroid Users Abandon Google’s Ecosystem: Open-Source Apps Surge in PopularityByteDance’s Astra: A New Dual-Model Framework for Smarter Robot NavigationMaximizing JSON.stringify Performance in V8: A Developer's Guide10 Key Milestones in Kia’s Electric Vehicle Surge—From the EV9 to the Upcoming EV3