Building Supply Chain and Logistics AI Agents with 90%+ Accuracy with Boon’s “Agent Studio”

Deepti Yenireddy

Complex logistics and freight forwarding workflows–pricing, quoting, and order processing–are notoriously difficult to automate. These workflows also consume the highest concentration of effort and resources since they are the bread and butter of how supply chain and logistics companies book revenue. This work is also critical, given the customer and vendor relationships at stake and real time pricing optimization that can have a real impact on profit margins. Agentic AI can now mirror human behavior to a point where many of these decisions and communications can be fully automated

This comes with difficult, but not insurmountable technical challenges to get to near perfect automation. Information is fragmented across emails, PDFs, hard to integrate portals, and spreadsheets. Generic LLMs stall around 80% accuracy, leaving significant room for automation.

Over the last few weeks we built a configurable logistics agent builder that is capable of instantly releasing agents that perform end to end tasks across multi-modal transportation–sea, air, and road. This includes receiving and responding to quotes, negotiating and confirming orders with customers, and entering details where they need to go across multiple modes of communication–text, voice, email–in a wide variety of languages. Each agent parses complex workflows and documents with over 90% accuracy across 100s of data fields. Here’s how we did it, and in the process, made some poor choices and some better ones to learn and land at our “Agent Studio” platform.

‍

The challenge: Fragmented workflows, systems, complex data fields, and historical context

Operations teams at logistics and freight forwarding companies manage a constant influx of unstructured information from multiple sources:

Emails containing quote requests and rate confirmations across countries, and languages–all with industry specific context e.g. intermodal, container, vehicle, dimensions, regulation specifics etc.
PDFs for pricing, approvals, and delivery documents
Excel spreadsheets filled with detailed pricing data
Historical context or unwritten rules based on historical procedures
‍

We humans are remarkable and fascinating when it comes to building this context and performing these workflows across many moving parts. It still takes any new operator time to train and it’s manually intensive. With so many moving parts and high-stakes touchpoints, an entire headcount (a single CSR or operator) is typically devoted to handling only a portion of a company’s book of business.

The extremely high levels of accuracy required when dealing with revenue-impacting processes creates an interesting challenge for AI to automate, not impossible but formidable. Each data source introduces unique complexities, making accurate data extraction, mapping and decision making challenging. Generic AI models trained broadly on internet data consistently misinterpret logistics-specific terms and fail to reliably handle critical fields like shipment dates, addresses, and rates across various document types, images and free form emails. Even with fine tuned small models and multi shot prompting, one runs into the challenges of abstracting this out to reusable building blocks.

Achieving accuracy rates above 90% on data extraction had proven elusive, keeping automation limited and heavily dependent on human oversight.

‍

Building an agent builder

Traditional software is fairly rule based, deterministic and predictable. However this breaks down in the supply chain and logistics world in verticals like Transportation, Field Services, Freight Forwarding, and Food and Beverage, to name a few. Additionally, every company uses disparate systems and functions slightly differently. This creates a huge burden when building technology the way we’ve built SaaS over the last few decades, but AI has ushered in a paradigm shift. Agent building can use the power of LLMs to create a new-level of scalability for engineering teams, even outside of AI powered coding.

We first started with building blocks for various parts of a single agent, such as receiving and parsing an email, conditional routing blocks, LLM blocks for document extraction, validation blocks among many others.

An orchestration layer connects each part of the workflow (parsing, inference, validation) into a single runtime. On top of it, we exposed a visual Agent Builder where logistics teams can assemble workflows using natural language and modular blocks like ‘Parse Email’, ‘Extract PDF’, and ‘Send Reply’. Each block calls into the backend system, abstracting complexity while retaining full access to the underlying accuracy gains. The result is a system that’s deployable in minutes and adaptable without code.
‍

How we approached the extraction problem

Traditional AI models lack specificity; they're powerful but generalized. Without fine-tuning for logistics contexts, these models:

Frequently extract incomplete or incorrect data, requiring extensive human correction.
Lack clear evaluation frameworks designed to measure and improve accuracy for logistics use cases.

As a result, they struggle to deliver the accuracy and operational efficiency that logistics teams need.

Boon's targeted approach

Recognizing these limitations, we took a precision-driven approach for all our document extraction:

Multi-shot prompting: We used multiple targeted examples simultaneously (multi-shot prompting) to rapidly fine-tune the model, directly teaching it:
1. Specific sub-vertical nuances since long-haul transportation is vastly different from field services or freight forwarding.
2. The fact that every company we work with has highly varied workflows, with each of their customers also using different systems to generate different formats of data.
3. Free form emails where you have to extract 30-40 fields, specific to the industry and company, including both positive and negative examples to land at our target of 95%+ accuracy.
  ‍
Rapid fine-tuning with built in evals to keep comparing results with other approaches. We used Open AI’s existing Supervised Fine Tuning framework in addition to multi shot prompting as mentioned above.
‍
Built a precise, field-level evaluation framework
Rather than use existing evaluation tool kits, we built our own framework to precisely measure accuracy for each individual data field in the structured output (e.g. shipment dates, addresses, or pricing details). This precision allowed rapid improvement by clearly identifying and addressing error-prone areas and helped with flexibility across data types–documents, images, marked downs etc.

FIELD-LEVEL ACCURACY REPORT
==============================
  freight_type:
    Finetuned: 101/120 (84.17%)
    Baseline:  96/120 (80.00%)
  customer_name:
    Finetuned: 93/120 (77.50%)
    Baseline: 115/120 (95.83%)
  customer_street_number:
    Finetuned: 110/120 (91.67%)
    Baseline:  91/120 (75.83%)
  customer_zip_code:
    Finetuned: 92/120 (76.67%)
    Baseline:  58/120 (48.33%)
  customer_country:
    Finetuned: 107/120 (89.17%)
    Baseline:  48/120 (40.00%)
  contact_person_name:
    Finetuned: 84/120 (70.00%)
    Baseline: 118/120 (98.33%)
  contact_person_email:
    Finetuned: 113/120 (94.17%)
    Baseline: 118/120 (98.33%)
  contact_person_phone:
    Finetuned: 117/120 (97.50%)
    Baseline: 119/120 (99.17%)
  customer_reference:
    Finetuned: 116/120 (96.67%)
    Baseline: 120/120 (100.00%)
  origin_section_origin_company:
    Finetuned: 85/120 (70.83%)
    Baseline: 104/120 (86.67%)
  origin_section_origin_street_number:
    Finetuned: 111/120 (92.50%)
    Baseline:  99/120 (82.50%)
  origin_section_origin_zip_code:
    Finetuned: 85/120 (70.83%)
    Baseline:  50/120 (41.67%)
  origin_section_origin_city:
    Finetuned: 108/120 (90.00%)
    Baseline: 118/120 (98.33%)
  origin_section_origin_country:
    Finetuned: 106/120 (88.33%)
    Baseline:  47/120 (39.17%)
  destination_section_destination_company:
    Finetuned: 107/120 (89.17%)
    Baseline: 114/120 (95.00%)
  destination_section_destination_street_number:
    Finetuned: 111/120 (92.50%)
    Baseline: 101/120 (84.17%)
  destination_section_destination_zip_code:
    Finetuned: 102/120 (85.00%)
    Baseline:  59/120 (49.17%)
  destination_section_destination_city:
    Finetuned: 107/120 (89.17%)
    Baseline: 116/120 (96.67%)
  destination_section_destination_country:
    Finetuned: 109/120 (90.83%)
    Baseline:  69/120 (57.50%)
  incoterm:
    Finetuned: 117/120 (97.50%)
    Baseline: 119/120 (99.17%)
  cargo_details_quantity:
    Finetuned: 104/120 (86.67%)
    Baseline:  86/120 (71.67%)
  cargo_details_package_type:
    Finetuned: 93/120 (77.50%)
    Baseline:  92/120 (76.67%)
  cargo_details_cargo_weight:
    Finetuned: 112/120 (93.33%)
    Baseline: 110/120 (91.67%)
  cargo_details_volume:
    Finetuned: 64/120 (53.33%)
    Baseline:  66/120 (55.00%)
  cargo_details_length:
    Finetuned: 110/120 (91.67%)
    Baseline:  66/120 (55.00%)
  cargo_details_height:
    Finetuned: 110/120 (91.67%)
    Baseline:  66/120 (55.00%)
  cargo_details_width:
    Finetuned: 110/120 (91.67%)
    Baseline:  66/120 (55.00%)
  cargo_details_stackable:
    Finetuned: 113/120 (94.17%)
    Baseline:  99/120 (82.50%)
  cargo_details_dangerous:
    Finetuned: 118/120 (98.33%)
    Baseline: 120/120 (100.00%)
  cargo_details_temperature:
    Finetuned: 108/120 (90.00%)
    Baseline:  94/120 (78.33%)
  cargo_details_container_type:
    Finetuned: 111/120 (92.50%)
    Baseline: 102/120 (85.00%)
  cargo_details_description_of_goods:
    Finetuned: 75/120 (62.50%)
    Baseline:  80/120 (66.67%)
------------------------------
FIELD-LEVEL ACCURACY SUMMARY METRICS:
  Micro‑Average (overall field correctness):
    Finetuned: 86.17% (3309/3840 fields correct)
    Baseline: 76.20% (2926/3840 fields correct)
==============================

Immediate operational improvements

This precision-driven, modular approach delivered immediate and tangible benefits:

Accuracy surpassed 90%, significantly reducing manual oversight and corrections.
Operational efficiency increased sharply, cutting manual review tasks by 50%.
Faster processing of pricing, quotes and orders, transforming these common logistics workflows from hours-long manual efforts to near-instant automated responses.‍
Engineering scalability and accessibility, overall engineering build time was 2 weeks for an Agent Studio that enterprise companies can use to build ANY agent to perform ANY task on ANY number of systems

‍

What this means for our customers

Our rapid development of highly accurate, specialized AI agents demonstrates that effective automation doesn’t require prolonged AI projects. With Boon, logistics teams can deploy agents tailored to their workflows in days. Accuracy is measurable for every extracted data point, prompts and logic are adjustable without code, human input shifts to exception handling, and workflows that once took months to build are now configurable on demand.

‍