Platform

Document Intelligence
Platform

A configurable platform for turning any corpus of internal documents into a queryable, auditable, domain-aware knowledge layer. Agentic research loops, semantic chunking, vector clustering with HNSW declustering, multi-provider LLM routing, live streaming admin telemetry.

Sector Infrastructure / B2B platform
Year 2026
Stack Python · HNSW · Union-Find · SSE · multi-provider LLM
Type Configurable platform

Every enterprise has the same broken RAG.

The pattern repeats. A team builds a retrieval-augmented system on top of an internal document set. It demos beautifully. In production, three failure modes show up. Chunking shreds context. Retrieval pulls semantically adjacent but logically wrong passages. And the system cannot explain why it returned what it returned, so operators cannot fix it.

The brief was to build a platform that any team could configure for their own corpus and ship without re-running through those failure modes. Not a one-off RAG implementation. The substrate underneath every future RAG project.

Document understanding is the part that classical NLP could never do.

Older systems segment documents by structural rules (headers, paragraphs, fixed token counts). The result is brittle and corpus-specific. Procurement contracts, scientific papers, internal wikis, and product specs all have different document grammars, and a one-size chunker fails all of them.

An LLM-driven semantic chunker can read intent. It segments on meaning boundaries, not formatting boundaries. The system uses that capability as the chunking primitive, and pushes a second LLM pass over the resulting catalog for enrichment: title, summary, entity extraction, topic taxonomy, and a synthetic question set used for downstream evaluation.

None of this is bolted-on AI. The LLM sits at the centre of the data model, with deterministic engineering wrapped around it.

How it was built.

Pipeline schematic: chunker, embedding, HNSW clustering, agentic researcher, streaming admin
Ingestion pipeline with semantic chunking, two-phase enrichment, HNSW declustering, and a self-evaluating agentic researcher on top.
  1. Incremental indexing. Google Drive (or any source) is polled by modifiedTime plus MD5 hash. The system never reprocesses unchanged documents. Cost of re-ingestion drops with corpus stability.
  2. Semantic sliding-window chunker. An LLM walks the document with a windowed context, emitting chunk boundaries on meaning shifts rather than token counts. Overlaps preserved across boundaries so retrieval never lands mid-thought.
  3. Two-phase catalog enrichment. Phase one: LLM enrichment per chunk (title, summary, entities, topic, synthetic Q&A). Phase two: HNSW vector index over the enriched space, then a Union-Find pass to decluster — collapsing near-duplicate chunks that the LLM enrichment had over-fragmented. The declustering step alone cut catalog size by ~35% on a typical corpus while improving retrieval precision.
  4. Agentic researcher loop. Queries are answered by a Grok-class small model running a self-evaluating loop: retrieve, draft, score the draft against the query, retrieve more if confidence is low, iterate. A larger model is only invoked when the small model flags low confidence. Cost-effective by construction.
  5. Multi-provider LLM routing. Each step of the pipeline (chunking, enrichment, embedding, generation) is routed to the right model independently. Different tasks, different cost-and-latency profiles. Provider migrations are configuration, not code changes.
  6. SSE-streaming admin panel. Operators see every retrieval, every score, every re-ranking, every model call, in real time. When a query goes wrong, the operator can read the trace and fix the system. This is the part that distinguishes production from demo.

Where this got interesting.

HNSW declustering via Union-Find. An LLM enriching chunks independently will generate semantically near-duplicate summaries for genuinely near-duplicate chunks. A naive vector index leaves you with N copies of the same point. A Union-Find pass over an HNSW proximity graph collapses these into representative classes without losing retrieval recall. Most teams skip this step and pay for it in retrieval quality.

Self-evaluating agentic loops. A small model scoring its own draft is biased toward confidence. The fix: the scoring rubric is anchored against the retrieval evidence, not the draft text. The model can only score high if the evidence supports it. Cheap escape valve to a larger model when scores fall below threshold.

Configurability without complexity. Every knob a configuration file can expose is a knob the operator has to understand. The platform exposes the small set of decisions that matter (chunking strategy, model routing, retrieval depth) and hides the rest behind sensible defaults discovered empirically.

What it produced.

~35%
Reduction in catalog size after HNSW declustering, with retrieval precision improved
3-tier
Model routing across chunking, enrichment, and generation
Live
SSE telemetry on every retrieval and model call

Delivered as a working platform with a streaming admin interface. The configurability is the product. A new domain corpus comes online with a configuration file and a few hours of tuning, not a re-implementation.

This is the substrate I reach for whenever a client comes in with the question, can we make our documents queryable? The answer is yes, but the version that survives production looks like this, not like a 200-line RAG prototype.

Back to all work Discuss a project