PaperTrail — Ketaki Dabade

S&P 500 companies file thousands of pages with the SEC every quarter. Sometimes, what a company claims in a rosy 10-K contradicts what an executive quietly does in a Form 4, or what they legally filed in a previous 10-Q. Manual analysts miss these temporal discrepancies. Models shouldn't.

I built PaperTrail, an event-driven microservices architecture to ingest, extract, and flag these discrepancies in real-time, achieving under 5-minute end-to-end latency from SEC filing to user alert.

The Infrastructure

Tech Stack

> NLP: FinBERT, LangChain, SpaCy NER, BART-MNLI
> Storage: PostgreSQL (pgvector), Neo4j Knowledge Graph
> Backend: FastAPI, Event-driven workers (Kafka/Celery)
> Frontend: Next.js Dashboard with WebSockets

Hybrid Retrieval & Extraction

You can't just throw an entire 10-K at an LLM context window and ask it to find lies. The system relies on a streaming EDGAR ingestion microservice. As forms hit the system, a fine-tuned FinBERT extracts specific financial claims.

To make the data searchable, I implemented a hybrid retrieval approach. Sentence-transformer embeddings are pushed into PostgreSQL using pgvector for fast semantic search, while the structural relationships (e.g., temporal links, insider trading actions, and contradiction edges) are mapped out in a Neo4j graph database.

[Insert Screenshot of the Next.js PaperTrail Dashboard here]

The live Next.js dashboard showing the WebSocket feed of contradictions and severity scores.

Agent Orchestration

The core intelligence isn't a simple vector similarity check. It's an orchestrated LLM agent running locally through Ollama via LangChain, armed with custom tools. When a new claim is ingested, the agent uses tools for negation detection, temporal reasoning, and insider transaction lookups to evaluate the claim against historical graph data.

If it finds a discrepancy, it runs a severity scoring model and pushes the live alert straight to the frontend dashboard.