S&P 500 companies file thousands of pages with the SEC every quarter. Sometimes, what a company claims in a rosy 10-K contradicts what an executive quietly does in a Form 4, or what they legally filed in a previous 10-Q. Manual analysts miss these temporal discrepancies. Models shouldn't.
I built PaperTrail, an event-driven microservices architecture to ingest, extract, and flag these discrepancies in real-time, achieving under 5-minute end-to-end latency from SEC filing to user alert.
The Infrastructure
Tech Stack
- > NLP: FinBERT, LangChain, SpaCy NER, BART-MNLI
- > Storage: PostgreSQL (pgvector), Neo4j Knowledge Graph
- > Backend: FastAPI, Event-driven workers (Kafka/Celery)
- > Frontend: Next.js Dashboard with WebSockets
Hybrid Retrieval & Extraction
You can't just throw an entire 10-K at an LLM context window and ask it to find lies. The system relies on a streaming EDGAR ingestion microservice. As forms hit the system, a fine-tuned FinBERT extracts specific financial claims.
To make the data searchable, I implemented a hybrid retrieval approach. Sentence-transformer embeddings are pushed into PostgreSQL using pgvector for fast semantic search, while the structural relationships (e.g., temporal links, insider trading actions, and contradiction edges) are mapped out in a Neo4j graph database.
Agent Orchestration
The core intelligence isn't a simple vector similarity check. It's an orchestrated LLM agent running locally through Ollama via LangChain, armed with custom tools. When a new claim is ingested, the agent uses tools for negation detection, temporal reasoning, and insider transaction lookups to evaluate the claim against historical graph data.
If it finds a discrepancy, it runs a severity scoring model and pushes the live alert straight to the frontend dashboard.