LLM Orchestration OCR Pipeline Springer Published

SkillSet Sherpa: AI Career Counseling via OCR and LLMs

Apr 15, 2025 3 min read

Traditional career counseling is often generic, failing to keep pace with the dynamic job market or account for a student's unique combination of aptitude and academic performance. We wanted a purely data-driven approach.

My team and I built SkillSet Sherpa, an AI-powered system that reads student marksheets using Optical Character Recognition (OCR) and combines it with a psychometric test to generate highly personalized LLM recommendations. This research was published in Springer's Lecture Notes in Networks and Systems.

The Architecture

Tech Stack

  • > Backend: Python, Flask
  • > Computer Vision: EasyOCR (CNN-based)
  • > Data Processing: Pandas, Openpyxl
  • > AI: LLM integration & Prompt Engineering

The OCR Pipeline

Extracting structured tabular data from messy, scanned marksheets is notoriously difficult. We benchmarked several models-including Tesseract, PaddleOCR, and Doctr-before selecting EasyOCR for its optimal balance of accuracy and integration ease.

Once a user uploads a scan via our Flask backend, the CNN-based EasyOCR model reads the image. We then use openpyxl and pandas to clean up the extracted text, map the headers, and save the structured grades into a clean CSV format for analysis.

Psychometrics: The RIASEC Model

Grades only tell half the story. We implemented the Holland Codes (RIASEC) test to quantify a user's affinity for Realistic, Investigative, Artistic, Social, Enterprising, and Conventional work environments. The backend ingests the raw survey inputs and converts them into normalized percentage scores, creating a mathematical profile of the user's psychological strengths.

LLM Orchestration & Prompt Engineering

The final step is where the system actually "thinks". The pipeline dynamically injects both the academic CSV data and the normalized RIASEC percentages into a highly structured prompt.

By forcing the LLM to consider these two distinct datasets simultaneously, it outputs highly specific career paths-and justifies its reasoning by explicitly mapping the required educational streams (e.g., advising a student to major in English or Mass Communication if they scored high in Artistic/Social traits and excelled in language subjects).

SkillSet Sherpa Chat UI showing career recommendations
The final output: The LLM synthesizing OCR marks data and RIASEC scores to provide personalized counseling.