Privacy-Preserving Ed Discussion RAG System

As a Graduate Teaching Assistant for MGT 6655 at Georgia Tech, I engineered an end-to-end Python pipeline for anonymizing, parsing, normalizing, and indexing historical Ed Discussion data. The primary objective was safely recycling massive amounts of prior course knowledge to power a Retrieval-Augmented Generation (RAG) chatbot capable of accurately answering recurring course logistics and policy questions.

Rather than simply writing a naive API wrapper, I approached this as a robust Data Engineering and Applied ML project. The system emphasizes privacy (PII scrubbing), schema-tolerant parsing for unpredictable forum inputs, and an advanced Hybrid Retrieval pipeline that balances dense similarity with lexical overlap and metadata intent priors.

Gemma-3
LLM Backend
Top-K=5
Context Chunks
0.80
MMR Lambda
2048-dim
Local Embeddings
PythonRetrieval-Augmented Generation (RAG)Hybrid Search Data EngineeringNumPyJSONL Slurm / GPULLMs (Gemma, OpenRouter)

System Architecture

1. Data Normalization & Anonymization Pipeline

Course forums are notoriously noisy and unstructured. I developed a schema-tolerant preprocessing pipeline that ingests raw forum JSON exports and builds clean, retrieval-ready corpora.

Raw Ed Discussion Exports
PII Anonymization
Schema-Tolerant Parsing
Normalization
JSONL RAG Corpus
  • Privacy First: Applied targeted anonymization scripts to strip out personally identifiable student information while retaining the core instructional context of the question and answer.
  • Schema-Tolerant Parsing: Handled nested thread replies, varying export formats, and unformatted code blocks gracefully, converting them into standardized structures.
  • SFT Datasets: Alongside the RAG index, the pipeline automatically generates aligned prompt-completion pairs in JSONL format, laying the groundwork for future Supervised Fine-Tuning (SFT) on the course workflow.

2. Hybrid Retrieval Engine (In-Memory NumPy)

Because dense embeddings often struggle with highly-specific educational acronyms or exact task names, I built a custom Hybrid Retriever operating strictly over in-memory NumPy matrices for zero-latency retrieval.

Dense Embedding Phase
Computes cosine similarity between queries and forum chunks. Supports hot-swapping between lightweight 2048-dim local hash embedders or OpenAI text-embedding-3-small via API. (Weighted at 0.48).
Lexical Overlap Phase
Tokenizes queries and chunks (with stopword fallback logic), computing a precision/recall blended overlap score to catch exact keyword matches missed by semantic search. (Weighted at 0.34).
Metadata Intent Priors
Uses heuristic-based coarse intent detection (e.g., detecting "logistics" vs "grading") to boost relevant metadata tags (like type:announcement) and drastically penalize unhelpful social chatter (e.g., introductions). (Weighted at 0.18).
MMR & Redundancy Selection
Applies Maximal Marginal Relevance (λ=0.80) alongside a thread-deduplication cap (max 2 chunks per thread) to ensure the LLM receives diverse context rather than repeated variations of the same answer.

3. Grounded LLM Chatbot

The system retrieves the top-5 optimal context chunks and prompts an open-weights LLM (defaulting to google/gemma-3-4b-it for local GPU inference, with optional hot-swapping to cloud APIs like OpenRouter). To combat hallucination—a critical failure mode in higher education—the system is heavily prompt-engineered to explicitly state when an answer cannot confidently be formed from historical context, hedging the response and redirecting the student to modern course staff.

4. Evaluation & Reproducibility

To ensure the solution scales effectively for future multi-year data expansion, the project employs a highly structured, production-style organization.

Evaluations (measuring Hit-Rate and Mean Reciprocal Rank on historical FAQ test-sets) and inference tests were packaged into modular execution bash scripts. I subsequently containerized and adapted these scripts for Slurm-based HPC clusters, allowing scalable testing of embedding generation and LLM inference batches across multiple GPU nodes.