Privacy-Preserving Ed Discussion RAG System
As a Graduate Teaching Assistant for MGT 6655 at Georgia Tech, I engineered an end-to-end Python pipeline for anonymizing, parsing, normalizing, and indexing historical Ed Discussion data. The primary objective was safely recycling massive amounts of prior course knowledge to power a Retrieval-Augmented Generation (RAG) chatbot capable of accurately answering recurring course logistics and policy questions.
Rather than simply writing a naive API wrapper, I approached this as a robust Data Engineering and Applied ML project. The system emphasizes privacy (PII scrubbing), schema-tolerant parsing for unpredictable forum inputs, and an advanced Hybrid Retrieval pipeline that balances dense similarity with lexical overlap and metadata intent priors.
System Architecture
1. Data Normalization & Anonymization Pipeline
Course forums are notoriously noisy and unstructured. I developed a schema-tolerant preprocessing pipeline that ingests raw forum JSON exports and builds clean, retrieval-ready corpora.
- Privacy First: Applied targeted anonymization scripts to strip out personally identifiable student information while retaining the core instructional context of the question and answer.
- Schema-Tolerant Parsing: Handled nested thread replies, varying export formats, and unformatted code blocks gracefully, converting them into standardized structures.
- SFT Datasets: Alongside the RAG index, the pipeline automatically generates aligned prompt-completion pairs in
JSONLformat, laying the groundwork for future Supervised Fine-Tuning (SFT) on the course workflow.
2. Hybrid Retrieval Engine (In-Memory NumPy)
Because dense embeddings often struggle with highly-specific educational acronyms or exact task names, I built a custom Hybrid Retriever operating strictly over in-memory NumPy matrices for zero-latency retrieval.
text-embedding-3-small via API. (Weighted at 0.48).type:announcement) and drastically penalize unhelpful social chatter (e.g., introductions). (Weighted at 0.18).3. Grounded LLM Chatbot
The system retrieves the top-5 optimal context chunks and prompts an open-weights LLM (defaulting to google/gemma-3-4b-it for local GPU inference, with optional hot-swapping to cloud APIs like OpenRouter). To combat hallucination—a critical failure mode in higher education—the system is heavily prompt-engineered to explicitly state when an answer cannot confidently be formed from historical context, hedging the response and redirecting the student to modern course staff.
4. Evaluation & Reproducibility
To ensure the solution scales effectively for future multi-year data expansion, the project employs a highly structured, production-style organization.
