SEAS SearchKG-Based Course QA

System Architecture

End-to-end pipeline from web scraping to knowledge graph-based QA system

Data Pipeline

Web Scraping

Scraped GWU course catalog and schedule data

Step 1
Components
bulletin.gwu.edu/courses/csci/
my.gwu.edu/mod/pws/courses.cfm
Outputs
bulletin_courses.csv
spring_2026_courses.csv
Details
  • Bulletin scraping: Course descriptions, prerequisites, credits
  • Schedule scraping: Instructors, times, rooms, CRNs, enrollment status
  • 187 courses from CSCI & DATS programs
  • 586 course instances for Spring 2026

Data Processing

Cleaned and structured data for training

Step 2
Components
prepare_dataset.py
CSV → JSONL conversion
Outputs
course_finetune.jsonl
Details
  • Generated 2,828 Q&A pairs from course data
  • Multiple question variations per course
  • Categories: Schedule, Prerequisites, Catalog, Faculty
  • OpenAI Chat format (system, user, assistant)

Knowledge Graph

Built structured relationship graph

Step 3
Components
NetworkX graph construction
Prerequisite extraction (regex)
Topic extraction (NLP)
Outputs
kg_graph.pkl
course_finetune_kg_rag.jsonl
Details
  • Nodes: Courses, Professors, Topics
  • Edges: Prerequisites, Taught_by, Covers_topic
  • Generated 200 multi-hop Q&A pairs
  • Graph context injection for training

Model Training

Fine-tuned with graph-augmented data

Step 4
Components
Llama 3.1 8B (4-bit quantization)
LoRA fine-tuning (rank=32)
Cosine annealing schedule
Outputs
lora_model_kg_qa/
merged_model_kg_qa/
Details
  • 5 epochs with early stopping
  • Validation split (80/20)
  • Final loss: 0.30 (accuracy pending)
  • Training time: ~4.2 minutes on GPU

Frontend Showcase

Interactive project demonstration

Step 5
Components
Next.js 16 + React 19
react-force-graph-2d
Recharts visualization
Outputs
Static web showcase
Details
  • No backend inference required
  • Static JSON data loading
  • Interactive graph visualization
  • Comprehensive methodology walkthrough

Web Scraping Details

All course data was collected via automated web scraping from official GWU sources. This was a critical first step in building the dataset.

Bulletin Scraping

Source: https://bulletin.gwu.edu/courses/csci/

Method: Python Requests + BeautifulSoup

Extracted: Course codes, titles, descriptions, prerequisites, credits

Output: 187 courses (CSCI & DATS programs)

Scraped on 2025-12-04. Includes course descriptions with embedded prerequisite information used for graph construction.

Schedule Scraping

Source: https://my.gwu.edu/mod/pws/courses.cfm

Method: Python Requests + Pandas parsing

Extracted: CRNs, sections, instructors, times, rooms, enrollment status

Output: 586 course instances (Spring 2026 term)

Scraped on 2025-12-04. Maps courses to instructors for "taught_by" graph edges. Includes scheduling data for Q&A generation.

Note: Web scraping was essential for obtaining structured course data. Without this step, we would not have had the raw material for building the knowledge graph or generating training Q&A pairs.

Model Architecture Diagrams

Optimized Fine-tuning Architecture

Direct fine-tuning approach with optimized hyperparameters and validation

Loading diagram...

KG-Based QA System Architecture

Knowledge graph construction with graph-augmented training and retrieval

Loading diagram...

Hyperparameters & Configuration

Optimized Fine-tuning

LoRA Configuration
LoRA Rank (r):32
LoRA Alpha:32
LoRA Dropout:0.05
Target Modules:q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Bias:none
Training Configuration
Base Model:Llama 3.1 8B
Quantization:4-bit (bitsandbytes)
Max Sequence Length:2048
Training Samples:2,828
Train/Val Split:80/20 (1,920 / 480)
Epochs:5 (with early stopping)
Early Stopping Patience:3 evaluations
Early Stopping Threshold:0.001
Optimization
Learning Rate:1e-4
LR Scheduler:Cosine Annealing
Warmup Ratio:0.1 (10%)
Optimizer:AdamW 8-bit
Weight Decay:0.01
Adam Beta1:0.9
Adam Beta2:0.999
Batch Configuration
Per Device Train Batch Size:4
Per Device Eval Batch Size:4
Gradient Accumulation Steps:2
Effective Batch Size:8
Performance Metrics
Final Training Loss:0.75
Final Validation Loss:N/A
Accuracy:Pending
Training Time:33.7 minutes
GPU:Tesla T4 (14.7 GB)
Peak Memory Usage:7.46 GB (50.6%)

KG-Based QA System

LoRA Configuration
LoRA Rank (r):32
LoRA Alpha:32
LoRA Dropout:0.05
Target Modules:q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Bias:none
Training Configuration
Base Model:Llama 3.1 8B
Quantization:4-bit (bitsandbytes)
Max Sequence Length:2048
Training Samples:195 (KG-RAG format)
Train/Val Split:80/20 (156 / 39)
Epochs:5 (with early stopping)
Early Stopping Patience:3 evaluations
Early Stopping Threshold:0.001
Optimization
Learning Rate:1e-4
LR Scheduler:Cosine Annealing
Warmup Ratio:0.1 (10%)
Optimizer:AdamW 8-bit
Weight Decay:0.01
Adam Beta1:0.9
Adam Beta2:0.999
Batch Configuration
Per Device Train Batch Size:2
Per Device Eval Batch Size:2
Gradient Accumulation Steps:4
Effective Batch Size:8
Knowledge Graph
Total Nodes:489
Courses:85
Professors:76
Topics:328
Total Edges:566
Prerequisite Edges:40
Taught By Edges:201
Covers Topic Edges:325
Performance Metrics
Final Training Loss:0.30
Final Validation Loss:N/A (early stopping)
Accuracy:Pending
Training Time:4.2 minutes
GPU:NVIDIA A100-SXM4-40GB
Peak Memory Usage:7.58 GB (19.2%)

Technology Stack

Data Collection

Python Requests
HTTP scraping library
BeautifulSoup4
HTML parsing
Pandas
Data manipulation

ML/AI Stack

Llama 3.1 8B
Base LLM model
Unsloth
Efficient fine-tuning
HuggingFace
Model hub & transformers
LoRA
Parameter-efficient training

Knowledge Graph

NetworkX
Graph construction & analysis
spaCy
NLP for topic extraction
Regex patterns
Prerequisite parsing

Frontend

Next.js 16
React framework
Tailwind CSS
Utility-first styling
Recharts
Data visualization
react-force-graph-2d
Graph visualization
Framer Motion
Animations