End-to-end pipeline from web scraping to knowledge graph-based QA system
Scraped GWU course catalog and schedule data
Cleaned and structured data for training
Built structured relationship graph
Fine-tuned with graph-augmented data
Interactive project demonstration
All course data was collected via automated web scraping from official GWU sources. This was a critical first step in building the dataset.
Source: https://bulletin.gwu.edu/courses/csci/
Method: Python Requests + BeautifulSoup
Extracted: Course codes, titles, descriptions, prerequisites, credits
Output: 187 courses (CSCI & DATS programs)
Scraped on 2025-12-04. Includes course descriptions with embedded prerequisite information used for graph construction.
Source: https://my.gwu.edu/mod/pws/courses.cfm
Method: Python Requests + Pandas parsing
Extracted: CRNs, sections, instructors, times, rooms, enrollment status
Output: 586 course instances (Spring 2026 term)
Scraped on 2025-12-04. Maps courses to instructors for "taught_by" graph edges. Includes scheduling data for Q&A generation.
Note: Web scraping was essential for obtaining structured course data. Without this step, we would not have had the raw material for building the knowledge graph or generating training Q&A pairs.
Direct fine-tuning approach with optimized hyperparameters and validation
Knowledge graph construction with graph-augmented training and retrieval