This repository is a knowledge graph of all UWaterloo programs, majors, courses, and topics. Similar to Hyperphysics, except algorithmically generated for any and all topics instead of just physics.
This repository contains three Python scripts to scrape academic program, major data, and courses from the University of Waterloo's academic calendar website.
programscrape.py: Scrapes all undergraduate programs and their links intoprograms.json.majorscrape.py: Scrapes majors under each program fromprograms.jsonand saves them intomajors.json.coursescraper.py: Scrapes courses under each major frommajors.jsonand saves them intocourses.json.syllabuscraper.py: Scrapes syllabi under each course fromcourses.jsonand saves them intosyllabi.json.
- Python 3.9+: Download Python
- Chrome WebDriver: Required for Selenium automation.
- Download from ChromeDriver
- Ensure it is added to your system
PATHor place it in the project directory.
- Git (Optional): For cloning the repository.
pip install seleniumpip install beautifulsoup4pip install spacy
git clone https://github.com/tumph/hyperloo.git
cd scraperspip install selenium
pip install beautifulsoup4
pip install spacyStep 1: Scrape Programs Run the first script to generate programs.json:
Hyperloo/scrapers/programscraper/programscrape.py
python programscrape.pyStep 2: Scrape Majors After programs.json is generated, run the second script to scrape majors:
Hyperloo/scrapers/majorscraper/majorscraper.py
python majorscrape.pyStep 3: Scrape Courses After majors.json is generated, run the third script to scrape courses:
Hyperloo/scrapers/coursescraper/coursescraper.py
python coursescraper.pyObviously, if you have a mac you need to configure your venv in order to run the python scripts and pip.
Step 3b: Stem Major Scrape
You need to run
python STEMfilter.pyin order to generate the stem_majors.json file. This file is used to filter out the majors that are not relevant to the topic of interest.
Step 4: Scrape Syllabi Then, you need to run scrape syllabus. After courses.json is generated, run the fourth script to scrape syllabi:
Hyperloo/scrapers/syllabuscraper/syllabuscraper.py
python syllabuscraper.pyThis creates the syllabi.json file.
Generating the NLP model is the most time consuming part of the process. It takes a few hours to train, so we made a chunker that splits up the syllabi text into 60 chunks that all get processed parallelly. The chunker is located in the NLP folder.
The chunker is a python script that takes syllabi.json as input and outputs a new folder called chunks that contains the chunked syllabi.json files.
Run
python NLPtrainer.pyin order to train the NLP model. This will create a new folder called syllabus_classifierv4 that contains the trained model.
Then, go into NLP/Processing and run these commands as in commands.txt
python split_syllabi.pychmod +x run_parallel.sh
./run_parallel.shcat trees/trees_*.jsonl > final_trees.jsonl#combine error jsnols as well
cat missedtrees/trees_*.jsonl > final_missed_trees.jsonlTaking the final trees.jsonl file, we can generate the knowledge graph. The knowledge graph is a JSON file that contains all the information about the topics, majors, and courses. It is located in the UI folder.
Convert the trees.jsonl file into a JSON file, and then run the Graph.js file on it.
You are done!