Skip to content

djchie/webreg_scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Webreg Scrapy

This is a web scraper for retrieving UCI course information from the UCI University Registrar. This is a tool I built for the UCI Course API.

Table of Contents

  1. Usage
    1. Process
  2. Requirements
  3. Development
    1. Installing Dependencies
    2. Running the Scraper
    3. Handling UCI Data Changes
    4. Roadmap
  4. Contributing

Usage

Use this scraper to grab course information and import it into a PostgreSQL database

Process

  1. Scraper is hosted on Heroku
  2. Executes the department spider to grab updated list of departments
  3. Executes a course spider for each department in department list
  4. Uploads all the information to the AWS RDS PostgreSQL database

Requirements

  • PostgreSQL

Development

Installing Dependencies

From within the root directory:

pip install -r requirements.txt

Running the Scraper

Start up PostgreSQL server with correct relations setup

// To crawl courses into database
scrapy crawl course_scrapy  
// To crawl courses into database and store them into courses.json
scrapy crawl course_scrapy -o courses.json

Handling UCI Data Changes

  1. Change items.py
  2. Change the way course_spider.py parses
  3. Change the models.py to reflect database schema
  4. Change pipelines.py to manage the insertion of new data

Roadmap

View the project roadmap here

Contributing

See CONTRIBUTING.md for contribution guidelines.

About

A WebReg scraper via Scrapy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages