Skip to content

Python app hosted on Heroku which manages and automatically updates a PostgreSQL database of government-posted H-2A and H-2B job listings.

Notifications You must be signed in to change notification settings

TRLegalAid/trla-H2Data-backend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

trla-H2Data-backend

background

Python app hosted on Heroku which manages and automatically updates a PostgreSQL database of H-2A and H-2B job listings. Includes processes to geocode addresses, check results for accuracy, and implements a system for people to fix the inaccurate addresses. Most frequent use of this code locally will be to upload new quarterly data and to re-run a script to pull new data.

how it works

Job postings are added daily from a web scraper hosted on Apify, which gets latest postings from here. Each quarter, the official dataset put out by the US Department of Labor is merged with the existing data.

Frontend web-app to visualize, filter, and download the data is here. Frontend code is located here.

uploading new quarterly data

See this file for a detailed explanation.

DOL data can be found online as "performance data" or "quarterly disclosure data," typically found here. The disclosure data has many more variables than we are able to scrape and includes addenda.

re-running for missed data

If the python script fails, you can re-run the script after the issue has been corrected.

First, go to Apify. Select Actors > Click the name of the actor (apify-dol-actor at time of writing) > Runs > Click the green, hyperlinked "status" of the run you are interested in > Under the 4 dashboard items with results, select "API" > Copy URL under Get Dataset Items

Then, go to the update_database.py file. Temporarily replace "most_recent_run_url" with the URL you got from Apify, enclosed in quotes (the version in line 18, within requests.get("URL ENCLOSED IN QUOTES!").json()).

Run the script in your console, and everything should be up to date. If you accidentally add duplicate case numbers, it's okay. They will not actually end up in the database.

cleaning up data

Data is cleaned via this google doc. TRLA staff only cleans data for TX, AR, LA, KY, AL, MS, and TN.

issues with the scraper itself

If the DOL website format changes, you may need to alter the scraper. This will also be the case if you want to attempt to scrape more fields.

We have contracted with a developer at Apify to create the scraper. You can send new requests for changes, and the developer will let you know a one-time cost.

You will just submit a new request through Apify's marketplace. Hopefully, if the original developer is still providing services through the platform, we can get connected to the same developer as before.

About

Python app hosted on Heroku which manages and automatically updates a PostgreSQL database of government-posted H-2A and H-2B job listings.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages