Skip to content

A Python notebook that takes a list of website urls and scrapes any email addresses found on those websites (from the exact page loaded by the url and from any 'contact us' or similar pages the page has links to).

Notifications You must be signed in to change notification settings

rhart-rup/Scrape-Email-Addresses-From-Websites

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scrape Email Addresses From Websites

The aim of this notebook is to scrape email addresses from websites. It takes a list of website urls, loads each website url in a browser and will attempt to find and scrape any email addresses present within the html of the page.

Most websites have some variation on the 'contact us' page that contains contact details including email addresses. To try and find this, it will look for any html links on the webpage that contain the word 'contact'. It will open the pages they link to and will scrape any email addresses found on these pages as well.

How to use:

Full details can be found in the emails_scrape.ipynb file. Below we summarise the key steps:

  1. Create csv file that contains the websites you wish to scrape for email addresses. It must have the following format:
  • Be called 'websites.csv'
  • Have a column called 'website' that contains a single website url per row. This defines the websites that will be scraped for email addresses.
  • It can have other columns e.g. for metadata. These will be ignored by the script and will not prevent it from working.
  1. Create virtual environment from requirements.txt file and create jupyter kernel from it.
  2. Open the emails_scrape.ipynb file (e.g. in jupyter lab or as a standalone jupyter notebook or with any other IDE / tool).
  3. Run all cells of the notebook using the jupyter kernel. The notebook handles commonly encountered errors when using chromedriver / selenium and will handle and log any unexpected errors. The log is displayed in the notebook after the scraping has completed.
  4. If the scraping is successful, the scraped email addresses are cleaned and saved to a csv file. The email addresses are stored as semi colon separated strings e.g. '[email protected];[email protected]'.

About

A Python notebook that takes a list of website urls and scrapes any email addresses found on those websites (from the exact page loaded by the url and from any 'contact us' or similar pages the page has links to).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published