Skip to content

A web spider that crawls HowLongToBeat to extract game and completion time data.

License

Notifications You must be signed in to change notification settings

KasumiL5x/hltb-scraper

Repository files navigation

HowLongToBeat Scraper

A web spider that crawls HowLongToBeat to extract game and completion time data.

A big thank you to HowLongToBeat for providing such a great service with such rich data. Sorry about the scraping!

Purpose

HowLongToBeat (HLTB) is a great website for discovering times that people take to complete games. While ripe with data, it's unfortunately lacking an API. This project scrapes all known games (at the time of writing) on the website, extracting the game data as well as all existing completion entries.

This project is part of my venture into the world of data science.

Structure

The HLTB_Game_Spider in hltb-game.py scrapes all games available through the website's search functionality. The HLTB_Completions_Spider in hltb-completions.py scrapes all user-submitted completion entries for each of the same game (although some have no entries and therefore will be missing).

The HLTB_Game_Spider extracts columns (post-cleaning):

  • id - Game ID from the website.
  • title - Game name.
  • main_story - Average completion time of 'Main Story' in hours.
  • main_plus_extras - Average completion time of 'Main + Extras' in hours.
  • completionist - Average completion time of 'Completionist' in hours.
  • all_styles - Average completion time of 'All Styles' in hours.
  • coop - Average completion time of 'Co-Op' in hours.
  • versus - Average completion time of 'Vs.' in hours.
  • type - Type entry to differentiate DLC/Expansion, Mod, and ROM Hack from regular game entries.
  • developers - Comma-space separated list of all developers of an entry.
  • publishers - Comma-space separated list of all publishers of an entry.
  • platforms - Comma-space separated list of all platforms an entry is available on.
  • genres - Comma-separated list of genres for an entry.
  • release_na - Release date in North America (if available).
  • release_eu - Release date in Europe (if available).
  • release_jp - Release date in Japan (if available).

The HLTB_Completions_Spider extracts columns (post-cleaning):

  • id - Game ID that can be cross-referenced with the above dataset.
  • type - Type of completion entry (Main Story, Main + Extras, Completionists, Co-Op Multiplayer, Speed Run - Any%, Speed Run - 100%).
  • platform - Platform the particular entry was completed on.
  • time - Time of entry in hours and minutes (e.g., 2hr 50m).

About

A web spider that crawls HowLongToBeat to extract game and completion time data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages