Scrapes a given board catalog on 4Chan for all comments, files, and associated metadata with help of the BASC 4Chan Python Library.
- Windows Navigation:
C:\Users\USER\AppData\Local\Programs\Python\Python3x\Lib\site-packages\basc_py4chan\util.py
- LINUX Navigation:
/usr/lib/site-packages/python3.x/site-packages/basc_py4chan/util.py
- Rename HTMLParser dependency from HTMLParser
# HTML parser was renamed in python 3.x
try:
from html.parser import HTMLParser
except ImportError:
from HTMLParser import HTMLParser
_parser = html.HTMLParser()
to the newly named dependency html:
# HTML parser was renamed in python 3.x
import html
_parser = html
Download .zip from the github repo or clone using
git clone https://github.com/malavmodi/4Chan-Scraper.git
as well as install the required dependencies with pip:
pip3 install -r requirements.txt
- --board_name
- Board from 4Chan to Scrape (Required)
- --num_threads:
- Number of threads to scrape (Required)
- --debug:
- Additional log output (Optional / Case Insensitive)
NOTE: For additional information on usage, run python 4chan_scraper.py -h to check options.
- Scraping the first 5 threads of /pol/
- python 4chan_scraper.py --board_name "pol" --num_threads 5 --debug "False"
- python 4chan_scraper.py --board_name "pol" --num_threads 5 --debug "False"
When running the script, it will create a folder with all associated data in the current working directory in a hierarchial structure as such:
- Thread ID with Subject (if not Null) (Folder)
- Thread ID files (Folder)
- File Data
- File Data
- CSV with comments/replies from posts
- JSON formatted output of thread
- File Metadata
- Thread Metadata
- Thread ID files (Folder)