Skip to content

ayaka14732/persianpod101-scraper

 
 

Repository files navigation

⚡ persianpod101-scraper ⚡

🎓 About

persianpod101-scraper helps you download Persian language courses and save them to a local directory. The course is produced and distributed by Innovative Language, who provides language learning courses from a selection of dozens of languages. Each lesson is usually 10-20 minutes long.

To get started, choose the Persian course offered by Innovative Language and create a free account.

📌 Usage

To use the script, fulfill the requirements and follow the example as demonstrated below.

🔌 Requirements

  • Download and install Python 3.9+.

  • Install required packages from requirements.txt file using pip.

    pip install -r requirements.txt
  • Put your username and password in scrape.py.

  • Run the scraping script:

    python scrape.py

    The scraped file is already stored in data.csv.

  • Run the post-processing script:

    python postprocess.py

    The processed file is already stored in data-postprocessed.csv. It contains 4 columns, namely id, url, english_text, and texts. The texts are JSON lists that may contain 1 to 3 items.

    • If a certain text contains 3 items, they would be the same Persian sentence in Arabic, Arabic with Tashkeel and Latin, respectively.
    • If it contains 2 items, they would be in Arabic and Latin.
    • If it contains 2 items, they would be in Arabic.
  • Create a directory output

  • Run download.py to download the files

📋 Disclaimer and known issues

  • Any usage of the script is under the user's responsibility only. Users of the script must act according to the site's terms.

  • As of today, Innovative Language's terms of use do not forbid the usage of crawlers or scrapers on any of their sites. This may change in the future, so be aware.

  • If you like the services Innovative Language provides you should consider a monthly subscription. Basic programs start at around $5 per month and include support from native speaker teachers.

  • As with all websites, the site's structure may change in the future and thus, as often happens with scraping scripts, deprecate it. It is not really a question of if the site's source code will change but rather when (so enjoy it while it's still working 😁).

🔒 License

All of the content presented on the websites belongs to the original creators (Innovative Language) and I have nothing to do with it.

The license below refers only to the script and not to the downloaded content.

License - MIT

💬 Status and changelog

  • 15.06.2023: Adapt to persianpod101.com.
  • 23.03.2022: Added support for basic video downloading (nothing fancy, just m4v and mp4 files) Added error handling for when a lesson library/lesson contents URL is used instead of the first lesson (user is now warned).
  • 11.05.2021: Headers and waiting time added, script is alive again.

About

Python scraper for persianpod101.com

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%