Skip to content

Uses Screaming Frog Internal HTML with text extraction along with a shingling algorithm to compare content duplication across the pages of a crawled site.

License

Notifications You must be signed in to change notification settings

jroakes/screaming-frog-shingling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

screaming-frog-shingling

Uses Screaming Frog Internal HTML with text extraction along with a shingling algorithm to compare content duplication across the pages of a crawled site.

Example Usage

  1. pip install -r requirements.txt

  2. Run Screaming Frog and use Extraction to pull the content out of a specific DOM element. Screaming Frog Extraction

  3. Export the internal HTML to a CSV file. Export Internal HTML

  4. Run the script using the following arguments.

 Example Usage:
    -i : Input filename
    -o : Output filename
    -c : Column from Screaming Frog that contains your extracted content.
    Example invocation:
    python sf_shingling.py -i internal_html_ap.csv -o output_html_ap.csv -c "BodyContent 1"

About

Uses Screaming Frog Internal HTML with text extraction along with a shingling algorithm to compare content duplication across the pages of a crawled site.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages