Skip to content
forked from j3ssie/durl

Remove duplicate URLs by retaining only the unique combinations of hostname, path, and parameter names

Notifications You must be signed in to change notification settings

ahmaad2221d/durl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Diff URLs

Remove duplicate URLs by retaining only the unique combinations of hostname, path, and parameter names.

Install

go install github.com/j3ssie/durl@latest

Usage

cat wayback_urls.txt | durl | tee differ_urls.txt

# with extra regex
cat wayback_urls.txt | durl -e 'your-regex-here' | tee differ_urls.txt

Covered cases

The following examples illustrate the criteria used to ensure each URL is considered unique and listed only once:

  • URLs with the same hostname, path, and parameter names
https://sample.example.com/product.aspx?productID=123&type=customer
https://sample.example.com/product.aspx?productID=456&type=admin
  • Paths indicating static content like blog, news or calender.
https://www.example.com/cn/news/all-news/public-1.html
https://www.sample.com/de/about/business/countrysites.htm
https://www.sample.com/de/about/business/very-long-string-here-that-exceed-100-char.htm
https://www.sample.com/de/blog/2022/01/02/blog-title.htm
  • URLs with numeric variations
https://www.example.com/data/0001.html
https://www.example.com/data/0002.html
  • Static file will be ignore like https://example.com.com/cdn-cgi/style.css

About

Remove duplicate URLs by retaining only the unique combinations of hostname, path, and parameter names

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 100.0%