Provides a simple method of spidering a website to provide some basic url information to assist in Search Engine Optimisation.
- Detects duplicate content using MD5 hashes
- Shows HTTP status codes for each url
- Displays the response time and page size
- Follows redirects
- Export results to CSV format
- Supports the Robots Exclusion Protocol (robots.txt)
- Supports rel="" link attribute
For usage parameters run
./spider.pl -h
-
First open and edit the spider.pl script and at the top set the full path to the lib directory.
-
Modify the options in the spider.conf file, each option is commented so it should be self explanatory.
-
Run the spider either by executing the script directly:
./spider.pl
Or by running the script through perl:perl spider.pl
-
While the script is running it will provide information on the currently tracked urls and will be outputting the information to results.txt file.
To output to a CSV file provide the --csv=FILE perameter.