broken-link-checker

Find broken links, missing images, etc in your HTML.

Features:

Stream-parses local and remote HTML pages
Concurrently checks multiple links
Supports various HTML elements/attributes, not just <a href>
Supports redirects, absolute URLs, relative URLs and <base>
Honors robot exclusions
Provides detailed information about each link (HTTP and HTML)
URL keyword filtering with wildcards
Pause/Resume at any time

Installation

Node.js >= 0.10 is required; < 4.0 will need Promise and Object.assign polyfills.

There're two ways to use it:

Command Line Usage

To install, type this at the command line:

npm install broken-link-checker -g

After that, check out the help for available options:

blc --help

A typical site-wide check might look like:

blc https://yoursite.com -ro

Programmatic API

To install, type this at the command line:

npm install broken-link-checker

The rest of this document will assist you with how to use the API.

Classes

`blc.HtmlChecker(options, handlers)`

Scans an HTML document to find broken links.

handlers.complete is fired after the last result or zero results.
handlers.html is fired after the HTML document has been fully parsed.
- tree is supplied by parse5
- robots is an instance of robot-directives containing any <meta> robot exclusions.
handlers.junk is fired with data on each skipped link, as configured in options.
handlers.link is fired with the result of each discovered link (broken or not).
.clearCache() will remove any cached URL responses. This is only relevant if the cacheResponses option is enabled.
.numActiveLinks() returns the number of links with active requests.
.numQueuedLinks() returns the number of links that currently have no active requests.
.pause() will pause the internal link queue, but will not pause any active requests.
.resume() will resume the internal link queue.
.scan(html, baseUrl) parses & scans a single HTML document. Returns false when there is a previously incomplete scan (and true otherwise).
- html can be a stream or a string.
- baseUrl is the address to which all relative URLs will be made absolute. Without a value, links to relative URLs will output an "Invalid URL" error.

var htmlChecker = new blc.HtmlChecker(options, {
    html: function(tree, robots){},
    junk: function(result){},
    link: function(result){},
    complete: function(){}
});
 
htmlChecker.scan(html, baseUrl);

`blc.HtmlUrlChecker(options, handlers)`

Scans the HTML content at each queued URL to find broken links.

handlers.end is fired when the end of the queue has been reached.
handlers.html is fired after a page's HTML document has been fully parsed.
- tree is supplied by parse5.
- robots is an instance of robot-directives containing any <meta> and X-Robots-Tag robot exclusions.
handlers.junk is fired with data on each skipped link, as configured in options.
handlers.link is fired with the result of each discovered link (broken or not) within the current page.
handlers.page is fired after a page's last result, on zero results, or if the HTML could not be retrieved.
.clearCache() will remove any cached URL responses. This is only relevant if the cacheResponses option is enabled.
.dequeue(id) removes a page from the queue. Returns true on success or an Error on failure.
.enqueue(pageUrl, customData) adds a page to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success or an Error on failure.
- customData is optional data that is stored in the queue item for the page.
.numActiveLinks() returns the number of links with active requests.
.numPages() returns the total number of pages in the queue.
.numQueuedLinks() returns the number of links that currently have no active requests.
.pause() will pause the queue, but will not pause any active requests.
.resume() will resume the queue.

var htmlUrlChecker = new blc.HtmlUrlChecker(options, {
    html: function(tree, robots, response, pageUrl, customData){},
    junk: function(result, customData){},
    link: function(result, customData){},
    page: function(error, pageUrl, customData){},
    end: function(){}
});
 
htmlUrlChecker.enqueue(pageUrl, customData);

`blc.SiteChecker(options, handlers)`

Recursively scans (crawls) the HTML content at each queued URL to find broken links.

handlers.end is fired when the end of the queue has been reached.
handlers.html is fired after a page's HTML document has been fully parsed.
- tree is supplied by parse5.
- robots is an instance of robot-directives containing any <meta> and X-Robots-Tag robot exclusions.
handlers.junk is fired with data on each skipped link, as configured in options.
handlers.link is fired with the result of each discovered link (broken or not) within the current page.
handlers.page is fired after a page's last result, on zero results, or if the HTML could not be retrieved.
handlers.robots is fired after a site's robots.txt has been downloaded and provides an instance of robots-txt-guard.
handlers.site is fired after a site's last result, on zero results, or if the initial HTML could not be retrieved.
.clearCache() will remove any cached URL responses. This is only relevant if the cacheResponses option is enabled.
.dequeue(id) removes a site from the queue. Returns true on success or an Error on failure.
.enqueue(siteUrl, customData) adds [the first page of] a site to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success or an Error on failure.
- customData is optional data that is stored in the queue item for the site.
.numActiveLinks() returns the number of links with active requests.
.numPages() returns the total number of pages in the queue.
.numQueuedLinks() returns the number of links that currently have no active requests.
.numSites() returns the total number of sites in the queue.
.pause() will pause the queue, but will not pause any active requests.
.resume() will resume the queue.

Note: options.filterLevel is used for determining which links are recursive.

var siteChecker = new blc.SiteChecker(options, {
    robots: function(robots, customData){},
    html: function(tree, robots, response, pageUrl, customData){},
    junk: function(result, customData){},
    link: function(result, customData){},
    page: function(error, pageUrl, customData){},
    site: function(error, siteUrl, customData){},
    end: function(){}
});
 
siteChecker.enqueue(siteUrl, customData);

`blc.UrlChecker(options, handlers)`

Requests each queued URL to determine if they are broken.

handlers.end is fired when the end of the queue has been reached.
handlers.link is fired for each result (broken or not).
.clearCache() will remove any cached URL responses. This is only relevant if the cacheResponses option is enabled.
.dequeue(id) removes a URL from the queue. Returns true on success or an Error on failure.
.enqueue(url, baseUrl, customData) adds a URL to the queue. Queue items are auto-dequeued when their requests are completed. Returns a queue ID on success or an Error on failure.
- baseUrl is the address to which all relative URLs will be made absolute. Without a value, links to relative URLs will output an "Invalid URL" error.
- customData is optional data that is stored in the queue item for the URL.
.numActiveLinks() returns the number of links with active requests.
.numQueuedLinks() returns the number of links that currently have no active requests.
.pause() will pause the queue, but will not pause any active requests.
.resume() will resume the queue.

var urlChecker = new blc.UrlChecker(options, {
    link: function(result, customData){},
    end: function(){}
});
 
urlChecker.enqueue(url, baseUrl, customData);

Options

`options.acceptedSchemes`

Type: Array
Default value: ["http","https"]
Will only check links with schemes/protocols mentioned in this list. Any others (except those in excludedSchemes) will output an "Invalid URL" error.

`options.cacheExpiryTime`

Type: Number
Default Value: 3600000 (1 hour)
The number of milliseconds in which a cached response should be considered valid. This is only relevant if the cacheResponses option is enabled.

`options.cacheResponses`

Type: Boolean
Default Value: true
URL request results will be cached when true. This will ensure that each unique URL will only be checked once.

`options.excludedKeywords`

Type: Array
Default value: []
Will not check or output links that match the keywords and glob patterns in this list. The only wildcard supported is *.