Skip to content

网页爬虫: Chrome插件,在Chrome浏览器同时加载多个页面并抓取内容.

Notifications You must be signed in to change notification settings

program-in-chinese/ChromeCrawlerWildSpider

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chrome extension in webstore: https://chrome.google.com/webstore/detail/wild-spider/aanpchnfojihjddlocpgoekffmjkhbbe

#WATCH OUT: more tabs you use, more computer resources (CPU, memory) will be used, and saving each page costs a bit disk (in IndexedDB, accessible from Chrome Extensions -> Wild Spider, Inspect views: background page)to save the content.

The "spider" works in this way:

    1. The current url is used as the starting point, and it's loaded again in a new tab.
    1. After this page is loaded, fetch all the links on the page.
    1. Get all the links on the page, including relative urls.
    1. Save the text content of the page. Open the extracted link parallelly in all the tabs used (by default 3, set in eventPage).
    1. repeat 2-4

控制部分主要用中文编写: eventPage.js

About

网页爬虫: Chrome插件,在Chrome浏览器同时加载多个页面并抓取内容.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 100.0%