Zhuzhuzi

A simple and easy-to-use crawler framework based on netty

Project structure

zhuzhuzi-common : Define common exceptions, pojo, enums and etc
zhuzhuzi-engine : Download engine and configuration of framework
zhuzhuzi-jsoup : HTML dom parser
zhizhuzi-gson : Json parser

User guide

How to position HTML elements
- Use pojo and annotation

  public class JDUrlSkus extends WebSite {
      @Nodes(domNodes = {
      @Node(nodeTagName = NodeTagName.body), // traverse to body
      @Node(nodeId = "J_searchWrap"),        // traverse to one or more dom node with this id
      @Node(nodeId = "J_container"),
      @Node(nodeId = "J_main"),
      @Node(nodeClassName = "m-list"),       // traverse to one or more dom node with this className
      @Node(nodeClassName = "ml-wrap"),      
      @Node(nodeId = "J_goodsList"),
      @Node(nodeClassName = "gl-warp clearfix"),
      @Node(nodeClassName = "gl-item"),
      @Node(nodeAttr = "data-sku")            // get data-sku attribute from all candidate DOM node
      })
      private List<String> urlSkus;
  }

Supported Targeting approach:

nodeTagName: Locating elements using dom tags, use enum instance instead of String
- nodeId: Location element using dom element's id
- nodeClassName: Location elements using element's class name
- nodeAttr: Get inner attribute value of dom element

Support for using index or offset to locate elements more precisely, see Node.order() and Node.bias()

Crawl item's data-sku from search.jd.com

There are three ways to perform tasks:

execute: Return immediately after calling, asynchronously consume crawling results
blockExecute: After calling, wait for the crawling to end before returning
submit: After calling, wait for the crawling to end before returning, and return crawl result as a pojo

Here is a demo of blockExecute

// Demo
public class JDCrawlTest {
   @Test
   public void test_urlList(){
      // more configuration about CrawlEngineBuilder, see CrawlEngineBuilder class
      try(CrawlEngine<JDUrlSkus> engine =
                  new CrawlEngineBuilder<>(JDUrlSkus.class) // engine will parse html to a JDUrlSkus Object
                          .ssl(true)    // engine ssl support
                          .compress(true)   // engine html-content compress support
                          .resConsumer(WebsiteConsumer::toConsole)  // define how to consume pojo we just crawl
                          .build() // return engine we build
      ){
          // submit task to engine in block mode
         // more configuration about CrawlTask, see CrawlTask class
         engine.blockExecute(new CrawlTask("https://search.jd.com/Search?keyword=GPW")); 
      }catch (Exception ignored){}
   }
}

Easy and peasy get what we want in console

urlSkus
39689153276 10026493519952 100010255665 8753300 10060147618464 10060147618465 34312424914 10055418617931 10040458414309 10052873738457 100018123120 32921207364 10026488360177 10022946653369 10068182342803 10045384931160 10051774315894 41305680312 10027050953329 10027050953330 10068182342804 10031646184951 10049690328133 10068038009331 10067871522660 10028019288129 10033116126240 10026490619621 10033116126241 10034899387643

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
.github/workflows		.github/workflows
zhizhuzi-common		zhizhuzi-common
zhizhuzi-engine		zhizhuzi-engine
zhizhuzi-gson		zhizhuzi-gson
zhizhuzi-jsoup		zhizhuzi-jsoup
zhizhuzi-selenium		zhizhuzi-selenium
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zhuzhuzi

Project structure

User guide

About

Releases

Packages

Contributors 2

Languages

License

JessySnow/zhizhuzi

Folders and files

Latest commit

History

Repository files navigation

Zhuzhuzi

Project structure

User guide

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages