MacOS with Apache-Spark 2.4.0 and Java8 installed.
- Java 8
$ brew update; brew tap homebrew/cask-versions; brew cask install java
- Apache-Spark
$ brew install apache-spark
Write code: extends AbstractCollector to use the project
Build:
$ mvn package
Run: collect data from one seed.
$ spark-submit --class collector.Example target/simple-project-1.0.jar
Output examples:
- (293,www.bbc.com#)
- (293,www.bbc.com#orb-modules)
- (293,www.bbc.com)
- (278,www.bbc.co.uk/news)
- (268,www.bbc.co.uk)
- (243,www.bbc.co.uk/sport)
- (74,www.bbc.co.uk/accessibility)
http:https://localhost:8501/Crawler?start=1&url=https://www.google.com/search?q=ucla
start URL urlCrawlerhttp:https://localhost:8501/Crawler?start=2
get results from URL urlCrawlerhttp:https://localhost:8501/Crawler?start=0
stop URL urlCrawler
http:https://localhost:8502/Crawler?start=1&url=https://www.google.com/search?q=china
start website urlCrawlerhttp:https://localhost:8502/Crawler?start=2
get results from website urlCrawlerhttp:https://localhost:8502/Crawler?start=0
stop website urlCrawler
http:https://localhost:8503/Crawler?start=1&url=https://www.bbc.com/
start keyword urlCrawlerhttp:https://localhost:8503/Crawler?start=2
get results from keyword urlCrawlerhttp:https://localhost:8503/Crawler?start=0
stop keyword urlCrawler
http:https://localhost:8504/Poi?start=1&lon=116.3978&lat=39.9033
starthttp:https://localhost:8504/Poi?start=2
get resultshttp:https://localhost:8504/Poi?start=0
stop
[{"_1":325,"_2":"www.google.com/about/"},{"_1":294,"_2":"www.google.com/search?q\u003ducla#"},...,{"_1":287,"_2":"www.ucla.edu/"}]
[{"_1":20260,"_2":"en.wikipedia.org"},{"_1":2648,"_2":"www.google.com"},{"_1":1130,"_2":"www.cia.gov"},{"_1":368,"_2":"www.aljazeera.com"},{"_1":322,"_2":"www.nytimes.com"},{"_1":319,"_2":"www.china.org.cn"},{"_1":290,"_2":"www.cnbc.com"},{"_1":283,"_2":"www.reuters.com"},{"_1":175,"_2":"www.bbc.com"},{"_1":171,"_2":"support.google.com"}]
[{"_1":38,"_2":"with"},{"_1":18,"_2":"from"},{"_1":12,"_2":"mins"},...,{"_1":7,"_2":"weekend"}]
[{"_1":5.0,"_2":"北京彭胜医院 116.421604337309 39.8790360490599"},{"_1":4.9,"_2":"北京方庄购物中心 116.42875232754 39.8655340880918"},{"_1":4.8,"_2":"老佛爷百货 116.374899989768 39.9141985798095"},...,{"_1":4.7,"_2":"首都电影院(金融街店) 116.360392231333 39.91567130133"}]