Skip to content

Commit

Permalink
Merge pull request #49 from istresearch/dev
Browse files Browse the repository at this point in the history
Scrapy Cluster 1.1 Merge
  • Loading branch information
Madison Bahmer committed Feb 23, 2016
2 parents 7b1c109 + 422d72d commit a8b611c
Show file tree
Hide file tree
Showing 195 changed files with 16,896 additions and 1,764 deletions.
37 changes: 37 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,40 @@
# Python binaries
*.pyc

# Sphinx
docs/_build
docs/_build_html

# OSX garbage
.DS_STORE

# Scrapy Cluster
kafka-monitor/logs/*
redis-monitor/logs/*
crawler/logs/*
crawler/main.log
localsettings.py

# Vagrant test VM
.vagrant
local/
bin/
pip-selfcheck.json

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
24 changes: 24 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
sudo: false

language: python

services:
- redis-server


install:
# Install conda
- wget http:https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
- bash miniconda.sh -b -p $HOME/miniconda
- export PATH="$HOME/miniconda/bin:$PATH"
- conda config --set always_yes yes --set changeps1 no
- conda update conda
- conda install pip

# install requirements
- conda env create -f ./conda_env.yml


script:
- source activate sc; ./run_offline_tests.sh

27 changes: 22 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Scrapy Cluster

[![Join the chat at https://gitter.im/istresearch/scrapy-cluster](https://badges.gitter.im/istresearch/scrapy-cluster.svg)](https://gitter.im/istresearch/scrapy-cluster?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[![Build Status](https://travis-ci.org/istresearch/scrapy-cluster.svg)](https://travis-ci.org/istresearch/scrapy-cluster) [![Join the chat at https://gitter.im/istresearch/scrapy-cluster](https://badges.gitter.im/istresearch/scrapy-cluster.svg)](https://gitter.im/istresearch/scrapy-cluster?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

Expand All @@ -26,17 +26,34 @@ This project tries to bring together a bunch of new concepts to Scrapy and large
- The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
- Scale Scrapy instances across a single machine or multiple machines
- Coordinate and prioritize their scraping effort for desired sites
- Persist across scraping jobs or have multiple scraping jobs going at the same time
- Allows for unparalleled access into the information about your scraping job, what is upcoming, and how the sites are ranked
- Persist data across scraping jobs
- Execute multiple scraping jobs concurrently
- Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
- Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
- Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
- Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address

## Scrapy Cluster test environment

To set up a pre-canned Scrapy Cluster test environment, make sure you have the latest **Virtualbox** + **Vagrant >= 1.7.4** installed. Vagrant will automatically mount the base **scrapy-cluster** directory to the **/vagrant** directory, so any code changes you make will be visible inside the VM.

### Steps to launch the test environment:
1. `vagrant up` in base **scrapy-cluster** directory.
2. `vagrant ssh` to ssh into the VM.
3. `sudo supervisorctl status` to check that everything is running.
4. `cd /vagrant` to get to the **scrapy-cluster** directory.
5. `conda create -n sc scrapy --yes` to create a conda virtualenv with Scrapy pre-installed.
6. `source activate sc` to activate your virtual environment.
7. `pip install -r requirements.txt` to install Scrapy Cluster dependencies.
8. `./run_offline_tests.sh` to run offline tests.
9. `./run_online_tests.sh` to run online tests (relies on kafka, zookeeper, redis).

## Documentation

Please check out our official [Scrapy Cluster documentation](http:https://scrapy-cluster.readthedocs.org/) for more details on how everything works!

## Branches

The `master` branch of this repository contains the latest stable release code for `Scrapy Cluster 1.0`.
The `master` branch of this repository contains the latest stable release code for `Scrapy Cluster 1.1`.

The `dev` branch contains bleeding edge code and is currently working towards `Scrapy Cluster 1.1`. Please note that not everything is documented, finished, tested, or finalized but we are happy to help guide those who are interested.
The `dev` branch contains bleeding edge code and is currently working towards [Scrapy Cluster 1.2](https://github.com/istresearch/scrapy-cluster/issues?utf8=%E2%9C%93&q=milestone%3A%22Scrapy+Cluster+1.2%22+). Please note that not everything may be documented, finished, tested, or finalized but we are happy to help guide those who are interested.
30 changes: 30 additions & 0 deletions Vagrantfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.require_version ">= 1.7.4"

Vagrant.configure(2) do |config|

# Configure general VM options
config.vm.provider "virtualbox" do |vb|
vb.memory = 2048
vb.cpus = 4
end

config.vm.define 'scdev' do |node|
node.vm.box = 'ubuntu/trusty64'
node.vm.hostname = 'scdev'
node.vm.network "private_network", ip: "192.168.33.99"
node.vm.provision "ansible" do |ansible|
ansible.verbose = true
ansible.groups = {
"kafka" => ["scdev"],
"zookeeper" => ["scdev"],
"redis" => ["scdev"],
"all_groups:children" => ["kafka", "zookeeper", "redis"]
}
ansible.playbook = "ansible/scrapy-cluster.yml"
node.vm.provision "shell", inline: "service supervisord restart", run: "always"
end
end
end
12 changes: 12 additions & 0 deletions ansible/kafka.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---

- name: Kafka Brokers
hosts: kafka

sudo: yes

vars:
- kafka_host_list: "{{ groups['kafka'] }}"
- zookeeper_host_list: "{{ groups['zookeeper'] }}"
roles:
- kafka
9 changes: 9 additions & 0 deletions ansible/redis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---

- name: Redis Master
hosts: redis

sudo: yes

roles:
- redis
6 changes: 6 additions & 0 deletions ansible/roles/java/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
# file: roles/common/defaults/main.yml

# The specific version of Oracle Java that can be found in YUM
java_version: 1.7.0_71

3 changes: 3 additions & 0 deletions ansible/roles/java/files/java.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Initialization script for Java
JAVA_HOME="/usr/java/default"
export JAVA_HOME
54 changes: 54 additions & 0 deletions ansible/roles/java/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
# file: roles/common/tasks/main.yml

- name: apt install java
apt:
name=default-jdk
state=present
update-cache=yes
tags: java
when: ansible_os_family == "Debian"

- name: yum install java
yum:
name=jdk-{{ java_version }}
state=present
tags: java
when: ansible_os_family == "RedHat"

- name: java system environment configuration
copy:
src=java.sh
dest=/etc/profile.d/java.sh
owner=0
group=0
mode=0755
tags: java

- name: Set JAVA_HOME ansible fact
set_fact:
java_home=/usr/java/default
tags: java

- name: Create Ansible facts.d directory
file:
state=directory
dest=/etc/ansible/facts.d
owner=0
group=0
mode=0755
tags: java

- name: Install java facts
template:
src=facts.j2
dest=/etc/ansible/facts.d/java.fact
owner=0
group=0
mode=0644
tags: java

- name: Re-read facts
setup:
filter=ansible_local
tags: java
2 changes: 2 additions & 0 deletions ansible/roles/java/templates/facts.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[general]
java_home={{ java_home }}
20 changes: 20 additions & 0 deletions ansible/roles/kafka/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---

kafka_version: 0.9.0.0

kafka_install_dir: /opt/kafka
kafka_config_dir: /opt/kafka/default/config
kafka_log_dir: /opt/kafka/default/logs
kafka_data_log_dir:
- /opt/kafka/topic-logs

kafka_port: 9092
kafka_message_max: 10000000
kafka_replica_fetch_max_bytes: 15000000
kafka_consumer_message_max: 16777216
kafka_num_partitions: "{{ groups['kafka'] | length }}"
kafka_replication_factor: "{{ groups['kafka'] | length }}"
kafka_log_retention_hours: 168
kafka_num_io_threads: 8

kafka_source: "http:https://apache.arvixe.com/kafka"
5 changes: 5 additions & 0 deletions ansible/roles/kafka/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
- name: restart kafka
supervisorctl:
name=kafka
state=restarted
4 changes: 4 additions & 0 deletions ansible/roles/kafka/meta/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
dependencies:
- { role: supervisord }
- { role: java }
91 changes: 91 additions & 0 deletions ansible/roles/kafka/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---

- name: create kafka directories
file:
path={{ item }}
state=directory
mode=0744
with_items:
- "{{ kafka_install_dir }}"
- "{{ kafka_data_log_dir }}"
tags: kafka

- name: check for existing install
stat: path={{ kafka_install_dir }}/kafka_2.11-{{ kafka_version }}
register: kafka
tags: kafka

- name: download kafka
get_url:
url="{{ kafka_source }}/{{ kafka_version }}/kafka_2.11-{{ kafka_version }}.tgz"
dest=/tmp/kafka_2.11-{{ kafka_version }}.tgz
mode=0644
validate_certs=no
when: kafka.stat.isdir is not defined
tags: kafka

- name: extract kafka
unarchive:
src=/tmp/kafka_2.11-{{ kafka_version }}.tgz
dest={{ kafka_install_dir }}
copy=no
when: kafka.stat.isdir is not defined
tags: kafka

- name: delete temporary kafka file
file:
path=/tmp/kafka_2.11-{{ kafka_version }}.tgz
state=absent
ignore_errors: yes
tags: kafka

- name: create kafka symlink
file:
path={{ kafka_install_dir }}/default
state=link
src={{ kafka_install_dir }}/kafka_2.11-{{ kafka_version }}
tags: kafka

- name: configure kafka brokers
template:
src=server.properties.j2
dest={{ kafka_config_dir }}/server.properties
mode=0644
notify:
- restart kafka
tags: kafka

- name: configure log4j
template:
src=log4j.properties.j2
dest={{ kafka_config_dir }}/log4j.properties
mode=0644
notify:
- restart kafka
tags: kafka

- name: configure kafka consumer
template:
src=consumer.properties.j2
dest={{ kafka_config_dir }}/consumer.properties
mode=0644
notify:
- restart kafka
tags: kafka

- name: copy supervisord config
template:
src=kafka-supervisord.conf.j2
dest={{ supervisord_programs_dir }}/kafka-supervisord.conf
mode=0644
notify:
- reread supervisord
tags: kafka

- name: set up aliases
lineinfile:
dest: "/root/.bashrc"
line: "export KAFKA={{ kafka_install_dir }}/default"
tags: env

- cron: name="clear old kafka app logs" job="find /opt/kafka/default/logs -mtime +7 -exec rm {} \; > /dev/null" minute="0"
32 changes: 32 additions & 0 deletions ansible/roles/kafka/templates/consumer.properties.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http:https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see kafka.consumer.ConsumerConfig for more details

# Zookeeper connection string
# comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002"
zookeeper.connect=127.0.0.1:2181

# timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000

#consumer group id
group.id=test-consumer-group

#consumer timeout
#consumer.timeout.ms=5000

# Need to increase this to play nice with message.max.bytes = 10000000
fetch.message.max.bytes={{ kafka_consumer_message_max }}
Loading

0 comments on commit a8b611c

Please sign in to comment.