Merge pull request #49 from istresearch/dev

Scrapy Cluster 1.1 Merge
istresearch · Feb 23, 2016 · a8b611c · a8b611c
2 parents 7b1c109 + 422d72d
commit a8b611c
Show file tree

Hide file tree

Showing 195 changed files with 16,896 additions and 1,764 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,40 @@
+# Python binaries
 *.pyc
+
+# Sphinx
 docs/_build
 docs/_build_html
+
+# OSX garbage
+.DS_STORE
+
+# Scrapy Cluster
+kafka-monitor/logs/*
+redis-monitor/logs/*
+crawler/logs/*
+crawler/main.log
+localsettings.py
+
+# Vagrant test VM
+.vagrant
+local/
+bin/
+pip-selfcheck.json
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,24 @@
+sudo: false
+
+language: python
+
+services:
+ - redis-server
+
+
+install:
+ # Install conda
+ - wget http:https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh
+ - bash miniconda.sh -b -p $HOME/miniconda
+ - export PATH="$HOME/miniconda/bin:$PATH"
+ - conda config --set always_yes yes --set changeps1 no
+ - conda update conda
+ - conda install pip
+
+ # install requirements
+ - conda env create -f ./conda_env.yml
+
+
+script:
+ - source activate sc; ./run_offline_tests.sh
+
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Scrapy Cluster
 
-[![Join the chat at https://gitter.im/istresearch/scrapy-cluster](https://badges.gitter.im/istresearch/scrapy-cluster.svg)](https://gitter.im/istresearch/scrapy-cluster?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
+[![Build Status](https://travis-ci.org/istresearch/scrapy-cluster.svg)](https://travis-ci.org/istresearch/scrapy-cluster) [![Join the chat at https://gitter.im/istresearch/scrapy-cluster](https://badges.gitter.im/istresearch/scrapy-cluster.svg)](https://gitter.im/istresearch/scrapy-cluster?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
 
 This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
 
@@ -26,17 +26,34 @@ This project tries to bring together a bunch of new concepts to Scrapy and large
 - The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
 - Scale Scrapy instances across a single machine or multiple machines
 - Coordinate and prioritize their scraping effort for desired sites
-- Persist across scraping jobs or have multiple scraping jobs going at the same time
-- Allows for unparalleled access into the information about your scraping job, what is upcoming, and how the sites are ranked
+- Persist data across scraping jobs
+- Execute multiple scraping jobs concurrently
+- Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
 - Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
 - Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
+- Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address
+
+## Scrapy Cluster test environment
+
+To set up a pre-canned Scrapy Cluster test environment, make sure you have the latest **Virtualbox** + **Vagrant >= 1.7.4** installed. Vagrant will automatically mount the base **scrapy-cluster** directory to the **/vagrant** directory, so any code changes you make will be visible inside the VM.
+
+### Steps to launch the test environment:
+1. `vagrant up` in base **scrapy-cluster** directory.
+2. `vagrant ssh` to ssh into the VM.
+3. `sudo supervisorctl status` to check that everything is running.
+4. `cd /vagrant` to get to the **scrapy-cluster** directory.
+5. `conda create -n sc scrapy --yes` to create a conda virtualenv with Scrapy pre-installed.
+6. `source activate sc` to activate your virtual environment.
+7. `pip install -r requirements.txt` to install Scrapy Cluster dependencies.
+8. `./run_offline_tests.sh` to run offline tests.
+9. `./run_online_tests.sh` to run online tests (relies on kafka, zookeeper, redis).
 
 ## Documentation
 
 Please check out our official [Scrapy Cluster documentation](http:https://scrapy-cluster.readthedocs.org/) for more details on how everything works!
 
 ## Branches
 
-The `master` branch of this repository contains the latest stable release code for `Scrapy Cluster 1.0`.
+The `master` branch of this repository contains the latest stable release code for `Scrapy Cluster 1.1`.
 
-The `dev` branch contains bleeding edge code and is currently working towards `Scrapy Cluster 1.1`. Please note that not everything is documented, finished, tested, or finalized but we are happy to help guide those who are interested.
+The `dev` branch contains bleeding edge code and is currently working towards [Scrapy Cluster 1.2](https://github.com/istresearch/scrapy-cluster/issues?utf8=%E2%9C%93&q=milestone%3A%22Scrapy+Cluster+1.2%22+). Please note that not everything may be documented, finished, tested, or finalized but we are happy to help guide those who are interested.
diff --git a/Vagrantfile b/Vagrantfile
@@ -0,0 +1,30 @@
+# -*- mode: ruby -*-
+# vi: set ft=ruby :
+
+Vagrant.require_version ">= 1.7.4"
+
+Vagrant.configure(2) do |config|
+
+ # Configure general VM options
+ config.vm.provider "virtualbox" do |vb|
+ vb.memory = 2048
+ vb.cpus = 4
+ end
+
+ config.vm.define 'scdev' do |node|
+ node.vm.box = 'ubuntu/trusty64'
+ node.vm.hostname = 'scdev'
+ node.vm.network "private_network", ip: "192.168.33.99"
+ node.vm.provision "ansible" do |ansible|
+ ansible.verbose = true
+ ansible.groups = {
+ "kafka" => ["scdev"],
+ "zookeeper" => ["scdev"],
+ "redis" => ["scdev"],
+ "all_groups:children" => ["kafka", "zookeeper", "redis"]
+ }
+ ansible.playbook = "ansible/scrapy-cluster.yml"
+ node.vm.provision "shell", inline: "service supervisord restart", run: "always"
+ end
+ end
+end
diff --git a/ansible/kafka.yml b/ansible/kafka.yml
@@ -0,0 +1,12 @@
+---
+
+- name: Kafka Brokers
+ hosts: kafka
+
+ sudo: yes
+
+ vars:
+ - kafka_host_list: "{{ groups['kafka'] }}"
+ - zookeeper_host_list: "{{ groups['zookeeper'] }}"
+ roles:
+ - kafka
diff --git a/ansible/redis.yml b/ansible/redis.yml
@@ -0,0 +1,9 @@
+---
+
+- name: Redis Master
+ hosts: redis
+
+ sudo: yes
+
+ roles:
+ - redis
diff --git a/ansible/roles/java/defaults/main.yml b/ansible/roles/java/defaults/main.yml
@@ -0,0 +1,6 @@
+---
+# file: roles/common/defaults/main.yml
+
+# The specific version of Oracle Java that can be found in YUM
+java_version: 1.7.0_71
+
diff --git a/ansible/roles/java/files/java.sh b/ansible/roles/java/files/java.sh
@@ -0,0 +1,3 @@
+# Initialization script for Java
+JAVA_HOME="/usr/java/default"
+export JAVA_HOME
diff --git a/ansible/roles/java/tasks/main.yml b/ansible/roles/java/tasks/main.yml
@@ -0,0 +1,54 @@
+---
+# file: roles/common/tasks/main.yml
+
+- name: apt install java 
+ apt: 
+ name=default-jdk
+ state=present
+ update-cache=yes
+ tags: java
+ when: ansible_os_family == "Debian"
+
+- name: yum install java 
+ yum: 
+ name=jdk-{{ java_version }}
+ state=present
+ tags: java
+ when: ansible_os_family == "RedHat"
+
+- name: java system environment configuration
+ copy:
+ src=java.sh
+ dest=/etc/profile.d/java.sh
+ owner=0
+ group=0
+ mode=0755
+ tags: java
+
+- name: Set JAVA_HOME ansible fact
+ set_fact:
+ java_home=/usr/java/default
+ tags: java
+
+- name: Create Ansible facts.d directory
+ file:
+ state=directory
+ dest=/etc/ansible/facts.d
+ owner=0
+ group=0
+ mode=0755
+ tags: java
+
+- name: Install java facts
+ template:
+ src=facts.j2
+ dest=/etc/ansible/facts.d/java.fact
+ owner=0
+ group=0
+ mode=0644
+ tags: java
+
+- name: Re-read facts
+ setup:
+ filter=ansible_local
+ tags: java
diff --git a/ansible/roles/java/templates/facts.j2 b/ansible/roles/java/templates/facts.j2
@@ -0,0 +1,2 @@
+[general]
+java_home={{ java_home }}
diff --git a/ansible/roles/kafka/defaults/main.yml b/ansible/roles/kafka/defaults/main.yml
@@ -0,0 +1,20 @@
+---
+
+kafka_version: 0.9.0.0
+
+kafka_install_dir: /opt/kafka
+kafka_config_dir: /opt/kafka/default/config
+kafka_log_dir: /opt/kafka/default/logs
+kafka_data_log_dir:
+ - /opt/kafka/topic-logs
+
+kafka_port: 9092
+kafka_message_max: 10000000
+kafka_replica_fetch_max_bytes: 15000000
+kafka_consumer_message_max: 16777216
+kafka_num_partitions: "{{ groups['kafka'] | length }}"
+kafka_replication_factor: "{{ groups['kafka'] | length }}"
+kafka_log_retention_hours: 168
+kafka_num_io_threads: 8
+
+kafka_source: "http:https://apache.arvixe.com/kafka"
diff --git a/ansible/roles/kafka/handlers/main.yml b/ansible/roles/kafka/handlers/main.yml
@@ -0,0 +1,5 @@
+---
+- name: restart kafka
+ supervisorctl:
+ name=kafka
+ state=restarted
diff --git a/ansible/roles/kafka/meta/main.yml b/ansible/roles/kafka/meta/main.yml
@@ -0,0 +1,4 @@
+---
+dependencies:
+ - { role: supervisord }
+ - { role: java }
diff --git a/ansible/roles/kafka/tasks/main.yml b/ansible/roles/kafka/tasks/main.yml
@@ -0,0 +1,91 @@
+---
+
+- name: create kafka directories
+ file:
+ path={{ item }}
+ state=directory
+ mode=0744
+ with_items:
+ - "{{ kafka_install_dir }}"
+ - "{{ kafka_data_log_dir }}"
+ tags: kafka
+
+- name: check for existing install
+ stat: path={{ kafka_install_dir }}/kafka_2.11-{{ kafka_version }}
+ register: kafka
+ tags: kafka
+
+- name: download kafka
+ get_url:
+ url="{{ kafka_source }}/{{ kafka_version }}/kafka_2.11-{{ kafka_version }}.tgz"
+ dest=/tmp/kafka_2.11-{{ kafka_version }}.tgz
+ mode=0644
+ validate_certs=no
+ when: kafka.stat.isdir is not defined
+ tags: kafka
+
+- name: extract kafka
+ unarchive:
+ src=/tmp/kafka_2.11-{{ kafka_version }}.tgz
+ dest={{ kafka_install_dir }}
+ copy=no
+ when: kafka.stat.isdir is not defined
+ tags: kafka
+
+- name: delete temporary kafka file
+ file:
+ path=/tmp/kafka_2.11-{{ kafka_version }}.tgz
+ state=absent
+ ignore_errors: yes
+ tags: kafka
+
+- name: create kafka symlink
+ file:
+ path={{ kafka_install_dir }}/default
+ state=link
+ src={{ kafka_install_dir }}/kafka_2.11-{{ kafka_version }}
+ tags: kafka
+
+- name: configure kafka brokers
+ template:
+ src=server.properties.j2
+ dest={{ kafka_config_dir }}/server.properties
+ mode=0644
+ notify:
+ - restart kafka
+ tags: kafka
+
+- name: configure log4j
+ template:
+ src=log4j.properties.j2
+ dest={{ kafka_config_dir }}/log4j.properties
+ mode=0644
+ notify:
+ - restart kafka
+ tags: kafka
+
+- name: configure kafka consumer
+ template:
+ src=consumer.properties.j2
+ dest={{ kafka_config_dir }}/consumer.properties
+ mode=0644
+ notify:
+ - restart kafka
+ tags: kafka
+
+- name: copy supervisord config
+ template:
+ src=kafka-supervisord.conf.j2
+ dest={{ supervisord_programs_dir }}/kafka-supervisord.conf
+ mode=0644
+ notify:
+ - reread supervisord
+ tags: kafka
+
+- name: set up aliases
+ lineinfile:
+ dest: "/root/.bashrc"
+ line: "export KAFKA={{ kafka_install_dir }}/default"
+ tags: env
+
+- cron: name="clear old kafka app logs" job="find /opt/kafka/default/logs -mtime +7 -exec rm {} \; > /dev/null" minute="0"
diff --git a/ansible/roles/kafka/templates/consumer.properties.j2 b/ansible/roles/kafka/templates/consumer.properties.j2
@@ -0,0 +1,32 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http:https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# see kafka.consumer.ConsumerConfig for more details
+
+# Zookeeper connection string
+# comma separated host:port pairs, each corresponding to a zk
+# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002"
+zookeeper.connect=127.0.0.1:2181
+
+# timeout in ms for connecting to zookeeper
+zookeeper.connection.timeout.ms=6000
+
+#consumer group id
+group.id=test-consumer-group
+
+#consumer timeout
+#consumer.timeout.ms=5000
+
+# Need to increase this to play nice with message.max.bytes = 10000000
+fetch.message.max.bytes={{ kafka_consumer_message_max }}