Skip to content

Commit

Permalink
Minor typos in documentation
Browse files Browse the repository at this point in the history
This should be the last commit before 1.0
  • Loading branch information
Madison Bahmer authored and Madison Bahmer committed May 21, 2015
1 parent b24eee4 commit 427e669
Show file tree
Hide file tree
Showing 6 changed files with 8 additions and 12 deletions.
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ Serving on http:https://127.0.0.1:8000
...
```

in order to view the documentation as you live edit it on your machine. Note that the `default` theme is overridden by readthedocs when uploading, so don't mind that the local documentation is different from what you see online.
You will now be able to view the documentation as you live edit it on your machine. Note that the `default` theme is overridden by readthedocs when uploading, so don't mind that the local documentation is different from what you see online.
2 changes: 0 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ Scrapy Cluster |version| Documentation

This documentation provides everything you need to know about the Scrapy based distributed crawling project, Scrapy Cluster.

.. note:: As of 5/15/15 an official tagged release is getting close, we are just trying to ensure everything works easily for new users and iron out any linger documentation issues. Thank you for your patience and interest!

.. toctree::
:hidden:
:maxdepth: 1
Expand Down
4 changes: 2 additions & 2 deletions docs/topics/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ As a friendly reminder, the following processes should be monitored:
Scrapy Cluster Response Time
----------------------------

The Scrapy Cluster Response time is dependent on two factors:
The Scrapy Cluster Response time is dependent on a number of factors:

- How often the Kafka Monitor polls for new messages

Expand All @@ -59,7 +59,7 @@ The Scrapy Cluster Response time is dependent on two factors:

With the Kafka Monitor constantly monitoring the topic, there is very little latency for getting a request into the system. The bottleneck occurs mainly in the core Scrapy crawler code.

The more crawlers you have running and spread across the cluster, the lower the average response time will be for a crawler to receive a request. For example if a single spider goes idle for 5 seconds, you would expect a your maximum response time to be 5 seconds, the minimum response time to be 0 seconds, but on average your response time should be 2.5 seconds for one spider. As you increase the number of spiders in the system the likelihood that one spider is polling also increases, and the cluster performance will go up.
The more crawlers you have running and spread across the cluster, the lower the average response time will be for a crawler to receive a request. For example if a single spider goes idle and then polls every 5 seconds, you would expect a your maximum response time to be 5 seconds, the minimum response time to be 0 seconds, but on average your response time should be 2.5 seconds for one spider. As you increase the number of spiders in the system the likelihood that one spider is polling also increases, and the cluster performance will go up.

The final bottleneck in response time is how quickly the request can be conducted by Scrapy, which depends on the speed of the internet connection(s) you are running the Scrapy Cluster behind. This final part is out of control of the Scrapy Cluster itself.

Expand Down
4 changes: 2 additions & 2 deletions docs/topics/crawler.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Each crawl job that is submitted to the cluster is given a priority, and for eve
:alt: Breath First
:align: center

As you can see above. the initial seed url generates 4 new links. Since we are using a priority based queue, the spiders continue to pop from the highest priority crawl request, and then decrease the priority for level deep they are from the parent request. Any new links are fed back into the same exact queue mechanism but with a lower priority to allow for the equal levels links to be crawled first.
As you can see above. the initial seed url generates 4 new links. Since we are using a priority based queue, the spiders continue to pop from the highest priority crawl request, and then decrease the priority for level deep they are from the parent request. Any new links are fed back into the same exact queue mechanism but with a lower priority to allow for the equal leveled links to be crawled first.

When a spider encounters a link it has already seen, the duplication filter based on the request’s ``crawlid`` will filter it out. The spiders will continue to traverse the resulting graph generated until they have reached either their maximum link depth or have exhausted all possible urls.

Expand Down Expand Up @@ -144,7 +144,7 @@ redis\_spider.py

A base class that extends the default Scrapy Spider so we can crawl continuously in cluster mode. All you need to do is implement the ``parse`` method and everything else is taken care of behind the scenes.

.. note:: There is a method within this class called ``reconstruct_headers()`` that is very important you take advantage of! The issue we ran into was that we were dropping data in our headers fields when encoding the item into json. The Scrapy shell didn’t see this issue, print statements couldn’t find it, but it boiled down to the python list being treated as a single element. We think this may be a formal defect in Python 2.7 but have not made an issue yet as the bug needs much more testing.*
.. note:: There is a method within this class called ``reconstruct_headers()`` that is very important you take advantage of! The issue we ran into was that we were dropping data in our headers fields when encoding the item into json. The Scrapy shell didn’t see this issue, print statements couldn’t find it, but it boiled down to the python list being treated as a single element. We think this may be a formal defect in Python 2.7 but have not made an issue yet as the bug needs much more testing.

link\_spider.py
^^^^^^^^^^^^^^^
Expand Down
6 changes: 3 additions & 3 deletions docs/topics/kafkamonitor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Design Considerations

The design of the Kafka Monitor stemmed from the need to define a format that allowed for the creation of crawls in the crawl architecture from any application. If the application could read and write to the kafka cluster then it could write messages to a particular kafka topic to create crawls.

Soon enough those same applications wanted the ability to retrieve and stop their crawls from that same interface, so we decided to make a dynamic interface that could support all of the request needs, but utilize the same base code. In the future this base code could expanded to handle any different style of request, as long as there was an validation of the request and a place to send the result to.
Soon enough those same applications wanted the ability to retrieve and stop their crawls from that same interface, so we decided to make a dynamic interface that could support all of the request needs, but utilize the same base code. In the future this base code could expanded to handle any different style of request, as long as there was a validation of the request and a place to send the result to.

From our own internal debugging and ensuring other applications were working properly, a utility program was also created on the side in order to be able to interact and monitor the kafka messages coming through. This dump utility can be used to monitor any of the Kafka topics within the cluster.

Expand Down Expand Up @@ -86,7 +86,7 @@ Example Crawl Requests:

python kafka-monitor.py feed '{"url": "http:https://www.dmoz.org/", "appid":"testapp", "crawlid":"abc123", "maxdepth":2, "priority":90}' -s settings_crawling.py

- Submits a dmoz.org crawl spidering 2 links deep with a high priority
- Submits a dmoz.org crawl spidering 2 levels deep with a high priority

::

Expand Down Expand Up @@ -160,7 +160,7 @@ Optional:

- **priority:** The priority of which to given to the url to be crawled. The Spiders will crawl the highest priorities first.

- **allowed_domains:** A list of domains that the crawl should stay within. For example, putting [ "cnn.com" ] will only continue to crawl links of that domain.
- **allowed_domains:** A list of domains that the crawl should stay within. For example, putting ``[ "cnn.com" ]`` will only continue to crawl links of that domain.

- **allow_regex:** A list of regular expressions to apply to the links to crawl. Any hits within from any regex will allow that link to be crawled next.

Expand Down
2 changes: 0 additions & 2 deletions docs/topics/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@ The goal is to distribute seed URLs among many waiting spider instances, whose r

The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of of the crawl requests.

.. note:: As of 4/27/15 an official tagged release is getting close, we are just trying to consolidate documentation and ensure everything works easily for new users. Thank you for your patience and interest! If you would like to jump right in anyways, the :doc:`./quickstart` guide is complete.

Dependencies
------------

Expand Down

0 comments on commit 427e669

Please sign in to comment.