Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow spider attr #15

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[scrashtest] add another test spider
  • Loading branch information
pawelmhm committed Apr 3, 2015
commit 652fd6ee050287d0f69432b1bec98d38407ffca0
31 changes: 31 additions & 0 deletions example/scrashtest/spiders/dmoz_two.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# -*- coding: utf-8 -*-
from urlparse import urljoin
import json

import scrapy
from scrapy.contrib.linkextractors import LinkExtractor


class DmozSpider(scrapy.Spider):
name = "js_spider"
start_urls = ['http:https://www.isjavascriptenabled.com/']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 to adding tests which fetch remote URLs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it's not really a test, just extra spider aside from existing dmoz spider, I used it for development, it can be removed no problem

splash = {'args': {'har': 1, 'html': 1}}

def parse(self, response):
is_js = response.xpath("//h1/text()").extract()
if "".join(is_js).lower() == "yes":
self.log("JS enabled!")
else:
self.log("Error! JS disabled!", scrapy.log.ERROR)
le = LinkExtractor()

for link in le.extract_links(response):
url = urljoin(response.url, link.url)
yield scrapy.Request(url, self.parse_link)
break

def parse_link(self, response):
title = response.xpath("//title").extract()
yes = response.xpath("//h1").extract()
self.log("response is: {}".format(repr(response)))
self.log(u"Html in response contains {} {}".format("".join(title), "".join(yes)))