Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prevent Splash sending its default headers i.e. 'Host'? #251

Open
iamumairayub opened this issue Feb 3, 2020 · 0 comments
Open

Comments

@iamumairayub
Copy link

iamumairayub commented Feb 3, 2020

I had just deployed Splash (in Docker) like a month ago on my dedicated server.

I am trying to scrape a website with Scrapy Splash, but I get following error no matter how many time I try that url

([scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.website.com via http:https://localhost:8050/render.html> (failed 1 times): User timeout caused connection failure: Getting http:https://localhost:8050/render.html took longer than 80.0 seconds..)

Meanwhile, same Splash server successfully scrapes every site I try.

If I try to cURL or scrapy.Request the above url from my server, it works, the site does not block no matter how many times I scrape via cURL or scrapy.Request

Then I had idea to see if there are some headers Splash is sending, I debugged Splash request headers via http:https://httpbin.org/get and found out that it automatically adds few headers

So now I know that Splash is sending "Host": "website.com" to the target site, which makes that website not scrape.

Question is, how do I make Splash not send any headers automatically? Or at least stop Splash from sending Host header?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant