You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had just deployed Splash (in Docker) like a month ago on my dedicated server.
I am trying to scrape a website with Scrapy Splash, but I get following error no matter how many time I try that url
([scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.website.com via http:https://localhost:8050/render.html> (failed 1 times): User timeout caused connection failure: Getting http:https://localhost:8050/render.html took longer than 80.0 seconds..)
Meanwhile, same Splash server successfully scrapes every site I try.
If I try to cURL or scrapy.Request the above url from my server, it works, the site does not block no matter how many times I scrape via cURL or scrapy.Request
Then I had idea to see if there are some headers Splash is sending, I debugged Splash request headers via http:https://httpbin.org/get and found out that it automatically adds few headers
So now I know that Splash is sending "Host": "website.com" to the target site, which makes that website not scrape.
Question is, how do I make Splash not send any headers automatically? Or at least stop Splash from sending Host header?
The text was updated successfully, but these errors were encountered:
I had just deployed Splash (in Docker) like a month ago on my dedicated server.
I am trying to scrape a website with Scrapy Splash, but I get following error no matter how many time I try that url
([scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.website.com via http:https://localhost:8050/render.html> (failed 1 times): User timeout caused connection failure: Getting http:https://localhost:8050/render.html took longer than 80.0 seconds..)
Meanwhile, same Splash server successfully scrapes every site I try.
If I try to cURL or
scrapy.Request
the above url from my server, it works, the site does not block no matter how many times I scrape via cURL orscrapy.Request
Then I had idea to see if there are some headers Splash is sending, I debugged Splash request headers via http:https://httpbin.org/get and found out that it automatically adds few headers
So now I know that Splash is sending
"Host": "website.com"
to the target site, which makes that website not scrape.Question is, how do I make Splash not send any headers automatically? Or at least stop Splash from sending
Host
header?The text was updated successfully, but these errors were encountered: