Hacker News new | past | comments | ask | show | jobs | submit login
Mastering Web Scraping in Python: Crawling from Scratch (zenrows.com)
233 points by maltz 3 months ago | hide | past | favorite | 81 comments

I’ve done a fair share of scraping, and I learned that on a large scale, there are a lot of cross-cutting repetitive concerns. Things like caching, fetching HTML (preferably in parallel), throttling, retries, navigation, emitting the output as a dataset…

My library, Skyscraper [0], attempts to help with these. It’s written in Clojure (based on Enlive or Reaver, both counterparts to Beautiful Soup), but the principles should be readily transferable everywhere.

[0]: https://github.com/nathell/skyscraper

In developing this what were some sites used to test it, what was the desired data and format of the data to be extracted, and what was the most challenging of those sites.

Thanks for the interest!

My most extensive use of Skyscraper to date has been to produce a structured dataset of proceedings, including individual voting results, of Central European parliaments (~500K total pages scraped, ~100M entries). I’ll do a full writeup at some point.

Shameless request for scraping enthusiasts at: https://www.pdap.io, an open source Police Data Accessibility project started on HN and Reddit. Our goal is scraping and collating all county level public records, giving us a dataset to enable "Policing the Police"

It seems like your primary call to action on your site is donating when I don't even really know what I am working with or looking at on your site. I think you need a big clear button pointing people to the data and how to get it.

Very into the idea... how is it going?

This is more of a beginners guide than master class. This method will not extract most content on modern websites because of the way javascript behaves on them. It is also vertically not horizontally scalable. There are many other reasons that this a step one when web scraping.

It's part of a series of blog posts that talks explicitly about crawling. There are indeed other links that do better explaining advanced extraction techniques.

Extraction => https://www.zenrows.com/blog/mastering-web-scraping-in-pytho...

Avoid blocking => https://www.zenrows.com/blog/stealth-web-scraping-in-python-...

ok but do you offer custom scraping services if i needed to hire someone to build it?

thank you

I worked on a large web scraper for several years and JavaScript almost never needs to be executed. The only times I've had were to extract obfuscated links that are revealed by some bit twiddling code, specific to each request, and this was achievable by forking out to deno.

I think javascript comes up because cloudflare use some kind of javascript challenge as part of the DDOS protection. There are python libraries that know how to deal with it, or you can use some level of headless browser. https://github.com/VeNoMouS/cloudscraper

This is highly domain (and sometimes User-Agent) dependent and in my experience JS is required more and more.

e.g. good luck trying to get much out of youtube.com (or any other video site) without executing JS.

YouTube has "var ytInitialData" & "var ytInitialPlayerResponse" params hardcoded in HTML. No need to run JS!

This is something I find a lot of web scraping tools miss. Are there any you'd recommend that specifically deal with things like async JavaScript content loading, or loading content based on what you click on a page (e.g., in Single Page Apps)?

Javascript content loading is easier in most cases. Just look at your browser network inspector and grab the URL.

Usually the response is in JSON and you can ignore the original page. You might have to auth/grab session cookies first, but thats still easier than working with the HTML.

Playwright. It can be easily used with JS, Python, Go, Java, etc.

Thanks! Is that like using Selenium? (i.e., you have to manage and code the actions yourself)

Yes, quite similar. According to their definition it is a "library to automate Chromium, Firefox and WebKit with a single API. "

Thanks! If there are any third-party managed tools to do this, that would be awesome to know about (i.e., where they somehow run common JS functions/site interactions to test for additional content).

Unfortunately, it's a pathological edge case.

Imagine an async-loaded list, that continues loading more content as it comes in, until it displays all of the content available to the backend.

When would you know such a list is finished loading?

This sounds insane, but it's pretty easy and common for an ambitious UXer to key in on, and is something I've seen in production pages.

(In the event you are a UXer, please include some sort of status update! Even an overlaid spinner that disappears solves the problem.)

kinda agree

- session persistence

- dealing with cdns

- dealing with regional proxies

- dealing with captchas

- dealing with websocket data

- dealing with custom session handshake sequences

list goes on and on and on, but probably just edge cases haha

Is there a reason, other than the BeautifulSoup library, that Python is considered by many to be the ideal language for web scraping? I would think that JavaScript would be a far better choice since it could natively parse scripts on the page and libraries for querying and parsing the DOM have existed for a long time in JavaScript and are well known (to the point of being boring -- eg: jQuery).

You don't really get any benefit from writing it in javascript, other than the normal benefits you get from writing anything in javascript. (I say this having very little experience with server-side javascript, so take it with a grain of salt)

DoM emulation and selectors are pretty much equivalent between nodejs and python, you can use css or xpath selectors on html/xml content on either of them. Either way you need to emulate something like a DoM, as neither language/execution-environment has a "native" DoM.

You don't want to execute random javascript code from the web inside your scraper, and just being able to parse the scripts doesn't do you much good. So you're not getting the main advantage I think you're suggesting, being able to emulate page javascript, being able to actually run that code.

Generally if you want to interact with javascript you need to do it in another process (I guess a sufficiently advanced sandbox could work too, an interpreter in your interpreter, but so far that doesn't exist). If you're already going to be running that javascript in a different process for security reasons that different process might as well just be a "remote controlled" web browser.

Historically that was done using selenium, which has good python bindings.

Now days it's being done more with playwright, which started out as a nodejs binding but is moving towards python....

Ultimately I think the reason is that there's no real advantage to using javascript and python is a nicer language with a healthier ecosystem, but your mileage my vary.

Actually one big advantage I see is the ability to quickly come up with needed functions and code from Browser DevTools then use the exact same code in a node script.

Personally I use this method with Puppeteer for advanced pages such as Single Page Apps (SPA) and other pages that depend on JavaScript, CSS, or other features in the page. Another example of an advanced page would be a site where you have to psychical scroll and wait for content to load from a web service. In these cases a headless browser with JavaScript makes the most sense to me.

I've found where it gets tricky with JavaScript is if you have a single missing `async/await` you can introduce bugs in your code that take extra time to solve.

For simple pages I do like Python and that you don't need `async/await`.

Selenium and playwright both allow you to inject javascript directly into the page, which can be nice.

I see your point though. Also when I do playwright scripting I normally use async/await, so I guess the grass is always greener ;p

In python I find a missing async/await is apparently very early on and doesn't really take extra time to solve. Maybe just better tracebacks in python?

If performance, Especially concurrency matters then we should include Go based libraries in the discussion as well.

Colly[1] is a all batteries included scrapping library, But could be bit intimidating for someone new to HTML scrapping.

rod [2] is a browser automation library based on DevTools Protocol which adheres to Go'ish way of doing things and so it's very intuitive but comes with overhead of running chromium browser even if its headless.

[1] https://github.com/gocolly/colly

[2] https://github.com/go-rod/rod

This is a thoughtful response, I don’t understand why it’s being downvoted.

Me either, alas

BeautifulSoup is great if you don't care about the performance at all. Because it is painfully slooooooww.

Lxml doesn't work well with broken html, but is an or two orders of magnitude faster for parsing, and same for querying with xpath.

A part from that, there is also Scrapy which is used a lot, but same it is also very slow, it is just horizontally scalable easily.

There are a lot of times in which scrapping doesn't use html parsing, when you are scrapping pages which change a lot of structure, it might be better to go with full text search, and in this case, the faster the better. And in that area Python is far from the best, except when .split() and .join() are enough. Even re.match is slow because of the algorithm they use is slow

And to finish, Requests is also super slow, if you want something fast you have to use pycurl.

In my experience selectolax is about 10x faster than lxml, and keeps the familiar CSS selector API: https://rushter.com/blog/python-fast-html-parser/

Does Scrapy's slow speed actually matter much? Your main bottleneck is always going to be network calls and rate limiting. I don't know how much optimization can help there.

Selectolax is nice, much faster than bs4 or lxml. Not a very well known project yet though.

Not sure there's anything faster on the javascript side of the fence?

If they beat lxml it is pretty impressive. Too bad that they don't support xpath

Libxml is pretty slow (lxml uses it). Selectolax is 5 times faster for simple CSS queries. It is basically a thin wrapper for a well optimized HTML parser written in C.

Beautiful Soup can use lxml, and does by default for parsing xml.

There is a big speed difference between lxml alone and lxml + bs4

I was unaware, I always use bs4 with lxml for parsing xml just because I like the interface. For what I'm doing, the bottleneck is the remote system/network, so it doesn't really matter. But now I'm curious about which parts are slower and why. Maybe I'll run some experiments later.

Yes, the difference is infinite with broken HTML which is checks notes a huge chunk of the Internet.

Maybe because of its historical position in the scientific/data science ecosystem?

Django and Flask are also very popular libraries, so the language and culture gap isn’t as large as it may seem.

Python is indeed far from ideal for scraping in the modern web, but for only one reason: It can't execute javascript.

As a result, js generated content cannot be scraped, and python scrapers also get blocked very fast as they don't execute fingerprinting scripts.

Executing javascript and being to render a HTML page are completely different things. To render an HTML page you need a way to create a DOM, donwload all ressources, ... An Node gives you no advantage, as you have to use another lib for that

Don't execute random javascript from the web in nodejs, nodejs isn't sandboxed the same way that web browsers are and it generally won't work any way.

That opens up massive security problems.

True - that's why you run scrapers using Playwright or Selenium - both of which can easily be scripted from either JavaScript or Python, while executing website code in a sandboxed browser instance.

Is there a similar guide that walks through step by step how to perform scraping using one of those sandboxes?

Not a guide but a doc I’ve used before is listed below. I use webdriver to open Firefox.


That's when you break out PySelenium (if you want to stick with python). Many languages work with selenium drivers, I don't think there's much point in debating which language is best for scraping. Probably one that supports threads, it depends on the scale of course and how much performance you want.

While BeautifulSoup is great, lxml + xpath really is the way to go. XPath is a W3C standard and works cross language and even in the browser.

If you need a an quick way to scrape javascript generated content, you can just open your browser console and use `document.evaluate` with an XPath query.

Can you please elaborate?

Are you thinking more about the performance, or code maintainability?

One thing that is difficult is updating the BeautifulSoup code every time a website changes design/layout/etc.

XPath is like SQL, learn it once. Anything you write with Beautifulsoup will not translate to any other language or library.

No mention of Javascript. All the pages that I would consider scraping are constructed, at least in part, client side by Javascript. If that Javascript is not executed then there is nothing interesting to scrape.

That's because javascript isn't relevant. The _only_ way the browser can interact with the server is via http requests. That's the level the scraper operates at - imitating the http requests the browser does.

In particular, it doesn't matter why the browser did those http requests. It could be because the user submitted a form, or clicked a link, or javascript did some AJAX request, it's done by a web worker or browser plugin, or god help us calls some function some Active-X component. Provided the scraper emulates http request perfectly, there is no way server can tell if the request came from the component it expects or a scraper.

It is both a benefit and a curse. It's a benefit because all the complexity of javascript libraries, DOM's and what not goes away. For example, back in the day I've scraped the satellite imagery from maps.google.com. Maps is a giant horridly complex javascript application - you really want avoid understanding how it does what it does. The http requests it makes on the other hand are pretty simple.

However Google didn't want you scrapping it, so they included authentication in there. Authentication always boils down to taking some data they sent in a previous request, mangling it with javascript then sending it back as a cookie or hidden field in a form. You have to replicate that mangling perfectly, which involves reading and understand the minimised javascript. That's the curse. Such reverse engineering can take a while, but it's mercifully rare.

The payoff is speed, and reduced fragility. The speed comes arises because most of the crap a browser downloads is only useful to human eyes, and the scraper doesn't have to download it. Fragility is reduced because GUI's, even web GUI's and especially javascript laden SPA's often want mouse clicks and keystrokes in a certain order, and while particular parts of the screen have focus. For some reason web designers love tweaking their UI's which breaks that order. The data they send back with their forms and AJAX requests is far more stable.

If pages are constructed client-side, the content you are looking for is either hardcoded as JSON in the HTML or loaded via XHR request. Scrape that.

That javascript is presumably fed by APIs, and often capturing the content of those APIs is better than capturing the rendered view.

That can be a lot of work though, use selenium or the more modern playwright to run entire web pages in a remote-controlled browser.

I've been using Airflow to coordinate scrapers that hit a number of various sites as part of a global market awareness system I've been building over the last year or so.

I have given up on BeautifulSoup and Scrapy since so many modern websites use obfuscated JS to hide the underlying data they are serving up, so I feel like its better to just act like a user and slowly walk through whatever site actions need to be done to get to the data you want to ingest.

Needless to say, as many have touched on in this post's comments, scraping reliably, and selectively retrying based on the many tens if not hundreds of different potential errors that can occur (either server side / API limitations, or client side based on the interaction that your browser, i.e. shit crashing, etc) is really almost an optimization problem of its own.

Definitely a boon to have scraping as an option, but as always, licensure of data especially if you want to resell it becomes a major concern that you should be thinking about up-front even if you kinda just want to hack things together in the beginning.

Does Airflow support streaming the outputs to downstream tasks? I tried to do something like this with Prefect but with Prefect you have to wait for the upstream task to finish before a downstream task can begin working.

Hit me up at my email in my profile if you want to chat about this stuff, I have a lot of thoughts on this but it's probably off topic for this post and I usually am just hacking stuff together to get my systems up and running.

What you're talking about is very sensible and I was equally surprised that Airflow didn't support long running tasks but you can layer over the workflow orchestration system a kind of ad-hoc higher order system that enables what you speak of. It kind of feels ugly but can get a lot done.

There are definitely ways to accomplish what you are saying using a combination of DockerOperators + ephemeral WebSocket servers running within containers as semi-long running tasks, and basically just have a dumb/heavy Redis container that persists to run streaming between the coordination architecture across these data flow jobs.

"Work in progress" lol!

EDIT: updating Airflow from 1.10.10 to 2.1.2 recently was a huge pain in the ass for what it's worth, good luck to all our fellow protagonists that are dealing with multi tens of thousands of task DAG setups... big ooooff

My understanding is each task is supposed to idempotent, so I don't believe this is a valid use for Airflow.

Yeah your understanding is correct but if you relax the idempotency constraint you can achieve a lot more with just a little bit extra logic in your interface layer with other internal services or potentially other mechanisms to ensure consistency. YMMV

I had been experimenting with web crawling with a lot of technologies. (Python based and others).

What most (uninitiated) developers do not realize is that web crawling is not for mere mortals.

1. We are at the mercy of the webpage authors.

HTML is a great lanuage to encode information. But most developers (usually webpage authors) see it as a kind of tool for presentation only. Infomation can go anywhere in the document. And they are prone to changes.

2. The internet society frowns on web crawling

You look into any site's TnC, you might come across a clause which prevents you from crawling. The specific word may not be in the legalese, however, it implies any kind of crawling is denied. There is good reason for this - it is mostly done to encourage fair use of the service.

3. No body designs services to be crawlable.

Most big name companies do have some alternative. Like Facebook had "graphs". (Now obsolete). They allowed end users to extract data using simple queries - like " list friends of X who live in city Y, and who is not your friend". But "graphs" feature came a lot later after Facebook launch. Not at beginning.

Usually at the beginning stages of any services we are at the mercy of #1 and #2

For #1 no one ever designs page to have have information always at a standard location. It changes.

4. The tech isn't ripe yet.

This is my personal view. I had been experimenting with puppeteer and selenium behind a corporate environment. I wasn't that happy with the "net" developer experience. I found things like taking a screenshot or pdf buggy. For e.g. to get the latter I have to run my browser in non-headless mode. In headless mode my laptop system policy disabled some extensions important for the webpage to load correctly.

Yep, I wanted to crawl a job search website so that I could search for jobs at work, without going to the job site (don't blame me, my job back then sucked). It was impossible to find information because all the tags were generated in some sort of framework that obfuscated everything.

Yeah architecturally Chromium Headless is an "embedder" so it doesn't automatically get all the front-end goodies full-fledged Chrome supports unless someone puts in the work to plumb through the code.

So stuff like extensions don't work at all.

Shameless plug in if anybody finds it useful. You can use https://apitruecaptcha.org/ to automate captchas and it has free tier usage.

Can you solve recaptcha/hcaptcha?

It's not that useful without being able to do so.

Thanks, will keep in mind for future projects

I really wouldn't recommend building a web scraper from scratch. You'll soon have to think about caching/rate-limiting/retries.

Personally, I use Scrapy and it works fine. For best practice, I wouldn't use the Pipeline concept Scrapy provides -don't do data transformation inside scrapy. Simply save the responses and perform the validation and transformations outside of Scrapy. The Pipeline concept is flawed because you cannot create DAGs with it -only serially linked pipelines.

Are there any good alternatives to Puppeteer/Playwright for other languages besides JavaScript? The full browser "emulation" is necessary for most sites nowadays.

What do you mean by "besides JavaScript"? We use Playwright with python.


To be fair, I've only used Puppeteer so far and I assumed that Playwright was mostly the same thing. Python support for Puppeteer was very buggy. Thanks for the pointer!

I’ve found imaging the page and doing OCR on the image is quite good for text extraction. Many pages on the Internet render with JavaScript, which means BS may not see the text in the DOM.

Here is the code to do some of that: https://github.com/kordless/grub-2.0

Link extraction is ignored, but could be done with BS on the rendering of the DOM.

I've had decent experience using [1] WebDumper to scrape SPA (single page applications), which rely almost entirely DOM rendering and client-side JS.

I'm curious if folks here had any other recommendations for scraping SPAs, ie React or Angular applications.

1: https://github.com/EllyMandliel/WebDumper

I learned scraping with python and beautiful soup. My biggest challenge was that on certain sites, the html I would get from requests was different than what would be seen in chrome.

I tried using selenium to get around this but was never successful. The issue has really handicapped my ability to scrape.

You should probably take a look at this.


User agent spoofing, also chrome adjusts/fixes some html so sometimes copying css or xpath directly will not work and requires modification. Good to work in Jupyter notebook locally to test and optimize scrapers

Try faking your User-Agent header.

The article touches on that a bit in the "Avoid being blocked" section. Sometimes user-agent isn't enough, I've run into other headers that can trigger a block or a change in behavior. The last one I ran into was gating on accept-language to not control the language, but serve up a honey-pot type page for automated crawlers.

Haha, that's devious but ultimately futile against a targeted scraper.

Kotlin + Jsoup is a very solid scraping combo. Type safety and non-nullabilty is nice. Lots of http clients to choose from. Only downside is the lack of proper XPath, Jsoup's selector syntax is similar but not exact.

Hmmm, I would use Docker Selenium and have python connect to your container with remote web driver. You can make pretty robust scrapers that way. I didn't know people still used beautiful soup.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact