Hacker News new | past | comments | ask | show | jobs | submit | maltz's comments login


YouTube has "var ytInitialData" & "var ytInitialPlayerResponse" params hardcoded in HTML. No need to run JS!


If pages are constructed client-side, the content you are looking for is either hardcoded as JSON in the HTML or loaded via XHR request. Scrape that.


Playwright. It can be easily used with JS, Python, Go, Java, etc.


Thanks! Is that like using Selenium? (i.e., you have to manage and code the actions yourself)


Yes, quite similar. According to their definition it is a "library to automate Chromium, Firefox and WebKit with a single API. "


Thanks! If there are any third-party managed tools to do this, that would be awesome to know about (i.e., where they somehow run common JS functions/site interactions to test for additional content).


Unfortunately, it's a pathological edge case.

Imagine an async-loaded list, that continues loading more content as it comes in, until it displays all of the content available to the backend.

When would you know such a list is finished loading?

This sounds insane, but it's pretty easy and common for an ambitious UXer to key in on, and is something I've seen in production pages.

(In the event you are a UXer, please include some sort of status update! Even an overlaid spinner that disappears solves the problem.)


It's part of a series of blog posts that talks explicitly about crawling. There are indeed other links that do better explaining advanced extraction techniques.

Extraction => https://www.zenrows.com/blog/mastering-web-scraping-in-pytho...

Avoid blocking => https://www.zenrows.com/blog/stealth-web-scraping-in-python-...


ok but do you offer custom scraping services if i needed to hire someone to build it?



thank you


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: