* The largest collection I've been able to find. Publicly accessible = is present in Common Crawl June/July 2022.
This project wouldn't have been possible without the amazing work of Common Crawl and their open dataset. In the spirit of open data I am sharing my findings publicly in this very repository. EMR-Output.tar.gz, potentional-pwas.txt and pwas.tsv (might not be complete, dicussed below).
It's incredible that you are esentially able to to scan the whole web for around 140$ (cost of running AWS-EMR and the actual crawler) and couple of days time. If you find this as interesting, I highly recommend checking CommonCrawl out and donating.
Secondarily, if my data is of any use to you, consider supporting me via ko-fi. But I urge you to direct your donations to CommonCrawl instead as they are non-profit.
Project is a mixture of various technologies to create a pipeline which progressively reduces the dataset and ultimately actually crawls websites that are potentionally Progressive Web Apps.
Firstly, there is big data exercise courtesy of CommonCrawl, which is a public dataset containing petabytes of web crawl data. Then there is the first filter to further reduce the input size by ruling out pages that don't conform to the rules of PWA. And finally, the web crawler to gather data from them.
Secondly touching on database design with Postgres as well as gathering a couple of statistical data points. And presentation of the data for promotion purposes in the form of an app-store of sorts.
CommonCrawl number of pages | circa 3 100 000 000 |
Distinct manifests | 5 864 284 |
Pages that have manifest linked | 281 460 208 |
Avg. number of pages poiting to a distinct manifest | ~47.9 |
Ratio of basic pages to ones with a linked manifest | 1 to 11 |
Distinct websites with valid PWA manifest | 615 510 |
Valid PWAs (dataset size) | 219 187 |
Crawl fail (no web worker, timeouts, DNS not resolving, TLS errors) | ~64.4% |
Ratio of PWAs to basic pages on the web | 1 to 14 143 |
According to Common Crawl June/July 2022 -> Map-Reduce -> Filter -> Hand edits -> Crawler pipeline.
After eventual realisation that crawling the whole web by myself isn't fiesable, I turned to CommonCrawl - which is an incredible dataset of historical web scrapes in structured format available for free in AWS S3 bucket in us-east-1. My goal was simple - Map every website contained in this dataset and filter out only those that contained <link rel="manifest" href="...">
in its <head>
as it is a distinct feature of PWA.
For this step I used python framework for MapReduce jobs named mrjob and pre-written module mrcc.py for easier access to the data. I split the input into 8 parts - 10 000 segments each. After some experimentation with AWS EMR, I settled on running my jobs on a cluster of 4 c3.8xlarge core instances and 1 m1.large master node (all spot instances).
10 000 segments job running on 128+2 vCPUs cluster took just shy of 3 hours. Meaning that all 8 jobs took around 24 hours and totaled around 90$.
Output data format was simple text file with tab separated values, where at the first position was a distinct url pointing to a web manifest and values following were pages refering to that manifest in their <link rel="manifest"
tag.
https://example.com/manifest.json https://example.com/pwa https://different-domain.com/other-pwa ...
...
...
...
Output files were too large to possibly include in a github repo (3.4 GB gzipped) - .torrent file
Utilising Go's amazing concurrency handling and http client performance I was able to make quick work of the reduced dataset while honoring robots.txt. I simply loaded each distinct manifest to check its validity according to PWAs rules and then combined it's start_url
field with the urls of the reffering pages, producing a list of potential PWAs, and after some editing done by bash scripts and ultimately by hand I have produced this dataset.
Thanks to Go's performance, I was able to comfortably run this job on a tiny Intel NUC homelab server (i3-3217U) in about 24 hours time. Ultimately, I was limited more by throughput of either my network or the servers's NIC, rather than CPU. I was also able to significantly speed up the job thanks to this StackOverflow question describing how to rise maximum number of TCP connections on linux.
Ultimately I had to go through the strenuous experience of running a browser-crawler on each potentional PWA and check whether the page has a service worker and if it ultimately is a valid PWA. If so then gather all the requisit data, store it in a temporary .tsv files and capture couple of screenshots. After that .tsv files were hand checked and inserted into a database.
For this I used puppeteer as I was already comfortable working with this library. In conjunction, I also used puppeteer-cluster to provide me with concurrency. I also experimented with unit-testing with jest.
Again, running this job on my tiny Intel NUC (i3-3217U) I was able to only run 4 cluster workers and averaged around 0.3 pages per second. Therefore, the whole job would take around 20 days to complete. So this time around, I gave Linode a go with their dedicated 32 core CPU linode. And as you can imagine, this substantially decreased the time necessary to complete, to around 2 days, averaging 3.4 pages per second. The final cost of the crawl was around 50$.
The result of crawling is this singular .tsv file. Admittedly, this dataset might not be complete as during the job many pages started timing out, probably due to excessive crawling, although my crawler obeys robots.txt, including the crawl-delay property. Ultimately, as this is just a hobby project, I couldn't feasibly deploy a whole cluster of IPs to perform my crawl and therefore get around restrictions imposed by the web servers. If you are interested in having a go with your own crawling solution, feel free to refer to my distilled list of over 600k potentional PWAs.
As frontend wasn't the main goal of this project, I decided to try out some interesting technologies, namely Blazor (C# inside the browser through WebAssembly) and ASP.NET for the API. Coming from JS/TS background, this was certainly an interesting change of pace when it comes to development. But I very much enjoyed natively sharing data models between frontend and backend.
Simple model for storing scraped data using Postgres. In the production database I also use built-in full text search module to perform fast queries from the search bar in the frontend.