-
Notifications
You must be signed in to change notification settings - Fork 56
Conversation
have to figure out how to appropriately handle displaying subdomains of preloaded domains, so as not to create mismatched #s.
This is moving along. Here's how a domain looks in the domains table in this branch now:
The new downloadable CSVs are generated during the data import process, written to disk locally, and then included in the big S3 sync, so those links go to I think an element of future work is to add a There is some initial work at attempting to work this data into the Agencies tab, but that ran aground over informatics concerns (in particular, how to reward preloading and not consider those domains eligible, without creating obvious data mismatch issues, or causing the number of subdomains to counter-intuitively drop as domains are preloaded). There's a bit of commented out code left in case this gets resuscitated. A couple of other small changes I made along the way:
I'm ready to get this merged into at least Here's a couple of additional screenshots of different situations: |
Though I haven't merged this yet, I have deployed this branch to our staging site, and the output can be seen here: https://staging.pulse.cio.gov/https/domains/ For an example, here's And we have methodology described here: Because this hasn't been updated to production, the subdomain data won't be updated automatically as part of the weekly data update. It'll remain static until we deploy to production and it's put through the full workflow. We're looking for feedback on the utility and user experience of the provided data, as well as the quality of the documentation. |
Is there a way to get the list of subdomains, and the status for each? On Wed, Sep 28, 2016 at 2:14 PM, Eric Mill [email protected] wrote:
"I disapprove of what you say, but I will defend to the death your right to |
Yes, but on a per-agency basis. It's linked to next to any domain with tracked subdomains. So next to gsa.gov: There's a link to their CSV download: |
We're 👍 to go here. |
This PR is tracking the long-awaited work to integrate .gov subdomains into Pulse.
It will update the weekly scanning process to re-gather and re-scan subdomains, and will update the data ingest process to incorporate aggregate subdomain-level statistics in the database. We're not going to store a row/entry for each individual subdomain. That would increase the complexity and size of our database immensely, as well as greatly increase churn, since subdomains are added/removed with great frequency -- much greater frequency than parent domains.
It is a work in progress, and I'll update when it's ready for review/merge. I wanted to get this all written out now, and will update the PR in-place.
Subdomain plan
The plan here is:
pshtt
scanner run on it. This will determine whether the subdomain is live and eligible to be counted, and what its HTTP/HTTPS/HSTS status is.Right now, my thinking is that we're not going to offer downloadable CSVs of all of an agency's subdomains right away, but rather will link to Censys and DAP. This takes some amount of responsibility off of us for hosting data that might make agencies uncomfortable, while emphasizing their already public nature.
But at the same time, I also plan to expand our already-existing permalinks to detailed scan data per-subdomain, so that we have easy reference links for owners to figure out issues. We already have this:
https://s3.amazonaws.com/pulse.cio.gov/live/cache/pshtt/18f.gov.json
And so we'll likely generate and publish those per-subdomain as well. Malicious actors who wish to walk through that data can do that, but can already do that on Censys.io right now with basically the same level of friction.
I feel pretty comfortable with this posture, but welcome feedback on how to proceed.
Data sources
Right now, there are two public sources:
Censys.io
Censys.io runs a zmap scan on the IPv4 space every night, and publishes the metadata of their scans. They scan several ports, including 443, and when they find TLS certifcates they record their metadata. Censys also syncs up with Certificate Transparency logs, so we're also getting hostnames from CT logs
that wouldn't otherwise have been spotted by a zmap scan.
This is helpful, because Censys' zmap scans have some completeness issues:
In February 2016, @jcjones measured the overlap between Censys' zmap scans and Certificate Transparency logs (this is before Censys started syncing with CT) and found a surprisingly large non-overlap, with CT having overall many more unique certificates:
Since then, Censys has begun syncing directly with a very comprehensive set of CT logs. I had originally planned to sync up with CT logs directly, but we may not need to do that now. However, if we ever do want to, this PR -- and the underlying work on domain-scan -- should make adding that data source relatively straightforward.
Digital Analytics Program
The Digital Analytics Program (DAP) began publishing, as of June 2016, a list of all hostnames which get reported into their underlying Google Analytics account that have at least 1 reported visit over the prior 14 days.
The CSV is here, and is updated automatically every night as part of the DAP's reporting:
https://analytics.usa.gov/data/live/sites.csv
This CSV has a bunch of non-.gov hostnames (which are generally legitimate), but our filtering process will limit it to subdomains that are children of the parent domains we track in Pulse.
cc @h-m-f-t @djharrity @alex @zakir