Skip to content
This repository has been archived by the owner on Jun 10, 2020. It is now read-only.

Integrating subdomains #543

Merged
merged 39 commits into from
Nov 28, 2016
Merged

Integrating subdomains #543

merged 39 commits into from
Nov 28, 2016

Conversation

konklone
Copy link
Contributor

@konklone konklone commented Sep 8, 2016

This PR is tracking the long-awaited work to integrate .gov subdomains into Pulse.

It will update the weekly scanning process to re-gather and re-scan subdomains, and will update the data ingest process to incorporate aggregate subdomain-level statistics in the database. We're not going to store a row/entry for each individual subdomain. That would increase the complexity and size of our database immensely, as well as greatly increase churn, since subdomains are added/removed with great frequency -- much greater frequency than parent domains.

It is a work in progress, and I'll update when it's ready for review/merge. I wanted to get this all written out now, and will update the PR in-place.

Subdomain plan

The plan here is:

  • All subdomains will be pulled from already public sources, and will be limited to child subdomains of those .gov domains already tracked by Pulse. Subdomains ending in .mil, .us, .org, etc. will not be tracked, nor will state/local/tribal subdomains.
  • This will not expand the table. The table will remain with one row per-parent-domain.
  • Each subdomain will have only the pshtt scanner run on it. This will determine whether the subdomain is live and eligible to be counted, and what its HTTP/HTTPS/HSTS status is.
  • As part of the new message display for the HTTPS domains table, we'll show aggregate numbers and % of subdomains from each public source, per-parent-domain.
  • Each agency will have its subdomains factored into its overall % progress for HTTP, HTTPS, and HSTS. This will probably mean a separate % that is specific to subdomains, so as not to confuse people and not to diminish the work agencies have already put into their parent domains.
  • Subdomains will not only be scanned every week, but what subdomains are scanned will be re-gathered every week. So we'll re-scan Censys and DAP for eligible subdomains each week, meaning that new hostnames appear and disappear from eligibility from week to week.

Right now, my thinking is that we're not going to offer downloadable CSVs of all of an agency's subdomains right away, but rather will link to Censys and DAP. This takes some amount of responsibility off of us for hosting data that might make agencies uncomfortable, while emphasizing their already public nature.

But at the same time, I also plan to expand our already-existing permalinks to detailed scan data per-subdomain, so that we have easy reference links for owners to figure out issues. We already have this:

https://s3.amazonaws.com/pulse.cio.gov/live/cache/pshtt/18f.gov.json

And so we'll likely generate and publish those per-subdomain as well. Malicious actors who wish to walk through that data can do that, but can already do that on Censys.io right now with basically the same level of friction.

I feel pretty comfortable with this posture, but welcome feedback on how to proceed.

Data sources

Right now, there are two public sources:

Censys.io

Censys.io runs a zmap scan on the IPv4 space every night, and publishes the metadata of their scans. They scan several ports, including 443, and when they find TLS certifcates they record their metadata. Censys also syncs up with Certificate Transparency logs, so we're also getting hostnames from CT logs
that wouldn't otherwise have been spotted by a zmap scan.

This is helpful, because Censys' zmap scans have some completeness issues:

  • It is possible to opt out of their scans, they respect requests for this. This is generally uncommon, but it is notable for our purposes that the Department of Defense does currently opt out of these scans.
  • Their scans are unlikely to observe certificates that require a SNI extension with a particular hostname to see them. This can be a large gap.

In February 2016, @jcjones measured the overlap between Censys' zmap scans and Certificate Transparency logs (this is before Censys started syncing with CT) and found a surprisingly large non-overlap, with CT having overall many more unique certificates:

certsinctversuscensys

Since then, Censys has begun syncing directly with a very comprehensive set of CT logs. I had originally planned to sync up with CT logs directly, but we may not need to do that now. However, if we ever do want to, this PR -- and the underlying work on domain-scan -- should make adding that data source relatively straightforward.

Digital Analytics Program

The Digital Analytics Program (DAP) began publishing, as of June 2016, a list of all hostnames which get reported into their underlying Google Analytics account that have at least 1 reported visit over the prior 14 days.

The CSV is here, and is updated automatically every night as part of the DAP's reporting:
https://analytics.usa.gov/data/live/sites.csv

This CSV has a bunch of non-.gov hostnames (which are generally legitimate), but our filtering process will limit it to subdomains that are children of the parent domains we track in Pulse.

cc @h-m-f-t @djharrity @alex @zakir

@konklone
Copy link
Contributor Author

konklone commented Sep 26, 2016

This is moving along. Here's how a domain looks in the domains table in this branch now:

screenshot from 2016-09-25 18-20-51

  • The known to Censys link goes to a Censys search for that domain name.
  • The known to DAP link goes to the DAP's downloadable CSV of hostnames.
  • The read our methodology link goes to a new Subdomains section on the /https/guidance/ page.
  • The download subdomain data for this agency goes to a CSV of subdomain data for Uses HTTPS, Enforces HTTPS, and HSTS scoped to a given agency. This uses the same field names and logic as the downloadable CSV for the main table, and for the display table, though the supporting fields are changed a bit for subdomains (including the Base Domain, and removing Preloaded and Grade).

The new downloadable CSVs are generated during the data import process, written to disk locally, and then included in the big S3 sync, so those links go to https://s3.amazonaws.com/pulse.cio.gov/. That's where we've been storing scan results to date, and it's been a crucial reference dataset, but this will be the first time we offer a public download link to the bucket.

I think an element of future work is to add a Subdomain model to this project, and then have those downloadable CSVs generated on-demand, like the main CSVs are. However, I think the database Pulse currently is using (tinyDB) should be changed before that time.

There is some initial work at attempting to work this data into the Agencies tab, but that ran aground over informatics concerns (in particular, how to reward preloading and not consider those domains eligible, without creating obvious data mismatch issues, or causing the number of subdomains to counter-intuitively drop as domains are preloaded). There's a bit of commented out code left in case this gets resuscitated.

A couple of other small changes I made along the way:

  • If a domain is preloaded, we no longer complain about a weak max-age. Preloading overrides dynamic max-ages anyway.
  • If a domain is preloaded, we no longer complain about a grade between F and A+ (exclusive). There is a slight textual difference ("Perfect score!" is only said for A+), but we don't spend screen time or change the visual emphasis at all. F grades retain an explicit warning.

I'm ready to get this merged into at least staging and then deployed to the staging site, for review by stakeholders.

Here's a couple of additional screenshots of different situations:

screenshot from 2016-09-25 18-30-23

screenshot from 2016-09-25 18-30-43

@konklone konklone changed the title WIP: Integrating subdomains Integrating subdomains Sep 26, 2016
@konklone
Copy link
Contributor Author

Though I haven't merged this yet, I have deployed this branch to our staging site, and the output can be seen here:

https://staging.pulse.cio.gov/https/domains/

For an example, here's gsa.gov:
https://staging.pulse.cio.gov/https/domains/#q=gsa.gov

And we have methodology described here:
https://staging.pulse.cio.gov/https/guidance/#subdomains

Because this hasn't been updated to production, the subdomain data won't be updated automatically as part of the weekly data update. It'll remain static until we deploy to production and it's put through the full workflow.

We're looking for feedback on the utility and user experience of the provided data, as well as the quality of the documentation.

@alex
Copy link
Contributor

alex commented Sep 28, 2016

Is there a way to get the list of subdomains, and the status for each?

On Wed, Sep 28, 2016 at 2:14 PM, Eric Mill [email protected] wrote:

Though I haven't merged this yet, I have deployed this branch to our
staging site, and the output can be seen here:

https://staging.pulse.cio.gov/https/domains/

For an example, here's gsa.gov:
https://staging.pulse.cio.gov/https/domains/#q=gsa.gov

And we have methodology described here:
https://staging.pulse.cio.gov/https/guidance/#subdomains

Because this hasn't been updated to production, the subdomain data won't
be updated automatically as part of the weekly data update. It'll remain
static until we deploy to production and it's put through the full workflow.

We're looking for feedback on the utility and user experience of the
provided data, as well as the quality of the documentation.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#543 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAADBCqMZnkx5POTEG1g11HQcwuukAxzks5quq6DgaJpZM4J4JEw
.

"I disapprove of what you say, but I will defend to the death your right to
say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero
GPG Key fingerprint: D1B3 ADC0 E023 8CA6

@konklone
Copy link
Contributor Author

Yes, but on a per-agency basis. It's linked to next to any domain with tracked subdomains.

So next to gsa.gov:
https://staging.pulse.cio.gov/https/domains/#q=gsa.gov

There's a link to their CSV download:
https://s3.amazonaws.com/pulse.cio.gov/live/subdomains/agencies/general-services-administration/https.csv

@konklone konklone merged commit 1954dcf into master Nov 28, 2016
@konklone konklone deleted the subdomains branch November 28, 2016 05:10
@konklone
Copy link
Contributor Author

We're 👍 to go here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants