Integrating subdomains #543

konklone · 2016-09-08T15:41:54Z

This PR is tracking the long-awaited work to integrate .gov subdomains into Pulse.

It will update the weekly scanning process to re-gather and re-scan subdomains, and will update the data ingest process to incorporate aggregate subdomain-level statistics in the database. We're not going to store a row/entry for each individual subdomain. That would increase the complexity and size of our database immensely, as well as greatly increase churn, since subdomains are added/removed with great frequency -- much greater frequency than parent domains.

It is a work in progress, and I'll update when it's ready for review/merge. I wanted to get this all written out now, and will update the PR in-place.

Subdomain plan

The plan here is:

All subdomains will be pulled from already public sources, and will be limited to child subdomains of those .gov domains already tracked by Pulse. Subdomains ending in .mil, .us, .org, etc. will not be tracked, nor will state/local/tribal subdomains.
This will not expand the table. The table will remain with one row per-parent-domain.
Each subdomain will have only the pshtt scanner run on it. This will determine whether the subdomain is live and eligible to be counted, and what its HTTP/HTTPS/HSTS status is.
As part of the new message display for the HTTPS domains table, we'll show aggregate numbers and % of subdomains from each public source, per-parent-domain.
Each agency will have its subdomains factored into its overall % progress for HTTP, HTTPS, and HSTS. This will probably mean a separate % that is specific to subdomains, so as not to confuse people and not to diminish the work agencies have already put into their parent domains.
Subdomains will not only be scanned every week, but what subdomains are scanned will be re-gathered every week. So we'll re-scan Censys and DAP for eligible subdomains each week, meaning that new hostnames appear and disappear from eligibility from week to week.

Right now, my thinking is that we're not going to offer downloadable CSVs of all of an agency's subdomains right away, but rather will link to Censys and DAP. This takes some amount of responsibility off of us for hosting data that might make agencies uncomfortable, while emphasizing their already public nature.

But at the same time, I also plan to expand our already-existing permalinks to detailed scan data per-subdomain, so that we have easy reference links for owners to figure out issues. We already have this:

https://s3.amazonaws.com/pulse.cio.gov/live/cache/pshtt/18f.gov.json

And so we'll likely generate and publish those per-subdomain as well. Malicious actors who wish to walk through that data can do that, but can already do that on Censys.io right now with basically the same level of friction.

I feel pretty comfortable with this posture, but welcome feedback on how to proceed.

Data sources

Right now, there are two public sources:

Censys.io

Censys.io runs a zmap scan on the IPv4 space every night, and publishes the metadata of their scans. They scan several ports, including 443, and when they find TLS certifcates they record their metadata. Censys also syncs up with Certificate Transparency logs, so we're also getting hostnames from CT logs
that wouldn't otherwise have been spotted by a zmap scan.

This is helpful, because Censys' zmap scans have some completeness issues:

It is possible to opt out of their scans, they respect requests for this. This is generally uncommon, but it is notable for our purposes that the Department of Defense does currently opt out of these scans.
Their scans are unlikely to observe certificates that require a SNI extension with a particular hostname to see them. This can be a large gap.

In February 2016, @jcjones measured the overlap between Censys' zmap scans and Certificate Transparency logs (this is before Censys started syncing with CT) and found a surprisingly large non-overlap, with CT having overall many more unique certificates:

Since then, Censys has begun syncing directly with a very comprehensive set of CT logs. I had originally planned to sync up with CT logs directly, but we may not need to do that now. However, if we ever do want to, this PR -- and the underlying work on domain-scan -- should make adding that data source relatively straightforward.

Digital Analytics Program

The Digital Analytics Program (DAP) began publishing, as of June 2016, a list of all hostnames which get reported into their underlying Google Analytics account that have at least 1 reported visit over the prior 14 days.

The CSV is here, and is updated automatically every night as part of the DAP's reporting:
https://analytics.usa.gov/data/live/sites.csv

This CSV has a bunch of non-.gov hostnames (which are generally legitimate), but our filtering process will limit it to subdomains that are children of the parent domains we track in Pulse.

cc @h-m-f-t @djharrity @alex @zakir

have to figure out how to appropriately handle displaying subdomains of preloaded domains, so as not to create mismatched #s.

konklone · 2016-09-26T01:31:18Z

This is moving along. Here's how a domain looks in the domains table in this branch now:

The known to Censys link goes to a Censys search for that domain name.
The known to DAP link goes to the DAP's downloadable CSV of hostnames.
The read our methodology link goes to a new Subdomains section on the /https/guidance/ page.
The download subdomain data for this agency goes to a CSV of subdomain data for Uses HTTPS, Enforces HTTPS, and HSTS scoped to a given agency. This uses the same field names and logic as the downloadable CSV for the main table, and for the display table, though the supporting fields are changed a bit for subdomains (including the Base Domain, and removing Preloaded and Grade).

The new downloadable CSVs are generated during the data import process, written to disk locally, and then included in the big S3 sync, so those links go to https://s3.amazonaws.com/pulse.cio.gov/. That's where we've been storing scan results to date, and it's been a crucial reference dataset, but this will be the first time we offer a public download link to the bucket.

I think an element of future work is to add a Subdomain model to this project, and then have those downloadable CSVs generated on-demand, like the main CSVs are. However, I think the database Pulse currently is using (tinyDB) should be changed before that time.

There is some initial work at attempting to work this data into the Agencies tab, but that ran aground over informatics concerns (in particular, how to reward preloading and not consider those domains eligible, without creating obvious data mismatch issues, or causing the number of subdomains to counter-intuitively drop as domains are preloaded). There's a bit of commented out code left in case this gets resuscitated.

A couple of other small changes I made along the way:

If a domain is preloaded, we no longer complain about a weak max-age. Preloading overrides dynamic max-ages anyway.
If a domain is preloaded, we no longer complain about a grade between F and A+ (exclusive). There is a slight textual difference ("Perfect score!" is only said for A+), but we don't spend screen time or change the visual emphasis at all. F grades retain an explicit warning.

I'm ready to get this merged into at least staging and then deployed to the staging site, for review by stakeholders.

Here's a couple of additional screenshots of different situations:

konklone · 2016-09-28T18:14:26Z

Though I haven't merged this yet, I have deployed this branch to our staging site, and the output can be seen here:

https://staging.pulse.cio.gov/https/domains/

For an example, here's gsa.gov:
https://staging.pulse.cio.gov/https/domains/#q=gsa.gov

And we have methodology described here:
https://staging.pulse.cio.gov/https/guidance/#subdomains

Because this hasn't been updated to production, the subdomain data won't be updated automatically as part of the weekly data update. It'll remain static until we deploy to production and it's put through the full workflow.

We're looking for feedback on the utility and user experience of the provided data, as well as the quality of the documentation.

alex · 2016-09-28T18:20:49Z

Is there a way to get the list of subdomains, and the status for each?

On Wed, Sep 28, 2016 at 2:14 PM, Eric Mill [email protected] wrote:

Though I haven't merged this yet, I have deployed this branch to our
staging site, and the output can be seen here:

https://staging.pulse.cio.gov/https/domains/

For an example, here's gsa.gov:
https://staging.pulse.cio.gov/https/domains/#q=gsa.gov

And we have methodology described here:
https://staging.pulse.cio.gov/https/guidance/#subdomains

Because this hasn't been updated to production, the subdomain data won't
be updated automatically as part of the weekly data update. It'll remain
static until we deploy to production and it's put through the full workflow.

We're looking for feedback on the utility and user experience of the
provided data, as well as the quality of the documentation.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#543 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAADBCqMZnkx5POTEG1g11HQcwuukAxzks5quq6DgaJpZM4J4JEw
.

"I disapprove of what you say, but I will defend to the death your right to
say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero
GPG Key fingerprint: D1B3 ADC0 E023 8CA6

konklone · 2016-09-28T18:23:45Z

Yes, but on a per-agency basis. It's linked to next to any domain with tracked subdomains.

So next to gsa.gov:
https://staging.pulse.cio.gov/https/domains/#q=gsa.gov

There's a link to their CSV download:
https://s3.amazonaws.com/pulse.cio.gov/live/subdomains/agencies/general-services-administration/https.csv

konklone · 2016-11-28T05:10:14Z

We're 👍 to go here.

konklone added 15 commits August 22, 2016 18:29

mocking out some subdomain info

e8b3092

Merge branch 'show-details' into subdomains

35d5423

Merge branch 'production' into subdomains

0e14ae2

Merge branch 'production' into subdomains

39b4034

a start at gathering and scanning subdomains during a weekly scan

b932c93

flesh out comments a bit

f1b762f

sort domain names while iterating

e100884

sort more things

b1a56ad

forgot to pass the options dict into the main scan method

03373b5

Load subdomain data from each pshtt.csv, store in-memory if live.

d6cd4b9

move pshtt report calculation into independent method

07f5fb4

actually storing https subdomain results on Domain objects

4022e25

documentation for subdomains

0b455a9

white space

66b461c

reworking some subdomain text

f2a9108

konklone mentioned this pull request Sep 19, 2016

Add each agency's top20 sites on analytics.usa.gov to pulse.cio.gov #481

Closed

konklone added 8 commits September 25, 2016 12:15

allow hsts max-age to be weak if the domain is actually preloaded

7f3a82a

first pass at arranging the text and styling for subdomains

57dafe2

count up subdomains for agencies, omitting preloaded domains

47a6d5f

don't bother preloaded domains over a between-F-and-A+ score

d73ec05

disable agency roll-up of subdomain data for now.

54e0625

have to figure out how to appropriately handle displaying subdomains of preloaded domains, so as not to create mismatched #s.

say "Known public subdomains", and be more explicit about scope

563f36d

start writing out agency-specific roll-up CSVs of subdomains

4396ee5

upload and link to generated agency CSVs of subdomain data

5127297

konklone changed the title ~~WIP: Integrating subdomains~~ Integrating subdomains Sep 26, 2016

konklone added 4 commits September 26, 2016 21:09

Merge branch 'master' into subdomains

65a3f96

re-enable accessibility for the subdomains branch, on master

1239ee6

no need to override the pshtt path anymore

84a2f21

Merge branch 'master' into subdomains

fc61367

konklone added 6 commits September 27, 2016 18:21

add private env vars example

fae9fa4

ignore private env vars file

23145b3

source private env vars during cronjob

6037327

manage symlinks to non-versioned private env vars file

21b3f96

Merge branch 'subdomains' of github.com:18F/pulse into subdomains

8f5a1fb

missing stray quote in env example

ad2758e

konklone and others added 6 commits October 12, 2016 15:40

Merge branch 'master' into subdomains

37d01bc

fix incorrect path to config.env

20abf3f

Merge branch 'master' into subdomains

bdec229

Merge branch 'master' into subdomains

0419158

Merge branch 'master' into subdomains

4731940

don't process subdomain data from non-executive domains during load

d16860f

konklone merged commit 1954dcf into master Nov 28, 2016

konklone deleted the subdomains branch November 28, 2016 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating subdomains #543

Integrating subdomains #543

konklone commented Sep 8, 2016

konklone commented Sep 26, 2016 •

edited

Loading

konklone commented Sep 28, 2016

alex commented Sep 28, 2016

konklone commented Sep 28, 2016

konklone commented Nov 28, 2016

Integrating subdomains #543

Integrating subdomains #543

Conversation

konklone commented Sep 8, 2016

Subdomain plan

Data sources

Censys.io

Digital Analytics Program

konklone commented Sep 26, 2016 • edited Loading

konklone commented Sep 28, 2016

alex commented Sep 28, 2016

konklone commented Sep 28, 2016

konklone commented Nov 28, 2016

konklone commented Sep 26, 2016 •

edited

Loading