Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider partial live computation of metadata #793

Open
sgratzl opened this issue Jan 5, 2022 · 4 comments
Open

Consider partial live computation of metadata #793

sgratzl opened this issue Jan 5, 2022 · 4 comments
Labels
enhancement v4 Big covidcast schema redesign

Comments

@sgratzl
Copy link
Member

sgratzl commented Jan 5, 2022

see also cmu-delphi/www-covidcast#1026 (comment)

we should try out how long

select source, signal, max(time_value) from covidcast group by source, signal

takes.

if we wanna hit other indices we could generate a big union :

select source, signal, max(time_value) from covidcast where source = 'x' and signal = 'x'
UNION ALL
select source, signal, max(time_value) from covidcast where source = 'x' and signal = 'y'
...
@krivard
Copy link
Contributor

krivard commented Jan 10, 2022

TL;DR:

  • Neither approach will be performant live
  • We might be able to do a small number of UNION queries as part of the acquisition pipeline, but the full set won't fit in the 45-minute block we have set aside for each acquisition cycle.
select source, signal, max(time_value) from covidcast group by source, signal

Cancelled when running time exceeded 2 hours.

select source, `signal`, max(time_value) from covidcast where source = 'chng' and `signal` = 'smoothed_adj_outpatient_cli'
UNION ALL
select source, `signal`, max(time_value) from covidcast where source = 'chng' and `signal` = 'smoothed_adj_outpatient_covid'
UNION ALL
...
UNION ALL
select source, `signal`, max(time_value) from covidcast where source = 'usa-facts' and `signal` = 'deaths_incidence_num'
UNION ALL
select source, `signal`, max(time_value) from covidcast where source = 'usa-facts' and `signal` = 'deaths_incidence_prop'

3 hours 36 minutes.

select source, `signal`, max(time_value) from covidcast where source = 'jhu-csse' and `signal` = 'confirmed_cumulative_num'

4 minutes.

@sgratzl
Copy link
Member Author

sgratzl commented Jan 11, 2022

oh that is unfortunate.

@sgratzl sgratzl removed their assignment Jan 11, 2022
@krivard
Copy link
Contributor

krivard commented Jan 11, 2022

I'm going to push this into v4 -- once we have a more resilient schema for covidcast we may be able to handle queries like these.

@krivard krivard added the v4 Big covidcast schema redesign label Jan 11, 2022
@melange396
Copy link
Collaborator

Nearly identical to the above 4 minute query, this one returns instantaneously:

SELECT MAX(time_value) FROM covid.epimetric_full WHERE signal_key_id=(
    SELECT signal_key_id FROM covid.signal_dim WHERE source='jhu-csse' AND `signal`='confirmed_cumulative_num' LIMIT 1
) AND time_type='day';

Note the inclusion of the constraint on time_type (of which there is only one for this signal), which lets the database make use of either of these indexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement v4 Big covidcast schema redesign
Projects
None yet
Development

No branches or pull requests

3 participants