Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

covar_pop not supported in all databases. #17

Open
dwreeves opened this issue Jun 12, 2024 · 0 comments
Open

covar_pop not supported in all databases. #17

dwreeves opened this issue Jun 12, 2024 · 0 comments
Labels
bug Something isn't working help wanted Extra attention is needed low priority Not something I intend to work on, but I will accept a PR that fixes it

Comments

@dwreeves
Copy link
Owner

dwreeves commented Jun 12, 2024

Redshift does not handle either covar or covar_pop, which means default__regress doesn't work. There may be other databases where this is true.

Current solution

The best solution is to use the chol method, which never relies on the default__regress macro. So the problem is fixed by simply using the suggested method for calculating regression coefficients.

I will make a note in the README about this.

If a user utilizes the fwl method, in the case of databases that do not support covar_pop, a compilation error should be raised, since it is not possible to calculate without two select statements. The compilation error should suggest using chol.

How to fix for real

This is actually fixable.

The problem is, frankly, I don't have the wherewithal these days to support this project in excruciating depth. (Sorry, I'm busy on other things.) Because this issue is not something that should impact Redshift users unless they deviate from the suggested method of using chol, I don't consider this critical enough to spend time on.

If anyone reading this wants to fix it themselves, there is a solution if you are willing to contribute a PR. For databases that do not support covar_pop, an additional preprocessing step needs to be done. It looks kinda like this:

with

step1 as (

  select
    grp,
    avg(x1) over (partition by grp) as avg_x1,
    avg(y) over (partition by grp) as avg_y,
    x1,
    y
  from data

),

step2 as (

  select
    grp,
    avg((x1 - avg_x1) * (y - avg_y)) / var_pop(x1) as x1_slope,
    avg(y) - avg(x1) * avg((x1 - avg_x1) * (y - avg_y)) / var_pop(x1) as intercept
  from step1
  group by grp

)

select *
from step2

Basically, it is possible to calculate covar_pop, but only by using two select statements, the first one calculating the average of the column.

That said, the code is a mess (in fairness to past me, how could it not be? This is jinja2 + SQL abuse dialed up to 11). And this preprocessing step should only occur for databases where covar_pop is not a supported method. I will not accept a PR that adds the window funcs to dbs that actually support covar_pop as it is a superfluous calculation in those cases. Adding this will be a lot of work, and I don't recommend diving down this rabbit hole. Please consider just using chol, it will make everyone's life easier. 😄 But, if you are insane enough to add this and you meet all the requirements, I salute you and I will get your change in.

@dwreeves dwreeves added bug Something isn't working help wanted Extra attention is needed low priority Not something I intend to work on, but I will accept a PR that fixes it labels Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed low priority Not something I intend to work on, but I will accept a PR that fixes it
Projects
None yet
Development

No branches or pull requests

1 participant