Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update interpret pipeline stage to optionally generate leaf node comps for all properties #106

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Dec 21, 2023

This PR updates the pipeline to optionally generate comps for all properties using leaf node assignments and upload them to S3 based on a new comp_enable param. This experimental approach to comp calculation is explained in this lightsnip vignette and uses the fastest algorithm described in this blog post.

The approach in this PR depends on changes in a companion branch of ccao-data/lightnsip. Before this branch gets merged, that branch should be merged into the main branch of lightsnip and the renv lockfile in this repo should be updated to point back to the main lightsnip branch.

Connects #41.

@jeancochrane jeancochrane force-pushed the jeancochrane/41-add-comparables-finding-output-to-res-and-condo-models-04-interpret-step branch from 05cbea9 to 93ba2ce Compare December 21, 2023 22:11
@jeancochrane jeancochrane force-pushed the jeancochrane/41-add-comparables-finding-output-to-res-and-condo-models-04-interpret-step branch from 93ba2ce to 05cbea9 Compare December 29, 2023 18:19
…tput-to-res-and-condo-models-04-interpret-step
@jeancochrane jeancochrane force-pushed the jeancochrane/41-add-comparables-finding-output-to-res-and-condo-models-04-interpret-step branch from 47c87c9 to e27581f Compare December 29, 2023 23:44
renv.lock Outdated
@@ -760,14 +765,14 @@
},
"glue": {
"Package": "glue",
"Version": "1.6.2",
"Version": "1.7.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few R dependencies got updated as part of the lightsnip update (ccao-data/lightsnip#10), which ended up suggesting an upgrade to the latest lightgbm release (4.3.0). I went ahead and allowed the upgrade and it doesn't seem like any breaking changes were introduced that affect us.

Comment on lines 189 to 192
# Make sure that the leaf node tibbles are all integers, which is what
# the comps algorithm expects
leaf_nodes <- leaf_nodes %>% mutate_all(as.integer)
training_leaf_nodes <- training_leaf_nodes %>% mutate_all(as.integer)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block ended up fixing the float-index slicing error that I was debugging yesterday. It seems like somehow the transformations we tweaked above ended up causing reticulate to interpret all numeric tibble values as floats rather than integers; I'm not 100% sure which transformation caused that change in behavior, but explicitly casting the tibbles to integers here resolves the issue.

@jeancochrane
Copy link
Contributor Author

@dfsnow The changes in response to your comments are officially working as of 2024-01-26-optimistic-eric complete, so I think this is ready for a final pass!

Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @jeancochrane, just a few super minor fixes left + the one blocking change. I can follow all your python here since it's well-commented and don't see any obvious issues.

R/helpers.R Outdated Show resolved Hide resolved
R/helpers.R Outdated Show resolved Hide resolved
@@ -145,7 +145,7 @@ lgbm_model <- parsnip::boost_tree(
# using floor(log2(num_leaves)) + add_to_linked_depth. Useful since
# otherwise Bayesian opt spends time exploring irrelevant parameter space
link_max_depth = params$model$parameter$link_max_depth,

save_tree_error = comp_enable,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): You'll actually need to move this out of the overall model spec and move it to the set_args() call that produces lgbm_model_final. That way it's shut off during CV but is on for final fitting.

pipeline/04-interpret.R Outdated Show resolved Hide resolved
pipeline/04-interpret.R Outdated Show resolved Hide resolved
pipeline/04-interpret.R Outdated Show resolved Hide resolved
pipeline/04-interpret.R Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane The most recent run failed with an obscure error. After some debugging, I'm pretty sure it's related to R not natively supporting 64-bit integers. If the 64-bit dtype is used in python then reticulate doesn't know what type to cast it back to unless the add-on bit64 R package is loaded.

I'm not sure how this actually succeeded before, but using 32-bit dtypes seems to solve all conversion issues.

@@ -355,7 +355,7 @@ lgbm_model_final <- lgbm_model %>%
stop_iter = NULL,
validation = 0,
trees = lgbm_final_params$num_iterations,
save_tree_errors = comp_enable
save_tree_error = comp_enable
Copy link
Member

@dfsnow dfsnow Jan 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jk @jeancochrane, this was the real issue. The save_tree_error arg here had a misspelling.

@dfsnow
Copy link
Member

dfsnow commented Jan 28, 2024

Alright we've got a full run with (seemingly) all bugs worked out. Final log output here. Gonna go ahead and merge!

@dfsnow dfsnow merged commit ddb6b06 into master Jan 28, 2024
7 checks passed
@dfsnow dfsnow deleted the jeancochrane/41-add-comparables-finding-output-to-res-and-condo-models-04-interpret-step branch January 28, 2024 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add comparables finding output to res and condo models 04-interpret step
2 participants