Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge-trees-to-database #15

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

tzinckgraf
Copy link
Contributor

This PR adds a merge step so that we can move trees into the database based on the geojson files. To do this, I've added a merge step to the data pipeline. This would run sometime after normalization using the geojsons from that step.
The process does two things:

  • Use ogr2ogr to append a geojson file to database table, in this case a staging table.
  • Use pgPromise to call a stored procedure that merges the staging table into the main table.

There are still a few TODOs in terms of async code we can add, albeit with additional libraries.

The performance of this step depends on the size of the file. ogr2ogr uploads 20k rows at a time. A file with 600k rows (like NYC) can take on the order of minutes, while a file with 20k rows (like Alameda) is within seconds. This is expected to run monthly, so that is not much of a concern.

Not included in here is the .env file that will be necessary holding the database info, similar to the wtt_server repo.

- added a merge step to upload to the database
- added the db config files similar to the wtt_server
- added command line arguments for the merge statement
- the merge statement runs ogr2ogr then calls a stored procedure
- still some TODOs for more async code
Copy link
Member

@zoobot zoobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!!
How much testing have you done on it? Has any source failed or does it time out or exit on the larger datasets?

@@ -12,6 +12,7 @@
"convert": "node bin/convert.js",
"normalize": "node bin/normalize.js",
"concatenate": "node bin/concatenate.js",
"merge": "node bin/merge.js",
"save": "node bin/save.js",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge will get rid of the need for the save step I think. Do you want to handle that or should I assign myself?

* source:
* https://github.com/vitaly-t/pg-promise/issues/78#issuecomment-171951303
*/
function camelizeColumns(data) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note, we have this function replicated here: https://github.com/waterthetrees/wtt_server/blob/68e22644d31da834a69b4435373e8b9aeac9ad5e/server/db/pg-promise-config.js

Probably not worth calling between repos tho since its a common utility function.

console.log(`Running for ${source.destinations.normalized.path}`);
// FIXME use the async version of exec, but that means a new dependecy
const command = `ogr2ogr -f "PostgreSQL" PG:"host=${dbConfig.host} user=${dbConfig.user} password=${dbConfig.password} dbname=${dbConfig.database}" ${source.destinations.normalized.path} -nln tree_staging -geomfield geom -append`
exec(command, async (error, stdout, stderr) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer spawn here as it has larger buffer for longer running processes. Could use spawnSync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

Sources Integration with FE and vector tiles and db
2 participants