Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enables non-Node map scripts #68

Closed
wants to merge 3 commits into from
Closed

Conversation

jwass
Copy link

@jwass jwass commented Nov 30, 2015

I've been interested in using tile-reduce but with Python analysis code.

I realized you could use a Python worker script in place of worker.js - appropriately setup to use the same interprocess communication channels, parse the command arguments, etc.

The only required change to tile-reduce is to execute the worker script differently. This PR implements this by adding an optional argument to tileReduce called workerCommand. When workerCommand is set, child_process.spawn(options.workerCommand, ...) is used with stdio set appropriately. When workerCommand is omitted, then fork() is called with worker.js as before.

A working Python library can be found here: https://github.com/jwass/tile-reduce-py with the canonical buildings and roads count example here: https://github.com/jwass/tile-reduce-py/tree/master/examples/count

The Python code is a proof-of-concept but it's definitely possible to write Python that integrates well with tile-reduce, without having to port everything. I'm pretty excited about the possibilities here.

Thoughts? I have limited experience with Node/Javascript so I'd appreciate any help with the code or gotchas I may have missed.

Use spawn instead of fork. Call 'python' with the map script as
the argument. Also setup stdio arguments to be the same as fork().
Python runners would then specify this option as 'python'
@sgillies
Copy link

@jwass cool! Your Python mapper project and examples look great.

@tcql we should merge this. Javascript hasn't caught up to Python in scientific computing quite yet, and @jwass's work unlocks a lot of linear algebra and statistical power for tile reduce.

@tcql
Copy link
Contributor

tcql commented Dec 14, 2015

@jwass @sgillies I don't know a whole lot about python - is inter-process communication very difficult? It seems like it might make more sense just to flat out port tile-reduce than have a frankenstein project that requires users to have some knowledge of node as well as python.

That said, this is pretty cool in that it allows for arbitrary map commands - it just means that anybody implementing a custom map command has to do the legwork of reading sources, as well as sending & receiving IPC messages (which @jwass has already taken care of for python).

@mourner @morganherlocker any other thoughts here? I think I might be in favor of merging this. It's not tied specifically to python or anything, and is a pretty simple modification. If we do, we should document and include a pretty noticeable caveat emptor that by using this you're signing up for more than just a map function.

@mourner
Copy link
Member

mourner commented Dec 14, 2015

This looks really cool, but I agree with @tcql that it would probably be better to just port the remaining portion of tile-reduce to Python — index.js is a pretty small and simple script.

@sgillies
Copy link

Agreed, from a product standpoint it's better to not create any unnecessary "how do I turn my 500 line Python script into a tile-reduce map function?" (worst case scenario) support questions. @jwass could try to own these questions, but I predict they would come directly to Mapbox's support channels.

@jwass
Copy link
Author

jwass commented Dec 15, 2015

Thanks for taking a look, @sgillies @tcql and @mourner.

There are some good reasons to avoid a complete port: we can easily incorporate future functionality, bug fixes, etc. from this repo to the broader user base. It avoids having to create and maintain ports of all the dependencies too, e.g., tile-cover, tile-belt, others. If anyone wants to do this for another language, they'd only have to rewrite worker functionality without having to rewrite all of that yet again. Also anyone looking to make other improvements would be pointed here rather than respective forks.

Obviously up to you all. Either way I totally understand and respect all of the concerns laid out from your end. If you end up closing without merging, I'll try to maintain an up-to-date fork for as long as that makes sense (with full attribution to this repo) for Python/other folks - let me know if you object. Hopefully it won't turn into misplaced support tickets at Mapbox :)

@morganherlocker
Copy link
Contributor

I personally do not think supporting this would be too onerous. Issues asking us to debug complex python worker scripts can be politely declined, which is what we would do with a complex JavaScript worker script as well. Of course, if someone isolates a bug specifically relevant to the library, we are more than happy to help (no matter the worker runtime). I don't think it is unreasonable to ask a user to provide more details or to narrow down their problem a bit to get help setting up a project like this.

That said... I do think mixing node.js and python scripts together like this is messy, and a full python port would be preferable. I am very interested in this usecase though, personally for running map scripts in C/C++.

possible alternative

I feel like if we support arbitrary runtimes, the ideal interface would really be a CLI rather than javascript. Results would be piped to stdout, and reducing would be handled by whatever program you piped the output to (if at all). For example, a cli command that counts via count.py and reduces via totals.py might look something like this:

tilereduce
  --bbox=[0,0,10,10] \
  --zoom 15 \
  --map "python ./count.py" \
  --sources ./sources.json \
| python ./totals.py

This general structure is not far off from where we are now, and could be built on top of it (which means you could use the CLI or the javascript module). It would make it trivial to swap runtimes for 100% of userland code, so you could write your mappers and reducers in any language you want, provided you implement a basic interface in your functions.

Thoughts?

@buma
Copy link

buma commented Dec 15, 2015

That said... I do think mixing node.js and python scripts together like this is messy, and a full python port would be preferable. I am very interested in this use case though, personally for running map scripts in C/C++.

I'm working exactly on this. I have proof of concept of tile-reduce in python which can have python workers. For now they only count features but with shapely I could do something more fancy. And I'm working on C++ worker. It is able to open tile and read features properties for now. I'll probably port tile-reduce part to C++ also. The prototyping in python was just faster. All communication is done with zmq and msgpack. So it means it can be also distributed on more computers not just on the same computer. And there is no problem with interopability between languages. Currently it works seamlessly to run tile-reduce in python and worker in C++. Javascript worker could also work without a problem.

I found out that python is slower then javascript for now because protocol buffers decoding is written in Python and even without decoding geometry it takes couple of seconds per tile without any queries on the data. I'll probably write decoding part in C++ or C and call it with python.

Primary idea was to wrote a worker in nim to play with it since it looks like Python and is supposed to be fast and able to use C/C++ libraries but I didn't found some libraries for protocol buffers and unzipping. So now I'm playing with C++11.

One option is to write server which just posts tile ids like tile-reduce does now (x,y,z) and then each worker reads from a db and unzips and decodes. Another is that server does the decoding and returns some object with decoded stuff. GeoJSON maybe.
For C++ worker first option is probably better, for dynamic languages like Pythona and JS second would probably be better. I can also support both.

@jwass
Copy link
Author

jwass commented Dec 15, 2015

I like where this is headed, @morganherlocker. It's also beginning to look a bit like Hadoop Streaming where mappers/reducers read and write stdin/stdout.

Maybe the CLI could also stand up the reduce script:

tilereduce
  --bbox=[0,0,10,10] \
  --zoom 15 \
  --map "python ./count.py" \
  --sources ./sources.json \
  --reduce ./totals.py

@tcql
Copy link
Contributor

tcql commented Dec 15, 2015

where mappers/reducers read and write stdin/stdout

If we're proposing piping geometry into workers using stdin, that works, but it'll be slower. Moving away from passing full tiles of geometry around was the major speed improvement of TileReduce 3 (instead we pass the tile to the worker and each worker manages talking to sources on it's end).

If we keep messaging, then the custom worker is still responsible for setting up and querying sources itself. Which is fine, if the community can stand up some decent wrappers.

This is all pretty intriguing. If we can keep the process communication overhead low, this could make for some slick and snappy analyses that lean on the strengths of whatever language & tools you need for a particular problem

@jwass
Copy link
Author

jwass commented Dec 15, 2015

If we're proposing piping geometry into workers using stdin,

Definitely still suggesting that the tile x,y,zoom be sent over to the worker with workers in charge of fetching/decoding themselves.

It might be slightly more straightforward for implementations to talk on stdin/stdout rather than the way the inter-process communication works currently on file descriptor 3. But really it's not that different.

@jwass jwass closed this Apr 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants