-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enables non-Node map scripts #68
Conversation
Use spawn instead of fork. Call 'python' with the map script as the argument. Also setup stdio arguments to be the same as fork().
Python runners would then specify this option as 'python'
@jwass @sgillies I don't know a whole lot about python - is inter-process communication very difficult? It seems like it might make more sense just to flat out port tile-reduce than have a frankenstein project that requires users to have some knowledge of node as well as python. That said, this is pretty cool in that it allows for arbitrary map commands - it just means that anybody implementing a custom map command has to do the legwork of reading sources, as well as sending & receiving IPC messages (which @jwass has already taken care of for python). @mourner @morganherlocker any other thoughts here? I think I might be in favor of merging this. It's not tied specifically to python or anything, and is a pretty simple modification. If we do, we should document and include a pretty noticeable caveat emptor that by using this you're signing up for more than just a map function. |
This looks really cool, but I agree with @tcql that it would probably be better to just port the remaining portion of tile-reduce to Python — |
Agreed, from a product standpoint it's better to not create any unnecessary "how do I turn my 500 line Python script into a tile-reduce map function?" (worst case scenario) support questions. @jwass could try to own these questions, but I predict they would come directly to Mapbox's support channels. |
Thanks for taking a look, @sgillies @tcql and @mourner. There are some good reasons to avoid a complete port: we can easily incorporate future functionality, bug fixes, etc. from this repo to the broader user base. It avoids having to create and maintain ports of all the dependencies too, e.g., tile-cover, tile-belt, others. If anyone wants to do this for another language, they'd only have to rewrite worker functionality without having to rewrite all of that yet again. Also anyone looking to make other improvements would be pointed here rather than respective forks. Obviously up to you all. Either way I totally understand and respect all of the concerns laid out from your end. If you end up closing without merging, I'll try to maintain an up-to-date fork for as long as that makes sense (with full attribution to this repo) for Python/other folks - let me know if you object. Hopefully it won't turn into misplaced support tickets at Mapbox :) |
I personally do not think supporting this would be too onerous. Issues asking us to debug complex python worker scripts can be politely declined, which is what we would do with a complex JavaScript worker script as well. Of course, if someone isolates a bug specifically relevant to the library, we are more than happy to help (no matter the worker runtime). I don't think it is unreasonable to ask a user to provide more details or to narrow down their problem a bit to get help setting up a project like this. That said... I do think mixing node.js and python scripts together like this is messy, and a full python port would be preferable. I am very interested in this usecase though, personally for running map scripts in C/C++. possible alternativeI feel like if we support arbitrary runtimes, the ideal interface would really be a CLI rather than javascript. Results would be piped to stdout, and reducing would be handled by whatever program you piped the output to (if at all). For example, a cli command that counts via
This general structure is not far off from where we are now, and could be built on top of it (which means you could use the CLI or the javascript module). It would make it trivial to swap runtimes for 100% of userland code, so you could write your mappers and reducers in any language you want, provided you implement a basic interface in your functions. Thoughts? |
I'm working exactly on this. I have proof of concept of tile-reduce in python which can have python workers. For now they only count features but with shapely I could do something more fancy. And I'm working on C++ worker. It is able to open tile and read features properties for now. I'll probably port tile-reduce part to C++ also. The prototyping in python was just faster. All communication is done with zmq and msgpack. So it means it can be also distributed on more computers not just on the same computer. And there is no problem with interopability between languages. Currently it works seamlessly to run tile-reduce in python and worker in C++. Javascript worker could also work without a problem. I found out that python is slower then javascript for now because protocol buffers decoding is written in Python and even without decoding geometry it takes couple of seconds per tile without any queries on the data. I'll probably write decoding part in C++ or C and call it with python. Primary idea was to wrote a worker in nim to play with it since it looks like Python and is supposed to be fast and able to use C/C++ libraries but I didn't found some libraries for protocol buffers and unzipping. So now I'm playing with C++11. One option is to write server which just posts tile ids like tile-reduce does now (x,y,z) and then each worker reads from a db and unzips and decodes. Another is that server does the decoding and returns some object with decoded stuff. GeoJSON maybe. |
I like where this is headed, @morganherlocker. It's also beginning to look a bit like Hadoop Streaming where mappers/reducers read and write stdin/stdout. Maybe the CLI could also stand up the reduce script:
|
If we're proposing piping geometry into workers using stdin, that works, but it'll be slower. Moving away from passing full tiles of geometry around was the major speed improvement of TileReduce 3 (instead we pass the tile to the worker and each worker manages talking to sources on it's end). If we keep messaging, then the custom worker is still responsible for setting up and querying sources itself. Which is fine, if the community can stand up some decent wrappers. This is all pretty intriguing. If we can keep the process communication overhead low, this could make for some slick and snappy analyses that lean on the strengths of whatever language & tools you need for a particular problem |
Definitely still suggesting that the tile x,y,zoom be sent over to the worker with workers in charge of fetching/decoding themselves. It might be slightly more straightforward for implementations to talk on stdin/stdout rather than the way the inter-process communication works currently on file descriptor 3. But really it's not that different. |
I've been interested in using tile-reduce but with Python analysis code.
I realized you could use a Python worker script in place of worker.js - appropriately setup to use the same interprocess communication channels, parse the command arguments, etc.
The only required change to tile-reduce is to execute the worker script differently. This PR implements this by adding an optional argument to
tileReduce
calledworkerCommand
. WhenworkerCommand
is set,child_process.spawn(options.workerCommand, ...)
is used withstdio
set appropriately. WhenworkerCommand
is omitted, thenfork()
is called with worker.js as before.A working Python library can be found here: https://github.com/jwass/tile-reduce-py with the canonical buildings and roads count example here: https://github.com/jwass/tile-reduce-py/tree/master/examples/count
The Python code is a proof-of-concept but it's definitely possible to write Python that integrates well with tile-reduce, without having to port everything. I'm pretty excited about the possibilities here.
Thoughts? I have limited experience with Node/Javascript so I'd appreciate any help with the code or gotchas I may have missed.