-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query Server Protocol v2 #1514
Comments
The hook up is still via the same protocol, just not streaming over stdio |
I'd like to see the protocol support the option to run a query server as a separate network service. There's been significant adoption of gRPC in the cloud-native computing world recently. I think this could provide a lot of interesting benefits including the network service (HTTP/2 is used for transport) and the streaming piece mentioned above. |
If we're talking about reworking the protocol, it is worth considering the implications of how we enforce sandboxing. As soon as you move to a network interface or simliar, there's nothing that stops that process from having crazy side effects, like storing the document locally, making calls back to CouchDB while a document is being processed, etc. Our current JS implementation is very strict about not allowing those things, and it's always bugged me that people used to write their own "query servers" that did all of these things that violate the contract between Couch and a query server. @kocolosk wouldn't it make more sense to keep everything inside of Erlang to enforce the above, and just redo rexi so it uses gRPC? |
Hmm, I guess I thought the sandboxing was a rather separate issue from the communication protocol. I'm not sure I see how they're related here. I'm certainly open to more innovation that reduces the need for custom code execution. |
I'm all for having internal erlang functions to do most common functions (for performance reasons), but please keep in mind that if the query server goes away, and we have to write our functions in erlang, anyone wanting to do anything remotely advanced will likely have to learn a whole new language specifically for the database. Since couchdb is designed to empower the client to connect directly, and skip the server, your user base, I believe, would primarily be front-end javascript developers, who likely don't know erlang, so it would be nice to keep it consistent. If you were to use a language more common in web development for the query server (js, php, ruby, etc.), it would be fine, because even if the developer didn't know the language before, the skills will be useful, but that's where the abilities of a query server shine: the fact that it can be any language. Also, given how long it is taking to port the current javascript implementation to a new version (it is now 7 years old), I'm a little hesitant to giving up the ability to do it myself. That said, the query protocol could use an overhaul. In designing a typescript, node-based query server, and was unable to create enhancements like processing documents in parallel (bear with me a minute). This is where (I think) the network-based query protocol could be helpful (though I'd stick to unix sockets for performance reasons), because things like views' map functions don't need to be synchronous. Given a protocol that supported this, we could use all cores on a machine to process documents in parallel either through multiple connections, or by tagging commands with IDs, and then let couchdb sort them in the b-tree as they come in (like it does now). |
Hi @sploders101 thanks for chiming in on this one! There’s been a lot of work on a next-generation query server network protocol outside the visibility of this ticket so I’d invite @davisp or maybe @garrensmith or @jiangphcn to summarize the progress. I believe the whole thing is gRPC-based, not sure if we’ve developed a variant that communicate over Unix sockets yet. We’re certainly planning on allowing users to define JS functions for views for the foreseeable future. |
@sploders101 Ah, sorry! I think I was too terse on my explanation previously. This change is purely to open up more possibilities for improvement and evolution of query server communication. Previously we were fairly tied to a stdio approach due to the level of abstraction in the couch_query_servers and couch_os_process modules. This work is just to try and abstract our basic function calls so that developers are able to experiment more without requiring the Erlang process communication. To be slightly more concrete, the goal here is to be able to allow for a NIF based query server (i.e., having SpiderMonkey or V8 linked directly to the Erlang VM) while also allowing for network attached execution environments (i.e., gRPC [1]). Or to put that all in a whole new and different light, the original goal of having a new protocol was a bit limiting. Having a more abstract API that would let people execute COBOL on the moon is, I think, more in line with what was intended. Also, if someone has COBOL servers on the moon, I'd like to have a chat because that would be awesome. |
@kocolosk @davisp Using a well-defined protocol for query servers should be more than fine. If you are doing something network-based though, unix sockets may be worth looking into. I'm not sure how they are done in erlang, but the C/C++ api is almost identical to that of TCP. I believe the only difference is in construction, due to the fact that it uses a file path instead of a network address. I'm excited to see where this goes! Thanks! |
This sounds really exciting, has progress been made on this idea behind the scenes? Also, does this mean that the indexing process would be radically faster than it is now if a JS engine is directly linked into the Erlang VM? I am a happy couchdb user, the only worry I have is about the duration of indexing at scale (for a database with billions of documents). I wonder if indexing will be able to catch up if 10s of millions of documents are added on a daily basis. I am also thinking about the scenarios when for some reasons an index must be built from scratch on such a large database. Because of the network overhead with the Query Server (if i understand correctly), I am worried that indexing may never converge. Boosting the performance of indexing and enabling the process to saturate bare metal would give users peace of mind when planning such large systems. So I am curious to know if the Query Server Protocol v2 would help with that or if this is out of scope. |
@janl:
@davisp:
@wohali:
Also see #1334.
The text was updated successfully, but these errors were encountered: