Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a user-local cache for URL imports makes reproducible builds hard #213

Closed
elldritch opened this issue Jun 8, 2018 · 15 comments
Closed

Comments

@elldritch
Copy link

elldritch commented Jun 8, 2018

TL;DR: Go already tried ~/.deno/src with $GOPATH, and it sucks when working with multiple projects. They switched to vendoring because it's a lot simpler, and Deno should use vendoring too.

This is a continuation of #195. I think the current URL imports implementation has a serious flaw with reproducible builds when working with multiple projects. Even if the implementation isn't changed, I'd like to know what the idiomatic Deno way of handling these use cases should/would be.

To clarify, I think that URL imports is a good idea. I disagree with the implementation detail of how it works ("load on first run, then cache", which I'll call LFRC), because I think this makes reproducible builds with multiple projects unnecessarily difficult.

This issue contains some example use cases where this implementation makes reproducible builds difficult and a proposal for an alternative implementation that keeps the same URL import syntax, maintains the spirit of URL imports (i.e. simple module resolution and no central registry), and makes reproducible builds easy.

Drawbacks of LFRC

I already wrote a bit about difficulties with the single-project workflow in https://github.com/ry/deno/issues/195#issuecomment-395565575, which I'll summarise here:

  • There are many times when I'll want to copy a project to a new machine (e.g. sharing a project with a coworker, pushing to a Git repository, deploying to a production machine, creating a new container image).
  • Every time I copy my project, I must also copy ~/.deno/src in order to get a reproducible build, because any URLs I'm importing might have changed or become unavailable since the time I first ran my project vs. the time the destination machine first runs the project.

These are already serious (albeit surmountable) issues, but the bigger problems arise when working with multiple projects:

  • Every Git repository of a Deno program must also include ~/.deno/src or do vendoring.
  • Every time I download a new a project, I need to overwrite my local ~/.deno/src with the project's provided ~/.deno/src (or at least overwrite the parts of my local ~/.deno/src that specify the downloaded project's dependencies). What if I was working on another project?

This model is exactly equivalent to Go's before 1.5 and vendoring (just substitute $GOPATH with ~/.deno/src).

In order to solve these issues, we'd need to resort to the same heavy-handed, painful, and unintuitive solutions: either rewrite URL import paths (so projects each get a unique URL import so they never clash) or do versioning on ~/.deno/src (so when we work on different projects, we can switch to the project's own copy of ~/.deno/src).

Given a user-local (instead of project-local) dependency cache, it is fundamentally impossible to solve the problem that a single user may have different projects that may have different dependencies that use the same name:

  • Even if you add the version into the URL, the actual source code at the URL may change between different first runs of the program. You might have two different projects that import the same URL but expect different source code because the projects were authored at different times. One common criticism of NPM is that we rely on a central authority to ensure that versions remain immutable; this is like relying on O(n) different NPMs (where n is the number of different dependency authors).
  • Even if you add a hash into the URL, the dependency may become unavailable between different first runs of the program.
  • Even if you added the file name of the importing source code file into the ~/.deno/src cache, the location of the project can change (it might move around in the same machine if the user has to move folders, and different users will have different directory structures when they clone the same Git repository) because the concept of a "project" is not tied to any particular filesystem path.

In an ideal world with ideal humans and ideal infrastructure, these problems would not exist. Unfortunately, we don't live in that world, so we shouldn't use a dependency model that forces users to rely on dependency providers behaving ideally.

Vendoring is a better solution

Instead of LFRC, URL imports should be vendored (in Go 1.5+ style). This means that projects should have some designated folder (perhaps vendor/) that contains all of their dependencies.

This fundamentally solves the problem that different projects may have different dependencies that have the same name. It's also easy to implement, easy to understand, and easy to use. Here's how dependency resolution with vendoring could work:

  1. Deno provides some tool (hypothetically, deno get) or the community builds its own tool.
  2. A user writes code that imports some dependency named Foo (where Foo is a URL or other string supported by vendoring tool).
  3. The user runs deno get, which downloads Foo to vendor/Foo.
  4. When the program is run and the runtime encounters some import Foo, it tries to load vendor/Foo. If vendor/Foo does not exist, the program crashes.

This behaviour is extremely intuitive: the tool's operation is simple, the failure modes are simple, and writing your own tool is simple. Users who desire complex dependency resolution logic (e.g. for doing version resolution) can easily write their own tool, and that complexity is not baked into the runtime.

I am almost certain that if Deno does not support this natively and becomes popular, then the community will be forced to write this tool. Go has already been down this path. Using a user-local dependency cache provides a very poor developer workflow and is not feasible for users who need to reproducibly build multiple projects.

@kizerkizer
Copy link
Contributor

kizerkizer commented Jun 8, 2018

What if the cache stored a hash of the downloaded package along side each URL. Then, some external tool could be written to generate a "snapshot" of the cache for reproducibility. The same tool could tell you if the packages have changed since the snapshot, by comparing hashes of downloads.

Edit: Actually, the external tool could do this hashing part as well. So it could all be implemented by a third-party tool separate from deno.

@kizerkizer
Copy link
Contributor

kizerkizer commented Jun 8, 2018

Also, @ry mentioned bundling npm with node as a regret, since it made npm the de facto project host and created an unnecessary centralization. Thus, I believe that deno shouldn't be bundled with a package manager. The community should write one (or many), as you mentioned will inevitably happen. Then, multiple package managers or package hosts can compete fairly with one-another.

@kizerkizer
Copy link
Contributor

kizerkizer commented Jun 8, 2018

Regardless, this could all be done without removing URL imports. Imagine you want to get started on some project quickly... you import the URL and get going. Later on, you can use the third-party tool to declare dependencies and ensure integrity, and potentially remove your URL imports.

I'd like to point out a parallel between TS and this strategy: both allow you to get up and running quickly, since you can just write regular JS, and analogously throw in a URL import. Later, perhaps simultaneously, you could add typing and other TS features in, and get rid of the URL imports. The key point is that time(idea to code) is minimized at first.

@elldritch
Copy link
Author

  1. Hashes don't solve the name clashing problem for multiple projects.
  2. There is a distinction between a central tool and a central registry. You can have a standard tool which still uses decentralised package sources (e.g. see Go's dep tool). In my experience with Go, a standard tool is a benefit because it reduces fragmentation, lowers the learning curve for beginners, and it's much easier to write your own tool vs. a registry.
  3. I'm not suggesting that we remove URL imports.

@jedahan
Copy link

jedahan commented Jun 8, 2018

This is one of the most clear titles I've seen of an issue in a while. That said, I disagree with the premise that deno should decide how to solve reproducible builds. Reproducible builds is a tooling and process feature that is appropriate for some projects but not all.

It seems to me that LFRC does not conflict with any of the proposals you mentioned in the 'Vendoring is a better solution' section.

I see deno as just a runtime that can defer to the OS on how to resolve a URI.

If you do require reproducible builds, then LFRC does not preclude doing anything like...

  • using a dvendor tool that uses vendor instead of ~/.deno/src as the default path
  • building dvendor get to resolve things in reproducible ways instead of bundling ~/.deno/src
  • having your main.ts entry point resolve things however you want them to (like calling dvendor) instead of working around whatever semantics we come up with now
  • use a content-addressed URI scheme instead of one that can change bytes out from under you

Baking in any tools or processes will make it harder to change in the future when we come up with better solutions than node_modules or go's vendoring.

More importantly, if you care about specificity I think allowing for proposals like #200 capture the spirit of proposals to help without restricting future decisions.

@Macil
Copy link

Macil commented Jun 8, 2018

I personally really like the idea of a "deno get" tool which creates a vendor/ directory, and havingdeno runs cause URLs to only be loaded from vendor/ instead of from the network if the vendor directory is present. Users who don't care about reproducible builds don't need to run deno get and don't deal with it for their own projects. Users that clone a repo that includes a vendor/ directory don't have to do any install step and don't have to run the project's code any differently than normal.

The "deno get" tool could be external / third party, but if deno doesn't have the "load URLs only from vendor/ if vendor/ is present" behavior built into it, then the one of the following would have to be true:

  1. Users would have to launch the project using some special configuration (possibly dvendor deno ... instead of deno ...) in order to use the vendored files. Users that don't launch the project the right way will silently get unvendored code. (If I publish deno projects, I want to be able to freely refer to many URLs in my source that I don't necessarily trust to always stay online and always honestly serve the same source, and I want to de-risk the danger by having my project vendor the code or hashes. If a user of my project runs my project without the correct vendoring setup and one of the many URLs I referred to became malicious, then that sucks hard for the user and puts me in an awkward spot.)
  2. The "deno get" tool would have to modify the project's source code (and the source code of the files it downloads) in order to rewrite the import statements to point into the vendor/ directory. I think it's a bit icky, might interact weirdly with sourcemaps if any files use them (I guess this is less likely since everyone is using typescript, but I don't think it's a possibility to write off), and might preclude any schemes that involve identifying/verifying/doing something with files by their content hash, but at least you know no users run your project wrongly. This might be nontrivial to do if there are any dynamic imports (in the project's code or the dependencies).
  3. The "deno get" tool makes some change to the directory that causes deno to execute URL imports differently for files in that directory. Perhaps "deno get" creates a special file "serviceworker.ts", which uses ServiceWorker-like APIs to intercept imports, changing any remote URLs to point into the local vendor directory. deno by default would always search the directory (and parent directories) of any script run to look for "serviceworker.ts" and would execute that first. Bringing in ServiceWorkers like this could arguably fit Deno's goal of supporting browser-compatible APIs.

@jedahan
Copy link

jedahan commented Jun 9, 2018

If you want people to use your vendored files, whether they were installed via dvendor/deno get or anything else you have your imports refer directly to the file you distribute in the repo. No need for deno to treat any directory as special.

@mohsen1
Copy link
Contributor

mohsen1 commented Jun 9, 2018

Subresource integrity can help with this.

https://github.com/ry/deno/issues/200

@elldritch
Copy link
Author

#200 does not solve the name clashing problem. This seems to be a point of confusion, so here's a concrete example:

Alice writes some program Foo, which depends on https://example.com/[email protected]. She publishes the program and forgets about it.

For some reason, https://example.com/[email protected] changes in an incompatible way.

Later (after the change), Bob writes some program Bar, which depends on the changed version of https://example.com/[email protected].

Now, somebody who wants to use both Foo and Bar can build one of them or the other, but not both at the same time. Both programs think they are depending on the same dependency, but only one version can be cached at a time. If Foo's version is cached, then Bar's build breaks, and vice versa. Which version is the "correct" one? Arguably, both of them.

This problem arises because dependency names may not be unique across projects. If your dependency cache is global across all of your projects (i.e. user-local), then it requires that all dependencies of those projects are uniquely named. As in the case above, this is not always guaranteed to be true.

Yes, this is arguably a pathological edge case. But this kind of edge case arises all the time in building real-world programs, and vendoring is the simplest way to address it. Other solutions, like SRI or lockfiles with hashes, pay the same complexity cost (your project has to contain something specifying your dependencies) without reaping all the benefits (SRI/hashes do not solve name clashes or availability).

Yes, you can vendor without requiring runtime support and LRFC does not preclude vendoring. But LFRC presents a significant footgun to new users. I think applying the law of least surprise for developer experience suggests that:

  • If a project builds in some state, it should continue to build in that state regardless of what happens in other projects.
  • Running one project should not affect the builds of another project.

LFRC does not guarantee either of these.

Reproducible builds are very difficult to solve with tooling bolted on as an afterthought. Go has been down this path -- it requires dependency name rewriting or dependency cache versioning. Vendoring is a simple solution to a real problem.

@mohsen1
Copy link
Contributor

mohsen1 commented Jun 9, 2018

I was about to say Go is doing this with no problem but saw this blog post complaining about similar issues!

@ry
Copy link
Member

ry commented Jun 9, 2018

This is the purpose of the -cachedir flag. Example:

deno -cachedir=./deps myscript.ts

I think having a global download cache as default is reasonable as long as it can be circumvented. How do you feel about this, @ilikebits ?

@kitsonk
Copy link
Contributor

kitsonk commented Jun 9, 2018

While I think it is inevitable that some sort of loader/resolver configuration will become necessary it feels a bit premature. Actually a lack of a loader configuration is one of the things that always frustrated me with Node.js while, AMD loaders were configureable. SystemJS provided a similar, but incompatible configuration, but it was sort of on track with the WHATWG, but that has totally stalled IIRC.

One of the biggest challenges with ESModules at the moment is this sort of meta problem of how you deal with resolution and loading of modules. My honest opinion is just to track that as closely as possible, though it is still likely to go nowhere fast. The sort of lack of agreement, because it is a complex horrible topic.

The only other option, outside of dealing with the configuration of the cache directory, is to make the module resolution "pluggable" from within deno. Of course anything you expose, people will use and become dependent upon and blame the person who created it for the 👣🔫. So even saying all that, it just feels like the best thing to do is hobble along, with some expressed semantics and patterns around hosting a "semver" modules on the web, and find out what problems are really encountered, and solve those problems when they occur. We are always great at trying to solve the problems we think we have, only to find out we didn't.

@elldritch
Copy link
Author

@ry that sounds like a reasonable compromise if you're really intent on maintaining the current implementation. I can easily see a hypothetical dvendor tool where dvendor run <project> runs something like deno -cachedir=<project>/vendor <project>/main.ts.

That said, I really think LFRC is a hidden footgun. It's an edge case that people probably won't run into a lot, but it'll definitely be somebody's surprised and mildly irritated blog post somewhere down the line.

I think the trade-offs between LFRC and vendoring are:

  1. Ease of getting started: using a global download cache is definitely the winner here, although I really don't think a hypothetical deno get is too much extra hassle to justify the much simpler semantics.
  2. Simple semantics: vendoring's semantics don't require you to reason about whether a project is being run for the first time or not. Every run on every machine uses the same source code every time.
  3. Simple implementation: vendoring is so simple that you can already do it (kind of, if you know the semantics of the cache dir). Moreover, anybody who wants to use their own module resolution logic (supporting custom protocol URLs, version resolution, etc.) can easily write a tool on top of vendoring because it's just regular files.

I think vendoring sits at a sweet spot of having simple semantics and implementation, being open to extensibility, and being very easy for new users to understand. (In fact, I wouldn't be surprised if beginners are more/just as confused by implicit downloads than by an extra download step.)

@ry
Copy link
Member

ry commented Jun 12, 2018

@ilikebits I'm pretty much in agreement. I just want to make sure the module resolution scheme is as dumb as possible. If other tooling wants to hack it - that's up to them - but the base runtime should be very simple.

One use-case I'd like to support is:

deno http:https://gist.github.com/ry/eslint.ts mylocalfile.js

By having the caching happen globally by default, and having default security, this allows people to distribute and run complex utility programs from any location.

I'm not necessarily against deno get but let's kick the tires with -cachedir a bit and see how it feels - because it's literally a 6 line implementation. I would be very happy to get away with something so trivial https://github.com/ry/deno/blob/69868c2b0e4372e6d7e49821caca41d372686eea/deno_dir.go#L100-L104

@elldritch
Copy link
Author

Sure, trying out -cachedir sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants