Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

propose registry search functionality #206

Open
stevvooe opened this issue Feb 25, 2015 · 31 comments
Open

propose registry search functionality #206

stevvooe opened this issue Feb 25, 2015 · 31 comments
Labels

Comments

@stevvooe
Copy link
Collaborator

The original plan was to provide search as a webhook based extension. The community has indicated they would prefer to have an integrated or standardized way to search a registry instance. Let's gather requirements and scope a proposal to on this matter.

@ncdc

@ncdc
Copy link

ncdc commented Feb 26, 2015

cc @aweiteka

@aweiteka
Copy link

Here's a pass at some high-level requirements:

  • Aggregate search results from multiple registry indices, both public and private registries. This matches the traditional packaging expectation where several repositories are configured and users can search across them.

  • Support flexible back-end search platforms via REST (Solr, GSA, DB rest interface, etc.). A simple plugin architecture will encourage development of search drivers to support a wide range of platforms.

  • User-customized search results. Different users want to display different columns. Provide a way for users to define what they want to see in search results. Personal examples:

    • I don't care if an image is popular so "star_count" is not interesting to me.
    • I do care what the base image is, e.g. rpm or debian based
    • I want to know where the content came from so I can determine I can trust the content: is it signed by a trusted party? (assumed in V2?) Is it certified by my organization or vendor? This is maybe what the v1 "official" boolean is trying to address but "official" means different things to different users.
  • Keep it simple. Commit to a bounded, well-defined search scheme to protect against feature sprawl.

  • API: I wouldn't expect much change from the v1 API. Defining what keys to support is the tricky part.

    GET /v2/search?key=value
    

@stevvooe @ncdc

@ncdc
Copy link

ncdc commented Mar 13, 2015

Aggregate search results from multiple registry indices, both public and private registries. This matches the traditional packaging expectation where several repositories are configured and users can search across them.

This seems more like something you'd want the Docker CLI to be able to do, and not necessarily a feature of a single registry?

I think it's important to distinguish between features that a registry must implement to support search, vs features you want in the client (Docker CLI).

@aweiteka
Copy link

This seems more like something you'd want the Docker CLI to be able to do, and not necessarily a feature of a single registry?

@ncdc Good point. As a high-level use case it may help identify data that needs to be returned by search. Do we want to discuss server + client issues together here or separate these out?

@ncdc
Copy link

ncdc commented Mar 13, 2015

I would say this issue is strictly about standardizing a specification for providing search functionality in a registry. So that would be:

  • what route or routes does the registry expose for search?
  • what inputs do the routes take (search parameters, etc)?
  • what does the response format look like?

@stevvooe WDYT - do we need another issue in docker/docker to talk about search result aggregation across multiple registries and other client items?

@dmp42
Copy link
Contributor

dmp42 commented Mar 13, 2015

@ncdc Let's first have this discussion here, including for client aggregation (probably relates to discovery, etc).

@stevvooe
Copy link
Collaborator Author

@ncdc @aweiteka I'm on board with everything stated above. I've brain dumped some thoughts on the search API below.

The search API and the registry should be separate systems, architecturally, but may be share a process or API space. They have very different roles and differing organizations will have separate requirements for the construction of that system. For example, if one deploys registries on every machine, it doesn't make sense to require running a search index on every system. Furthermore, a search system may show different results to different users. One search system might include results for the build system, which it shows to operations, but developers might only be able to see a subset.

These are some further requirements I've been bouncing around:

  1. I propose that we call this the "index" API, as the use cases seem to be more oriented towards exploration of content, rather than search alone. Search is a subset of an index. For example, we might have exact match queries in addition to term based relevance.
  2. An index may be integrated with a registry. This implies that an index API must be mutually exclusive to the registry API.
  3. The actual location of the index API should be separately discoverable.
  4. The registry API should define some ability to report on the complete contents of a repository or registry. This will allow bootstrapping a search system from an existing registry and support aggregate indexes. Perhaps, there will be some differential requests (I have this set of blobs)
  5. The data model should allow some flexibility in indexed fields on json content. Data formats have a tendency to change and move fast. Indexes should just work between these updates.
  6. While the data model should allow arbitrary field indexing, the indexable object set should be restricted. Per current plans, this could be manifests, tags and blobs.

Other possible explorations:

  1. It may not make sense to provide tag and manifest results separately. Is there a way to have the index aggregate related objects, such as manifests and tags?
  2. Should we define a minimum set of data for a search API to work with the docker client?

@ekristen
Copy link

When dealing with private registries and the images uploaded there, the search functionality should probably take into consideration the namespaces to which the current user doing the search has access to.

@ekristen
Copy link

Question, what is the use case or reasoning behind running multiple registries that you'd need to search N registries from a single search instance?

@stevvooe
Copy link
Collaborator Author

@ekristen Imagine a case where there is no centralized "registry" service. There are just repositories, available via ssh, torrent, http, etc. How would one index that if a service could only catalog a single registry?

@ekristen
Copy link

Well in that context it makes sense, but what is the feasibility of that? Right now there is a very specific protocol to the registry and the docker server and client expect it to be http based to pull image layers, etc. Is the idea to move away or just support multiple transport methods?

@dtromb
Copy link
Contributor

dtromb commented Apr 23, 2015

It seems like there is a difference between a "search feature" (what an end user wants to do, possibly through docker cli - all those high-level reqs of the kind listed above), and a "catalog" (the bare minimum the registry service needs to do to support a search service - that is: provide a list of the images it holds with no filters, and provide the standard detailed info about specific images)

If we specified a very minimal interface to support the latter in docker/distribution, this could be small, straightforward, and very quickly implemented. Then, actual "search" services could be built on top of that, as separate systems - satisfying the SoC @stevvooe points out is necessary.

I haven't read through all the code yet, so I may be missing something, but it seems like this could be extremely simple - just exposing manifest objects through a RESTful interface, something like https://<docker-distribution-repo>/v2/manifest/<key> and a corresponding enumeration. They're already mapped to JSON, even... perhaps also adding some callback hooks to notify search services of manifest CRUD to avoid polling. (For basic cli integration - cli could cache the repo lists and search them locally, similar to the way some popular system package management works - this could be enhanced by a "changed-since" feature, etc...)

Thoughts on this?

@stevvooe
Copy link
Collaborator Author

@BADZEN This is an accurate way of characterizing the different use cases. Providing a "catalog" for the registry would be realistic, as long as the catalog is only the locally available images. This matters more when you start having a caching proxy. Should the catalog return only the cached items or should it merge its cached list with the upstreams? Realistically, it should only return the cached items.

The main issue here are user expectations. If they use the catalog to index what is available in the registry, they must understand that the registry may serve up images that aren't in that catalog. More accurately, one cannot assume that the registry does not have something if it is not available in the catalog. The other aspect is future support. Yes, it will only take an afternoon to put together an image listing API. But, we'd have to live with the decisions made there for a very long time afterwards.

That said, a proposal for a catalog API would be successful with the following restrictions:

  1. The API should only return repository names, similar to the tags api. Returning other data will require too many trips to the backend.
  2. Requests to the catalog should require admin privileges on the registry.
  3. Catalog should be mounted at "/v2/_catalog" and be part of the V2 API specification.
  4. The repository list returned by the catalog API will only include those available to the local "registry cluster" meaning they have a shared backend. For example, a proxy cache would only return cached images and images pushed directly to the cache.

A few notes:

  1. There is already a notification system that could be used to power a search system. If you're comfortable with the guarantees, its sufficient.
  2. The manifests are already exposed via a REST API. Please see the specification for more information.

@atcol
Copy link

atcol commented May 4, 2015

Perhaps another use case:

It would be useful to be able to index/search a registry backend running V2 much-like V1. I'm the developer of the Docker-Registry-UI project and am unable to integrate with 2.0 until we can have a similar REST endpoint like /search.

@stevvooe
Copy link
Collaborator Author

stevvooe commented Jun 3, 2015

@atc- @BADZEN @ekristen @aweiteka I've submitted a draft of the catalog API for discussion. Please see #583.

@ekristen
Copy link

ekristen commented Jun 3, 2015

@stevvooe the docs look good for the catalog API. Just curious as to timeline when do you think all those new endpoints will be implemented?

@stevvooe
Copy link
Collaborator Author

stevvooe commented Jun 3, 2015

@ekristen My goal is to have them ready for the 2.1 release.

@trinitronx
Copy link

As a current docker-registry v1 user looking eventually to migrate to v2, I'd like to chime in on one initial problem that I can see if this is only implemented using the notification system mentioned by @stevvooe.

That problem is:

Given a pre-existing STORAGE_BACKEND (e.g.: S3 bucket) already populated with images. There will be only pull events given for images that users already know about. This therefore limits the index contents to images that users already know about... potentially leaving out images which already exist in the STORAGE_BACKEND, but no notification events have been generated for to populate the index.

This may be obvious, but I just wanted to point it out as a potential problem for current private v1 or v2 registry users who most likely already have images stored in a pre-existing STORAGE_BACKEND. Ideally there should be some way to populate the index the first time with all of the pre-existing images in the registry's storage backend.

P.S.: I like where this is going, and that the different use cases and architectural considerations are being discussed (e.g: the things brought up in this comment by @stevvooe). I especially like that the registry and search functions can be made separate and that it shouldn't be required to run a search index on every system that runs the docker registry. In my past experiences with the v1 registry this can be important from a performance and reliability standpoint.

@stevvooe
Copy link
Collaborator Author

@trinitronx No notification based search index can maintain accurate state unless it is based on a transaction log. Notifications can get missed, dropped or corrupted with even the greatest care.

We've addressed this concern by creating the catalog API. Periodic sync via catalog + notifications should allow one to maintain an accurate search index. Once we see a few implementations (they're out there), we'll start the specification process for the V2 Search API. However, there is still a lot in flux with the manifest format. Ideally, we want an extensible index and querying system, allowing users to search fields added in new versions without have to update the search index or registry.

@nyarly
Copy link

nyarly commented May 16, 2016

One feature we'd like to see is search-by-label. As part of the registry-integrated project I'm working on, we're building what's essentially an ad-hoc by-label search; if that could be standardized into distribution, it'd be great.

As a related issue, the next-generation of registry responses (especially the application/vnd.docker.distribution.manifest.v2+json type) don't appear to be documented, so where we're currently pulling the v2-1 manifest and parsing it's history to extract labels, it's unclear how that'll work in the future.

@gisjedi
Copy link

gisjedi commented May 20, 2016

@stevvooe Both search by name/tag and label are features of interest to our team, as well. Is this something that is expected to get prioritized for a near-time release or something we should pursue developing ourselves? We would certainly be happy to make a pull request if this functionality is of interest to others, which it appears to be.

@stevvooe
Copy link
Collaborator Author

@gisjedi Barring input from @RichardScothern, I am not sure about the road mapping of this task. We definitely want to move forward with something here, but I'm hesitant in coupling it directly to the registry API. For the most part, others have found luck using the notifications feature to populate an application-specific index. Defining a separate, standard image search API, implemented in the docker client, would complement such implementations.

That said, we would be more than willing to assist in taking a contribution. I'll let you coordinate with @RichardScothern, but a good start would be a common Google Doc or a proposal issue.

@CpuID
Copy link
Contributor

CpuID commented May 20, 2016

but I'm hesitant in coupling it directly to the registry API. For the most part, others have found luck using the notifications feature to populate an application-specific index.

Whats the incentive to storing a second source of truth for the metadata that the registry contains already...? Other than having to manage ensuring it is synchronized (event based population helps, but if the alternate datastore is down for maintenance those events can be lost, and you end up wanting a form of reconciliation to ensure nothing was missed).

@stevvooe
Copy link
Collaborator Author

Whats the incentive to storing a second source of truth for the metadata that the registry contains already...?

The main reason is to keep deployments simple. The V1 registry required a database, a cache and a backend to run. V2 only requires a backend. While this simplifies setup for new users, the main benefit of this approach is that one can deploy a large number of registries in a cluster without having to have a database to go along with that. If you wanted to replicate your registry out to a set of nodes, you can just run readonly instances and let rsync or some other process populate registry data from an origin. This supports an N+1 scaling model. With some imagination, one could sync the directory via torrent, and get p2p. Requiring a search index, in addition to other components, makes these kinds of applications challenging to implement. Decoupling the search API avoids these problems.

The other reason behind this is to support federated search. The current model mostly homes docker to a single registry search endpoint per registry namespace (ie example.com/myimage). One may want to actually search a number of registries for myimage and get back aggregated results.

Another consideration is security. While discoverability is great for starting out, searching and running images from a number of different sources can lead to picking up images from a nefarious source. If one could aggregate result from trusted registries into single endpoint and lock down the docker client to that single source, you avoid the likelihood of picking up untrusted material.

I've discussed some of these points in this issues and others, so it may be good to read around on this topic.

you end up wanting a form of reconciliation to ensure nothing was missed

This is actually provided by the catalog API.

@msabramo
Copy link

msabramo commented Jul 8, 2016

Catalog API was implemented in #653

@PI-Victor
Copy link

is this discussion stale now? i've lost count of how many github issues i've read and i can't figure out what's the latest on this topic? docker search doesn't work against v2 distribution.

@stevvooe
Copy link
Collaborator Author

@PI-Victor The issue is still open. If someone would like to tackle v2 search, they are more than welcome. At this point, we haven't seen a single proposal.

@grexe
Copy link

grexe commented Sep 12, 2017

I wonder what alternatives I have to just extract the labels portion of a manifest without having to wade through the Manifest history with some custom regex, like mentioned in comment #206 (comment) as described by @nyarly

@nyarly
Copy link

nyarly commented Sep 12, 2017 via email

@grexe
Copy link

grexe commented Sep 12, 2017

@nyarly problem is they are not even valid JSON - the labels are hidden within the history inside a map with multiple, duplicate v1compatibility keys, where the value is just a quoted JSON dump...
Thanks for the link though, will check it out (although I'm in Java and quite far already,-)

@stevvooe
Copy link
Collaborator Author

@nyarly @grexe So, that is the "old" schema1 format. They are valid json strings, with json inside.

Most images today, are going to be schema2, which makes this a lot easier. The labels are going to be a part of the config, which is referenced via digest in the manifest. I would suggest investigating whatever is generating schema1 manifests. It could be as simple as adding the correct Accept header when fetching the manifest, if you are fetching by tag or changing to a version of docker that pushes the newer version of the manifest (1.10+, I believe).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests