propose registry search functionality #206

stevvooe · 2015-02-25T21:46:36Z

The original plan was to provide search as a webhook based extension. The community has indicated they would prefer to have an integrated or standardized way to search a registry instance. Let's gather requirements and scope a proposal to on this matter.

@ncdc

ncdc · 2015-02-26T13:47:30Z

cc @aweiteka

aweiteka · 2015-03-13T14:02:22Z

Here's a pass at some high-level requirements:

Aggregate search results from multiple registry indices, both public and private registries. This matches the traditional packaging expectation where several repositories are configured and users can search across them.
Support flexible back-end search platforms via REST (Solr, GSA, DB rest interface, etc.). A simple plugin architecture will encourage development of search drivers to support a wide range of platforms.
User-customized search results. Different users want to display different columns. Provide a way for users to define what they want to see in search results. Personal examples:
- I don't care if an image is popular so "star_count" is not interesting to me.
- I do care what the base image is, e.g. rpm or debian based
- I want to know where the content came from so I can determine I can trust the content: is it signed by a trusted party? (assumed in V2?) Is it certified by my organization or vendor? This is maybe what the v1 "official" boolean is trying to address but "official" means different things to different users.
Keep it simple. Commit to a bounded, well-defined search scheme to protect against feature sprawl.
API: I wouldn't expect much change from the v1 API. Defining what keys to support is the tricky part.
```
GET /v2/search?key=value
```

@stevvooe @ncdc

ncdc · 2015-03-13T14:05:28Z

Aggregate search results from multiple registry indices, both public and private registries. This matches the traditional packaging expectation where several repositories are configured and users can search across them.

This seems more like something you'd want the Docker CLI to be able to do, and not necessarily a feature of a single registry?

I think it's important to distinguish between features that a registry must implement to support search, vs features you want in the client (Docker CLI).

aweiteka · 2015-03-13T14:12:51Z

This seems more like something you'd want the Docker CLI to be able to do, and not necessarily a feature of a single registry?

@ncdc Good point. As a high-level use case it may help identify data that needs to be returned by search. Do we want to discuss server + client issues together here or separate these out?

ncdc · 2015-03-13T14:15:53Z

I would say this issue is strictly about standardizing a specification for providing search functionality in a registry. So that would be:

what route or routes does the registry expose for search?
what inputs do the routes take (search parameters, etc)?
what does the response format look like?

@stevvooe WDYT - do we need another issue in docker/docker to talk about search result aggregation across multiple registries and other client items?

dmp42 · 2015-03-13T21:08:41Z

@ncdc Let's first have this discussion here, including for client aggregation (probably relates to discovery, etc).

stevvooe · 2015-03-13T21:54:56Z

@ncdc @aweiteka I'm on board with everything stated above. I've brain dumped some thoughts on the search API below.

The search API and the registry should be separate systems, architecturally, but may be share a process or API space. They have very different roles and differing organizations will have separate requirements for the construction of that system. For example, if one deploys registries on every machine, it doesn't make sense to require running a search index on every system. Furthermore, a search system may show different results to different users. One search system might include results for the build system, which it shows to operations, but developers might only be able to see a subset.

These are some further requirements I've been bouncing around:

I propose that we call this the "index" API, as the use cases seem to be more oriented towards exploration of content, rather than search alone. Search is a subset of an index. For example, we might have exact match queries in addition to term based relevance.
An index may be integrated with a registry. This implies that an index API must be mutually exclusive to the registry API.
The actual location of the index API should be separately discoverable.
The registry API should define some ability to report on the complete contents of a repository or registry. This will allow bootstrapping a search system from an existing registry and support aggregate indexes. Perhaps, there will be some differential requests (I have this set of blobs)
The data model should allow some flexibility in indexed fields on json content. Data formats have a tendency to change and move fast. Indexes should just work between these updates.
While the data model should allow arbitrary field indexing, the indexable object set should be restricted. Per current plans, this could be manifests, tags and blobs.

Other possible explorations:

It may not make sense to provide tag and manifest results separately. Is there a way to have the index aggregate related objects, such as manifests and tags?
Should we define a minimum set of data for a search API to work with the docker client?

ekristen · 2015-03-19T21:36:55Z

When dealing with private registries and the images uploaded there, the search functionality should probably take into consideration the namespaces to which the current user doing the search has access to.

ekristen · 2015-03-19T21:52:00Z

Question, what is the use case or reasoning behind running multiple registries that you'd need to search N registries from a single search instance?

stevvooe · 2015-03-19T22:14:15Z

@ekristen Imagine a case where there is no centralized "registry" service. There are just repositories, available via ssh, torrent, http, etc. How would one index that if a service could only catalog a single registry?

ekristen · 2015-03-19T22:22:41Z

Well in that context it makes sense, but what is the feasibility of that? Right now there is a very specific protocol to the registry and the docker server and client expect it to be http based to pull image layers, etc. Is the idea to move away or just support multiple transport methods?

dtromb · 2015-04-23T16:43:32Z

It seems like there is a difference between a "search feature" (what an end user wants to do, possibly through docker cli - all those high-level reqs of the kind listed above), and a "catalog" (the bare minimum the registry service needs to do to support a search service - that is: provide a list of the images it holds with no filters, and provide the standard detailed info about specific images)

If we specified a very minimal interface to support the latter in docker/distribution, this could be small, straightforward, and very quickly implemented. Then, actual "search" services could be built on top of that, as separate systems - satisfying the SoC @stevvooe points out is necessary.

I haven't read through all the code yet, so I may be missing something, but it seems like this could be extremely simple - just exposing manifest objects through a RESTful interface, something like https://<docker-distribution-repo>/v2/manifest/<key> and a corresponding enumeration. They're already mapped to JSON, even... perhaps also adding some callback hooks to notify search services of manifest CRUD to avoid polling. (For basic cli integration - cli could cache the repo lists and search them locally, similar to the way some popular system package management works - this could be enhanced by a "changed-since" feature, etc...)

Thoughts on this?

stevvooe · 2015-04-23T22:57:36Z

@BADZEN This is an accurate way of characterizing the different use cases. Providing a "catalog" for the registry would be realistic, as long as the catalog is only the locally available images. This matters more when you start having a caching proxy. Should the catalog return only the cached items or should it merge its cached list with the upstreams? Realistically, it should only return the cached items.

The main issue here are user expectations. If they use the catalog to index what is available in the registry, they must understand that the registry may serve up images that aren't in that catalog. More accurately, one cannot assume that the registry does not have something if it is not available in the catalog. The other aspect is future support. Yes, it will only take an afternoon to put together an image listing API. But, we'd have to live with the decisions made there for a very long time afterwards.

That said, a proposal for a catalog API would be successful with the following restrictions:

The API should only return repository names, similar to the tags api. Returning other data will require too many trips to the backend.
Requests to the catalog should require admin privileges on the registry.
Catalog should be mounted at "/v2/_catalog" and be part of the V2 API specification.
The repository list returned by the catalog API will only include those available to the local "registry cluster" meaning they have a shared backend. For example, a proxy cache would only return cached images and images pushed directly to the cache.

A few notes:

There is already a notification system that could be used to power a search system. If you're comfortable with the guarantees, its sufficient.
The manifests are already exposed via a REST API. Please see the specification for more information.

atcol · 2015-05-04T10:52:38Z

Perhaps another use case:

It would be useful to be able to index/search a registry backend running V2 much-like V1. I'm the developer of the Docker-Registry-UI project and am unable to integrate with 2.0 until we can have a similar REST endpoint like /search.

stevvooe · 2015-06-03T03:23:20Z

@atc- @BADZEN @ekristen @aweiteka I've submitted a draft of the catalog API for discussion. Please see #583.

ekristen · 2015-06-03T14:20:51Z

@stevvooe the docs look good for the catalog API. Just curious as to timeline when do you think all those new endpoints will be implemented?

stevvooe · 2015-06-03T20:59:33Z

@ekristen My goal is to have them ready for the 2.1 release.

trinitronx · 2015-08-18T23:12:42Z

As a current docker-registry v1 user looking eventually to migrate to v2, I'd like to chime in on one initial problem that I can see if this is only implemented using the notification system mentioned by @stevvooe.

That problem is:

Given a pre-existing STORAGE_BACKEND (e.g.: S3 bucket) already populated with images. There will be only pull events given for images that users already know about. This therefore limits the index contents to images that users already know about... potentially leaving out images which already exist in the STORAGE_BACKEND, but no notification events have been generated for to populate the index.

This may be obvious, but I just wanted to point it out as a potential problem for current private v1 or v2 registry users who most likely already have images stored in a pre-existing STORAGE_BACKEND. Ideally there should be some way to populate the index the first time with all of the pre-existing images in the registry's storage backend.

P.S.: I like where this is going, and that the different use cases and architectural considerations are being discussed (e.g: the things brought up in this comment by @stevvooe). I especially like that the registry and search functions can be made separate and that it shouldn't be required to run a search index on every system that runs the docker registry. In my past experiences with the v1 registry this can be important from a performance and reliability standpoint.

stevvooe · 2015-08-19T00:03:22Z

@trinitronx No notification based search index can maintain accurate state unless it is based on a transaction log. Notifications can get missed, dropped or corrupted with even the greatest care.

We've addressed this concern by creating the catalog API. Periodic sync via catalog + notifications should allow one to maintain an accurate search index. Once we see a few implementations (they're out there), we'll start the specification process for the V2 Search API. However, there is still a lot in flux with the manifest format. Ideally, we want an extensible index and querying system, allowing users to search fields added in new versions without have to update the search index or registry.

nyarly · 2016-05-16T19:09:52Z

One feature we'd like to see is search-by-label. As part of the registry-integrated project I'm working on, we're building what's essentially an ad-hoc by-label search; if that could be standardized into distribution, it'd be great.

As a related issue, the next-generation of registry responses (especially the application/vnd.docker.distribution.manifest.v2+json type) don't appear to be documented, so where we're currently pulling the v2-1 manifest and parsing it's history to extract labels, it's unclear how that'll work in the future.

gisjedi · 2016-05-20T00:53:23Z

@stevvooe Both search by name/tag and label are features of interest to our team, as well. Is this something that is expected to get prioritized for a near-time release or something we should pursue developing ourselves? We would certainly be happy to make a pull request if this functionality is of interest to others, which it appears to be.

stevvooe · 2016-05-20T02:13:15Z

@gisjedi Barring input from @RichardScothern, I am not sure about the road mapping of this task. We definitely want to move forward with something here, but I'm hesitant in coupling it directly to the registry API. For the most part, others have found luck using the notifications feature to populate an application-specific index. Defining a separate, standard image search API, implemented in the docker client, would complement such implementations.

That said, we would be more than willing to assist in taking a contribution. I'll let you coordinate with @RichardScothern, but a good start would be a common Google Doc or a proposal issue.

CpuID · 2016-05-20T02:54:12Z

but I'm hesitant in coupling it directly to the registry API. For the most part, others have found luck using the notifications feature to populate an application-specific index.

Whats the incentive to storing a second source of truth for the metadata that the registry contains already...? Other than having to manage ensuring it is synchronized (event based population helps, but if the alternate datastore is down for maintenance those events can be lost, and you end up wanting a form of reconciliation to ensure nothing was missed).

stevvooe · 2016-05-20T03:12:49Z

Whats the incentive to storing a second source of truth for the metadata that the registry contains already...?

The main reason is to keep deployments simple. The V1 registry required a database, a cache and a backend to run. V2 only requires a backend. While this simplifies setup for new users, the main benefit of this approach is that one can deploy a large number of registries in a cluster without having to have a database to go along with that. If you wanted to replicate your registry out to a set of nodes, you can just run readonly instances and let rsync or some other process populate registry data from an origin. This supports an N+1 scaling model. With some imagination, one could sync the directory via torrent, and get p2p. Requiring a search index, in addition to other components, makes these kinds of applications challenging to implement. Decoupling the search API avoids these problems.

The other reason behind this is to support federated search. The current model mostly homes docker to a single registry search endpoint per registry namespace (ie example.com/myimage). One may want to actually search a number of registries for myimage and get back aggregated results.

Another consideration is security. While discoverability is great for starting out, searching and running images from a number of different sources can lead to picking up images from a nefarious source. If one could aggregate result from trusted registries into single endpoint and lock down the docker client to that single source, you avoid the likelihood of picking up untrusted material.

I've discussed some of these points in this issues and others, so it may be good to read around on this topic.

you end up wanting a form of reconciliation to ensure nothing was missed

This is actually provided by the catalog API.

msabramo · 2016-07-08T01:55:45Z

Catalog API was implemented in #653

PI-Victor · 2017-03-30T14:48:20Z

is this discussion stale now? i've lost count of how many github issues i've read and i can't figure out what's the latest on this topic? docker search doesn't work against v2 distribution.

stevvooe · 2017-03-30T18:13:36Z

@PI-Victor The issue is still open. If someone would like to tackle v2 search, they are more than welcome. At this point, we haven't seen a single proposal.

grexe · 2017-09-12T12:28:46Z

I wonder what alternatives I have to just extract the labels portion of a manifest without having to wade through the Manifest history with some custom regex, like mentioned in comment #206 (comment) as described by @nyarly

nyarly · 2017-09-12T15:56:01Z

The manifests are all JSON, so parsing is easier and more reliable than a regex. The larger problem is scanning through all the manifests. I've been working on a project which (as a small part) incorporates a partial search index for labels. c.f. https://github.com/opentable/sous/blob/master/ext/docker/image_mapping.go which might make a decent basis for such a system.

…

On Tue, Sep 12, 2017 at 5:28 AM Gregor B. Rosenauer < ***@***.***> wrote: I wonder what alternatives I have to just extract the labels portion of a manifest without having to wade through the Manifest history with some custom regex, like mentioned in comment #206 (comment) <#206 (comment)> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#206 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHyPI96UdWlf7-ojWYhMOun7w1RyLuUks5shnkJgaJpZM4Dl58q> .

grexe · 2017-09-12T21:04:44Z

@nyarly problem is they are not even valid JSON - the labels are hidden within the history inside a map with multiple, duplicate v1compatibility keys, where the value is just a quoted JSON dump...
Thanks for the link though, will check it out (although I'm in Java and quite far already,-)

stevvooe · 2017-09-18T20:53:28Z

@nyarly @grexe So, that is the "old" schema1 format. They are valid json strings, with json inside.

Most images today, are going to be schema2, which makes this a lot easier. The labels are going to be a part of the config, which is referenced via digest in the manifest. I would suggest investigating whatever is generating schema1 manifests. It could be as simple as adding the correct Accept header when fetching the manifest, if you are fetching by tag or changing to a version of docker that pushes the newer version of the manifest (1.10+, I believe).

stevvooe added question feature labels Feb 25, 2015

stevvooe mentioned this issue Mar 16, 2015

Registry <-> web communication proactive security #265

Closed

stevvooe mentioned this issue Mar 25, 2015

Proposal: Repository Image Search by {description, platform} moby/moby#8256

Open

dmp42 mentioned this issue Mar 26, 2015

Allow user to search multiple registries at once moby/moby#11818

Closed

stevvooe mentioned this issue Apr 17, 2015

List images and tagging HTTP APIs #382

Closed

This was referenced Apr 24, 2015

Feature Request: paginate by repo name kwk/docker-registry-frontend#7

Open

registry:2.0 API support kwk/docker-registry-frontend#33

Closed

stevvooe mentioned this issue Apr 27, 2015

how list all images or search api? #441

Closed

stevvooe mentioned this issue May 4, 2015

Support JSONP on API #487

Closed

stevvooe mentioned this issue May 18, 2015

Alternative to searching in registry #528

Closed

stevvooe mentioned this issue Jun 5, 2015

Label inspection & aggregation endpoint #600

Closed

justingood mentioned this issue Jun 8, 2015

Support Docker Registry API v2 banyanops/collector#5

Closed

stevvooe mentioned this issue Jun 8, 2015

Feature Request: Add latest tag to basic search #606

Closed

mssola mentioned this issue Jul 6, 2015

Take advantage of the Catalog API SUSE/Portus#199

Closed

stevvooe mentioned this issue Jul 16, 2015

Docker Search not working for private repos moby/moby#14660

Closed

dmp42 removed the question label Jul 24, 2015

wking mentioned this issue Jul 30, 2015

bundle: add initial run use case opencontainers/runtime-spec#85

Merged

stevvooe mentioned this issue Aug 4, 2016

Support OS/arch dimension in image search moby/moby#25417

Open

3 tasks

marcuslinke mentioned this issue Dec 9, 2016

Authentication with private docker registry does not work docker-java/docker-java#754

Closed

rcjsuen mentioned this issue Aug 11, 2017

Adopt the dockerfile-language-server-nodejs npm module microsoft/vscode-docker#100

Merged

todaygood mentioned this issue Jan 22, 2018

failed to docker search private registry todaygood/container-lab#1

Open

jonjohnsonjr mentioned this issue Aug 29, 2018

[RFP] replace catalog API functionality opencontainers/distribution-spec#22

Closed

shibd mentioned this issue Nov 12, 2018

How to search the registry from the command line goharbor/harbor#3632

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

propose registry search functionality #206

propose registry search functionality #206

stevvooe commented Feb 25, 2015

ncdc commented Feb 26, 2015

aweiteka commented Mar 13, 2015

ncdc commented Mar 13, 2015

aweiteka commented Mar 13, 2015

ncdc commented Mar 13, 2015

dmp42 commented Mar 13, 2015

stevvooe commented Mar 13, 2015

ekristen commented Mar 19, 2015

ekristen commented Mar 19, 2015

stevvooe commented Mar 19, 2015

ekristen commented Mar 19, 2015

dtromb commented Apr 23, 2015

stevvooe commented Apr 23, 2015

atcol commented May 4, 2015

stevvooe commented Jun 3, 2015

ekristen commented Jun 3, 2015

stevvooe commented Jun 3, 2015

trinitronx commented Aug 18, 2015

stevvooe commented Aug 19, 2015

nyarly commented May 16, 2016

gisjedi commented May 20, 2016

stevvooe commented May 20, 2016

CpuID commented May 20, 2016

stevvooe commented May 20, 2016

msabramo commented Jul 8, 2016

PI-Victor commented Mar 30, 2017

stevvooe commented Mar 30, 2017

grexe commented Sep 12, 2017 •

edited

Loading

nyarly commented Sep 12, 2017 via email

grexe commented Sep 12, 2017

stevvooe commented Sep 18, 2017

propose registry search functionality #206

propose registry search functionality #206

Comments

stevvooe commented Feb 25, 2015

ncdc commented Feb 26, 2015

aweiteka commented Mar 13, 2015

ncdc commented Mar 13, 2015

aweiteka commented Mar 13, 2015

ncdc commented Mar 13, 2015

dmp42 commented Mar 13, 2015

stevvooe commented Mar 13, 2015

ekristen commented Mar 19, 2015

ekristen commented Mar 19, 2015

stevvooe commented Mar 19, 2015

ekristen commented Mar 19, 2015

dtromb commented Apr 23, 2015

stevvooe commented Apr 23, 2015

atcol commented May 4, 2015

stevvooe commented Jun 3, 2015

ekristen commented Jun 3, 2015

stevvooe commented Jun 3, 2015

trinitronx commented Aug 18, 2015

stevvooe commented Aug 19, 2015

nyarly commented May 16, 2016

gisjedi commented May 20, 2016

stevvooe commented May 20, 2016

CpuID commented May 20, 2016

stevvooe commented May 20, 2016

msabramo commented Jul 8, 2016

PI-Victor commented Mar 30, 2017

stevvooe commented Mar 30, 2017

grexe commented Sep 12, 2017 • edited Loading

nyarly commented Sep 12, 2017 via email

grexe commented Sep 12, 2017

stevvooe commented Sep 18, 2017

grexe commented Sep 12, 2017 •

edited

Loading