Skip to content

Commit

Permalink
Update DATA_MODEL.md
Browse files Browse the repository at this point in the history
  • Loading branch information
straight-shoota committed Oct 30, 2021
1 parent 10396dd commit cd23ed9
Showing 1 changed file with 99 additions and 107 deletions.
206 changes: 99 additions & 107 deletions DATA_MODEL.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,158 +8,150 @@ database is independent of validations provided by ORM. Constraints and triggers
are in place to make sure the data is inherently consistent, regardless of how it
is accessed.

## Models

### Shards
## Shards

```
+-------------+--------------------------+------------------------------------------------------+
| Column | Type | Modifiers |
|-------------+--------------------------+------------------------------------------------------|
| id | integer | not null default nextval('shards_id_seq'::regclass) |
| name | citext | not null |
| qualifier | citext | not null default ''::citext |
| description | text | |
| created_at | timestamp with time zone | not null default now() |
| updated_at | timestamp with time zone | not null default now() |
+-------------+--------------------------+------------------------------------------------------+
Indexes:
"shards_pkey" PRIMARY KEY, btree (id)
"shards_name_unique" UNIQUE CONSTRAINT, btree (name, qualifier)
Table "public.shards"
Column | Type | Collation | Nullable | Default
-------------+--------------------------+-----------+----------+----------------------------------
id | bigint | | not null | generated by default as identity
name | citext | | not null |
qualifier | citext | | not null | ''::citext
description | text | | |
created_at | timestamp with time zone | | not null | now()
updated_at | timestamp with time zone | | not null | now()
categories | bigint[] | | not null | '{}'::bigint[]
archived_at | timestamp with time zone | | |
Check constraints:
"shards_name_check" CHECK (name ~ '^[A-Za-z0-9_\-.]{1,100}$'::text)
"shards_qualifier_check" CHECK (qualifier ~ '^[A-Za-z0-9_\-.]{0,100}$'::citext)
Referenced by:
TABLE "dependencies" CONSTRAINT "depdendencies_shard_id_fkey" FOREIGN KEY (shard_id) REFERENCES shards(id)
TABLE "releases" CONSTRAINT "specs_shard_id_fkey" FOREIGN KEY (shard_id) REFERENCES shards(id)
TABLE "repos" CONSTRAINT "repos_shard_id_fkey" FOREIGN KEY (shard_id) REFERENCES shards(id) ON DELETE CASCADE
Triggers:
set_timestamp BEFORE UPDATE ON shards FOR EACH ROW EXECUTE PROCEDURE trigger_set_timestamp()
```

### Simple, fault-tolerant, global naming schema for shards

Shards should generally be known by their name. Because of shard's decentralized design, name clashes can't be ruled out.
Shards are generally called by their name. The name is defined in `shard.yml` and also
applied when used as a dependency both in `dependencies:` mapping as well as the folder
name in `./lib`.

Because of shard's decentralized design, name clashes can't be ruled out.

#### Mirrors

Name clashes mostly come from forks and mirrors of a shard's repository. But there might also be unrelated repositories with homonymous shards.
This is obviously bad by itself, but can't be avoided without a centralized registry.
Name clashes are often caused by mirrors of a shard's repository. This is usually
not that problematic because they can be viewed as alternate repositories of the same shard.

Example: kemal

* Main repo: `github:kemalcr/kemal`
* Old repo: `github:sdogruyol/kemal`
* There are also development forks like `github:matomo/kemal`
* Main repo location: `github:kemalcr/kemal`. This is considered the canonical repo.
* Old repo location: `github:sdogruyol/kemal`
* There are also development forks like `github:straight-shoota/kemal`

They all reference the same shard. Mirrors could be considered as
individual instances but unless they have separate releases, they are considered
the same shard, just provided at a different location. When a mirror releases
independently, it must be considered a fork and a separate shard.

Shardbox doesn't take care of this automatically. Mirrors need to be listed as
mirrors belonging to the canonical repository in the catalog. Shardbox only every
synchronizes releases from the canonical repository. To separate a mirror to form its
own shard, it simply needs to be removed from the mirrors list.

##### Homonymous Shards

They all reference in essence the same shard. Forks could be considered as an individual instance but unless they have separate releases, they are not really.
There might also be unrelated shards which share the same name.
This is obviously not ideal, but can't really be avoided when there is no
centralized registry that assigns names.

This problem is approached as follows:

* Shards are generally identified by their name as specified in `shard.yml` (e.g. `kemal`) and an additional qualifier (e.g. `kemalcr` or `matomo`)
* Qualifiers can be omitted when there is no ambiguity (probably just first come-first serve). In the database this is expressed as an empty value for `qualifier` (due to enforcing uniqueness constraint). The data mapping should interpret an empty string as `nil`.
* Slug could look like `kemal` (main shard, `kemalcr/kemal`) and `kemal~matomo` (fork, `matomo/kemal`)
* Avoids `/` as delimiter for easier use in HTTP routes and to distinguish from github `<org>/<project>` scheme.
* Shards are generally identified by their name as specified in `shard.yml` (e.g. `kemal`).
* When there are multiple shards of the same name (related or unrelated) in the database, an
additional qualifier is used to tell them apart (e.g. `kemalcr` or `straight-shoota`).
This qualifier is interpreted as `#{qualifier}'s version of #{name}`
* Qualifiers can be omitted when there is no ambiguity.
* Name and qualifier combined form a slug which could look like `kemal` (no qualifier) or
`kemal~straight-shoota`.
* Avoids `/` as delimiter for easier use in HTTP routes and to distinguish from
`<org>/<project>` scheme typically used for source code hosting.

This is still a trial, and not confirmed to work well with all real-world scenarios.
Especially, it needs to be determined if this nomenclature works for both forks and mirrors as well as entirely different shards, just sharing the same name. It's probably not always easy to tell these two cases apart.
Shardbox automatically assigns qualifiers derived from the URL.
Qualifiers are only assigned when a shard name has already been taken
(first come &ndash; first serve).
There is currently no mechanism for modifying qualifiers (but might be implemented in the future).

### Repos
## Repos

```
+------------+--------------------------+-----------------------------------------------------+
| Column | Type | Modifiers |
|------------+--------------------------+-----------------------------------------------------|
| id | integer | not null default nextval('repos_id_seq'::regclass) |
| shard_id | integer | not null |
| resolver | repo_resolver | not null |
| url | citext | not null |
| role | repo_role | not null default 'canonical'::repo_role |
| synced_at | timestamp with time zone | |
| created_at | timestamp with time zone | not null default now() |
| updated_at | timestamp with time zone | not null default now() |
+------------+--------------------------+-----------------------------------------------------+
Indexes:
"repos_pkey" PRIMARY KEY, btree (id)
"repos_shard_id_role_idx" UNIQUE, btree (shard_id, role) WHERE role = 'canonical'::repo_role
"repos_url_uniq" UNIQUE CONSTRAINT, btree (url, resolver)
Table "public.repos"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+----------------------------------
id | bigint | | not null | generated by default as identity
shard_id | bigint | | |
resolver | repo_resolver | | not null |
url | citext | | not null |
role | repo_role | | not null | 'canonical'::repo_role
metadata | jsonb | | not null | '{}'::jsonb
sync_failed_at | timestamp with time zone | | |
synced_at | timestamp with time zone | | |
created_at | timestamp with time zone | | not null | now()
updated_at | timestamp with time zone | | not null | now()
Check constraints:
"repos_resolvers_service_url" CHECK (NOT (resolver = ANY (ARRAY['github'::repo_resolver, 'gitlab'::repo_resolver, 'bitbucket'::repo_resolver])) OR url ~ '^[A-Za-z0-9_Foreign-key constraints:
"repos_shard_id_fkey" FOREIGN KEY (shard_id) REFERENCES shards(id) ON DELETE CASCADE
Triggers:
set_timestamp BEFORE UPDATE ON repos FOR EACH ROW EXECUTE PROCEDURE trigger_set_timestamp()
"repos_obsolete_role_shard_id_null" CHECK (role <> 'obsolete'::repo_role OR shard_id IS NULL)
"repos_resolvers_service_url" CHECK (NOT (resolver = ANY (ARRAY['github'::repo_resolver, 'gitlab'::repo_resolver, 'bitbucket'::repo_resolver])) OR url ~ '^[A-Za-z0-9_\-.]{1,100}/[A-Za-z0-9_\-.]{1,100}$'::citext AND url !~~ '%.git'::citext)
"repos_shard_id_null_role" CHECK (shard_id IS NOT NULL OR role = 'canonical'::repo_role OR role = 'obsolete'::repo_role)
```

NOTES:

* `role` specifies the role of this repo for the shard (defaults to `canonical`). Other values are `mirror` and `fork`. Thus, multiple repositories can be linked to the same shard. This is important for example to preserve continuity when a repository is transferred to a different location (for example `github:sdogruyol/kemal` to `github:kemalcr/kemal`).
* `role` specifies the role of this repo for the shard (defaults to `canonical`). Other values are `mirror`, `legacy` and `obsolete`. Thus, multiple repositories can be linked to the same shard. This is important for example to preserve continuity when a repository is transferred to a different location (for example `github:sdogruyol/kemal` to `github:kemalcr/kemal`) and to map mirrors of the same shard.

### Releases

```
+---------------+--------------------------+--------------------------------------------------------+
| Column | Type | Modifiers |
|---------------+--------------------------+--------------------------------------------------------|
| id | integer | not null default nextval('releases_id_seq'::regclass) |
| shard_id | integer | not null |
| version | character varying | not null |
| revision_info | jsonb | not null |
| spec | jsonb | not null |
| position | integer | |
| latest | boolean | |
| released_at | timestamp with time zone | not null |
| yanked_at | timestamp with time zone | |
| created_at | timestamp with time zone | not null default now() |
| updated_at | timestamp with time zone | not null default now() |
+---------------+--------------------------+--------------------------------------------------------+
Indexes:
"specs_pkey" PRIMARY KEY, btree (id)
"releases_position_uniq" UNIQUE CONSTRAINT, btree (shard_id, "position") DEFERRABLE INITIALLY DEFERRED
"releases_shard_id_latest_idx" UNIQUE, btree (shard_id, latest) WHERE latest = true
"releases_version_uniq" UNIQUE CONSTRAINT, btree (shard_id, version)
Table "public.releases"
Column | Type | Collation | Nullable | Default
---------------+--------------------------+-----------+----------+----------------------------------
id | bigint | | not null | generated by default as identity
shard_id | bigint | | not null |
version | character varying | | not null |
revision_info | jsonb | | not null |
spec | jsonb | | not null |
position | integer | | not null |
latest | boolean | | |
released_at | timestamp with time zone | | not null |
yanked_at | timestamp with time zone | | |
created_at | timestamp with time zone | | not null | now()
updated_at | timestamp with time zone | | not null | now()
Check constraints:
"releases_latest_check" CHECK (latest <> false)
"releases_version_check" CHECK (version::text ~ '^[0-9]+(\.[0-9a-zA-Z]+)*(-[0-9A-Za-z-]+(\.[0-9A-Za-z-]+)*)?$'::text OR version::text = 'HEAD'::text)
Foreign-key constraints:
"specs_shard_id_fkey" FOREIGN KEY (shard_id) REFERENCES shards(id)
Referenced by:
TABLE "dependencies" CONSTRAINT "dependencies_release_id_fkey" FOREIGN KEY (release_id) REFERENCES releases(id) ON DELETE CASCADE
Triggers:
releases_only_one_latest_release BEFORE INSERT OR UPDATE OF latest ON releases FOR EACH ROW WHEN (new.latest = true) EXECUTE PROCEDURE ensure_only_one_latest_release_
set_timestamp BEFORE UPDATE ON releases FOR EACH ROW EXECUTE PROCEDURE trigger_set_timestamp()
```

NOTES:

* Releases are bound to a shard (`shard_id`), not an individual repo because repo locations may change. We consider each shard to have a unique release history.
* `position` is a utility column used to sort versions because postgresql doesn't provide proper comparison operator for version strings. There is a `semver` extension, but it requires versions to follow SEMVER, which is not enforced by shards. So we need to enforce this externally using `Service::OrderReleases`. A further enhancement would be to use a trigger and notify channel to automatically request a reorder job, when a version is added or removed.
* Releases are bound to a shard (`shard_id`), not an individual repo because repo locations may change. We consider each shard to have a unique release history determined by the releases provided by the canonical repo.
* `position` is a utility column used to sort versions because PostgreSQL doesn't provide proper comparison operator for version strings. There is a `semver` extension, but it requires versions to follow SEMVER, which is not enforced by shards. So we need to implement sort order externally using `Service::OrderReleases`. As a benefit, result sorting is simple integer comparison.
* If a release has been deleted from the repo (i.e. the tag was removed) it is marked as `yanked`. This procedure needs refinement. Yanked releases should still be addressable.
* When a tag is changed to point to a different commit, it is simply updated. This also needs refinement.

### Dependencies

```
+------------+--------------------------+-------------------------+
| Column | Type | Modifiers |
|------------+--------------------------+-------------------------|
| release_id | integer | not null |
| shard_id | integer | |
| name | citext | not null |
| spec | jsonb | not null |
| scope | dependency_scope | not null |
| resolvable | boolean | not null |
| created_at | timestamp with time zone | not null default now() |
| updated_at | timestamp with time zone | not null default now() |
+------------+--------------------------+-------------------------+
Indexes:
"dependencies_uniq" UNIQUE CONSTRAINT, btree (release_id, name)
Foreign-key constraints:
"depdendencies_shard_id_fkey" FOREIGN KEY (shard_id) REFERENCES shards(id)
"dependencies_release_id_fkey" FOREIGN KEY (release_id) REFERENCES releases(id) ON DELETE CASCADE
Triggers:
set_timestamp BEFORE UPDATE ON dependencies FOR EACH ROW EXECUTE PROCEDURE trigger_set_timestamp()
Table "public.dependencies"
Column | Type | Collation | Nullable | Default
------------+--------------------------+-----------+----------+---------
release_id | bigint | | not null |
name | citext | | not null |
repo_id | bigint | | |
spec | jsonb | | not null |
scope | dependency_scope | | not null |
created_at | timestamp with time zone | | not null | now()
updated_at | timestamp with time zone | | not null | now()
```

NOTES:

* `shard_id` points to the shard referenced as dependency. If `NULL`, it could not (yet) be resolved. The dependent shard is available through joining `releases` on `release_id`.
* When a dependency's repository can't be resolved (for example it's a `path` dependency or the URL does not resolve, there is an error, ...) it is marked as `resolvable = false` and won't be revisited in the future. This needs refinement, because the repository might become available at some point.
* `repo_id` points to the repo referenced as dependency. This is only `NULL` if it cannot be resolved every (`path` scheme).
In all other cases, `repo_id` points to a repository record, but might not be actually resolvable (when the repository is not available).
* The dependent shard is available through joining `releases` on `release_id`.
* Scope is either `runtime` or `dependency`.

0 comments on commit cd23ed9

Please sign in to comment.