RFC: Offline DB/API revamp #1794

Harjot1Singh · 2020-07-21T18:47:07Z

Problem

The repository is built into a relational database, in the form of an SQLite output, and Docker containers for Postgres + MariaDB.

This is great for a huge number of use-cases, however, the data (imo) is intrinsically document-based. That is not to say that relations do not exist, but that they're only useful for gleaning insights if you were querying the database blind and did not understand what it contains, which is often not the case.

In practice, this means that we end up transforming the data into a relational format, but to consume the data in a useful manner, we have to join it back together during runtime. This seems a little pointless and is the source of slowness.

Proposition

I can see the flexibility of the relational database being used in applications to do with data exploration and visualisation, so there is definitely a valid use-case to keep supporting this. But the majority of the time, the database is leveraged in a search-oriented fashion.

So, I propose that we provide a NoSQL/document-based format, which will facilitate high-speed querying offline, whilst maintaining our relational formats.

We'd want to expose this functionality through @shabados/database and expose any of our search features as wrappers, as we currently do. I would be personally happy with dropping the SQLite dependency from the JS library entirely, if need be.

Other options

Denormalisation: I did consider producing a denormalised SQLite output. But I suspect it'd be far too large for offline storage on consumer devices.

Design Considerations

Doesn't require too much more space than our current DB (~150MB normalised)
Fixed, local dataset, so write speed/concurrency/scale issues do not matter
Full-text/fuzzy is a bonus (but a full API should really deal with this)
Prioritising ease of use and speed regarding querying shabads, banis, and lines, and anyone who'd like to use it for other things can use the SQLite build to do so
Differential updates/sync in the future, so there could be an API server or master instance sitting somewhere, and streams DB updates to everyone
Useable on desktop and mobile
Local JS API publishable in a similar manner to the existing @shabados/database
Partial sync

Please post any additional considerations.

Solutions

Comments for suitable technologies and products would be appreciated.

The text was updated successfully, but these errors were encountered:

bhajneet · 2020-07-21T19:09:06Z

I'm a big fan of this. Have you also considered a supplmentary lmdb for enhanced performance? Something specific to desktop/mobile, and perhaps not for other database users.

Check this out https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database it's a key value but with improvements to narrow down the key faster. If we were to use this for all potential search queries it would be gross, we'd be talking in the gigabytes of storage, however if we were to use it smartly (perhaps using serialized ids/numbers for values and incomplete lists), we may be able to squeeze out even further performance than a document based db would.

bhajneet · 2021-02-01T14:38:52Z

Differential updates/sync in the future, so there could be an API server or master instance sitting somewhere, and streams DB updates to everyone

Will we control which sources are included in this? Can the developer only update/sync a few sources/publications for a variable number of compositions? Can the user make offline only a few sources ? Etc.

Harjot1Singh · 2021-02-06T22:28:32Z

Notes so far

Search indexes use a LOT of storage, if you pregenerate the possible results for every query (about 86MB just for SGGS, with only line IDs attached).
- 2.5ms for fetching 50 results from the search table in SQLite. About 33ms with a LIKE query instead.
- Compression would make repeating/denormalisation much more palatable.
A PouchDB is a single index. You need multiple to store different "views".
- It's not amazingly fast (40ms) to fetch the list of results as ids from the index and then request the full docs from another PouchDB.
- It is very fast to fetch everything in one shot (5ms?)
Denormalising SQLite entirely to reduce joins speeds up fetches (6-15ms) at the expense of storage (600mb including transliterations + translations, 158mb with translations, 20Mb for just Gurbani). Joining ONLY translations and transliterations 158mb to denormalised lines.

Harjot1Singh · 2021-02-06T22:53:27Z

Perhaps we only need to solve one issue, aside from dropping the ORM. The smart sorting:

select * from `line_content` inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `line_content`.`first_letters` like '%a%' limit 50
op: 12.902ms

vs

select * from `line_content` inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `line_content`.`first_letters` like '%a%' order by CASE  WHEN first_letters LIKE 'a' THEN 0 ELSE 1 END asc limit 50
op: 297.052ms

the CASE WHEN sorting is pretty slow... Maybe giving 100-150mb to a search index isn't such a bad idea...

Harjot1Singh · 2021-02-07T00:13:27Z

with generated serach index:

select * from `search_only` inner join `line_content` on `line_content`.`line_id` = `search_only`.`line_id` 
inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `search_only`.`query` = 'hggm' limit 50
op: 11.707ms

saihaj · 2021-02-08T00:10:13Z

Maybe giving 100-150mb to a search index isn't such a bad idea

For offline support if we plan to bundle this @shabados/mobile wouldn't that just make app size even bigger? Is there a way we can split into small chunks to keep it smaller?

Harjot1Singh · 2021-02-08T00:10:52Z

Indeed! Which is why Partial Sync support is planned for the next release!

Harjot1Singh added Status: 3hard5me Impacts Most Affects a majority of end-users. ϟ Type Epic Describing a big goal. A collection of issues. Should never end up in an iteration. labels Jul 21, 2020

Harjot1Singh added the Next Major label Jan 31, 2021

Harjot1Singh added this to the v5 milestone Jan 31, 2021

saihaj mentioned this issue Feb 22, 2021

Suggestions jujhars13/bhai-gurdas-kabit-savaiye#1

Open

bhajneet removed Status: 3hard5me labels Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Offline DB/API revamp #1794

RFC: Offline DB/API revamp #1794

Harjot1Singh commented Jul 21, 2020 •

edited

Loading

bhajneet commented Jul 21, 2020

bhajneet commented Feb 1, 2021 •

edited

Loading

Harjot1Singh commented Feb 6, 2021

Harjot1Singh commented Feb 6, 2021 •

edited by bhajneet

Loading

Harjot1Singh commented Feb 7, 2021

saihaj commented Feb 8, 2021

Harjot1Singh commented Feb 8, 2021

RFC: Offline DB/API revamp #1794

RFC: Offline DB/API revamp #1794

Comments

Harjot1Singh commented Jul 21, 2020 • edited Loading

Problem

Proposition

Other options

Design Considerations

Solutions

bhajneet commented Jul 21, 2020

bhajneet commented Feb 1, 2021 • edited Loading

Harjot1Singh commented Feb 6, 2021

Harjot1Singh commented Feb 6, 2021 • edited by bhajneet Loading

Harjot1Singh commented Feb 7, 2021

saihaj commented Feb 8, 2021

Harjot1Singh commented Feb 8, 2021

Harjot1Singh commented Jul 21, 2020 •

edited

Loading

bhajneet commented Feb 1, 2021 •

edited

Loading

Harjot1Singh commented Feb 6, 2021 •

edited by bhajneet

Loading