Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Offline DB/API revamp #1794

Open
Harjot1Singh opened this issue Jul 21, 2020 · 7 comments
Open

RFC: Offline DB/API revamp #1794

Harjot1Singh opened this issue Jul 21, 2020 · 7 comments
Labels
Impacts Most Affects a majority of end-users. ϟ Type Epic Describing a big goal. A collection of issues. Should never end up in an iteration.
Milestone

Comments

@Harjot1Singh
Copy link
Member

Harjot1Singh commented Jul 21, 2020

Problem

The repository is built into a relational database, in the form of an SQLite output, and Docker containers for Postgres + MariaDB.

This is great for a huge number of use-cases, however, the data (imo) is intrinsically document-based. That is not to say that relations do not exist, but that they're only useful for gleaning insights if you were querying the database blind and did not understand what it contains, which is often not the case.

In practice, this means that we end up transforming the data into a relational format, but to consume the data in a useful manner, we have to join it back together during runtime. This seems a little pointless and is the source of slowness.

Proposition

I can see the flexibility of the relational database being used in applications to do with data exploration and visualisation, so there is definitely a valid use-case to keep supporting this. But the majority of the time, the database is leveraged in a search-oriented fashion.

So, I propose that we provide a NoSQL/document-based format, which will facilitate high-speed querying offline, whilst maintaining our relational formats.

We'd want to expose this functionality through @shabados/database and expose any of our search features as wrappers, as we currently do. I would be personally happy with dropping the SQLite dependency from the JS library entirely, if need be.

Other options

  • Denormalisation: I did consider producing a denormalised SQLite output. But I suspect it'd be far too large for offline storage on consumer devices.

Design Considerations

  • Doesn't require too much more space than our current DB (~150MB normalised)
  • Fixed, local dataset, so write speed/concurrency/scale issues do not matter
  • Full-text/fuzzy is a bonus (but a full API should really deal with this)
  • Prioritising ease of use and speed regarding querying shabads, banis, and lines, and anyone who'd like to use it for other things can use the SQLite build to do so
  • Differential updates/sync in the future, so there could be an API server or master instance sitting somewhere, and streams DB updates to everyone
  • Useable on desktop and mobile
  • Local JS API publishable in a similar manner to the existing @shabados/database
  • Partial sync

Please post any additional considerations.

Solutions

Comments for suitable technologies and products would be appreciated.

@Harjot1Singh Harjot1Singh added Status: 3hard5me Impacts Most Affects a majority of end-users. ϟ Type Epic Describing a big goal. A collection of issues. Should never end up in an iteration. labels Jul 21, 2020
@bhajneet
Copy link
Member

I'm a big fan of this. Have you also considered a supplmentary lmdb for enhanced performance? Something specific to desktop/mobile, and perhaps not for other database users.

Check this out https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database it's a key value but with improvements to narrow down the key faster. If we were to use this for all potential search queries it would be gross, we'd be talking in the gigabytes of storage, however if we were to use it smartly (perhaps using serialized ids/numbers for values and incomplete lists), we may be able to squeeze out even further performance than a document based db would.

@Harjot1Singh Harjot1Singh added this to the v5 milestone Jan 31, 2021
@bhajneet
Copy link
Member

bhajneet commented Feb 1, 2021

Differential updates/sync in the future, so there could be an API server or master instance sitting somewhere, and streams DB updates to everyone

Will we control which sources are included in this? Can the developer only update/sync a few sources/publications for a variable number of compositions? Can the user make offline only a few sources ? Etc.

@Harjot1Singh
Copy link
Member Author

Notes so far

  • Search indexes use a LOT of storage, if you pregenerate the possible results for every query (about 86MB just for SGGS, with only line IDs attached).
    • 2.5ms for fetching 50 results from the search table in SQLite. About 33ms with a LIKE query instead.
    • Compression would make repeating/denormalisation much more palatable.
  • A PouchDB is a single index. You need multiple to store different "views".
    • It's not amazingly fast (40ms) to fetch the list of results as ids from the index and then request the full docs from another PouchDB.
    • It is very fast to fetch everything in one shot (5ms?)
  • Denormalising SQLite entirely to reduce joins speeds up fetches (6-15ms) at the expense of storage (600mb including transliterations + translations, 158mb with translations, 20Mb for just Gurbani). Joining ONLY translations and transliterations 158mb to denormalised lines.

@Harjot1Singh
Copy link
Member Author

Harjot1Singh commented Feb 6, 2021

Perhaps we only need to solve one issue, aside from dropping the ORM. The smart sorting:

select * from `line_content` inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `line_content`.`first_letters` like '%a%' limit 50
op: 12.902ms

vs

select * from `line_content` inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `line_content`.`first_letters` like '%a%' order by CASE  WHEN first_letters LIKE 'a' THEN 0 ELSE 1 END asc limit 50
op: 297.052ms

the CASE WHEN sorting is pretty slow... Maybe giving 100-150mb to a search index isn't such a bad idea...

@Harjot1Singh
Copy link
Member Author

with generated serach index:

select * from `search_only` inner join `line_content` on `line_content`.`line_id` = `search_only`.`line_id` 
inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `search_only`.`query` = 'hggm' limit 50
op: 11.707ms

@saihaj
Copy link
Member

saihaj commented Feb 8, 2021

Maybe giving 100-150mb to a search index isn't such a bad idea

For offline support if we plan to bundle this @shabados/mobile wouldn't that just make app size even bigger? Is there a way we can split into small chunks to keep it smaller?

@Harjot1Singh
Copy link
Member Author

Indeed! Which is why Partial Sync support is planned for the next release!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Impacts Most Affects a majority of end-users. ϟ Type Epic Describing a big goal. A collection of issues. Should never end up in an iteration.
Projects
None yet
Development

No branches or pull requests

3 participants