Blockchain Engineering - class of 2022 - Team Import Science #6783

synctext · 2022-02-22T10:38:37Z

Your task is to gather scientific publications and engineer machine reading of scientific knowledge in BrainDAO. Thousands of scientific articles are available with Creative Commons copyright license in simple .PDF format. Get thousands of such files on each device and start processing. Use a light library for Natural Language Processing. Use the Bittorrent engine inside Superapp for efficient file sharing. Use IPv8 community to gossip new content. What does this have to do with our "Blockchain Engineering course? True, this is adding lots of data and processing on top of our blockchain-based BrainDAO. Reading: https://doi.org/10.3389/frma.2019.00002

13 years ago: material from Leonardo: https://bitbucket.org/ldalonzo/p2p-search-scientific-pubs/src/master/ . a thesis. A few lessons I learnt (COPIED):

Extracting text from PDFs is (was) not a trivial exercise. At that time I used https://linux.die.net/man/1/pdftohtml. These days there are much better options.
Parsing citations is a tricky exercise. I used this tool https://github.com/knmnyn/ParsCit. I saw they perfected it using deep learning.
I wrote code to manually build an inverted index to support full-text search. There's probably something off the shelf that can be reused and I could have better spent the time elsewhere.
I wrote code to manually cluster documents using Latent Semantic Analysis. Again, there's probably some library out there that does the same and I could have better spent the time performing measurements on how clustering works on very large collections;

Other related work is the MusicDAO: feel free to re-use all that code. First steps:

compiling the superapp from the source
select a library and try to get this Natural Language Lib compiling for Android
read the pointer on this ticket + read the IPv8 documentation https://py-ipv8.readthedocs.io/ + Trustchain https://trustchain.readthedocs.io/en/latest/trustchain.html.
(Manually) create a directory of .PDF files to parse. Creative Commons. At least 25 article for next meeting. (https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/)
Week 4 goal: automatically parse a .PDF file, extract good possible keywords for possible user search, build 1 index.
Week 6 goal: distributed search

Please keep it simple, this will all fail if you try to get something as ambitious as knowledge graph operational on Android with a blockchain. Key points for grading: merged pull request on Superapp and architecture that works; performance and usability is secondairy.

synctext · 2022-03-01T12:08:05Z

please self-organise efficiently and avoid overlap
- Just stick with pdf-to-html for now and get that working first, one week with 2 people?
- Spend 1 day to get the first dataset
  - get 10 examples PDFs, manual pdf-to-html cmdline
  - get the citation parser going
  - Or is there no Kotlin/Java parser available? (male first names dataset)
  - Read: http:https://ceur-ws.org/Vol-2563/aics_25.pdf "GIANT is a large dataset with 991,411,100 XML labeled reference strings" If parsing of citations is unsolved, focus on Libtorrent sharing, relevant keyword ranking, IPv8 remote search, and easy injection of new PDFs.
- integrate pdf converter plus parser for this week sprint. No superapp yet?
All have the Superapp compiling
No parsing of .PDF yet on Kotlin
Found a URL, no collection of PDF files yet. Note Creative Commons. https://api.semanticscholar.org/corpus/download/https://api.semanticscholar.org/corpus/download/
Sisko: found various libraries of .PDF conversion and citations parsing.
- https://github.com/knmnyn/ParsCit Requires Perl 🛑 ⛔
- https://github.com/WING-NUS/Neural-ParsCit (library from hell, "Python 2.7 (works in Python 3 but not fully supported), with Numpy, Theano and Gensim installed. scikit-learn is needed for model evaluation if you are training a new model.")
- https://github.com/allenai/s2search
- https://github.com/itext/itext7/graphs/contributors (library from hell, XML parser, bar codes, full blown SVG, etc.)
- https://github.com/topics/pdf-to-html (entry point)
- https://www.google.com/search?q=citation+parsing+site%3Agithub.com (google entry point)
Key focus: get the whole chain of parse pdf, extract keywords, local search, and distributed search. efficiency is secondary

sisko444 · 2022-03-07T11:49:25Z

This week's progress

We got a PDF parser with wrapper for compatibillity with Apache license: https://github.com/TomRoush/PdfBox-Android
Visual: https://imgur.com/a/Y1GBZYN
MusicDao code was copied and used as a basis for a stub
Reading into the android basics of project structure and fragments
Clearer tasks to be picked up made
10 PDF's next to src folder

Our plan for next week

First make a more propper and compiling stub (Sisko rush monday)
Fragments for: catalog, reader, document, dialogue for upload, search, search result
The uploading and saving of PDF into trustchain
Maybe some NLP, keyword extraction or extra citation parsing endeavours
Make repo public: https://github.com/keonchennl/trustchain-superapp

Questions for the meeting

No images parsing
What kinds of keyword extration / NLP will be next after the MVP

To be altered during Johan meeting monday

synctext · 2022-03-07T12:29:06Z

please have some .APK, otherwise you're really behind schedule with this course in WK6.
select local .pdf file on Android. Parse PDF to HTML (now to text for MvP) (not yet done for 10 pdf dataset) Show top-10 words
- extract 10 most used keywords from this article (naive approach, no natural language parsing)
- normalise with average word frequency
- search for top-10 keywords in local files only
- everything in main memory, no sqlite details
~~https://github.com/knmnyn/ParsCit Requires Perl 🛑 ⛔~~ No cloud-free citation parser for Java, Android
integrate the above MvP inside Superapp
download from others (use Libtorrent seeding, share magnet links inside IPv8 overlay) (note new PR for MusicDAO which is build around magnet sharing)
search for articles using simple keyword matching (use local files and remote search example)
- https://github.com/rads/sqlite-okapi-bm25 (dont try full text search, use top-N keywords)
external reader (open .PDF to read)

sisko444 · 2022-03-07T13:25:27Z

We have switched from a bottom up to a top down approach, meaning, no stub, rather we will implement sepparte functionalities and later consolidate them together into one app.

sisko444 · 2022-03-07T16:37:08Z

Keyword extraction

A word list of 20k words was found from: http:https://corpus.leeds.ac.uk/list.html
Under a creative commons license.
Considered alternatives was a larger data set, a lemmatized dataset and a paid dataset.
The larger data set was clearly too large as 5mb in memory just for this purpose seems to be overkill: https://www.kaggle.com/rtatman/english-word-frequency
@synctext nevermind, we asked a question but we solved it already.
It looks like a healthy ammount of 60k word stems together with a 0,7 mb lighwight java stemming library will yield the best outcomme for this.

marko-matusovic-personal · 2022-03-17T08:59:26Z

Progress notes

Try our apk

https://github.com/keonchennl/trustchain-superapp/blob/db41c2887a37d458e055f1b538d3d9c552bf10da/app-debug.apk

Screenshots from the UI

| | |

Work done

board with issues https://github.com/keonchennl/trustchain-superapp/projects/2#card-78758433
created UI
backend for PDF parsing
backend for keyword extraction

Discussion

contact for the musicDAO developer
verify query forwarding idea

synctext · 2022-03-17T09:57:11Z

Solid progress in Week 6 (60% done of course, if nominal and linear) 🎊
Please have a well tested prototype APK for next sprint meeting
Few day task, 1 person responsible for getting more than 11 .PDF test cases
Scientific grounding: https://scholar.google.co.uk/scholar?q=relative+word+frequency+information+retrieval
About sharing with others feature.
- Include a magic: collect 1 new random .PDF from the network per 60 second (user config)
- magnet link based
- Gossip, spread, and query: 15-years ago work by the Tribler lab
- no query broadcasting (not incentive compatible)
- collect, parse and also conduct a remote search query of direct neighbors
- Assume random strangers on the Internet can be trusted
- RemoteQuery example: from 1 phone to 10 neighbors LiteratureDAO Query:mars isru methane
hopefully close the loop next sprint: parse, query, gossip
Remember, you need to have an accepted PR on the superapp as a requirement for this course. (wrap up 13,14 April?)
- do PR of finished parser only part?

synctext · 2022-03-22T12:47:46Z

PDFs Creative commons: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/d8/1d/ (many GBytes, many directories)

sisko444 · 2022-03-25T08:42:56Z

This week the query handler was written which includes the document ranking methods. We also have now over 300 PDF for testing. Besides that, we can also pass peer to peer messages now. Below we can see the pdf rating.

synctext · 2022-03-25T08:58:08Z

No much visible progress
4 people know and understand IPv8, non-visible learnings
Concerned about the integration of 6 people and branches.
Very concerning, unable to install an APK which is compiled from sources on Android (only emulator).
Recommend doing weekly fixed meeting. The Wednesday morning slot should be open for everybody.
Assessment, before 15 April. Or move into week 4.1
Have an tested .APK for next meeting (otherwise you're behind schedule)

sisko444 · 2022-04-01T12:51:35Z

This week I implemented the parsing of PDF's as a coroutine to make it a non stopping process. However it still stops (intuitively i think that is becouse of the nature of the task, not the thread executing it) I also worked on storing and loading the metadata of PDF files for search queries and made sure we can run the app on a physical mobile device.
Peter and Rahul worked on passing the search string from the gui to the backend and implementing a PDF import button into the GUI. Throwing the intend and getting the results back in logging, still a work in progress.
Keon has been working on seeding for downloading the PDF's from peers. I an Keon settled on an architecture to settle queries. We think its best to save the keywords locally and transmit only the results of a query after doing a local comparison.
Quinten worked on the UI layout and also worked on asynchronous tasking. He and I will look at this together more to make the PDF parsing non-stopping.

Continuous scoreing of local parsed documents as a user types in the searchbar:

The tested working apk can be dowloaded through we transfer, it is zipped: https://we.tl/t-ZnaUbho1X0

synctext · 2022-04-01T14:37:51Z

installs, but fails to run completely; crash of your app. 😲 (end of week 8, so getting nervous/tight)
No idea what the GUI will show????
Please get the .PDF parsing stable as background task. Standard dispatcher of Android.
Storage of .PDF files, local app context only app-restricted storage, and import new .PDF from any URL.
Feel free to copy this approach; their libtorrent and EVA protocol fallback; plus EVA protocol fix.
Just to spread the files: copy the above approach, gossip magnet link to neighbors, try downloading for 30 seconds or so, fallback to EVA protocol
By default download all .PDF files that you hear about, security must be ignored, that is for class of 2023.
Whole network thus gets to hear about all .PDF files eventually
Remote search: ask neighbors to check their local stored files

sisko444 · 2022-04-22T08:21:16Z

PDF selection from internal storage is implemented.
The local file storage is now implemented to work with persistence.
The parsing of PDFs is now mostly ran in a coroutine, the part that isn't, still can't for unknown reasons. (it doesnt stop anymore in the simulator and has about 10 seconds of black screen on a phone test)
EVA was implemented, once a file is parsed, its torrent is generated and will be broadcast to all peers, every 10 seconds.
Every 20 seconds the client attempts to download a piece of literature from a received torrent. (This is not working optimally yet, a later commit made it work less good)

@todo

The remote search back-end is finished, the front end is still in development.
The display of parsed documents and displaying of EVA operations is also still in development.
Make it so the entire PDF parsing is in a coroutine.
Repair whatever is hindering the performance of the downloading using the torrents.
The link to the APK: https://we.tl/t-rLd6GS7Pdt

synctext · 2022-04-22T08:46:33Z

app works !!
- blocking main thread upon parsing .PDF
- no showing yet of .PDF metadata to replace "Lorem ipsum".
lots of stuff happening in background and things are coming together {hopefully} soon.
"select file to freely share and torrent around the world", functionality of the 'select file' button
exact click on magnification glass required to get inside keyword entry
Tip: replace 2 "Lorem ipsum" boxes with something useful. (progress bar when parsing) Small text "no files found yet. Please add some". Example scientific .PDF. Behaviour of MusicDAO: fills screens 2 seconds after start, 300 items after 20 seconds-ish.
Final course Pull Request: Expect to need 1+ week to get feedback, process feedback, {repeat} and get it polished.
Bonus: restrict to local files by the Superapp only. Import .PDF through typing a URL. No global files system read permission. Give user the choice between invasive permissions. They are hidden behind a button. "access files (warning requires broad permissions)".

sisko444 · 2022-05-10T09:17:38Z

The day of reckoning has come and we have to make our final pull request to the actual mother repo.
To tie the current state back to the previous feedback:

App still works! No more blocking the main thread and we show the .pdf files.
Things are very much coming more together.
The upload button now shows a large warning: THIS WILL BE DISTRIBUTED above it.
The whole search bar is now clickable.
This tip is not necessary, the boxes are populated with the actual PDFs.
Sadly we were still working on it until now and there are still test cases to be solved, because of that there were no pull requests yet.
We now ask for permissions when the app launches for the first time, and there is the option to import PDFs though UTL's aswel.

Some gifs of the app functioning:

sisko444 · 2022-05-10T10:14:21Z

The APK file download: link
This expires in one week.

synctext · 2022-05-10T11:33:30Z

Lots more polished level:

remote download works (eva or magnet)
1 search box
extracts title of .PDF file
details of ongoing downloads
etc

keonchennl · 2022-05-24T10:58:32Z

Some changes that have been merged into master involves with a invalid library (info.blockchain.api 1.1.4), which breaks the master pipeline.
Tribler/trustchain-superapp#113 (comment)
Tribler/trustchain-superapp#113 (comment)

devos50 · 2022-07-15T07:43:03Z

This work has been completed, closing the issue 👍

synctext · 2022-10-25T14:05:29Z

LiteratureDAO Source code is here?? https://github.com/keonchennl/trustchain-superapp/tree/lit-dao/literaturedao
Related work:
Novel public review model, great idea. use pre-print services, public review process. No more rejects or accepts

Great dataset: 170,919 Creative Commons articles in the arXiv for biology

synctext · 2022-12-22T16:48:50Z

more related work: https://experimentalhistory.substack.com/p/the-rise-and-fall-of-peer-review
Shadow Libraries: Access to Knowledge in Global Higher Education {Balázs Bodó}
p2p Free Library: Help build humanity's free library on IPFS with Sci-Hub and Library Genesis
Humanity wins: our fight to unlock 32,544 COVID-19 articles for the world. This petition is dedicated to the victims of the outbreak and their families. We fought for every article for every scientist for you.

synctext added the type: MSc course work label Feb 22, 2022

synctext assigned marko-matusovic-personal, sisko444, AngeliPeter and keonchennl Feb 22, 2022

Tribler deleted a comment from keonchennl Mar 7, 2022

sisko444 mentioned this issue May 10, 2022

Literature Dao Tribler/trustchain-superapp#114

Closed

devos50 closed this as completed Jul 15, 2022

synctext mentioned this issue Sep 28, 2022

The Global Brain - the roadmap #7064

Open

synctext mentioned this issue Dec 6, 2022

master thesis placeholder - decentralised learning with security and unbounded scalability #7027

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blockchain Engineering - class of 2022 - Team Import Science #6783

Blockchain Engineering - class of 2022 - Team Import Science #6783

synctext commented Feb 22, 2022 •

edited

Loading

synctext commented Mar 1, 2022 •

edited

Loading

sisko444 commented Mar 7, 2022 •

edited

Loading

synctext commented Mar 7, 2022 •

edited

Loading

sisko444 commented Mar 7, 2022

sisko444 commented Mar 7, 2022 •

edited

Loading

marko-matusovic-personal commented Mar 17, 2022 •

edited

Loading

synctext commented Mar 17, 2022 •

edited

Loading

synctext commented Mar 22, 2022 •

edited

Loading

sisko444 commented Mar 25, 2022

synctext commented Mar 25, 2022 •

edited

Loading

sisko444 commented Apr 1, 2022 •

edited

Loading

synctext commented Apr 1, 2022 •

edited

Loading

sisko444 commented Apr 22, 2022 •

edited

Loading

synctext commented Apr 22, 2022 •

edited

Loading

sisko444 commented May 10, 2022 •

edited

Loading

sisko444 commented May 10, 2022 •

edited

Loading

synctext commented May 10, 2022

keonchennl commented May 24, 2022

devos50 commented Jul 15, 2022

synctext commented Oct 25, 2022 •

edited

Loading

synctext commented Dec 22, 2022 •

edited

Loading

Blockchain Engineering - class of 2022 - Team Import Science #6783

Blockchain Engineering - class of 2022 - Team Import Science #6783

Comments

synctext commented Feb 22, 2022 • edited Loading

synctext commented Mar 1, 2022 • edited Loading

sisko444 commented Mar 7, 2022 • edited Loading

This week's progress

Our plan for next week

Questions for the meeting

synctext commented Mar 7, 2022 • edited Loading

sisko444 commented Mar 7, 2022

sisko444 commented Mar 7, 2022 • edited Loading

Keyword extraction

marko-matusovic-personal commented Mar 17, 2022 • edited Loading

Progress notes

Try our apk

Screenshots from the UI

Work done

Discussion

synctext commented Mar 17, 2022 • edited Loading

synctext commented Mar 22, 2022 • edited Loading

sisko444 commented Mar 25, 2022

synctext commented Mar 25, 2022 • edited Loading

sisko444 commented Apr 1, 2022 • edited Loading

synctext commented Apr 1, 2022 • edited Loading

sisko444 commented Apr 22, 2022 • edited Loading

synctext commented Apr 22, 2022 • edited Loading

sisko444 commented May 10, 2022 • edited Loading

sisko444 commented May 10, 2022 • edited Loading

synctext commented May 10, 2022

keonchennl commented May 24, 2022

devos50 commented Jul 15, 2022

synctext commented Oct 25, 2022 • edited Loading

synctext commented Dec 22, 2022 • edited Loading

synctext commented Feb 22, 2022 •

edited

Loading

synctext commented Mar 1, 2022 •

edited

Loading

sisko444 commented Mar 7, 2022 •

edited

Loading

synctext commented Mar 7, 2022 •

edited

Loading

sisko444 commented Mar 7, 2022 •

edited

Loading

marko-matusovic-personal commented Mar 17, 2022 •

edited

Loading

synctext commented Mar 17, 2022 •

edited

Loading

synctext commented Mar 22, 2022 •

edited

Loading

synctext commented Mar 25, 2022 •

edited

Loading

sisko444 commented Apr 1, 2022 •

edited

Loading

synctext commented Apr 1, 2022 •

edited

Loading

sisko444 commented Apr 22, 2022 •

edited

Loading

synctext commented Apr 22, 2022 •

edited

Loading

sisko444 commented May 10, 2022 •

edited

Loading

sisko444 commented May 10, 2022 •

edited

Loading

synctext commented Oct 25, 2022 •

edited

Loading

synctext commented Dec 22, 2022 •

edited

Loading