Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blockchain Engineering - class of 2022 - Team Import Science #6783

Closed
6 tasks
synctext opened this issue Feb 22, 2022 · 21 comments
Closed
6 tasks

Blockchain Engineering - class of 2022 - Team Import Science #6783

synctext opened this issue Feb 22, 2022 · 21 comments

Comments

@synctext
Copy link
Member

synctext commented Feb 22, 2022

Your task is to gather scientific publications and engineer machine reading of scientific knowledge in BrainDAO. Thousands of scientific articles are available with Creative Commons copyright license in simple .PDF format. Get thousands of such files on each device and start processing. Use a light library for Natural Language Processing. Use the Bittorrent engine inside Superapp for efficient file sharing. Use IPv8 community to gossip new content. What does this have to do with our "Blockchain Engineering course? True, this is adding lots of data and processing on top of our blockchain-based BrainDAO. Reading: https://doi.org/10.3389/frma.2019.00002

13 years ago: material from Leonardo: https://bitbucket.org/ldalonzo/p2p-search-scientific-pubs/src/master/ . a thesis. A few lessons I learnt (COPIED):

  • Extracting text from PDFs is (was) not a trivial exercise. At that time I used https://linux.die.net/man/1/pdftohtml. These days there are much better options.
  • Parsing citations is a tricky exercise. I used this tool https://github.com/knmnyn/ParsCit. I saw they perfected it using deep learning.
  • I wrote code to manually build an inverted index to support full-text search. There's probably something off the shelf that can be reused and I could have better spent the time elsewhere.
  • I wrote code to manually cluster documents using Latent Semantic Analysis. Again, there's probably some library out there that does the same and I could have better spent the time performing measurements on how clustering works on very large collections;

Other related work is the MusicDAO: feel free to re-use all that code. First steps:

Please keep it simple, this will all fail if you try to get something as ambitious as knowledge graph operational on Android with a blockchain. Key points for grading: merged pull request on Superapp and architecture that works; performance and usability is secondairy.

@synctext
Copy link
Member Author

synctext commented Mar 1, 2022

@sisko444
Copy link

sisko444 commented Mar 7, 2022

This week's progress

Our plan for next week

  • First make a more propper and compiling stub (Sisko rush monday)
  • Fragments for: catalog, reader, document, dialogue for upload, search, search result
  • The uploading and saving of PDF into trustchain
  • Maybe some NLP, keyword extraction or extra citation parsing endeavours
  • Make repo public: https://github.com/keonchennl/trustchain-superapp

Questions for the meeting

  • No images parsing
  • What kinds of keyword extration / NLP will be next after the MVP

To be altered during Johan meeting monday

@Tribler Tribler deleted a comment from keonchennl Mar 7, 2022
@synctext
Copy link
Member Author

synctext commented Mar 7, 2022

  • please have some .APK, otherwise you're really behind schedule with this course in WK6.
  • select local .pdf file on Android. Parse PDF to HTML (now to text for MvP) (not yet done for 10 pdf dataset) Show top-10 words
    • extract 10 most used keywords from this article (naive approach, no natural language parsing)
    • normalise with average word frequency
    • search for top-10 keywords in local files only
    • everything in main memory, no sqlite details
  • https://github.com/knmnyn/ParsCit Requires Perl 🛑 ⛔ No cloud-free citation parser for Java, Android
  • integrate the above MvP inside Superapp
  • download from others (use Libtorrent seeding, share magnet links inside IPv8 overlay) (note new PR for MusicDAO which is build around magnet sharing)
  • search for articles using simple keyword matching (use local files and remote search example)
  • external reader (open .PDF to read)

@sisko444
Copy link

sisko444 commented Mar 7, 2022

We have switched from a bottom up to a top down approach, meaning, no stub, rather we will implement sepparte functionalities and later consolidate them together into one app.

@sisko444
Copy link

sisko444 commented Mar 7, 2022

Keyword extraction

A word list of 20k words was found from: http:https://corpus.leeds.ac.uk/list.html
Under a creative commons license.
Considered alternatives was a larger data set, a lemmatized dataset and a paid dataset.
The larger data set was clearly too large as 5mb in memory just for this purpose seems to be overkill: https://www.kaggle.com/rtatman/english-word-frequency
@synctext nevermind, we asked a question but we solved it already.
It looks like a healthy ammount of 60k word stems together with a 0,7 mb lighwight java stemming library will yield the best outcomme for this.

@marko-matusovic-personal
Copy link

marko-matusovic-personal commented Mar 17, 2022

Progress notes

Try our apk

https://github.com/keonchennl/trustchain-superapp/blob/db41c2887a37d458e055f1b538d3d9c552bf10da/app-debug.apk

Screenshots from the UI

| | |

Work done

Discussion

  • contact for the musicDAO developer
  • verify query forwarding idea

@synctext
Copy link
Member Author

synctext commented Mar 17, 2022

  • Solid progress in Week 6 (60% done of course, if nominal and linear) 🎊
  • Please have a well tested prototype APK for next sprint meeting
  • Few day task, 1 person responsible for getting more than 11 .PDF test cases
  • Scientific grounding: https://scholar.google.co.uk/scholar?q=relative+word+frequency+information+retrieval
  • About sharing with others feature.
    • Include a magic: collect 1 new random .PDF from the network per 60 second (user config)
    • magnet link based
    • Gossip, spread, and query: 15-years ago work by the Tribler lab
    • no query broadcasting (not incentive compatible)
    • collect, parse and also conduct a remote search query of direct neighbors
    • Assume random strangers on the Internet can be trusted
    • RemoteQuery example: from 1 phone to 10 neighbors LiteratureDAO Query:mars isru methane
  • hopefully close the loop next sprint: parse, query, gossip
  • Remember, you need to have an accepted PR on the superapp as a requirement for this course. (wrap up 13,14 April?)
    • do PR of finished parser only part?

@synctext
Copy link
Member Author

synctext commented Mar 22, 2022

PDFs Creative commons: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/d8/1d/ (many GBytes, many directories)

@sisko444
Copy link

This week the query handler was written which includes the document ranking methods. We also have now over 300 PDF for testing. Besides that, we can also pass peer to peer messages now. Below we can see the pdf rating.
image

@synctext
Copy link
Member Author

synctext commented Mar 25, 2022

  • No much visible progress
  • 4 people know and understand IPv8, non-visible learnings
  • Concerned about the integration of 6 people and branches.
  • Very concerning, unable to install an APK which is compiled from sources on Android (only emulator).
  • Recommend doing weekly fixed meeting. The Wednesday morning slot should be open for everybody.
  • Assessment, before 15 April. Or move into week 4.1
  • Have an tested .APK for next meeting (otherwise you're behind schedule)

@sisko444
Copy link

sisko444 commented Apr 1, 2022

  • This week I implemented the parsing of PDF's as a coroutine to make it a non stopping process. However it still stops (intuitively i think that is becouse of the nature of the task, not the thread executing it) I also worked on storing and loading the metadata of PDF files for search queries and made sure we can run the app on a physical mobile device.

  • Peter and Rahul worked on passing the search string from the gui to the backend and implementing a PDF import button into the GUI. Throwing the intend and getting the results back in logging, still a work in progress.

  • Keon has been working on seeding for downloading the PDF's from peers. I an Keon settled on an architecture to settle queries. We think its best to save the keywords locally and transmit only the results of a query after doing a local comparison.

  • Quinten worked on the UI layout and also worked on asynchronous tasking. He and I will look at this together more to make the PDF parsing non-stopping.

Continuous scoreing of local parsed documents as a user types in the searchbar:
image

@synctext
Copy link
Member Author

synctext commented Apr 1, 2022

  • installs, but fails to run completely; crash of your app. 😲 (end of week 8, so getting nervous/tight)
  • No idea what the GUI will show????
  • Please get the .PDF parsing stable as background task. Standard dispatcher of Android.
  • Storage of .PDF files, local app context only app-restricted storage, and import new .PDF from any URL.
  • Feel free to copy this approach; their libtorrent and EVA protocol fallback; plus EVA protocol fix.
  • Just to spread the files: copy the above approach, gossip magnet link to neighbors, try downloading for 30 seconds or so, fallback to EVA protocol
  • By default download all .PDF files that you hear about, security must be ignored, that is for class of 2023.
  • Whole network thus gets to hear about all .PDF files eventually
  • Remote search: ask neighbors to check their local stored files

@sisko444
Copy link

sisko444 commented Apr 22, 2022

  • PDF selection from internal storage is implemented.
  • The local file storage is now implemented to work with persistence.
  • The parsing of PDFs is now mostly ran in a coroutine, the part that isn't, still can't for unknown reasons. (it doesnt stop anymore in the simulator and has about 10 seconds of black screen on a phone test)
  • EVA was implemented, once a file is parsed, its torrent is generated and will be broadcast to all peers, every 10 seconds.
  • Every 20 seconds the client attempts to download a piece of literature from a received torrent. (This is not working optimally yet, a later commit made it work less good)

@todo

  • The remote search back-end is finished, the front end is still in development.
  • The display of parsed documents and displaying of EVA operations is also still in development.
  • Make it so the entire PDF parsing is in a coroutine.
  • Repair whatever is hindering the performance of the downloading using the torrents.
    The link to the APK: https://we.tl/t-rLd6GS7Pdt

@synctext
Copy link
Member Author

synctext commented Apr 22, 2022

  • app works !!
    • blocking main thread upon parsing .PDF
    • no showing yet of .PDF metadata to replace "Lorem ipsum".
  • lots of stuff happening in background and things are coming together {hopefully} soon.
  • "select file to freely share and torrent around the world", functionality of the 'select file' button
  • exact click on magnification glass required to get inside keyword entry
  • Tip: replace 2 "Lorem ipsum" boxes with something useful. (progress bar when parsing) Small text "no files found yet. Please add some". Example scientific .PDF. Behaviour of MusicDAO: fills screens 2 seconds after start, 300 items after 20 seconds-ish.
  • Final course Pull Request: Expect to need 1+ week to get feedback, process feedback, {repeat} and get it polished.
  • Bonus: restrict to local files by the Superapp only. Import .PDF through typing a URL. No global files system read permission. Give user the choice between invasive permissions. They are hidden behind a button. "access files (warning requires broad permissions)".

@sisko444
Copy link

sisko444 commented May 10, 2022

The day of reckoning has come and we have to make our final pull request to the actual mother repo.
To tie the current state back to the previous feedback:

  • App still works! No more blocking the main thread and we show the .pdf files.
  • Things are very much coming more together.
  • The upload button now shows a large warning: THIS WILL BE DISTRIBUTED above it.
  • The whole search bar is now clickable.
  • This tip is not necessary, the boxes are populated with the actual PDFs.
  • Sadly we were still working on it until now and there are still test cases to be solved, because of that there were no pull requests yet.
  • We now ask for permissions when the app launches for the first time, and there is the option to import PDFs though UTL's aswel.

Some gifs of the app functioning:
local_upload
peers
search_in_keywords
url_upload

@sisko444
Copy link

sisko444 commented May 10, 2022

The APK file download: link
This expires in one week.

@synctext
Copy link
Member Author

Lots more polished level:

  • remote download works (eva or magnet)
  • 1 search box
  • extracts title of .PDF file
  • details of ongoing downloads
  • etc

@keonchennl
Copy link

Some changes that have been merged into master involves with a invalid library (info.blockchain.api 1.1.4), which breaks the master pipeline.
Tribler/trustchain-superapp#113 (comment)
Tribler/trustchain-superapp#113 (comment)

@devos50
Copy link
Contributor

devos50 commented Jul 15, 2022

This work has been completed, closing the issue 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants