New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phd Placeholder: learn-to-rank, decentralised AI, on-device AI, something. #7586
Comments
Hmmm, very difficult choice. |
Re-read papers regarding learn-to-rank and learned how to use the IPV8. With it I created an algorithm which simulates a number of nodes and sends messages to one another. From here I worked with Marcel and started implementing a system whereby one node sends a query to the swarm and then receives recommendations of content back from it. The progress is detailed in ticket 7290. There are 2 design choices: One issue discovered was regarding the size of the IPV8 network packet which is currently smaller than the entire model serialized with Pytorch, Marcel is currently working on that. We have 720k weights at the moment, and the maximum network packet size for IPV8 is 2.7MB so we have to fit in as many weight updates as possible. You can see a demonstration of the prototype below: I'm currently working on how to aggregate the recommendations of the swarm (for example, what happens if the recommendations of each node which received the query are entirely different). My branch on Marcel's repository: https://github.com/mg98/p2p-ol2r/tree/petrus-branch |
It's beyond amazing what you acomplished in 6 weeks after starting your phd. 🦄 🦄 🦄 Can we upgrade to transformers? That is the cardinal question for scientific output. We had Distributed AI in unusable form deployed already in 2012 within our Tribler network. Doing model updates is too complex compared to simple starting with sending training triplets around in a IPv8 community. The key is simplicity, ease of deployment, correctness, and ease of debugging. Nobody has a self-organising live AI with lifelong learning, as you have today in embryonic form. We even removed our deployed clicklog code in 2015 because it was not good enough. Options:
For a Youtube alternative smartphone app we have a single simple network primitive : Next sprint goal: get a performance graph! |
After looking into what datasets we could use for training a hypothetical model, I found ORCAS which consists of almost 20 million queries and the relevant website link given the query. It is compiled by Microsoft and it represents searches made on Bing in a period of a few months (with a few caveats to preserve privacy, such as showing only queries which have been searched a number of times and not showing a user_ID and stuff like that). The data seems good, but the fact that we have links instead of titles of documents made it impossible to use the triplet model we have right now (where we need to calculate the 768 dimension embedding of the title of the document: since we don't have a document-title and only a link we cannot do that). So I was looking for another model architecture to be usable in our predicament and I found Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture. I got to implement an intermediary version of the network whereby the same encoder that Marcel used (the allenai/specter language model) encodes a query and the output is the probability for each document individually. The rest of the architecture is left unmodified: Moving forward, I'm looking to finally implement a good number of peers in a network that send each other the query and answer (from ORCAS) and get the model to train. |
Cool stuff 👍 Could you tell me more about your performance metrics? I have two questions:
This matters a lot for deployment in Tribler. |
But keep in mind, this is extremely preliminary, I did not implement NanoGPT with this setup so that's bound to increase computing requirements |
Paper idea to try out for 2 weeks:
LLM for search related work example on Github called vimGPT: vimgpt.mov |
I got the T5 LLM to generate the ID's of ORCAS documents.
I was looking for what to do moving forward. I found a paper survey on the use of LLM's in the context of information retrieval. It was very informational, there's a LOT of research in this area at the moment. Made a list of 23 papers which were referenced there that I'm planning to go through at an accelerated pace. At the moment I'm still wondering what to do next to make the work I've already performed publishable by the conference on the 5'th of Jan. |
update |
In the past weeks I've managed to introduce 10 users who send each other query-doc_id pairs. The mechanism implemented is the following:
For the future I think trying to use DAS6 to perform a test with 100 peers may be worthwhile to check the integrity of the model and the evolution as the number of peers increases. |
AI with access to all human knowledge, art, and entertainment.AGI could help humanity by developing new drugs, treatments for diseases, and turbocharging the global economy.
Related: How is AI impacting science? (Metascience 2023 Conference in Washington, D.C., May 2023.) |
Public AI with associative democracyWho owns AI? Who owns The Internet, Bitcoin, and Bittorrent? We applied public infrastructure principles to AI. We build an AI ecosystem which is owned by both nobody and everybody. The results is a democratically self-governing association for AI. We pioneered 1) a new ownership model for AI, 2) novel model for training, and 3) competitive access to GPU hardware. AI should be public and contribute to the common good. More then just open weights, we envision full democratic self-governance. AI improvements are a social process! The process of create long-enduring communities is to slowly grow and evolve them. The first permissionless open source machine learning infrastructure was Internet-deployed in 2012. |
Solid progress! Operational decentralised machine learning 🚀 🚀 🚀 De-DSI for the win. Possible next step is enabling unbounded scalability and on-device LLM. See Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis or the knowledge graph direction. We might want to schedule both! New hardware will come for the on-device 1-bit LLM era update: Nature paper 😲 Uses LLM for parsing of 1200 sentences and 1100 abstracts of scientific papers. Avoids the hard work of PDF knowledge extraction. Structured information extraction from scientific text with large language models |
Poster for the De-DSI paper: |
ToDo: determine phd focus and scope
Phd Funding project: https://www.tudelft.nl/en/2020/tu-delft/eur33m-research-funding-to-establish-trust-in-the-internet-economy
Duration: 1 Sep 2023 - 1 sep 2027
First weeks: reading and learning. See this looong Tribler reading list of 1999-2023 papers, the "short version". Long version is 236 papers 😄 . Run Tribler from the sources.
Before doing fancy decentralised machine learning, learn-to-rank; first have stability, semantic search, and classical algorithms deployed. Current Dev team focus: #3868
update: Sprint focus? reading more Tribler articles and get this code going again: https://github.com/devos50/decentralized-rules-prototype
The text was updated successfully, but these errors were encountered: