Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phd Placeholder: learn-to-rank, decentralised AI, on-device AI, something. #7586

Open
synctext opened this issue Sep 4, 2023 · 17 comments
Open
Assignees

Comments

@synctext
Copy link
Member

synctext commented Sep 4, 2023

ToDo: determine phd focus and scope

Phd Funding project: https://www.tudelft.nl/en/2020/tu-delft/eur33m-research-funding-to-establish-trust-in-the-internet-economy
Duration: 1 Sep 2023 - 1 sep 2027

First weeks: reading and learning. See this looong Tribler reading list of 1999-2023 papers, the "short version". Long version is 236 papers 😄 . Run Tribler from the sources.

Before doing fancy decentralised machine learning, learn-to-rank; first have stability, semantic search, and classical algorithms deployed. Current Dev team focus: #3868

update: Sprint focus? reading more Tribler articles and get this code going again: https://github.com/devos50/decentralized-rules-prototype

@pneague
Copy link

pneague commented Oct 3, 2023

I have taken to understanding the work done by Martijn on ticket 42. I read through it and downloaded the code attached.

The last version of the code had a couple of functions not yet implemented so I reverted to the 22-06-2022 version (instead of the last version uploaded on 27-06-2022).

The 22-06-2022 had a few outdated functions and small bugs as well here and there, but since they were small I was able to solve them.

I have downloaded the required dataset and then successfully run the parser and scenario_creating functions implemented by Martijn. After that I ran the experiment itself based on the above-mentioned scenario, resulting in a couple of csv's and graphs.

I understand the general idea of the experiments and how they work, however the code still eludes me since it's not commented to a significant amount.
Here's an example of the graph of an experiment run with Martijn's code so far:
image

@synctext
Copy link
Member Author

synctext commented Oct 3, 2023

Hmmm, very difficult choice.
For publications we should focus on something like Web3AI: deploying decentralised artificial intelligence

@pneague
Copy link

pneague commented Oct 18, 2023

Re-read papers regarding learn-to-rank and learned how to use the IPV8. With it I created an algorithm which simulates a number of nodes and sends messages to one another. From here I worked with Marcel and started implementing a system whereby one node sends a query to the swarm and then receives recommendations of content back from it. The progress is detailed in ticket 7290.
The idea at the moment is that we implement a version of Mixture-of-Experts (https://arxiv.org/pdf/2002.04013.pdf) whereby one node sends the query to other nodes which are nearby and receives recommendations. These are then aggregated to create a shortened and sorted list of recommendations for the querying node.

There are 2 design choices:
We could send the query-doc_inferior-doc_superior around as gossip or we (as we do at the moment) send the updates around every run. We'll look deeper into these ideas.

One issue discovered was regarding the size of the IPV8 network packet which is currently smaller than the entire model serialized with Pytorch, Marcel is currently working on that. We have 720k weights at the moment, and the maximum network packet size for IPV8 is 2.7MB so we have to fit in as many weight updates as possible.

You can see a demonstration of the prototype below:
Alt Text

I'm currently working on how to aggregate the recommendations of the swarm (for example, what happens if the recommendations of each node which received the query are entirely different). My branch on Marcel's repository: https://github.com/mg98/p2p-ol2r/tree/petrus-branch

@synctext
Copy link
Member Author

synctext commented Oct 18, 2023

It's beyond amazing what you acomplished in 6 weeks after starting your phd. 🦄 🦄 🦄
Is the lab now All-In on Distributed AI? 🎲

Can we upgrade to transformers? That is the cardinal question for scientific output. We had Distributed AI in unusable form deployed already in 2012 within our Tribler network. Doing model updates is too complex compared to simple starting with sending training triplets around in a IPv8 community. The key is simplicity, ease of deployment, correctness, and ease of debugging. Nobody has a self-organising live AI with lifelong learning, as you have today in embryonic form. We even removed our deployed clicklog code in 2015 because it was not good enough. Options:

For a Youtube alternative smartphone app we have a single simple network primitive :
Query, content-item-clicked, content-item-NOT-clicked, clicked-item-popularity,signature and in TikTok form without queries and added viewing attention time: content-item-long-attention, long-attention-time, content-item-low-attention, low-attention-time, long-attention-item-popularity,signature. Usable for content discovery, cold starts, content recommendation, and obviously semantic search.

Next sprint goal: get a performance graph!
We need to get a paper on this soon, because the field is moving at lightning speed. So up and running before X-Mas, Tribler test deployment, and usage of NanoGPT in Jan, paper in Feb 🚀

@pneague
Copy link

pneague commented Nov 2, 2023

After looking into what datasets we could use for training a hypothetical model, I found ORCAS which consists of almost 20 million queries and the relevant website link given the query. It is compiled by Microsoft and it represents searches made on Bing in a period of a few months (with a few caveats to preserve privacy, such as showing only queries which have been searched a number of times and not showing a user_ID and stuff like that).

The data seems good, but the fact that we have links instead of titles of documents made it impossible to use the triplet model we have right now (where we need to calculate the 768 dimension embedding of the title of the document: since we don't have a document-title and only a link we cannot do that).

So I was looking for another model architecture to be usable in our predicament and I found Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture.

I got to implement an intermediary version of the network whereby the same encoder that Marcel used (the allenai/specter language model) encodes a query and the output is the probability for each document individually. The rest of the architecture is left unmodified:
layers = [
('lin1', nn.Linear(768, 256)), # encoded query, 768 dimensions
('relu1', nn.ReLU()),
('lin2', nn.Linear(256, 256)),
('relu2', nn.ReLU()),
('lin3', nn.Linear(256, 256)),
('relu3', nn.ReLU()),
('lin4', nn.Linear(256, number_of_documents)), # output probabilities
]
In my preliminary tests so far, when we have 884 documents (i.e. 884 output neurons) we can perform 50 searches in 4 seconds (so about one search per 0.08 seconds). When we have 1066561 documents, 50 searches get completed in 200 seconds (one search per 4 seconds). Under some circumstances this may be acceptable for Tribler users but people with older computers might experience significant difficulties. I will need to look at ways of reducing the computation time required.

Moving forward, I'm looking to finally implement a good number of peers in a network that send each other the query and answer (from ORCAS) and get the model to train.

@qstokkink
Copy link
Contributor

Cool stuff 👍 Could you tell me more about your performance metrics? I have two questions:

  1. Are these are SIMD results (i.e., one batch of 50 searches take 200 seconds but a batch with 1 search also takes 200 seconds)?
  2. What hardware did you use (e.g., CPU, some crappy laptop GPU, HPC node with 10 Tesla V100's, ..)?

This matters a lot for deployment in Tribler.

@pneague
Copy link

pneague commented Nov 3, 2023

  1. They are not SIMD. One search actually takes 1/50'th of the mentioned time
  2. I used a Mac laptop with M2 Pro Chip

But keep in mind, this is extremely preliminary, I did not implement NanoGPT with this setup so that's bound to increase computing requirements

@synctext
Copy link
Member Author

synctext commented Nov 8, 2023

Paper idea to try out for 2 weeks:

LLM for search related work example on Github called vimGPT:

vimgpt.mov

@pneague
Copy link

pneague commented Nov 22, 2023

I got the T5 LLM to generate the ID's of ORCAS documents.
Current Setup:

  • From entire dataset, I took 100 documents which have around 600 queries associated with them each, yielding around 60k query-document pairs. No query-document pair appears more than once.
  • I split the dataset into train/test with a split factor of 50%
  • Two agents read the same data from the disk, initially the train set
  • They send each other sequentially every row of the data (which at this point looks like [query, doc_id] )
  • They train on the message received but not the one sent (as they both have the same data I'm avoiding training on the same data twice)
  • The model predicts the doc_id given a query
  • After all train_dataset has been iterated through, I count this as an epoch and I iterate through it all over again. I count the number of times the doc_id was guessed by the model and this is how I calculated accuracy
  • After each 'epoch', if accuracy on train set reaches >=90% I saved the model and tokenizer
  • Training took about 12 hours
  • Then I calculate accuracy on the test set using the same method (but without training on the new data)
  • This way, accuracy on the test set was found to be 93%, proving that the model has a high potential to generalize

I was looking for what to do moving forward.

I found a paper survey on the use of LLM's in the context of information retrieval. It was very informational, there's a LOT of research in this area at the moment. Made a list of 23 papers which were referenced there that I'm planning to go through at an accelerated pace. At the moment I'm still wondering what to do next to make the work I've already performed publishable by the conference on the 5'th of Jan.

@synctext
Copy link
Member Author

synctext commented Nov 22, 2023

update
Please try to think a bit already about the next step/article idea for upcoming summer 🌞 🍹 ? Can you think of something where users donate their GPU to Tribler and get a boost in their MeritRank as a reward 🥇 ➕ the Marcel angle of "active learning" by donating perfect metadata. Obviously we need the ClickLog deployment and crawling deployed first.

@pneague
Copy link

pneague commented Dec 12, 2023

In the past weeks I've managed to introduce 10 users who send each other query-doc_id pairs.

The mechanism implemented is the following:

  • a number of 100 documents per available peer is selected from the entire ORCAS dataset from the beginning to act as the actual dataset
  • from this, the new dataset is split into train/test datasets, keeping the ratio of each document in the dataset equal (so if there are 20 queries for a document, 10 will be in the train set and 10 in the test set). I've hardcoded that no documents appear which have only 1 query associated with them, meaning they would have appeared only on the train or test sets. The test set is excluded from training, only the data from the train set is sampled in the training process;
  • from the documents available, each peer samples a random number between 80 and 120 documents that act as the peers own dataset. Peers may sample documents which have already been sampled by somebody else. In total for the experiment with 10 peers, 661 documents were sampled by at least 1 peer out of 1000 (100 docs per peer * 10 peers);
  • each peer initiates its own T5 model (small version) and sets it to train mode;
  • training is now performed in batches of 32. Each peer has a list (corresponding to the batch-data) containing the query and another list containing the doc_id. When the list reaches 32 items, the peer trains its model on the data from those 2 lists and then resets them;
  • every 0.1 seconds, each peer selects a random query-doc_id pair from its own dataset and sends it to another random peer, but does not append to its own current_batch_list. This is done to not agglomerate the training with a peers own data more than the data of the other peers. So each peer appends data (equal to 32 / nbr_of_peers_currently_identified) to its own batch_list when the it is empty. This way we can more or less control that the data fed into the model of each peer is approximately equal probability to come from any peer in the network, including the current peer;
  • I've tried experiments with 2, 10, 32 peers so far. The experiments with 2 and 10 peers have performed well. For the case with 10 peers, training was finished within 6 hours and they all have an accuracy of 99-100% on the train set and 90-91% on the test set (for the 661 sampled documents out of 1000). The experiment with 32 peers ran out of RAM memory (as each peer holds its own model) and started performing erratically, I don't think we can trust those results. I've talked with Sandip and got an account for DAS6 as I don't think we can scale the experiments more without a training server. I'll be working to understand how to use it;

For the future I think trying to use DAS6 to perform a test with 100 peers may be worthwhile to check the integrity of the model and the evolution as the number of peers increases.

@synctext
Copy link
Member Author

synctext commented Dec 12, 2023

AI with access to all human knowledge, art, and entertainment.

AGI could help humanity by developing new drugs, treatments for diseases, and turbocharging the global economy.
Who would own this AGI? Our dream is to contribute to this goal by pioneering a new ownership model for AI and novel model for training. AI should be public and contribute to the common good. More then just open weights, full democratic self-governance. Open problem is how to govern such a project and devise a single roadmap with conflicting expert opinions. Current transformer-based AI has significant knowledge gaps, needs thousands or even millions of people to tune. Needs the Wikipedia paradigm! Gemini example: what is the most popular Youtube video. The state-of-the-art AI fails to understand the concept of media popularity, front-page coverage, and the modern attention economy in general.

  • It all starts with Learn-to-Rank in full decentral setting {current ongoing work}
  • Unlock swarm-based data
  • Continuous learning at next level: eternal learning
  • Get a few thousand people to contribute (e.g. like Linux,Wikipedia,Bittorrent,Bitcoin, etc.)

Related: How is AI impacting science? (Metascience 2023 Conference in Washington, D.C., May 2023.)

@synctext
Copy link
Member Author

synctext commented Jan 29, 2024

Public AI with associative democracy

Who owns AI? Who owns The Internet, Bitcoin, and Bittorrent? We applied public infrastructure principles to AI. We build an AI ecosystem which is owned by both nobody and everybody. The results is a democratically self-governing association for AI.

We pioneered 1) a new ownership model for AI, 2) novel model for training, and 3) competitive access to GPU hardware. AI should be public and contribute to the common good. More then just open weights, we envision full democratic self-governance.
Numerous proposals have been made for making AI safe, democratic, and public. Yet, these proposal are often grounded exclusively in either philosophy or technology. Technological experts from the builders of databases, Operating Systems, and clouds rarely interact with the experts whom deep understand the question 'who has control'? Democracy is still a contested concept after centuries. Self-governance is the topic of active research, both in the world of atoms and the world of bits. Complex collective infrastructure with self-governance is an emerging scientific field. Companies such as OpenAI run on selling their AI dream to ageing companies such as Microsoft. There is great need for market competition and a fine-grained supply chain. Furthermore, lack of fine-grained competition in a supply chain ecosystem is hampering progress. Real world performance results irrefutably show that the model architecture is not really that important, it can be classical transformers, Mamba, SSM, or RWKV. The training set dominates the AI effectiveness equation. Each iteration brings more small improvements to a whole ecosystems, all based on human intelligence. Collective engineering on collective infrastructure is the key building blocks towards creating intelligence superior to the human intellect.

AI improvements are a social process! The process of create long-enduring communities is to slowly grow and evolve them. The first permissionless open source machine learning infrastructure was Internet-deployed in 2012.
However, such self-ruled communities only play a minor role in the AI ecosystem today. The dominating AI architecture is fundamentally unfair. AI is expensive and requires huge investments. An exclusive game for the global tech elite. Elon Musk compared the ongoing AI race to a game of poker, with table stakes of a few billion dollars a year. Such steep training costs and limited access to GPUs causes Big Tech to dominate this field. These hurdles notably affect small firms and research bodies, constraining their progress in the field. Our ecosystem splits the ecosystem by creating isolating competitive markets for GPU renting and training set storage. Our novel training model brings significant synergy, similar to the Linux and Wikipedia efforts. By splitting the architecture and having fine-grained competition between efforts the total system efficiency is significantly boosted. It enables independent evolution of dataset gathering, data storage, GPU rental, and AI models.
Our third pioneering element is the democratic access to GPU hardware. One branch of distributed machine learning studies egalitarian architectures, even a tiny smartphone can be used to contribute to the collective. A billion smartphones, in theory, could significantly outsmart expensive hardware. Wikipedia and Linux have proven that you can't compete with free. We mastered the distributed, permissionless, and egalitarian aspects of AI. The next stage of evolution is to add democratic decision making processes. A team of 60 master students is currently attempting to engineering this world-first innovation collectively.
Another huge evolutionary leap is AI with access to all human knowledge, art, and entertainment. Currently datasets and training hardware are expensive to gather and store. For instance, the open access movement to scientific knowledge has not yet succeeded in creating a single repository. The training of next-generation AI requires completion of this task. All creative commons content (text,audio,video,DNA,robotics,3D) should be scripted in a expanding living dataset, similar to SuperGLUE set-of-datasets. Cardinal problem is building trust in the data, accurancy, and legal status. We pioneered in prior work a collective data vault based on passport-grade digital identity.

@pneague
Copy link

pneague commented Jan 30, 2024

In the last few weeks I had run experiments with ensembles of peers. The experiments with more than 10 peers makes the laptop run out of RAM memory and starts acting weirdly so I had to change the direction my work.
The current idea is that T5 small is not able to fit inside its weights that many doc_id's (because it is so small). But we need it to be small for it to run on Tribler peer computers.

So in order to increase the number of retrievable documents I thought of sharding the datasets, with each shard having its own peers. In the experiments performed, each shard consists of 10 peers.

  • Each of the experiments was successfully run, with each peer achieving good results on the shard's test set (as described in a previous entry here).
  • Each shard was trained in an independent run so the laptop I'm using wouldn't run out of RAM memory.
  • Each shard had different doc-id's from the other shards.
  • I've used 5000 documents per shard and let each peer catch a random number of documents between 200 and 300 (as in the previous entry).
  • Documents not chosen by any peer were discarded.
  • After successfully training all models on their respective shards, I experimented with using ensembles to aggregate the results of multiple shards. Initially, the idea was that the system would pick a random number of models, belonging to all shards, and each picked model would vote on a document_id, given a query. But this relied on chance picking the models belonging to the right shard for each tested query. Marcel came up with the idea that we could in principle gossip the shard-number of each peer and then we would know to ask models from each shard given a query.
  • The idea was that models trained on the right data would pick the correct document (as each had a top1 accuracy of 90%), while models not trained on the right data would output either random documents, different from one model to the other, or hallucinate doc_id's, different from one model to the other. So when we see two models voting for the same doc_id, we know that they were trained on data matching the query in question.
  • Another ensemble idea was to get the top5 results for a query with beam-search, and get their model-scores for those 5 beams. After that we could take softmax of the 5 results so we know the confidence that the model has on each of them. Then, instead of summing the number of times a result was suggested by a model, we would sum the confidences of each model for each result.
  • At the moment I'm still running some experiments but here are the accuracy results for each shard:
    Accs_by_shard_and_beam
    The image above depicts the accuracy on the test set of each shard of each peer belonging to that shard. Blue is top1 accuracy, and red is top5 accuracy (obtained with beam-search).

Model Ensemble from different shards drawio
This diagram shows how a 2-shard ensemble would work in the voting and confidence mechanism (in the previous iteration where the models were chosen randomly, without caring how many models we get from each shard)

@synctext
Copy link
Member Author

synctext commented Feb 29, 2024

Solid progress! Operational decentralised machine learning 🚀 🚀 🚀 De-DSI for the win.

Possible next step is enabling unbounded scalability and on-device LLM. See Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis or the knowledge graph direction. We might want to schedule both! New hardware will come for the on-device 1-bit LLM era

update: Nature paper 😲 Uses LLM for parsing of 1200 sentences and 1100 abstracts of scientific papers. Avoids the hard work of PDF knowledge extraction. Structured information extraction from scientific text with large language models this work outputs entities and their relationships as JSON documents or other hierarchical structures

@pneague
Copy link

pneague commented Mar 26, 2024

Fresh results from DAS6 for magnet link prediction:
1000 docs - 90.5%
5000 docs - 77%
10000 docs - 65%

See comparison between predicting docids vs magnet links:
image

When the dataset is relatively small, the accuracies are the same for both top-1 and top-5. As more data appears in the dataset, we can see a divergence in the accuracies posted in both metrics. We hypothesize that the limited number of weights in our model efficiently captures URL patterns in scenarios with sparse data. However, as the data complexity increases, this constraint appears to hinder the model’s ability to accurately recall the exact sequence of tokens in
each URLs. This is merely a guess, and we intend to investigate this further in future work. However, the observed
discrepancy in accuracy levels remains marginal, amounting to merely a few percentage points across a corpus of 10 K
documents.

@pneague
Copy link

pneague commented Apr 10, 2024

Poster for the De-DSI paper:
De-DSI Poster.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants