phd placeholder: IPFS crawl experience, metadata, and semantic gap awareness #7290

synctext · 2023-02-10T09:47:19Z

< Placeholder >
timeline: April 2023 - April 2027.
ToDo: 6 weeks hands-on Python onboarding project. Learn a lot and plan to throw it away. Next step is then to start working on your first scientific article. You need 4 thesis chapters and you then completed your phd.

One idea: towards trustworthy and perfect metadata for search (e.g. perfect memory retrieval for the global brain #7064 ).
Another idea: Gradient decent model takes any keyword search query as input. Output is a limited set of vectors. Only valid content is recommended. Learning goal is to provide semantic matching between input query and output vector.
General background and Spotity linkage possible dataset sharing

Max. usage of expertise: product/market fit thinking

Background reading:

latest unsupervised learn-to-rank: G-Rank: Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network
First decentralised learning called gossip learning in 2011 (!!!) Gossip Learning with Linear Models on Fully Distributed Data by Róbert Ormándi, István Hegedüs, Márk Jelasity.
Meritrank: Sybil tolerant reputation for merit-based tokenomics
Mitigating sybils in federated learning poisoning (FoolsGold algorithm Thnx @ThomasWerthenbach!)
Cardinal problem for search is accurate metadata (tags, description, popularity, spam-making)
- Without a central editor we need collective decisions, crowdsourcing, or even digital parliament to decide on each item what the perfect naming would be.
- 346 articles on crowdsourcing in a survey
- Smartocracy: Social Networks for Collective Decision Making
- 99-pages on digital democracy No scientific results...
- Towards a User-aware Enrichment of Multimedia Metadata
- CPSFS: A Credible Personalized Spam Filtering Scheme by Crowdsourcing
- Item Popularity and Recommendation Accuracy
- collective intelligence
1965 foundations of ultra intelligent machines
Robotic block stacking, Atari gaming play, text writing plus lots of other tasks by Google made: A Generalist Agent
You need to consistently be critical about the hype and Next Big Thing. Because years later you have a small item popping up which completely kills everything. Current terms: how useful are transformers, huge language models, and foundational model for normal Outlook and MSN users 😲 Normal people just want their privacy back.
https://github.com/tatsu-lab/stanford_alpaca 🎉 GPT-3 clone. Literally runs on a raspberry pi (very slowly). GPT models are incredible but the future is somehow even more amazing than that
Pytorch tips
pointwise approach, broadly speaking, each historic impression with a click is a positive training example, and each impression without a click is a negative training example. https://towardsdatascience.com/learning-to-rank-a-primer-40d2ff9960af
sequence learning, We will implement a character-level sequence-to-sequence model, processing the input character-by-character and generating the output character-by-character. Another option would be a word-level model, which tends to be more common for machine translation., https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
https://developer.nvidia.com/blog/deep-learning-nutshell-sequence-learning/
UPDATE: Vector databases and knowledge graph ideas...
- https://www.alexandria.unisg.ch/258982/1/mso202002.issue.pdf#page=91 "Bringing Semantic Knowledge Graph Technology to Your Data"
- https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-buzz-words-and-what-makes-each-9c65a3bd0696
- https://www.latent.space/p/agents
- https://observablehq.com/@asg017/introducing-sqlite-vss

Venues:

Knowledge Graphs, Semantics, Social and Adaptive Web
IEEE Transactions on Computational Social Systems
note no prior msc courses on machine learning. We are a systems lab and might know how to apply machine learning in permissionless, byzantine, unsupervised, decentralised, adversarial, continuous learning context.

Possible scientific storyline: SearchZero a decentralised, self-supervised search engine with continuous learning

The text was updated successfully, but these errors were encountered:

mg98 · 2023-04-24T09:27:51Z

Ideas...

"Perfect search" as an instance of unsupervised online learning-to-rank (OL2R) (using deep neural networks). The ML model should use user signals (i.e., clicks/downloads) as a measure to learn query-document relevancy and accordingly adapt its weights.
- I think the employment of NN makes sense as we have very limited metadata/understanding about each torrent, much unlike is the case with web pages
- State-of-the-art for OL2R: Duel Bandit Gradient Descent (DBGD) (listwise approach), (placeholder)
- Potential research lines:
  - Analysis of pointwise, pairwise, & listwise OL2R approaches for their application on distributed or decentralized systems; How do the tradeoffs added by listwise approaches scale with the constraints of robustness and efficiency in a dist./dec. network? Or more specific: how would DBGD work out in this setting?
  - Alternatively, ...
    - Decentralized AI-powered information retrieval using something as simple as, e.g., BM25 ranking (that is, without NN)
    - Decentralized embeddings for semantic search

synctext · 2023-04-26T09:25:06Z

Brainstorm: We present a decentralised search engine, based on deep learning of URLs, URIs, magnet links, and IPFS links. Deep learning has been successfully used in the past to identify trusted and malicious websites. We go beyond this prior work and present an experimental search engine based on fully decentralised unsupervised learning. Our fuzzy search algorithm is based on unsupervised online learning-to-rank (OL2R).
Left for future work: beyond static item list (no content discovery, currently unable to add new content items), reputation & trust for spam filtering (also required for trustworthy content discovery).

Literature, Dataset and code:

Fuzzy search of: Arts and Science (music, creative commons videos, and .PDF files)
(2007) Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search
(first simple SGD!?) Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem
Identifying Suspicious URLs: An Application of Large-Scale Online Learning
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs
1.7 million scientific article with metadata, keywords, and their URL
https://github.com/sdmhans/arxiv_dataset_extraction
Code for the paper URLNet - Learning a URL Representation with Deep Learning for Malicious URL Detection

mg98 · 2023-05-02T13:58:24Z

Spent the last week doing basic courses on neural networks again. Trying to get a linear regression model running to predict a sine function (using SGD). As basic as this is, implementing it is not as easy as I would have expected 🥲
EDIT: Probably because a sine function is not exactly linear 🤦🏼‍♂️

I will continue to try to get it to work. I need to learn it at some point, I think. However, I'm inclined to first-publication ideas that do not directly employ NN/learning, as kind of a soft start.

Talked to @kozlovsky again about semantic search based on simple embeddings. We could use the crowdsourced information and metadata to compute embeddings for every torrent and build a distributed (dec.) search algorithm based on vector distance to have a YouTube-like search experience.
I don't know how far the literature is with that but I feel like there are some interesting avenues to explore:

The language model used for vectorization: How to distribute it? Or should it come shipped with the software? How to ensure correctness/consistency. I imagine we'd have to reach consensus somehow over the computation of an embedding.
How static is the language model really... thinking about decentralized fine-tuning and its challenges
With a distributed database of vectors, how do you efficiently search the network by cosine similarity (as opposed to text-matching)? This might as well be trivial; I'm just brainstorming here.

Use this as a basis for semantic search and improve through OL2R in the next step?

This week's ToDos:

Finish my mini-project of a NN that is able to predict a sine function
Explore collaborative filtering, recommender systems
Explore information retrieval on knowledge graphs

synctext · 2023-05-03T13:20:08Z

Please chat to all people in the lab, understand their speciality. btw documenting your lesson learned; ML blogs make it look easy, none of them worked.

Please review the Meerkat system Github repo from RedPajama.

Something we called crowdsourcing 20 years ago: RLHF, https://huyenchip.com/2023/05/02/rlhf.html
https://github.com/Tribler/science/issues/42
https://github.com/openlm-research/open_llama
https://github.com/togethercomputer/RedPajama-Data/blob/main/data_prep/arxiv/arxiv_cleaner.py
2011 P2Prec: A P2P Recommendation System for Large-Scale Data Sharing
2013 A peer-to-peer recommender system for self-emerging user communities based on gossip overlays
simpel sprint idea: IPv8 community with trivial primitive. 3 items in a request, 2 recommendations + signed semantic similarity in return. ToDo: tag-based.

Suggested sprint: you successfully encoded sin(). Next, take 4 images, try to embed them in a machine learning model using useful overfitting. Meerkat docs example of embed(). Scientific primitive, unbounded storage of unstructured data for decentralised learning. Key question, how much storage space for 1 million static thumbnails?

import meerkat as mk
mk.search(
    df, 
    query=mk.embed("A photo of a person", engine="clip"),
    by=mk.embed(df["image"], engine="clip")
)

mg98 · 2023-05-08T11:17:53Z

Got a sine function prediction working using a NN regression model. It's not much... but feels good to have succeeded at this task. Learned about activation functions, and the challenge of parametrization in ML.

Also did some reading on tagging, recommender systems, and collaborative filtering, which opens another broad area of research, e.g., the issue of trust in collaborative tagging (see research topic 4 in science#42) - which I do find interesting.
Publication idea: Empirical analysis/ measurements of MeritRank deployed on Tribler

This (and perhaps also the next) week, I want to play around with Meerkat, learn about Autoencoder, and see if I can get something up and running, i.e., another ML model. I hope to further evolve my understanding of AI/ML and learn about yet new concepts.

synctext · 2023-05-12T15:07:19Z

Rich metadata embedding of "artist passport" Cullah.
Cullah donation wallet 1FfnfPJJ6yTTT9PGZe8TGgfEGQFc9kvQoW
Linkage with strong identity and GoldEuro direct donation by superfans.

scientific venue https://dl.acm.org/conference/facct

mg98 · 2023-05-13T06:40:23Z

Update on Autoencoder:

The hardships of the last weeks seem to start paying off. I was able to create some functional autoencoders within just one day.

I trained a model on eight pictures displaying tulips (dimensions: 240x180px), i.e., an input layer of 3x240x180=130k neurons, reduced that to 1000 neurons in a single hidden layer (encoding). If I'm not mistaken, this equates to a data reduction from 130 to 4 KB (the original JPEGs had 50-80 KB).

Example decoded output:

This might not be impressive, and with the right parametrization, we might be able to get more out of it. But for now, I'm just happy that it works in principle.

mg98 · 2023-05-22T13:20:53Z

Motivated by my recent success with autoencoders, I spent the last week trying again to get a pairwise LTR model for a sample of test data running. By doing that, I learned a lot more about the details of this approach. However, I had to pause this project because I would like to move this outside of a notebook and run it locally. I'm waiting for my work MacBook for that (my machine has difficulties) - it should arrive next week.

So now I turned to the idea of NN-based file compression which apparently is not only successful with the task of lossless compression but can actually compete with traditional algorithms like GZIP or 7Z (see, e.g., DZIP or its successor DeepZIP).
I find that quite impressive and as I have in the past been working in the realm of data deduplication and also compression a bit, I'm interested to understand how that works, how much of it is theory, and how much of it can actually sustain real-world use cases.
They do employ autoencoders but also recurrent NNs. RNNs are yet new to me and I've come across them before in the context of language models, so I want to educate myself on them finally and thereby further broaden my knowledge of AI/ML.

synctext · 2023-05-31T10:08:02Z

Using LLM for ranking. Background https://blog.vespa.ai/improving-text-ranking-with-few-shot-prompting/ and also https://blog.reachsumit.com/posts/2023/03/llm-for-text-ranking/ Did a silly OpenAI experiment, might be useful building block, but small part:

These are examples of queries with sample relevant documents for
each query. The query must be specific and detailed.

Example 1:
document: ubuntu-mate-19.10-desktop-amd64.iso
query: latest Ubuntu image

Example 2:
document: ubuntu-21.10-beta-pack
query: ubuntu 21.10

Example 3:
document: chemistry-3b-002-fall2014-ucberkeley
query: chemistry class Berkeley

Example 4:
document: 14BHD 2020-2021 Informatica
query:

Found this dataset with 7000+ pet images (+ code) for scientific research, so it sort of works 🐶

mg98 · 2023-06-02T11:26:33Z

So I ended up not learning about RNNs, NN-compression, etc. last week.

Instead, I investigated an idea proposed by @grimadas, which is to leverage ML as a means to classify nodes as Sybil or not-Sybil and use the resulting score as a weight-factor in the reputation algorithms of MeritRank.
The way this would work is by embedding nodes using algorithms like node2vec (excellent tutorial series) such that in vector space they reveal information about structure and connectivity of individual nodes, off which patterns can be derived or learned that could identify Sybil-regions. Further, an ML model would be trained on the simulation of Sybil attacks (cycle, serial, series). The hope is that the model will find more effective strategies to mitigate Sybil attacks than the static heuristics employed in MeritRank.
In a first short literature review, I have only found efforts of ML for Sybil detection/tolerance in the context of online social networks.

Back to LTR

Got my new MacBook yesterday 🔥 so I was able to continue my work on the OL2R project. My goal was to just get anything working, and I specifically sticked with the pairwise LTR approach for this. To this end, I was only following the basic idea, which is to train a model based on query-document-pair triples, and followed my intuition for the rest.

Algorithm

I use a truncated version of the CORD-10 dataset, which is a collection of COVID-19 related scientific papers and their embeddings (based on title + abstract and the allenai/specter language model)
For every new (i.e. unseen) query, I embed it, and I first consult simple cosine similarity to get a top-5 list of matching documents. I (pairwisely) train a model on this result set on 100 epochs.
Let a query's result set be an ordered list of documents (1,2,3,4,5). If a user now selects result 3, we loosely (10 epochs) train the existing model on the desired order (3,1,2,4,5).

Demo

Code

Remarks

A query's result set is eternally biased and restricted to the top-5 results returned by the initial document matching -> Remedy idea: some measure of exploitation/exploration balancing.
A single model will be trained on multiple queries. I haven't thought through the dynamics this would have, if they are useful or disruptive or how they can be exploited.

synctext · 2023-06-06T08:46:27Z

Please study in detail: 👀 - Tribler roadmap for semantic search, tags, MeritRank, and discussion
- ~~COVID-19 related scientific paper, sounds like great dataset for 1st chapter!~~ probably audio/video is better match for Tribler
- MeritRank+crowdsourcing phd chapter 2023?
- Learn-to-rank 2024 thesis chapter?
The latest from the "Sybil protection for machine learning" master student
Web3Recommend: "balancing relevance and trust" master thesis
Web3 survey with nearly 300 citatitions

Sprint: get a solid production-level metadata dataset (Creative Commons, https://github.com/MTG/mtg-jamendo-dataset ?)

mg98 · 2023-06-11T09:07:41Z

Update:
I have started giving more time again to my first thesis chapter (the one I started writing several months ago, with a colleague). Work title: "A Comprehensive Study of Content-Defined Chunking Algorithms for Data Deduplication". I want tot finish this within the next 1-2 months!

MeritRank+crowdsourcing phd chapter 2023?
Learn-to-rank 2024 thesis chapter?

Love this roadmap!! ❤️

I have started getting my hands on the Tribler code and gain a better understanding of its inner workings. Will try to move forward with the practical preparatory work for the next thesis chapter, such as getting a dataset.

mg98 · 2023-07-12T09:19:28Z

Over the last month, I was mainly focused on my first thesis chapter. That involved writing, but also running experiments, fixing bugs in our code, and figuring out the best way to present our data.

For example, visualising the chunk size distribution was a challenge because of the amount of experiments and amount of algorithms we evaluate. Furthermore, the variance in behavior made it difficult to fit everything into a single plot while preserving readability.

Apart from this, I was reading some papers, talking to my peers in the lab, trying to understand what they're doing, and also continuing to explore AI/ML, e.g., learning about transformers.

synctext · 2023-07-12T09:34:10Z

Please write 3 next paper ideas around topic of "Web3 crowdsourcing". 6-page for DICG 4th workshop would be indeed a great learning. One of 3: crowdsourcing of metadata using MeritRank; everybody can tag, describe work done of KnowledgeGraph, desing an Algorithm 1, discover tags, rate them using MeritRank, emulation using Bulat Python magic, epic Sybil attack. take Rohan algorithm, get going in tag context instead of recommendation, improve somewhat.

Great news, hoping chapter to be ready for final reading & submission aug/sep. As a starting point for new lab people a short history and essential reading.

Year	Topic	Milestone description
2000	vision	Open Information Pools was the first paper with the dream of establishing a new Internet. Now grown into the Web3 movement. Dreams of a young man, taking 23 years to realise and still counting. `A reputation function calculates a reputation based on several aspects of the user, like the amount of activity of the user, number of retrievals of his submitted information, the number of modi�cations to his submissions, etc.`
2006	vision	Tribler, initial overview. First deployment of social networking with academically pure self-organisation
2008	vision	we reported on BarterCast feedback loop
2023	vision	A Deployment-First Methodology to Mechanism Design and Refinement in Distributed Systems
2020	trust	TrustChain: A Sybil-resistant scalable blockchain
2018	trust	trustchain IETF draft Internet standard https://www.ietf.org/archive/id/draft-pouwelse-trustchain-01.txt
2018	trust	we formulated the improved feedback loop
2019	trust	Jeremie is now a professor at Delft: RepuCoin: Your Reputation is Your Power
2021	trust	Foundations of Peer-to-Peer Reputation
2023	trust	Meritrank solves the impossibility result formulated by Prof Parkes of Harvard. A reputation system which might actually induce online trust.
2010	community	Private communities help eachother, biggest measurement ever conducted. No tit-for-tat in most of the time with private communities, see Figure 6.
2011	community	competition due to oversupply. Fast download but eternal seeding: The reward and punishment of Sharing Ratio Enforcement
2013	community	Systemic Risk and User-Level Performance in Private P2P Communities
2020	DAO	Technology Stack for Decentralized Mobile Services (Matt Scala)
2021	DAO	Towards a Robot Economy for the Music Industry (Tim Wissel)
2022	DAO	Web3: A Decentralized Societal Infrastructure for Identity, Trust, Money, and Data (Joost Bambacht)
2023	DAO	Performance analysis of an offline digital Euro prototype (Robbert Koning)
2023	DAO	First Deployed DAO with True Full Decentralisation (Brian Planje)
2023	AI	MoDeST: Bridging the Gap between Federated and Decentralized Learning with Decentralized Sampling
2023	AI	Decentralised recommendations with trust and relevance (Rohan)
2023	AI	Towards Sybil Resilience in Decentralized Learning (Thomas)

synctext · 2023-07-13T13:17:10Z

New ACM Journal on Collective Intelligence seems like a solid venue to target. Your second chapter could then simply be decentralised collective intelligence. Using Tribler to share info, use tagging, and trust. You can re-publish a lot of the Tribler work done by others in the past 18 years and 3 months.
the journal encourages a broad-minded approach to collective performance. We welcome perspectives that emphasize traditional views of intelligence as well as optimality, satisficing, robustness, adaptability, and wisdom. In more technical terms, this includes issues related to collective output quality and assessment, aggregation of information and related topics (e.g., network structure and dynamics, higher-order vs. pairwise interactions, spatial and temporal synchronization, diversity, etc.), accumulation of information by individuals/components, environmental complexity, evolutionary considerations, and design of systems and platforms fostering collective intelligence.

synctext · 2023-07-29T12:37:51Z

https://eugeneyan.com/writing/system-design-for-discovery/

mg98 · 2023-08-06T12:14:11Z

To give a little update...

I skim-read through the complete list of papers and made my notes for each one.
My thesis chapter has seen great progress but is currently blocked again by my co-author who is on vacation now.
I have consulted with Bulat about your proposed paper idea (DICG) to understand and plan the required steps. I have yet to get a better understanding of the field to come up with my own ideas for topics.

mg98 · 2023-08-14T18:34:37Z

I'm focusing on the DICG'23 now. @grimadas

MovieLens Dataset

I was able to find a really nice dataset. MovieLens is a non-commercial project run by people from U. Minnesota since 1995. It's a database of movies that users are allowed to rate and tag. Any user can add new tags, and existing tags can be up- and downvoted in a way (see screenshot).

Tags also have the attribute of being positive, neutral, or negative. I am not sure how complete their dataset is about that, but they are responsive to my emails and seem highly cooperative with the provision of data.

We can use this dataset to get an idea of the quantity and quality when it comes to crowdsourcing of tags, and base our simulations on it.

Quick plots...

Idea

Perhaps, for this workshop, I could come up with some subjective tag scoring algorithm, a bit related to the "Justin Bieber is gay" problem. Playing with the idea that for a group of similar users, a tag might be agreed upon, but for another group of users the same might not, etc.
Regarding the beforementioned question, there might be a community of users (with similar taste) for whom he is, and another community for whom he isn't. Therefore, introducing the notion of "subjective reality".
What indicates user affinity? It could be that they are interested in similar content, i.e., they interacted (tagged or rated) with similar movies.

Approach

Propose some algorithm that for a tuple (peer, resource, tag) can calculate a score in [0,1].
Introduce the notion of user affinity by some metric of has-interacted-with-similar-movies.
Make plots and observations... Perhaps we can visualize the 'bubbles' of communities that are created by this metric of user affinity and how they support different tags. It will depend on the quality of the dataset though, I'm yet skeptical.
Simulate my subjective tag scoring algorithm and observe its effects. Given the tags that the user has agreed and disagreed on in the dataset, we could be able to introduce a metric of success for this algorithm. My concern here again is potentially the lack of data. We rely on the existence of a big enough set of highly active users whose contributions are distributed on not too many movies.

Will further investigate this idea and the dataset and make updates here. Comments welcome.

devos50 · 2023-08-15T07:02:38Z

Just noticed this line of work, very interesting! I worked on something similar (trust, tags + MovieLens dataset) more than a year ago, see this issue (note that this is a private repo with many of our research ideas so you might have to request access). The overall goal of that issue was to work on the foundation of tag-centric crowdsourcing in Tribler.

I tried out a few algorithms/approaches and I remember I identified some shortcomings of Credence, which is related to what you're trying to achieve. but as reputation/trust was not really my domain, I decided to focus on the fundamental data structures instead (using Skip Graphs and Knowledge Graphs for Web3 data management). The paper with these ideas is currently under review. Nonetheless, extending that work with a trust layer would be a valuable contribution!

mg98 · 2023-08-28T23:50:22Z

Hi Martijn :) thanks for your input!

I was knocked out by COVID over the last two weeks, and still am a bit, but here is the continuation of what I was trying to do:

I have calculated user similarity based on the Pearson correlation of common sets of rated movies (as suggested here), and based on that, subjective tags on movies (indeed similar to Credence in that I weigh based on peer correlation). I based this solely on the interactions of the 200 most active users (perf reasons).

Example of a sample of users and their subjective tags on the movie "The Shawshank Redemption"

              tag     score
126      friendship  1.465223
260          prison  1.367187
8    Morgan Freeman  0.966742
275            rape  0.924138
151            hope  0.854776

          tag     score
126  friendship  1.109590
260      prison  1.103990
275        rape  0.703847
151        hope  0.673232
285  redemption  0.673232

            tag     score
247        prison  1.158236
121    friendship  0.947820
293       revenge  0.735881
333       suicide  0.685857
134  great ending  0.578089

From there on, I tried to find extreme results, i.e., movie tags for users of "opposite" groups. To this end, I looked up controversial movies and their tags for users with minimum/negative correlation, hoping for something like a clear political or a gender split.

And it wasn't easy, perhaps due to the lack of data. But I still found an interesting disparity for Disney's Star Wars remake.

While one user has funny, good action, and great action among his top tags,
another user has Bad jokes, Weird Pacing, boring long, and Script Recycle among its top tags, and further, feminism and social justice.
Most of these tags exist with negative scores in the list of tags of the respective other user.

Full list of tags for two negatively correlated users on "Star Wars: The Last Jedi"

                 tag     score
29         plot twists  0.181055
4     Benicio del Toro  0.181055
1                 BB-8  0.181055
16         Space opera  0.181055
7                Funny  0.181055
8          Good Action  0.181055
9     Gorgeous visuals  0.181055
10        Great action  0.181055
12       John Williams  0.181055
17           Star Wars  0.126848
13         Mark Hamill  0.126848
18  Strong female lead  0.126848
14        Rian Johnson  0.126848
0          Adam Driver  0.099062
6         Daisy Ridley  0.099062
3    Bechdel Test:Pass  0.048224
19        Weird Pacing  0.048224
20              boring  0.048224
34        stormtrooper  0.036496
28     part of trilogy  0.036496
33         space opera  0.036496
21              bunker  0.036496
22              defeat  0.036496
23             failure  0.036496
32        space battle  0.036496
25        good vs evil  0.036496
30              sequel  0.036496
27  military operation  0.036496
24            feminism  0.020438
31      social justice -0.027786
2            Bad jokes -0.033770
15      Script Recycle -0.054207
5        Carrie Fisher -0.054207
11         John Boyega -0.054207
26                long -0.170404

                  tag     score
1            Bad jokes  0.854447
16            feminism  0.730180
0          Adam Driver  0.440624
4         Daisy Ridley  0.440624
2    Bechdel Test:Pass  0.413822
11        Weird Pacing  0.413822
12              boring  0.413822
15             failure  0.410806
24         space opera  0.410806
23        space battle  0.410806
21              sequel  0.410806
20     part of trilogy  0.410806
19  military operation  0.410806
17        good vs evil  0.410806
13              bunker  0.410806
14              defeat  0.410806
25        stormtrooper  0.410806
22      social justice  0.316358
10  Strong female lead  0.124267
9            Star Wars  0.124267
8       Script Recycle  0.124267
7         Rian Johnson  0.124267
6          Mark Hamill  0.124267
5          John Boyega  0.124267
3        Carrie Fisher  0.124267
18                long  0.101508

That was fun to explore but it still lacks a scientific methodology in order to really evaluate the effectiveness of the subjective tags I computed. Previously, I proposed that

Given the tags that the user has agreed and disagreed on in the dataset, we could be able to introduce a metric of success for this algorithm.

Maybe that gives us something. Maybe for all tags that have been up- and down-voted, I can compare the subjective with the objective reality and derive a success metric. And this would allow me to experiment with more sophisticated scoring algorithms and see their effect on this metric.

The paper with these ideas is currently under review. Nonetheless, extending that work with a trust layer would be a valuable contribution!

Good stuff. I don't know if trust should be my scope either. I'll talk to Johan today, will know more then.

Status update on

Please write 3 next paper ideas around topic of "Web3 crowdsourcing".

After almost half a year, I still don't have a grasp of the field enough to come up with own ideas for publications.

synctext · 2023-08-29T11:42:25Z

No worries about your progress in 6 months of a 48 months phd. Getting into the field of distributed system and doing something novel is hard. Having a draft publication within the first 12 months is already a solid achievement. Goal: April 2024 == 1 thesis chapter under review + 1 finished draft thesis chapter. Non-linear productivity 📈

Task for September 2023: come up with ideas for a scientific paper and select one (or get inspiration)

SwarmLLM: collective LLM intelligence (with new AI phd expert??)

We present the first proof-of-principle of collective intelligence for transformers. Intelligence emerges from the interaction between numerous elements [REFS]. We use a transformer as the basic building block for a interconnected network of connected unique transformers. Instead of the classical transformer approach with billions of parameters, we connect thousands of specialised transformers into a network. This is a generalisation of the mixture of experts approach with the highly desired new property of unbounded scalability. There is a cost to pay in our approach. In a typical divide and conquer style, the challenge of finding the correct expert becomes harder.
Go beyond deepswarm, a system with outdated MNIST evaluation from 2019 using pheromone update rules.

LLM as a key/value store

key: any youtube URL in Youtube-8M dataset.
value: the preview thumbnail generated {re-usage of your autoencoder idea??!!}
With sharding and multiple LLMs per node a unique datase can be spread in a ring topology, ordered by the keyspace.
Semantic clustering or not?

Rich metadata inside an LLM

Tulip picture embedding in generic form.
Rich metadata embedding of "artist passport" Cullah.
Cullah donation wallet 1FfnfPJJ6yTTT9PGZe8TGgfEGQFc9kvQoW
Linkage with strong identity and GoldEuro direct donation by superfans.

Tribler: a public semantic search engine

We shamelessly claim to have a proof-of-principle for public Internet infrastructure after 20 years of effort. We argue that critical societal infrastructure should be developed as a non-profit endeavour. Similar to Bittorrent and Bitcoin we present a self-organising system for semantic search. Our work is based on learn-to-rank and clicklog gossip with privacy-enhancing technology using a Tor-derived protocol.

Web3Search: Online Pairwise Learning to Rank by Divide-and-Conquer with full decentralisation

Embedding nodes using algorithms like node2vec. Embedding of any item using 50-ish dimensional vector.
Fully decentralise and make trustworthy, this older work from 2021: https://arxiv.org/abs/2103.00368
Dataset: Covid papers, Youtube-8M, creative commons magnet scrapes??

Foundations of Trust and Tags.

Use the @grimadas work with theoretical grounding: emergence of trust networks with repeated successful interactions. Use tags based on crowdsourcing if you have successfully collaborated.
"The extended mind", classic reading from 1998.

Next steps: learn-by-doing methodology. Work for 2 weeks further on the Tulip stuff. Autoencoder of 1000-ish Youtube-8M thumbnails. Work for 2 weeks on another of above brainfarts. Commit to best chapter idea.

mg98 · 2023-08-31T17:21:18Z

Seeing how far I can get autoencoding YouTube thumbnails. Time for some quick coding.

Using YouTube's API I got the thumbnails of 577 search results with "starship" as the query.
I'm using the same network parameters as I used for the tulips but with fewer input neurons (the thumbnails are 4x smaller in size).
However, it's not like YouTube thumbnails are good representations of the queried terms, nor is there necessarily a high similarity of visual features throughout the thumbnails in the search results, as you can easily verify: https://www.youtube.com/results?search_query=starship, or browse the YT 8M Dataset Explorer.

Note

Using YouTube's search API instead of its 8M dataset (can't run that on my machine!) is different in that I collect the thumbnails of videos which match the search query, and in the 8M dataset they sort of match the query (selected set of visual entities) with what they actually found displayed in the video.

I still went with it, trained the network on 576 thumbnails, and then ran the 577th search result's thumbnail through the autoencoder.
→

Initially confused about how well that went... Because, as I implied, I don't think the fed set of images works much better than any arbitrary collection of images. It's funny though because technically it's a data reduction from 32K worth of information down to 4K (the size of the model), and 4K is also exactly what JPEG compression gets for the original image. So I probably just built myself a generic image compressor. (error, see my next comment)

What might do is the labeling we get on frame-level (or ~1-second-interval video segments). We have that in the 8M dataset. Getting an actual image entails downloading the original YouTube video and then extracting the corresponding frame. That's costly but doable on a small to mid scale.

We have been thinking about doing text-to-image basically, using auto-encoders? I think that was the plan...
The 8M dataset contains embeddings on video segments/frames. It would be cool to be able to embed a custom query like "starship" and then find video segments near it. But it doesn't work that easily. The 8M model is not trained on text. We only get fixed labels, e.g., we have "guitar", "toy", "minecraft", a few thousands of those. "starship" is not part of this taxonomy. Perhaps we can use a separate model to link semantics between "starship" and the labels in the 8M dataset and thereby make our way...
Tbc

synctext · 2023-08-31T20:15:19Z

wow, as impressive as I hoped it would be!! 4k!
Just drop the YouTube 8M for now and focus on the 1000 Starship thumbnails. How good can you autoencode them all?
What tradeoffs? What does 5 days of M2 GPU crunching yields?

mg98 · 2023-08-31T22:50:26Z

There was an error in my code, and a bit in the approach. What I did was training only on 50 thumbnails, and then use a thumbnail that was part of the training data for testing. I updated my last comment; the result is very different.

Click here to see previous (misleading) result

→

What does 5 days of M2 GPU crunching yields?

Actually, PyTorch does not support CUDA (GPU acceleration) on Mac :( Google Colab with GPU runs faster for me.

mg98 · 2023-09-06T12:32:13Z

I'll share my notes on three of the suggested topics

SwarmLLM: collective LLM intelligence (with new AI phd expert??)

Learned about MoE, ACO, and NAS
Shazeer et al. (2017) propose Sparsely-Gated Mixture-of-Experts to significantly cut down the cost of training LLMs. It introduces a trainable gating network layer that activates a sparse set of feed forward networks (the experts) in a generic, i.e., general purpose, fashion.

Get an impression
People have thought about decentralizing this approach, e.g., Ryabinin & Gusev (2020)
Could you @synctext specify how this relates to your idea and what’s the link to DeepSwarm?

Web3Search: Online Pairwise Learning to Rank by Divide-and-Conquer with full decentralisation

The OL2R literature is exciting and I already dipped my toes in. I can see myself delving into it for a long time frame.
I am not yet familiar with the gossip learning literature, but it looks like conceptually working approaches of it exist. So why not try combining those two (OL2R x Gossip Learning)?
Don’t see the need for node2vec yet...
It would take me some weeks, maybe two months, to understand the literature and then implement a prototype. From there on I think I’d have a better view of the challenges, also with regards to trustworthiness, spam, personalized OL2R, …, there will be some valuable output to this experiment for sure! After all, we are the first to do this, I think.

Tribler: a public semantic search engine

So, LTR through clicklog gossip. That's G-Rank!
I just reread the paper and realized, it is not actual Learning to Rank! That is, it's not making use of any ML, no learning, no semantics, just a static ranking algorithm that incorporates user signals and peer similarity weighting, based on metadata text matching search.
There are probably ways to improve upon that work in both, effectiveness (better peer similarity or document similarity clustering algorithms...) and protection against adversarial attacks which it does to not seem recover from too well.

Crowdsourcing experiment that discovers user-perceived dimensions of semantic similarity among near duplicates Vliegendhart et al. (2011)

Researching these topics over the last few days, I feel inspired and more confident to develop ideas for publications!

mg98 · 2023-09-10T12:07:26Z

I'm really interested in designing my own holistic search system. Either using a simple algorithm like done in G-Rank or we actually dare to implement something like a decentralized LTR where each peer trains their own model.

I like the aspect of personalized search by the metric of peer similarity. I would do something privacy-preserving: Private Set Intersection (Protocol II). This could be implemented with both approaches, i.e., even with NN-based LTR we could still maintain local ClickLog databases in addition to the model.

So that's about the privacy aspect. Apart from that, Sybil-resistance was a weak spot in G-Rank, we could mix MeritRank-computed reputation weights in the algorithm.

I'm trying to build a prototype.

✨ 1st paper idea ✨

Update 5 days later

I implemented Protocol II from Zeilemaker et al. (2013) (my code in gist) and noticed what I would have understood if I just read a bit further in the paper: High item values and large sets of items become computationally very heavy, very fast. It's impractical for what I planned on doing.
Zeilemaker et al. (2013) recognizes this issue and proposes partitioning of the sets. I am not sure if this could work with this application, but it's something to think about. If nothing helps, we could do a "private set intersection" using an intermediate third party; the values would be hashed anyway. That said, I'm still very eager and creative about finding a good solution to this problem.

Variance Reduction in Gradient Exploration for Online Learning to Rank

TODO

Come up with two more paper ideas

synctext · 2023-09-15T09:02:25Z

Field: https://scholar.google.com/scholar?hl=en&q=adversarial+information+retrieval Also: https://scholar.google.com/scholar?hl=en&q=dark+patterns+~youtube

Great stuff, metadata authorship is actually fundamentally broken on Youtube. Incentives are due to The Algorithm of ranking are obsessively geared towards clickbait. Related: Justin Bieber tag spam: https://musicbrainz.org/artist/e0140a67-e4d1-4f13-8a01-364355bee46e/tags also conflicting versions of content: https://musicbrainz.org/artist/c7d0d6f4-4eb2-4d36-8f61-fe7e2ce4af0d Justin Bieber Nickelback is a cover thingie? Or is this Internet trolling: https://musicbrainz.org/label/e58b8d5b-8278-4804-be44-ac2ea17ac90b 🦝

Scientific related work: https://www.kaggle.com/datasets/thelazyaz/youtube-clickbait-classification
Youtube videos with clickbait titles as: 15 Biggest Animals You Won't Believe Actually Exist!
Original with 6M views and acurate short hashtagging: https://www.youtube.com/watch?v=oXtATL9vtIs
6,272,233 views 20 May 2022 #CharliePuth #WizKhalifa #SeeYouAgain
See You Again - Wiz Khalifa (ft. Charlie Puth) | Furious 7 Soundtrack

See clickbait example: https://www.youtube.com/watch?v=CuyOr7WThug

Wiz Khalifa - See You Again (Lyrics) || Playlist || Ed Sheeran, Imagine Dragons
👋 Hey guys, welcome to Contemporary Pop Hits
💖 I am a lover of Music, so I created this channel to share with you my favorite remixes of popular songs.
🔔 So right, you make everything feel so nice.
🔔 Click the bell to stay updated on the best Lyrics / Lyric Videos from Contemporary Pop Hits!

❖▬▬▬▬▬▬▬▬▬▬۞▬▬▬▬▬▬▬▬▬▬❖
Lyrics:
Wiz Khalifa - See You Again (Lyrics) || Playlist || Ed Sheeran, Imagine Dragons
Wiz Khalifa - See You Again (Lyrics) || Playlist || Ed Sheeran, Imagine Dragons
Wiz Khalifa - See You Again (Lyrics) || Playlist || Ed Sheeran, Imagine Dragons
Wiz Khalifa - See You Again (Lyrics) || Playlist || Ed Sheeran, Imagine Dragons
Wiz Khalifa - See You Again (Lyrics) || Playlist || Ed Sheeran, Imagine Dragons

[#ContemporaryPopHits](https://www.youtube.com/hashtag/contemporarypophits)  [#Pop](https://www.youtube.com/hashtag/pop) [#Vocals](https://www.youtube.com/hashtag/vocals)
[#Lyrics](https://www.youtube.com/hashtag/lyrics) [#TopHits](https://www.youtube.com/hashtag/tophits)
[#Hits](https://www.youtube.com/hashtag/hits) lyrics [#lyricspopmusic](https://www.youtube.com/hashtag/lyricspopmusic) [#lyricspopsong](https://www.youtube.com/hashtag/lyricspopsong) [#lyricvideo](https://www.youtube.com/hashtag/lyricvideo) 
[#tiktokviral](https://www.youtube.com/hashtag/tiktokviral) [#tiktokhits](https://www.youtube.com/hashtag/tiktokhits) 
[#tiktok](https://www.youtube.com/hashtag/tiktok) [#tiktokvideo](https://www.youtube.com/hashtag/tiktokvideo) 
[#LyricVideo](https://www.youtube.com/hashtag/lyricvideo) [#PopHits](https://www.youtube.com/hashtag/pophits) 
[#avamax](https://www.youtube.com/hashtag/avamax) [#ruthb](https://www.youtube.com/hashtag/ruthb) 
[#taylorswift](https://www.youtube.com/hashtag/taylorswift)  [#justinbieber](https://www.youtube.com/hashtag/justinbieber) [#charlieputh](https://www.youtube.com/hashtag/charlieputh) 
[#passenger](https://www.youtube.com/hashtag/passenger) [#maroon5](https://www.youtube.com/hashtag/maroon5) [#wizkhalifa](https://www.youtube.com/hashtag/wizkhalifa) [#adele](https://www.youtube.com/hashtag/adele)
 [#zayn](https://www.youtube.com/hashtag/zayn) [#sia](https://www.youtube.com/hashtag/sia)
 [#onedirection](https://www.youtube.com/hashtag/onedirection) [#KatyPerry](https://www.youtube.com/hashtag/katyperry) [#EllieGoulding](https://www.youtube.com/hashtag/elliegoulding) 
[#PalomaFaith](https://www.youtube.com/hashtag/palomafaith) [#EdSheeran](https://www.youtube.com/hashtag/edsheeran) [#LewisCapaldi](https://www.youtube.com/hashtag/lewiscapaldi) 
[#RuthB](https://www.youtube.com/hashtag/ruthb) [#MileyCyrus](https://www.youtube.com/hashtag/mileycyrus) 
[#ShawnMendes](https://www.youtube.com/hashtag/shawnmendes)


Tag: Contemporary Pop Hits,lyric,lyrics,song lyric,song lyrics,pop song,best song,trending song,trending,lyric 
video,lyrics video,pop,us uk,pop music,spotify,music 2022,throwback song,hits mix,songs,playlist,cathy,
old song,old songs,tiktok song,tiktok trending,tiktok songs,hiphop songs,pop songs,rap songs,relax songs,
new songs,new music,hit songs

mg98 · 2023-09-15T09:15:17Z

May the Best Metadata Win: Authorship for Perfect Metadata Curation

I gotta say I understand my field (IR in DS) much better now. The problem can and must be attacked from two sides:

Developing a search algorithm that is good at finding relevant content
Having good (“well curated”) content!

(2) -> I thought about the metadata issue, and tried to take a few steps back...

The web is essentially decentralized, YouTube in a way is decentralized. Not with regards to (1) but (2).
What I mean is the content speaks for itself. And why is that? Because the authors care to be found, they are incentivized to attach the right keywords, and to describe their content following a certain structure to be found by their target audience. (Note: they are also disincentivized to write bs because of things like ranking punishments for bounces)

I am also intrigued by the following intuition: If YouTube believed that crowdsourcing worked, if it would help describing the content truthfully and therefore help users find the content they’re looking for (which they care about), they would do it!!

They might do so in many implicit ways (likes/dislikes, watch time, …) and they might have automatic annotations (nowadays a lot of ML probably), but not in the way of crowdsourcing tags.

So what am I advocating for? Authorship!
But also single-authorship: Screw tag crowdsourcing, let us instead have multiple versions of metadata competing for each other. The metadata could be maintained by its original author, or even be completely immutable (sth to think about). The incentive then could be anything: Your name in shiny letters, your BTC wallet address displayed. A metadata entity would link to a torrent file or to a bundle of torrent files and describe how they differ (1080p, 720p, ...).
It's the responsibility of a good search & ranking algorithm, as well as user signals (report, like/dislike), to then figure out the best metadata.

I feel like if you really want solid metadata, authorship is the way to go.

✨ 2nd paper idea ✨

People don't put up with clickbait and will react by the means of reporting, dislikes, comments.

(Source: Gothankar et al., 2021)

mg98 · 2023-09-19T07:38:15Z

Notes on ✨ 1st paper idea ✨

The goal of this paper would be to propose a system for personalized search (at best, learning-to-rank) with an emphasis on privacy.

I continue thinking about this issue. It is difficult. Very challenging.

I can think of an online protocol doing secure multi-party computation, k-anonymity, ..., a lot of things become imaginable in this scenario. Designing such a protocol becomes a fun thought-experiment, but its application is impracticable: dependence on similar peers being online, and expensive network operations with every search.

We need something that works offline!

Peers could apply their set of queries and selected results on Bloom filters and gossip that around the network. Much more space- and network-efficient than what G-Rank did. This, however, would still only serve for a "If you know, you know" privacy... which is also better than what G-Rank did.

Edit/ spontaneous idea: What if we use distance-sensitive Bloom filters. Not only can we then respect for query or document similarity, but we could also add noise to the data (differential privacy?)

I also thought about learning-to-rank (LTR) and how and if this should be combined with search personalization.
There is merit in a solution where the information retrieval and the personalization algorithm are decoupled. That is, an algorithm A gives us an ordered list of results, and an algorithm B reorders them based on peer similarity.

A could be LTR, but it doesn't have to be. Let's say it is! We would be the first to publish about LTR in p2p! (cf. #7586) So how are we gonna do that? Should there be consensus on one global model that all models converge to (gossip learning)? Or should peers run local models that benefit from global data but also maintain personalization (maybe by maintaining two models: the global and a personal one)?

In the latter case, could we couple that with B, such that the personal model is gossiped between similar peers and their updates are applied onto our personal model based on their similarity?

Maybe from this question we can derive the ✨ 3rd paper idea ✨, which would deal with decentralized LTR, its risks and challenges.

ToDo for this week:
Get more concrete with 2nd paper idea ("May the Best Metadata Win") and get inspiration for a 3rd/4th paper idea.

mg98 · 2023-10-03T09:02:49Z

✨ 1st paper idea ✨ Towards Privacy in Personalized Ranking Models in Decentralized Search

I spent a whole lot of time in the last two weeks thinking about a solution to this problem. My conclusion, in the end, was to use Bloom filters with calibrated parameters for plausible deniability in addition. Either that, for an "offline" protocol, or do SMPC. I'm a bit disappointed by the simplicity of this solution and if something truly paper-worthy could grow out of it. So I've set that on hold.

✨ 2nd paper idea ✨ May the Best Metadata Win: Authorship for Perfect Metadata Curation

Still interested in that as good metadata lays the foundation for every IR algorithm I could develop later. I was too distracted by other ideas to really focus on that. But in the back of my head, I kept thinking of ways to approach this paper. How could I possibly evaluate this idea, what dataset could I use... it does not seem obvious to me at all.
In my research, I stumbled upon the AOL 2006 dataset, listing "anonymized" user search histories and the result they picked. Great fun.

✨ 3rd paper idea ✨ Decentralized OL2R: Risks and Challenges

Gossip learning works so why not go and try implementing OL2R on it, nobody has done it.

✨ 4th paper idea ✨ Decentralized LLM

Everything would be easier if we just had a semantic understanding of the documents. Maybe then I could use the distance-sensitive Bloom filters I have encountered earlier, researching my 1st idea.
Petals (very fresh!) seems to have got something running that actually works. Maybe we could build on that, see what they're missing. From what I know, privacy is still unsolved, and perhaps security is another aspect that needs to be further examined.

synctext · 2023-10-03T09:49:59Z

proposal of a 1 month coding hackathon
Code from today until 3 Nov 2023, team up with @pneague possibly.
Likely lesson learned: Machine learning 1%, distributed ML using IPv8 community 4%, remaining 95% of time: dataset, tooling, "system engineering 4 ML".
Goal: gossip learning in a IPv8 community
Outcome: single amazing .GIF
Getting stuff from 2011 invented by Prof. Jelasity and his phd student to actually work.
ahhh, wait, in 2012 Tribler was the first to deploy decentralised AI. 😄 Distributed Machine Learning Using the Tribler Platform - msc thesis University of Szeged. Amazing work, after graduation he got to live the Californian dream.
Master student has this working on Android. ClickLog approach for viability: BeyondFederated - truly decentralised learning at the edge
Brainstorm: build full-custom Youtube URL dataset. Everything around trailers and low-ranked hits, different language, low resolution (e.g. non 4K) for trailer queries. : https://www.youtube.com/results?search_query=barbie+movie+2023
limit time for dataset to maximum 1 week???
- https://research.google.com/youtube8m/
- https://www.google.com/search?q=site%3Akaggle.com+youtube+dataset
- Actualt ClickLog https://www.kaggle.com/datasets/sohier/recommender-click-logs-sowiport GOLD 🥇
  - lots of work around this one https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HFIV1A
Bonus: lastest james bond movie semantic search results gracefully handles the typo, understands query, and finds this trailer:
- US trailer: https://www.youtube.com/watch?v=N_gD9-Oa0fg
- Trailer 2: https://www.youtube.com/watch?v=vw2FOYjCz38
- Another trailer: https://www.youtube.com/watch?v=BIhNsAtPbPI
- Final International Trailer: https://www.youtube.com/watch?v=p9gREXr-zzE
- Blu-ray, 4K UltraHD & DVD On 12/21 trailer: https://www.youtube.com/watch?v=3VgLCR3t7nk

update after your 1 months sprint you could join the "Blockchain for Trust" team for another 1 month sprint. Today this team has been formed by @grimadas joining @InvictusRMC for #7517

mg98 · 2023-10-16T10:31:10Z

Outcome: single amazing .GIF

Tada🙆🏻‍♂️🙆🏼‍♂️

Source code: https://github.com/mg98/p2p-ol2r

We learned that gossiping an entire model is expensive due to its size. In our prototype, this corresponds to a payload of 2.7MB (multitude of that to be expected in real-world applications, of course).

The literature suggests various compression techniques (e.g., only gossiping a subset of the model parameters). Furthermore, there are ways to decrease the total number of messages sent.

While we have a solid base implementation of OL2R in P2P, many improvements are required or desirable to make it practical. These include

Sybil-tolerance
better efficiency
privacy
personalization

Perhaps those are all general to gossip learning. We wonder what is the specific challenge with the application of LTR. One might be the size of the model: How much space would it need, and would it still fit on a single device, or would we have to think about distributing the intelligence?

synctext · 2023-10-17T07:27:57Z

Great news! Please focus fully on practical deployment. Nobody has done this yet in a fully decentralised manner, all prior work is just unproven theory 🙃 (credits to MIT: "On-Device Training Under 256KB Memory) This payload of 2.7MB for a single pairwise sync is a lot. Ideally it fits inside a single UDP message of 1472 Bytes. So, just a few compressed parameters or something would be awesome.
Security and privacy are great after it actually works, right?

mg98 · 2023-10-17T14:19:32Z

Quick update:

Without the metadata overhead created by PyTorch, we can compress the model to 0.72 MB instead of 2.7 MB
Using quantization, as suggested by the MIT guys, we can cut it down to a quarter (i.e., 0.18 MB)
We still have not made efforts to optimize our hyperparameters. Perhaps we can do with fewer neurons, fewer layers.

Sparse weight updates seem like a good idea. See our following experiment:

Training alters only a subset of parameters (30-80% and possibly less)

(base) ➜  p2p-ol2r git:(main) ✗ python main.py 1
Indexing (please wait)...
QUERY: molecular tumor
Epoch [10/10], Loss: 0.4321
593859/721921 parameters changed (82%)
QUERY: molecular tumor
Epoch [10/10], Loss: 0.2447
498198/721921 parameters changed (69%)
QUERY: molecular tumor
Epoch [10/10], Loss: 0.0749
368436/721921 parameters changed (51%)
QUERY: molecular tumor
Epoch [10/10], Loss: 0.3645
427678/721921 parameters changed (59%)
QUERY: corona
Epoch [10/10], Loss: 0.2711
426012/721921 parameters changed (59%)
QUERY: corona
Epoch [10/10], Loss: 0.4429
311566/721921 parameters changed (43%)
QUERY: corona
Epoch [10/10], Loss: 0.1320
258525/721921 parameters changed (35%)
QUERY: corona
Epoch [10/10], Loss: 0.1076
218907/721921 parameters changed (30%)
QUERY: corona
Epoch [10/10], Loss: 0.4273
226130/721921 parameters changed (31%)
QUERY: cancer
Epoch [10/10], Loss: 0.2630
498395/721921 parameters changed (69%)
QUERY: cancer
Epoch [10/10], Loss: 0.3669
394564/721921 parameters changed (54%)

However, those require proper annotation which would blow up the size. I doubt it's worth pruning only the 0-delta params.
I wonder, though, can we set a threshold on weight deltas before propagation? Is the intuition right that major weight deltas contribute to the model output more than minor weight deltas (or the combination of many minor weight deltas)?
Looking at quantization (~~essentially rounding params~~ ooh it's more than that), it seems like it is!
Still, I haven't found a gossip learning paper talking about this yet.

synctext · 2023-10-20T08:29:34Z

ToDo: Determine an effective deployment plan. Grand idea: quick to production and iterate fast - ELON style.

0️⃣ First data phase. Create Youtube/Bittorrent swarm dataset in maximum 3 days. Focus on getting started, training, and simple ML code.
- 10 november First code review by @qstokkink of standalone running code
- after the code review we can make an informed plan for deployment
- ready to use embedding of COVID and ready to use dataset: there will be less data to work with. No abstract. No document vector.
- use perfect metadata, something like this: https://www.kaggle.com/datasets/datasnaek/youtube-new?resource=download&select=USvideos.csv
- Trail and error of finding a good neural network architecture. Formal method with ease-of-use does not exist. So machine learning is still black magic today! Our very own Marco Loog paper on the alchemy of deep learning
🚫 NO ML phase. read-only community of the tripplets. You get them only if you specifically ask for them. Crawl. Check validity. (7.14)
✅ Offline ML phase . Use the production network to test Tribler version which collects these tripplets. Use the production network to train your offline AI weights.
✔️ Simple ML phase, on-device AI architecture. Add a few semantic results inside the traditional syntax-matching search pipeline. Avoid the cold start problem by giving hard-coded starting weights inside the Tribler 7.15 release. Triplets are exchanged and used to further train the model locally.
☑️ Improved gossip learning. All search results come from machine learning. We have a lot of training data from production now (both 7.14 and 7.15). This is used to tune, tweak and re-work the model for faster training, efficiency, and effectiveness in general. Use tags as key source of model intelligence? Full usage of metadata enrichment and crowdsourcing.

Reason for pushing strong on simple gossip exchange of human readable data versus model exchange with pure magic numbers. stability, ease of debugging, correctness, convergence, expertise of team, bloat in ML libraries, and lack of machine learning maturity. Really everything is in favour of doing gossip learning with data exchange, only true AI expert would do maximal AI. Easy low hanging fruit for a scientific in-depth 2nd paper: Distributed artificial intelligence: empirical proof for the model exchange versus data exchange paradigm (thesis chapter @pneague or @mg98). silly idea: learning aggragation and compression. Locality of learning to guard horizontal scalability, instead of global broadcast of 1 day of node-learning.

mg98 · 2023-10-22T12:07:56Z

Update on my experiments with Quantization

It's not as trivial as I thought, and I'm not sure if I fully understood it, but I'll try to give an explanation for those interested:

Quantization: Introduction & Overview

Quantization is a technique that is used to shrink a model's size by basically discretizing its parameters. This makes computations on the model faster and more memory-efficient. Typically, weights and activations are quantized from float32 to int8 following the formula $\lfloor w \cdot (2^{b-1}-1) \rceil$, where $b$ indicates the precision (roughly speaking).

Quantization relies on a phase of calibration in which it learns the typical range (scale) and zero-point of individual tensors, thereby optimizing the quantization parameters.

There exist three ways to do quantization.
Screenshot 2023-10-22 at 12 36 00

We are interested in Quantization-Aware Training. In this mode, there is no separate calibration stage (which wouldn't work in our use case). Instead, the model is in a quantization-prepared mode, constantly fine-tuning its parameters, but never actually quantized until conversion.
We get great computational overhead... but we still end up with a quarter of the file size when transmitting it over the network 💪🏻 which is our main objective 🎯

If you are interested in learning more, I found this source the easiest to follow. However, I think the full process behind PyTorch's quantization is more sophisticated and, frankly, still a little mysterious to me.

It also took me a while to implement it. Aggregating a quantized model with an unquantized-but-prepared-to-quantize model is, again, not a trivial task.

In the end, I got it to work. Below I show a demo without and then with quantization. (By the way, the UI evolved ☺️)

Demo.mp4

Demo.Quantization.mp4

Observations:

🥲 Training and inference got much slower
🎯 Transmitting the entire model in 92 instead of 354 packets (of 8 KB). We could easily get it further down (in the order of magnitude) by stripping the metadata off.
😭 The loss in accuracy is extreme, and perhaps unacceptable. Furthermore, I was not able to improve it by tweaking the hyperparameters or number of epochs. So that's a problem!

I think I'll pause it at this point, and switch my focus to the dataset, maybe do some empirical analysis.

mg98 · 2023-11-07T13:46:21Z

I discovered that our ranking model does not perform nearly as well as I thought. Reasons being a small set of results (length 5), which inherently resulted in minor ranking errors, as well as my own confirmation bias which gave undue credit to the occasional successes. Above all, our testing lacked a rigorous analytical approach.

Thus, the last two weeks have been an intense period of debugging, refactoring, and testing.

I increased the number of results to around 10 and introduced an evaluation metric (nDCG) to gauge the model's performance--it was about random😑. This led me to experiment with new loss functions, new optimizers, even new model designs, in addition to all the other hyperparameters like model size, epochs, learning rate, ... I learned that those little things can often have a dramatic impact on the performance. Furthermore, I learned about overfitting and dropout regularization.

I made everything configurable! And played around with the options running against my automated tests. Tweaking around, I managed to get near-optimal performance with 10 results (according to my simulation of 5,500 user signals).

To further optimize my parameters, I plan to adopt a systematic approach to parameter search. I know that there exist some tools (e.g., RayTune) that can do this intelligently... will read into that... in any case, being able to control everything from my config.ini lays a solid foundation to do that, even on my own.

Another major concern that arose was the effect of the number of epochs on the model accuracy. In an online (= continuous learning) environment, the number of epochs is not only arbitrary but also ever-increasing. Therefore, with static parameters the model will always continue to overfit. We'll have to dig the OL2R papers to see how they dealt with it... My guess is the solution will involve something like adaptive learning rates.

mg98 · 2023-11-27T05:36:44Z

On a hybrid work/vacation until mid-December. Trying to recap and orient myself in my research:

Broader Goal: Better information retrieval in decentralized systems

Explored Methods:

Mainly LTR
Recently started reading into LLM for IR (what Petru's digging atm)
(Metadata.... the important other half when doing IR. Haven't got anywhere with it yet. Listing for completeness.)

Roadmap:

I want to continue working on a functional prototype for Decentralized LTR (repo)
Focus could be
- the comparison of data exchange models (gossiping training data vs. model weights).
- Personalization is another interesting domain to investigate, but that might be for later
In need of a dataset to evaluate my model's performance. Concretely, I need a dataset that provides me
- search queries
- static list of top-k results given query
- which document has been clicked
- document vectors
Often, the list of relevant results is what is missing from available datasets. Collecting data from Tribler?

mg98 · 2023-12-22T08:35:01Z

Continual learning

Quick-fix for the overfitting problem of continuous LTR (= indefinitely growing number of epochs). I applied an exp learning rate decay and found parameters which converge to optimal ranking.

How I think LTR relates to the DSI approach

The DSI approach is capable of retrieve-and-rank. And I'm trying to understand where this leaves traditional LTR. My idea is that LTR's power lies in the consideration of user signals (click-through rates in particular) which go beyond semantic relationships, primarily, and then the potential for personalization, secondly. In which case LTR could complement this model in a two-stage IR system where the second stage is concerned with the reranking of the previously retrieved result set.

Another thought:

LTR comes in where metadata is insufficient. You could have great content, and shitty content, both with equal metadata. Scientific problem: Spam, clickbait, and the hidden gems are indistinguishable for most algorithms. On personalization: One person's gem is another man's clickbait.

In a future work, it would be interesting to test this out in a first-DSI-then-LTR architecture and evaluate if and how far LTR's re-ranking capabilities change the ranking of search results, and finally improve the accuracy of search results. The latter could be measured by metrics like clicks @ 1st result, for example. Empiric dataset required.

synctext · 2023-12-22T09:20:12Z

Possible next paper: DSI-then-LTR two-stage IR system. Another key issue for real-world usability is the bundeling and filtering of near-duplicate items as we started to experiment with in 2022:

synctext · 2024-01-08T09:31:33Z

ongoing work on the "A Comprehensive Study of Content-Defined Chunking for Data Deduplication" paper. With 27-pages it's complete, but text polishing remains. Hard to compete with chunks against bleeding edge sexy ML. (March target (acm transactions on storage???, solid thesis chapter)
DSI-then-LTR two-stage IR system finish on 19 Jan!
brainstorm on next-next paper
- crowdsourcing direction
- end-to-end LLM for Tribler/Netflix

mg98 · 2024-01-23T07:45:37Z

Still working on the completion of my thesis chapter
Still working with Petru on Decentralized DSI paper (target EuroMLSys with deadline on 20th Feb)
Btw, we can finally use DAS6! 🚀 very much needed
Brainstorming other ideas, like...

Queries Is All You Need

When Google presented DSI, they had the luxury of knowing complete document texts. These were used as the foundation in training their LLM to map queries to (arbitrary) docids. They further state:

[...] it is clear that examples of type [query->docid] alone do not provide enough information for a system to generalize to novel retrievals [...]

Petru, lacking any document contents or metadata, trained solely on query-docid pairs. Astoninglishly, this model became able to output the correct docid to unseen queries (i.e., generalize to new inputs).

This is a discovery that deserves more attention for its implications in search, especially in systems where metadata is scarce.

Can we really describe documents by (and solely by) the queries people used to find it? Can we maybe by collecting enough diverse queries to a popular document, have a model learn the document's rich semantic profile?

Those, I think, are ideas worth an examination in another short-formed paper.

</Pitch>

Time for a first figure. I designed a simple experiment using a small subset of the ORCAS dataset.

I'm using T5-small as a base model (60 million parameters).
I took a sample of 100 documents with samples of 200 associated (ambiguous) queries, respectively (i.e., 100x200=20000 query-docid pairs)
For n=1..10, I reduced this dataset such that it's 100 documents with only n associated queries, respectively. This would become my train dataset. My test dataset would be what is left from the initial dataset.
After each iteration, the model is reset.
I trained on this dataset on 650-n*30 epochs. This is a formula that aimed to train the dataset with 100% accuracy, without over-training it (i.e., accuracies are 96%+)
From the test dataset, I took a sample of 100 queries for each document (i.e., 100x100=10000 query-docid pairs with unseen queries), and measured the model's accuracy on them.
Experiments conducted on a single node in DAS6, takes about a day to run with this setup. Parallelization on multiple nodes possible, though.

Experiment setup might have some flaws, but I hope with my first figure to give a general idea, which should be: the majority of queries can be processed successfully given only very few (according to experiment at least 2) prior queries to the sought document.

(y = success rate on unseen queries)

I wanna try again with 1000 (instead of 100) of distinct documents. Will update here.

Live update: (1, 10%), (2, 23%), (3, 27%), (4, 32%), ... so it definitely has the capacity to decrease sharply as the output space grows, but still fair results

synctext · 2024-01-30T08:35:21Z

Great results!!
It seems likely you can write a paper with the 'queries is all you need' idea
Lets not work on 3 papers at the same time. Submit something before diving further.
It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries. One ORCAS use case is Web mining, to find clusters of related queries and/or related documents. These can be mined for synonyms, used for expanding and understanding the vocabulary of queries and documents. The 10 million queries could be used in studies of query autocompletion., from https://microsoft.github.io/msmarco/ORCAS.html
Expand that golden figure! Start writing the first experimentation section; 1-2 pages. No intro, No problem, no related work.
Exaggerate! Artificially pick the most extreme example which produces the best results. Select from the 10 million queries the most clustered ones? Select the document with the most ambiguous queries? Select the documents with the most similar queries?
Storyline idea: In this experimental section we first present our motivation example and then elaborate with detailed performance analysis. The first experiment is chosen specifically for simplicity and dramatic effectiveness.
1 thesis chapter on block storage, 1 thesis chapter on 'queries all you need'
- Active learning (crowdsourcing, knowledge graph)?
- Continuous learning?

mg98 · 2024-03-07T12:53:50Z

It's time for another update. I have been conducting more and more experiments, as I was trying to better understand what's going on.

Here is one very interesting finding:

💡 More popular documents have a more diverse semantic range of queries than more niche documents.

^{⚠️ Caveat: "More popular" just means those document have a larger number of distinct associated queries.}

For example, people who looked for Gmail got successful querying "google login online" but also "create a new account".

We can also visualize that by clustering the embeddings for queries of low popularity (40-50 queries), and queries of high popularity (1000+ queries) on a 2D semantic space. (Dots represent queries, color-coded by their associated document)

As we can see, the queries of popular documents are semantically more dispersed, which makes them more difficult to group them together just by looking at their position. In other words, if you let k-means do the clustering (intuitively what our model would do), the mismatch rate would be higher with the high-pop docs than with the low-pop docs.

There are also metrics to quantify that. In the following figure, I have plotted the Adjusted Rand Index (ARI) of the clustering and increasing document popularity.

ARI values range from -1 to 1. A score of 1 indicates a perfect match between the clustering results and the ground truth, while a score of 0 indicates random clustering, and a negative score suggests less agreement than expected by chance.

I have plotted this for 20 docs and for 40 docs, to show that of course the more documents share the same semantic space, the more crowded the queries get, and the more unreliable the clustering.

Finally, this is of course reflected in the accuracy the model can attain, as shows the following example (here with 100 docs trained on 10 queries).

💡 Second finding is that the model often hallucinates, and beam search sometimes duplicates results.

I call a result a hallucination (or invalid) if it outputs a string that was not part of the outputs it got trained on.

This means that if you have a system where you are aware of the existing documents, you can instead take the first valid docid output by beam search, and therefore bump the accuracy.

On the lowest range of our experiments (100 docs and 1 query fed) this made a difference of +3%. However, on the upper end it is rather negligible. The following shows an excerpt of the results.

Documents	Queries	acc@top1valid	acc@top1	acc@top3	acc@top5	inv@top1	inv@top3	inv@top5	dup@top5
100	1	0.4265	0.3955	0.457	0.483	0.2185	0.534833	0.6417	0.0006
100	5	0.824	0.8235	0.898	0.9365	0.0015	0.2255	0.307	0.0185
100	10	0.878	0.8775	1	1	0.002	0.227833	0.3431	0.1753
100	20	0.932	0.932	1	1	0.0015	0.327833	0.4391	0.0628
1000	1	0.20515	0.157	0.198	0.2156	0.58495	0.757117	0.81045	0
1000	5	0.68995	0.68925	0.77	0.80935	0.006	0.16825	0.23343	0.01082
1000	10	0.79405	0.79385	0.96025	1	0.0014	0.191117	0.2658	0.07091
1000	20	0.832	0.8314	0.93815	1	0.0022	0.217933	0.28892	0.04126

mg98 · 2024-04-08T09:12:46Z

I have been digging into some papers in the context of the upcoming Queries-Is-All-You-Need chapter and potential future work. Dumping my learnings here.

Neural Corpus Indexer (NCI) [PDF]

Abstract: NCI is currently probably the best-performing version of DSI (after GenRet 😉 see below). NCI and others, however, gain most of their extra performance from leveraging document contents.

Most of these proposed systems, like also DSI-QG and SEAL, leverage knowledge over the document contents. For example, they will artificially generate queries to train on, or create semantic docids based on the contents (e.g., SEAL does ngrams for indices/docids).
So it's important to keep those technologies out of scope - we are interested in content-oblivious search.
NCI is interesting because they--besides doing a lot of other things--offer two technological improvements that do not rely on document knowledge:

Prefix-aware weight-adaptive (PAWA) decoder
Consistency-based regularization

Combined, these changes increased the Recall@1 from 62.57 to 65.86.

DSI++ [PDF]

Abstract: This is an update by the original authors of DSI to equip DSI with eternal learning (aka continuous or lifelong learning).

This is how they do it:

They optimize for flatter loss basins. There's this thing called Sharpness Aware Minimization (SAM), which seeks out flatter areas within the loss landscape. My intuition around this is that you have a wider area of tolerance when new things are learned and altered parameters shift the loss.
Then, they employ a second model (generative memory) that keeps generating pseudo-queries for previously learned stuff. This is then fed into the DSI for rehearsal.

GenRet [PDF]

Abstract: Using an autoencoder to learn to tokenize documents (my idea is to use the mean of the collection of associated queries instead) into short semantic docids.

The accuracy drops as soon as you replace ORCAS' docids D1234839 to magnet links 0xabcde...40 chars. Intuitively, the more tokens the model needs to generate, the more likely it will make a "typo".
Therefore, I asked myself how can we represent N documents using the least number of tokens. I naively tried to create docids myself where I calculated the shortest length of characters or tokens with which I could still create N unique combinations. But turns out, it either performs worse or equal to ORCAS docids. It apparently is not that simple because of the semantic meaning that is carried with every token.

So instead of writing a docid tokenization function myself, I would like to learn a function that does the docid tokenization. This is what the authors of GenRet did, and thereby (and only thereby) outperformed NCI (which got most of its performance from extracting document knowledge -- we could maybe do GenRet's technique using queries only).

I roughly understood the theory of how they were doing it, but luckily they are also open source (repo), I will try to get it to work and run experiments.

Other Interesting Papers for Future Work

DynamicRetriever team combined DSI with actual Learn-to-Rank [PDF]
Same team was also working on personalized LLM-retrieval, idk how it works [PDF]
DSI team published TIGER: Recommender system using LLM with I/O of form (doc1, doc2, doc3) -> (doc4) [PDF]

synctext · 2024-04-10T08:34:42Z

great progress in phd 🚀
ready-for-use semantic dataset
perfect fit with the The Global Brain - the roadmap #7064 goal of our Tribler Lab
arXiv datasets seem ideal for open science
- The peS2o dataset is a collection of ~40M open access academic papers, cleaned, filtered, and formatted for pre-training of language models.
  - It is derived from the Semantic Scholar Open Research Corpus(Lo et al, 2020), or S2ORC.
- 28 billion tokens of arXiv (no bibliography, just Latex 🤔 ): https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
- Compile Latex to semantic PDF ? LATEX Rainbow: Universal LATEX to PDF Document Semantic & Layout then seed this in Tribler?
How to get a clicklog to enrich this dataset?
- "learn to tokenize documents"
- Start seeding in Tribler and crawl?
ToDo: roadmap on life-long decentralised learning
- what are the issues (overfitting, pollution, spam, hallucinations, 300-scientists mega-job, decentralisation)
- what is the first realistic next step?

update : (more paper ideas then finished chapter, simply documenting)
A Survey of Hallucination in Large Foundation Models. Would it be possible to filter hallucination of our magnet link generative AI (Queries Is All You Need)? The popularity community is a gossip mechanism to update statistics of millions of torrents. We should be able to track dead torrents, popular torrents, and hallucination torrents.

mg98 · 2024-04-12T11:06:38Z

"learn to tokenize documents"

I got the script to work (GenRet). However, I was rethinking this idea and I don't see how we could sell it. Learning semantic docids from the queries themselves obviously requires you to have the queries beforehand, and even then it implies some fluidity (cannot just "improve" the ID in the continuum).
If anything, DSI's hierarchical clustering is much better suited for semantic docids, because here at least the semantic space on which it clusters it is fixed. GenRet really goes one step further and limits the space to what the document space encompasses. [my understanding]

I'm dropping this!

roadmap on life-long decentralised learning [...] what is the first realistic next step?

Yeah as you listed: overfitting, pollution, spam, those are also things that come to my mind. While there are some ideas how it could work conceptually (e.g., DSI++, IncDSI), the datasets they use to validate them are a bit weak (in those papers, NQ and MS MARCO). For a real evaluation, we (1) need real search queries, including spam, but also (2) we should care about the chronological order that the queries come in, and that the model learns on.

Waiting for the Tribler Clicklog.

Would it be possible to filter hallucination of our magnet link generative AI (Queries Is All You Need)?

Not in the way that is described in this body of research (referring to your survey link), I think.
What I have already done in my experiments is to restrict the vocabulary of the generator and control the length. This way you can tune it a little bit to the pattern of a docid.

What we can do in the next step is to assume knowledge of, let's say, healthy torrents. Using this knowledge, the model will be configured to predict the most likely token, with which the resulting output continues to match a prefix found within the set of healthy torrents.

Will do!

mg98 · 2024-05-07T11:52:12Z

Representation of Targets Matter

In our last paper, we saw significant differences in performance when representing our targets as ORCAS docids (e.g., D1234567) vs. as magnet links (40 character hex string). The model would generally have a harder time predicting magnet links. We blamed this on their length; more tokens to generate, more chances to trip along the way.

When thinking about how to optimize the performance of our model, I therefore thought the number of tokens on a docid should be minimized. Why not use the entire ASCII space for example? Or hell, the T5-small has 32k tokens, why not encode docids as, for instance, "car 2001": two tokens, 1 billion possible combinations.

It turns out this confuses the model more than it helps 😅. This beeeeeegs the question....

🤔 Using an LLM to predict arbitrary identifiers, what kind of identifiers come natural to it?

Is it a question of length? Or consistency? Or the employed vocabulary? What tokens should you use? How many?

I ran a lot of experiments to get closer to an answer about all these things.

In order to enhance the performance of our model, I initially thought that the number of tokens used to represent a docid should be minimized. My rationale was _less tokens, less chances to mispredict. And while that might be true, or maybe only true to some extent, it definitely seems to be case that the employed vocabulary matters too!

I have been experimenting with different representations (or rather encodings) of the targets (i.e., the docids) -- and, spoiler, the results are actually quite impressive.

Here is exactly what I did

I transform all ORCAS docids to their their SHA1 hash. This hash has 20 bytes. Because my experiment only samples 100 docids, I only took a slice of 2 bytes from that hash.
I encode these 2-byte-hash-slice as, for instance, hexadecimal strings. This yields docids in the form "fc48", "7f82", ....
I use the 1-query-on-100-docids experiment as a baseline to measure its top1 accuracy.

I repeat this experiment with different encodings. A full list, including some result metrics, is shown below.
(Vocab Size: Number of distinct tokens to encode 100 docids; Vocab. Sim: Mean pairwise euclidian distance between all tokens in vocabulary)

Encoding	Description	Example	Accuracy	Mean Token Length	Vocab. Size	Vocab. Sim.
original	Original ORCAS docid	D972207	0.3360	4.23	146	408
dec	Integer	9958	0.3345	2.73	131	416
dec_pad	Zero-padded integer	09958, 14382	0.4020	2.83	136	427
dec_x	Integer with spaced digits	9 9 5 8	0.2175	5.18	11	312
dec_x_pad	Zero-padded integer with spaces digits	0 9 9 5 8	0.2545	5.48	11	314
bin	Bitstring	111000000100	0.2700	5.59	21	417
bin_pad	Zero-padded bitstring	0000111000000100	0.3015	6.09	24	427
hex	Hexadecimal encoding	0b84	0.3600	3.62	93	439
base64	Base64 encoding	07s=	0.3400	4.25	109	435

Boxplot of token lengths with each encoding

🚀 In this experiment, we made a 7% performance increase over our original results just by choosing a different encoding for the docids

It seems to have an easier time with numbers. Maybe it is because there exist many tokens for compounded digits (69, 420, 2001, 1944), thus reducing in less tokens needed to represent a docid.
There is an incredible 7% difference between dec and dec_pad, showing again how consistency is key! As I have already learned when trying to mix magnet links with youtube URLs, the model really suffers with inconsistent formats. And if you don't believe that the consistent char length is reflected in consistent token length, check the boxplot! It's actually kind of crazy.

Another theory I have is that having predicted a number token, based on this context it is more likely to predict another number token, and that this might help performance a little.

It is perhaps also interesting to acknowledge that number tokens are semantically very similar to each other. That goes to say the tokens for 56 and 55, or even 21, are semantically very close. Indeed, if you do k-means clustering on all token embeddings and then look for the most dense clusters, they will contains years (2012, 2013, 2014) or other numbers. 💡

We might already be very close to what the perfect (or perfectly-enough) representation is. But it might be interesting, not just for this application, but also for the broader ML community, to investigate what representations an LLM works best with. @pneague and me were thinking of using ML (genetic algorithms in particular) to learn an optimal vocabulary for representing arbitrary string identifiers. 🌈

Edit: Looking at the results again, it might just be that the model favors a low but consistent token length. But more experiments need to be conducted.

synctext added type: long-term type: Epic labels Feb 10, 2023

synctext assigned mg98 Feb 10, 2023

synctext mentioned this issue May 15, 2023

BeyondFederated - truly decentralised learning at the edge #7254

Open

synctext mentioned this issue Aug 23, 2023

msc placeholder: LLM as a Google alternative #7438

Open

synctext mentioned this issue Sep 4, 2023

Phd Placeholder: learn-to-rank, decentralised AI, on-device AI, something. #7586

Open

synctext mentioned this issue Oct 17, 2023

msc placeholder: 5G overlay infrastructure for decentralised learning ??? #7258

Open

synctext mentioned this issue Mar 8, 2024

phd topic placeholder - identity, reputation, trust, something #6519

Open

phd placeholder: IPFS crawl experience, metadata, and semantic gap awareness #7290

phd placeholder: IPFS crawl experience, metadata, and semantic gap awareness #7290

Comments

synctext commented Feb 10, 2023 • edited

mg98 commented Apr 24, 2023 • edited

synctext commented Apr 26, 2023 • edited

mg98 commented May 2, 2023 • edited

synctext commented May 3, 2023 • edited

mg98 commented May 8, 2023 • edited

synctext commented May 12, 2023 • edited

mg98 commented May 13, 2023

mg98 commented May 22, 2023 • edited

synctext commented May 31, 2023 • edited

mg98 commented Jun 2, 2023 • edited

Back to LTR

synctext commented Jun 6, 2023 • edited

mg98 commented Jun 11, 2023 • edited

mg98 commented Jul 12, 2023 • edited

synctext commented Jul 12, 2023 • edited

synctext commented Jul 13, 2023

synctext commented Jul 29, 2023

mg98 commented Aug 6, 2023 • edited

mg98 commented Aug 14, 2023 • edited

MovieLens Dataset

Idea

Approach

devos50 commented Aug 15, 2023

mg98 commented Aug 28, 2023 • edited

synctext commented Aug 29, 2023 • edited

SwarmLLM: collective LLM intelligence (with new AI phd expert??)

LLM as a key/value store

Rich metadata inside an LLM

Tribler: a public semantic search engine

Web3Search: Online Pairwise Learning to Rank by Divide-and-Conquer with full decentralisation

Foundations of Trust and Tags.

mg98 commented Aug 31, 2023 • edited

synctext commented Aug 31, 2023

mg98 commented Aug 31, 2023 • edited

mg98 commented Sep 6, 2023

SwarmLLM: collective LLM intelligence (with new AI phd expert??)

Web3Search: Online Pairwise Learning to Rank by Divide-and-Conquer with full decentralisation

Tribler: a public semantic search engine

mg98 commented Sep 10, 2023 • edited by synctext

synctext commented Sep 15, 2023 • edited

mg98 commented Sep 15, 2023 • edited

May the Best Metadata Win: Authorship for Perfect Metadata Curation

mg98 commented Sep 19, 2023 • edited

Notes on ✨ 1st paper idea ✨

mg98 commented Oct 3, 2023 • edited

✨ 1st paper idea ✨ Towards Privacy in Personalized Ranking Models in Decentralized Search

✨ 2nd paper idea ✨ May the Best Metadata Win: Authorship for Perfect Metadata Curation

✨ 3rd paper idea ✨ Decentralized OL2R: Risks and Challenges

✨ 4th paper idea ✨ Decentralized LLM

synctext commented Oct 3, 2023 • edited

mg98 commented Oct 16, 2023

synctext commented Oct 17, 2023 • edited

mg98 commented Oct 17, 2023 • edited

synctext commented Oct 20, 2023 • edited

mg98 commented Oct 22, 2023 • edited

Update on my experiments with Quantization

mg98 commented Nov 7, 2023 • edited

mg98 commented Nov 27, 2023 • edited

mg98 commented Dec 22, 2023 • edited

synctext commented Dec 22, 2023 • edited

synctext commented Jan 8, 2024 • edited

mg98 commented Jan 23, 2024 • edited

Queries Is All You Need

synctext commented Jan 30, 2024 • edited

mg98 commented Mar 7, 2024

💡 More popular documents have a more diverse semantic range of queries than more niche documents.

💡 Second finding is that the model often hallucinates, and beam search sometimes duplicates results.

mg98 commented Apr 8, 2024 • edited

Neural Corpus Indexer (NCI) [PDF]

DSI++ [PDF]

GenRet [PDF]

Other Interesting Papers for Future Work

synctext commented Apr 10, 2024 • edited

mg98 commented Apr 12, 2024 • edited

"learn to tokenize documents"

roadmap on life-long decentralised learning [...] what is the first realistic next step?

synctext commented Feb 10, 2023 •

edited

mg98 commented Apr 24, 2023 •

edited

synctext commented Apr 26, 2023 •

edited

mg98 commented May 2, 2023 •

edited

synctext commented May 3, 2023 •

edited

mg98 commented May 8, 2023 •

edited

synctext commented May 12, 2023 •

edited

mg98 commented May 22, 2023 •

edited

synctext commented May 31, 2023 •

edited

mg98 commented Jun 2, 2023 •

edited

synctext commented Jun 6, 2023 •

edited

mg98 commented Jun 11, 2023 •

edited

mg98 commented Jul 12, 2023 •

edited

synctext commented Jul 12, 2023 •

edited

mg98 commented Aug 6, 2023 •

edited

mg98 commented Aug 14, 2023 •

edited

mg98 commented Aug 28, 2023 •

edited

synctext commented Aug 29, 2023 •

edited

mg98 commented Aug 31, 2023 •

edited

mg98 commented Aug 31, 2023 •

edited

mg98 commented Sep 10, 2023 •

edited by synctext

synctext commented Sep 15, 2023 •

edited

mg98 commented Sep 15, 2023 •

edited

mg98 commented Sep 19, 2023 •

edited

mg98 commented Oct 3, 2023 •

edited

synctext commented Oct 3, 2023 •

edited

synctext commented Oct 17, 2023 •

edited

mg98 commented Oct 17, 2023 •

edited

synctext commented Oct 20, 2023 •

edited

mg98 commented Oct 22, 2023 •

edited

mg98 commented Nov 7, 2023 •

edited

mg98 commented Nov 27, 2023 •

edited

mg98 commented Dec 22, 2023 •

edited

synctext commented Dec 22, 2023 •

edited

synctext commented Jan 8, 2024 •

edited

mg98 commented Jan 23, 2024 •

edited

synctext commented Jan 30, 2024 •

edited

mg98 commented Apr 8, 2024 •

edited

synctext commented Apr 10, 2024 •

edited

mg98 commented Apr 12, 2024 •

edited

mg98 commented May 7, 2024 •

edited