Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BeyondFederated - truly decentralised learning at the edge #7254

Open
synctext opened this issue Jan 11, 2023 · 40 comments
Open

BeyondFederated - truly decentralised learning at the edge #7254

synctext opened this issue Jan 11, 2023 · 40 comments
Assignees

Comments

@synctext
Copy link
Member

synctext commented Jan 11, 2023

Started full-time thesis around april/may 2023.

Track DST, Q3/4 start. Still "seminar course" ToDo. Has superapp/MusicDAO experience. Discussed as diverse as digital Euro and Web3 search engine (unsupervised learning, online learning, adversarial, byzantine, decentralised, personalised, local-first AI, edge-devices only, low-power hardware accelerated, and self-governance). Done machine Learning I class. (background: Samsung solution, ONE (On-device Neural Engine): A high-performance, on-device neural network inference framework.

Recommendation or semantic search? Alternative direction. Some overlap with the G-Rank follow-up project. Essential problem to solve: learning valid Creative Commons Bittorrent swarms.

class Seq2SeqEncoder(d2l.Encoder):
    """The RNN encoder for sequence to sequence learning."""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqEncoder, self).__init__(**kwargs)
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, num_hiddens, num_layers,
                          dropout=dropout)

    def forward(self, X, *args):
        # The output `X` shape: (`batch_size`, `num_steps`, `embed_size`)
        X = self.embedding(X)
        # In RNN models, the first axis corresponds to time steps
        X = X.permute(1, 0, 2)
        # When state is not mentioned, it defaults to zeros
        output, state = self.rnn(X)
        # `output` shape: (`num_steps`, `batch_size`, `num_hiddens`)
        # `state` shape: (`num_layers`, `batch_size`, `num_hiddens`)
        return output, state

Second sprint (strictly exploratory):

  • get musicDAO running from source.
  • read at least 4 master thesis works from the lab

Doing information retrieval msc course to prepare for this thesis

Literature survey initial idea: "nobody is doing autonomous AI" {unsupervised learning, online learning, adversarial, byzantine, decentralised, personalised, local-first AI, edge-devices only, low-power hardware accelerated, and self-governance}.

@synctext synctext changed the title msc placeholder: brainstorm on thesis direction msc placeholder: brainstorm on thesis direction (Deployment of Gold-Rank?) Feb 20, 2023
@synctext synctext changed the title msc placeholder: brainstorm on thesis direction (Deployment of Gold-Rank?) msc placeholder: brainstorm on thesis direction (Web3 recommender with conceptual drift) Feb 20, 2023
@synctext
Copy link
Member Author

synctext commented Mar 24, 2023

@quintene
Copy link

quintene commented Apr 3, 2023

To create a suggestion model with neural hashes using metadata as input to find songs in Creative Commons BitTorrent swarms:

  1. Collect (Scrape) a dataset of songs and their corresponding metadata from Creative Commons BitTorrent swarms.
  2. Use the neural network to generate neural hashes for each song in the dataset. These neural hashes would represent each song as a high-dimensional vector that captures its features and characteristics.
  3. DOING: Research in the distribution of hashes;
    possible directions:
    3.a Each node would have a copy of the neural network and the neural hashes for some subset of the songs. The distribution of songs could be done based on some criteria like proximity or similarity of neural hashes, for example.
    3.b Use of a multi-index hashing scheme.
  4. When a user wants to search for songs based on a given input genre, the query would be broadcasted to all nodes. Each node would then perform a nearest neighbor search on its own subset of the neural hashes to find the songs that are most similar to the input metadata.
  5. The results from each node would be collected and combined to generate a list of suggested songs. This could be done by taking the top N results from each node, and then combining them based on their relevance or popularity.

Optionally, Improve the model over time, track which songs are actually downloaded or listened to by users, and use this data to train the model to improve its suggestions.

@synctext
Copy link
Member Author

synctext commented Apr 3, 2023

Proposal: a dedicated sprint to implementing a basic search engine.

@quintene
Copy link

quintene commented Apr 24, 2023

https://colab.research.google.com/drive/1j_voFtr6j0gEStsMfcafi9FV5XJOLxjj?usp=sharing

  1. Scraping metadata
    [
    "Cullah Firebird electronic folk soul",
    "Serious Mastering Ego electronic",
    "Serious Mastering La chaleur du soleil electronic",
    "Oxidant Deconstruct hardcore.punk powerviolence punk",
    ...
    ]

  2. Translate into embeddings

  3. compare embeddings using cosimalarity
    query: ['Firebird']
    similarity score: 0.6434079439455619
    Cullah Firebird electronicfolksoul

query: ['electronic']
similarity score: 0.4482832368649311
Serious Mastering Ego electronic

similarity score: 0.3406708597897247
Serious Mastering La chaleur du soleil electronic

@synctext
Copy link
Member Author

synctext commented Apr 24, 2023

  • solid progress in your part-time thesis startup.
  • clean scrape now done sample.json
  • make a simple android .APK that installs on my phone, new superapp icon
  • it can search in strings, show matches, and {random} rank
  • focus on functional, not yet on efficiency or ease-of-use.

@quintene
Copy link

quintene commented May 15, 2023

APK including:

wetransfer link (118mb) https://we.tl/t-pnugzyNiRV

@synctext
Copy link
Member Author

synctext commented May 15, 2023

Question: how impressed/intimidated/confused are you about recent ML/LLM/Diffusion explosion?
@quintene answer: innovation speed is fast/sophisticed due to everybody building on top of each other.
Johan note: What does a leaked Google memo reveal about the future of AI?
Question: how to identify and follow a long-enduring winner? 1) Alpaca on Pixel7, or 2) MLC Android or 3) https://github.com/BlinkDL/RWKV-LM
@quintene answer: Nobody has solved the magic architecture of decentralised learning! Personalised model, how to partition, can we re-use the "decentralisation layer" across the whole ML domain? Current limited approach: one dataset, one application. "dynamic distributed learning". johan note goal: non-i.i.d?

@quintene
Copy link

quintene commented Jun 5, 2023

Have not found resources that does not have a central server; Is it?

  • Related work:

https://we.tl/t-0ffNeOjjJO

{
"artist": "Cullah",
"title": "Firebird",
"author_image": "https://images.pandacontent.com/artist/12/250x250/2-cullah.jpeg?ts=1675708015",
"author_description": "MC Cullah is a producer/singer/songwriter/rapper from Milwaukee, Wisconsin. His music is lost somewhere in between Rock -n- Roll, Electronica and Hip Hop with a pinch of psychedelic melodies. With an arsenal of synthesizers and a library of forgotten sounds he manages to create something that sparks imagination and wonder.",
"author_upcoming": [
{
"context": "https://schema.org",
"type": "MusicEvent",
"startDate": "2023-06-15T00:00:00+00:00",
"offers": "https://www.songkick.com/concerts/41175136-cullah-at-radio-milwaukee-889-fm",
"name": "Radio Milwaukee 88.9 FM",
"location": {
"type": "PostalAddress",
"addressLocality": "Milwaukee, WI, US"
}
],
"year": "2022",
"tags": [
"electronic",
"folk",
"soul"
],
"artwork": "https://images.pandacontent.com/release/779/250x250/1-firebird.jpeg?ts=1675708399",
"magnet": "magnet:?xt=urn:btih:O2NCAP26N63U7VK6LSCXNVR3VV3ODILA&tr=udp%3A//tracker.pandacd.io%3A2710&dn=Cullah%20-%20Firebird%20%282022%29%20-%20MP3",
"songs": [
"The Feather",
"Firebird Credits",
"The Golden Apple",
"The Anima",
"The King"
]
},

@synctext
Copy link
Member Author

synctext commented Jun 5, 2023

  • Working "Peer AI" superapp !
  • Following the master course: Seminar on distributed machine learning systems (by Lydia)
  • The following Creative Commons music tags are available for bootstrap purposes (see other Tribler music issues):
  • Draft thesis title: "BeyondFederated: truly decentralised learning at the edge"
    • search without servers is unsolved. Only theory for 25 years. No proven solutions or actual usage.
    • academically pure: decentralised, self-organised systems
    • only use the User-Tag-Item matrix for recommendation and content discovery please.
  • ToDo: next sprint also make a 1 page Problem Description and read prior master thesis articles

@quintene
Copy link

quintene commented Jul 6, 2023

  • "Peer AI": Refactoring on earlier work "Vectorization from scratch in Kotlin" creating a Searcher model using ScaNN within tensorflow (mobile).

  • Research goal: Train/Share above model within P2P enviroment considering significant challenges due to the limited availability of peers, lack of trust, and dynamic identities of peers.

  • Research into related work (fully federated learning approaches)

  • Seminar on distributed ML systems: Working on a project applying differential privacy within federated learning where attacks are executed. paper

  • Wrting on Problem Description.

thesisproblem_iteration_qvaneijs.pdf

@synctext
Copy link
Member Author

synctext commented Jul 6, 2023

  • BeyondFederated Nice title!
  • Developing a music search engine within a peer-to-peer (P2P) network presents significant challenges due to the limited availability of peers, lack of trust, and dynamic identities of peers. These factors add complexity to the task of building an efficient and reliable music search engine within a decentralized environment.. Suggestion: whole storyline on "each peer only has a partial view of the network. No central viewpoint exists with the complete overview. This severely impacts the possible solutions. None of the traditional mechanism are able to function in this leaderless environment. Traditional solution all assume a client/server or single ownership entity. We need self-organisation
  • Good find on Papaya. More related work exists than I was aware of. Peer-to-peer Federated Learning on Graphs
  • "Swarm Learning", very catchy term. SwarmLearning2 ?
  • Please avoid content analysis and homomorphic crypto
  • Very good dataset choice. Just do your thesis on Pandacd (with Bitcoin wallet for artists!!!) and FMA. FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres.
  • Semantic search as cardinal focus of your thesis
    • collaborative filtering uses a traditional 1-dimensional similarity: Pearson similarity, Cosine similarity, Euclidean distance, Manhattan distance, etc.
    • LLM-like hundreds of dimensions
    • Beyond simple: user-item matrix
    • Semantic clustering: Each user has preference for certain items, only knows about a limited set of items, and has a model that only knows a limited set of items with a bias for semantic similarity. All users together know every item.

Ideal sprint outcome for 15 Aug: a operational PeerAI .APK with minimal viable "TFLite Searcher model" with PandaCD and FMA. Focus: genre similarity. Adding new item, fine-tuning, exchanging models is still out of scope. Lets get this operational first!

@quintene
Copy link

quintene commented Aug 16, 2023

Modelling "TFLite Searcher model" using subset of dataset using only: Title, Artist, Genre/Tag, Album.

  • Generate embeddings for the data using with custom model generating multi-dimensional vectors space corresponding the previous mentioned dataset. Current embedding models mainly focusses on larger texts such as small paragraphs to generate vectors used in semantic search. Goal here is to generate multidimensional vectors to respectively keep multiple meta-data attributes in the same dimensions.

Training goal; creating vectors with similar atrributes having smaller distances.

Porting everything into Kotlin.

  • Trained model used in kotlin using Tensorflow Lite
  • Implementing ANNOY (rust) bindings into Kotlin app.
    MVP: Input Query -> Embedding using custom model -> finding semntical similarities from Approximate Nearest Neighbors.

@synctext
Copy link
Member Author

synctext commented Aug 16, 2023

  • Still very much in exploratory phase!?
  • Big picture is still unclear, unsupervised approach? What are you optimising? learning phase? Cost function?
  • Approximate search, cluster search, vector cloud.. What master thesis figure will convey to the reader that everything works and is brilliant?
  • Keep the semantic search simple. Maximum 2 weeks to wrap up that part. Then focus on BeyondFederated full self-organisation with IPv8 gossip community. Spread new content items and news genre-overlap info.
  • {repeating} Ideal sprint outcome for 15 Aug: a operational PeerAI .APK with minimal viable "TFLite Searcher model" with PandaCD and FMA. Focus: genre similarity. Adding new item, fine-tuning, exchanging models is still out of scope. Lets get this operational first!
  • Write 1 page Problem Description {update to more into-level existing draft}

@synctext synctext changed the title msc placeholder: brainstorm on thesis direction (Web3 recommender with conceptual drift) msc placeholder: BeyondFederated - truly decentralised learning at the edge Aug 18, 2023
@synctext synctext changed the title msc placeholder: BeyondFederated - truly decentralised learning at the edge BeyondFederated - truly decentralised learning at the edge Aug 18, 2023
@synctext
Copy link
Member Author

synctext commented Aug 18, 2023

Pleas ensure to cite this work in your thesis, AI Benchmark: All About Deep Learning on Smartphones in 2019. website of ETH-Z en-device AI benchmarking, includes S23 results.

UPDATE: Youtube contrains more content than FMA and PandaCD. Great datasets exist. See Youtube player you could connect to your thesis focus of BeyondFederated content search with actual playable content.

Please load this URL_Youtube into Kaggle and check it out. 20230 unique music videos to recommend by 2079 artists! This would impact your work and disconnect it more of the MusicDAO code. {brainstorm input: any Youtube&magnet playback of both video or music. 116098 "music video URLs" inside this Youtube-8M dataset with annotations from a diverse vocabulary of 3,800+ visual entities for semantic search}

@quintene
Copy link

quintene commented Sep 4, 2023

Neural Instant Search for Music and Podcast

Finished the model design where the final .tflite model will exists from an embedder model and an ScaNN layer.
Translating Dataset Key + meta data setup would transform users query input(title, genre, author) into a vector and search for closest vectors available in the network.

The model output conists of clostest neighbors including all the metadata of the dataset.
Also exploring replacing scentence encoder model with (song)object embedding model.
Model updates are available on users end device, next goal would be to distribute model changes in fully decentralizedFL.
https://blog.tensorflow.org/2021/11/on-device-training-in-tensorflow-lite.html

However current status of being stucked implementing The TFLite Model Maker library for on-device ML applications; Creating first version of the model.
This is just needed to translate the collected datasets into a .tflite model, can't get it running currently looking for other solutions...
https://pypi.org/project/tflite-model-maker/

Goal: Self learned semantic network with 100k items

Todo: mention research in paper: https://arxiv.org/abs/1908.10396

@synctext
Copy link
Member Author

synctext commented Sep 4, 2023

  • ToDo: meeting with @qstokkink and the new phd on learn-to-rank.
    • GOAL: BeyondFederated - a fully decentralised semantic search machine learning
    • Thesis pictures: Cluster vector space and zoom into "rock", "hard rock", "soft rock", and "gothic"
    • the absolute performance is totally not interesting! It's about the unique architecture and running code
  • solid architectural and conceptual progress on your thesis!
    • as simple as possible. Also implement cosine-similarity, euclidean distance, Jaccard coefficient
    • Compare to 5, 10, 20, 40, and 100 dimension results
    • What is your measure of success?
    • no ground truth
  • TFLite Model Maker. Big dragon to slay. Never give up? 🐲 ⚔️ 🐲
  • Player solution with Native Android WebView usage (no full JavaScript I guess).
  • Conceptual dream goal:
    • Fully decentralised, CHOOSE:
    • Youtube and also Tiktok are central Big Tech services, but today they still are the monopoly sources of content.
    • Universal player idea: use the platform aggregation trick against Big Tech. Play any content from the network on a mobiel phone. Support Youtube, Tiktok, Netflix, Spotify, IPFS, and torrents.
    • Contribution is as easy as creating a Tiktok video. Authentic amateur content as a first-class citizen. Against mega studios and hyper commercial influencers.
    • On-device AI. AI-centric universal player. Decentral AI is based on your incremental model update in TFLite.
    • Next Master thesis {future work}: beyond semantic search, semantic clustering. 100k items on each on-device AI are personalised, depending on the {evolving} taste of the user. TikTok inspired, superior content discovery and enabler of deep long-tail community content.

@quintene
Copy link

quintene commented Sep 25, 2023

  • Implemented search model into kotlin [Embedder + ScaNN].tflite on youtube dataset. (apk)
  • Testing custom Embedding vs Scentence Encoding model layer.
  • Adding new items and gossiping adds new challenge since ScaNN layer in model keeps track of index in LevelDB format; Meaning a rebuild of the index is required to distribute new items/audio. Also this layer is more of a traditional "Database" layer gossiping of gradients in Federated Learning will require more of clicklog based recommendations to share gradients and average this data in a fully decentralized enviroment.
  • Adding youtube player to display results.

TFLight Model Maker: Not being able to build tflight_model_maker since a lot of dependencies where conflicting. Resolved by custom Dockerfile with manual build steps including other libraries. (also repo update since 2 weeks)

image

Model Image
metadata of ScaNN layer

{ "associated_files": [ { "name": "on_device_scann_index.ldb", "description": "On-device Scann Index file with LevelDB format.", "type": "SCANN_INDEX_FILE" } ] }

Key decision Learning: Determine what is exactly learned by connected clients (A rebuilded custom index vs clicklig gradients/recommendations within search higher ranked items based on Clicklog (a.k.a popular audio higher ranked))

Decentralized Learning todo;

  • Rebuild Index / Extend on device training function with 'gossip gradients'.
  • Define initial state when a user does not have a (tflite)model available or an older version. (initial thoughts: share latest version by connected peers)
  • App engineering visual updates

@synctext
Copy link
Member Author

synctext commented Sep 25, 2023

  • Scientific architecture: loss function minimisation of (mis-)clicks on recommended items.
    • self-supervised learning. ToDo: model, feedback, loss and architecture.
    • Personalised learning and semantic search.
    • cold start problem: first usage, no profile yet. Recommend most popular discovered items.
    • Frontpage is simply most popular discovered items, based on ClickLog gossip. After first seconds we will only obtain a few clicklogs, so not much global popularity discovery yet. Only if we decide later to bias the gossip with 1 item in top-50 with each outgoing message.
    • New content item discovery is most simple possible with the gossip layer. Each incoming ClickLog message may contain numerous new discovered items. Simply use this as primary discovery method.
    • No security yet. Later master thesis projects can augment your with with MeritRank and web-of-trust against spam, fraud, and Sybil attacks.
    • {again} keep it as simple as possible for future enhancement and re-write.
    • Bonus Semantic clustering. Connect to several peers in IPv8 with most similarity to our personal ClickLog.
    • Warning: sending ClickLog beyond the 1492 Bytes UDP limit. You get into EVA binary transfer hell, avoid at all cost!
  • Cardinal question: can this scale to TikTok size or will it implode. See IPFS performance
  • hot topic: rebuilding index on Android, post by Quinten :-)
  • 281 MByte app works!
    • very fast and responsive
    • Great start
    • Plays Youtube content, very disruptive ❗
    • Especially content discovery of unlisted Youtube URLs without advertisements ❗ ❗
  • Big Tech is increasingly under attack. See new book from left-wing viewpoint end-of-capitalism called Technofeudalism. This is bit political, feel free to leave activism out of your thesis. Your show 👯‍♂️

@quintene
Copy link

quintene commented Oct 11, 2023

  • Hard time transforming leveldb index within metadata of the .tflite model, transforming required to work with new items. Index is build based on whole dataset.
  • TODO: Rebuild ScaNN index. Update 100dimesion vector is hard. Requires of whole dataset.
  • The only supports classic reccomendation(inference) without extending the learning space. Learning is done through an existing set of items. When new items emerge the current implementation lacks "learning" (adding new vectors to the learning space) to execute inference. (Why is it called Scalable Nearest Neighbors) Scalability is not defined by extending new items..

image

image

Goal for upcoming days: Scale Scalable Nearest Neighbors.
Create the first ScaNN indexer without needing to rebuild a new index based on the whole dataset and only with the ScaNN library running in some python code.
The indexer will:

  • Create a new index based on new vectors(either user vector or audio vector).
  • Update the model's metadata such that it could be applied for local learning.

@synctext
Copy link
Member Author

synctext commented Oct 11, 2023

  • focus on single .APK, first get running code, push code!! and only then focus again on decentral AI
  • Failure to scale of "Scalable Nearest Neighbors" by Google Research
    • Static Nearest Neighbors with scalability 🤔 (better fitting name)
    • no update of vector space
    • TFLite does not support this!?!
    • No on-device AI (only exists in simple recommenders and PR blog posts 😃 )
    • Awesome challenge for a young master student to fix 💥 🎉 You found a solid thesis project
    • BeyondFederated: first deployed true Decentralised Artificial Intelligence
  • Master thesis level picture!! Solid progress as usual.
  • Solidly stuck now 2 times, no problem. The core of decentral AI was never going to be a 2 sprint thingie.
  • Keep it simple! Keep it as simple as possible. No sync operation, no dedicated bootstrap, or block swap sync. Just a single ClickLog message to parse. Only a single method for both empty new peers and mere updates. You can simply ask peers for more ClickLog messages, insert them in vector space, update structs, and repeat. This gives natural rate control and avoid congestion. Emergent effect is quick, simple, and full-speed (cpu,IO, networking) bootstrapping.
  • accept gossip for users who are relatively close. Please avoid any bias at this moment, keep it simple, measure, performance analysis, then tune! Keep decentral global search alive, not current narrow taste bias. Close to 2006 Buddycast: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=423253720d670bba267639473af8697e44f19879

@quintene
Copy link

quintene commented Nov 1, 2023

  • Hot reloading vector spaces works .
    Model architecture remains intact while it is possible to swap between different vectorspaces, Either from a whole new dataset or existing adjusted vectorspace. Next Step: Single item add and remove (Incremental) absl::btree_map<std::string, size_t> ordered_partition_key_to_index;
  • If I have this working, the ondevice learning then I will be able to make this a web3 app. Bootsrapping sparse data on new items for on-device learning.
  • Finally being able to overwrite vectorspace on device models which allows new gossiped items to be extended. However to make this happen I had to extend the tensorflow-lite API. Which is requires java bindings to native c++ code.
  • Extending with the appendToVectorSpace(String key, Value) which 1.encodes the new item, 2. builds the index from the initialized underlying leveldb instance, 3. Coverts the index to a Flatbuffer which then is zipped and instead of (append-only support) now also allows overwrites.
  • Allow java code with native JNI binding to compiled c++.
  • When succesfull we have ON_DEVICE expension of the vector space such that we allow learning in dynamic enviroment where new items are gossiped by neighbors.
  • Future Work: Performance analyse, native on-device learning vs vector database.

Realizing this will be a big contribution towards Tflite-support

  • TODO(b/180502532): add pointer to example model.

  • TODO(b/222671076): add factory create methods without options, such as `createFromFile`, once the single file format (index file packed in the model) is supported.

fat Android build for multiple architectures: x86,x86_64,arm64-v8a,armeabi-v7a
Succesfull builds/compiles target /java/src/java/org/tensorflow/lite/task/text/task-library-text.aar which includes custom API tasks such as extend index convert to buffer and replace in model metadata + Pack the associated index files

@synctext
Copy link
Member Author

synctext commented Nov 1, 2023

  • {repeating} focus on single .APK, first get running code, push code!! and only then focus again on decentral AI.
  • Stable (281MByte) .APK??
  • JNI hell
    • Yet again fighting a known difficult issue
    • Before it was: 1) TFLight Model Maker build conflicts 2) rebuilding leveldb index
  • full stack engineering. Lot's to keep in your head: native code Kotlin versus cpp. Tensorflow light bindings. Android NDK models. 100-dimensions world vectors. LevelDB storage model. Scalable Nearest Neighbors efficient search. content-search user model. Web3 GUI of Youtube
  • Understanding this full stack + zero-comments code by Google Research Lab might be too much for average students. This is a severely challenging thesis project, requires significant engineering talent. 🚀 🏅
  • New item discovery, fresh item problem, cold start, Bootstrapping of recommenders are all known problems. Fresh Content Needs More Attention: Multi-funnel Fresh Content Recommendation
  • ClickLog design idea {Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature}
  • Final toolchain will be very fragile. Breaks if you version upgrade a single ARMv7 dep, NDK version, SDK of Superapp. Docker distribute @ Linux without Windows10?
  • December wrap-up, bundle results and graphs
  • Final sprint: PR on official TF-lite Github repo??? Small blog/Substack post to explain? Obviously an arXiv of thesis! Deployment of Superapp with Web3 Youtube and distributed AI?

@quintene
Copy link

quintene commented Nov 22, 2023

Extending TFLite Support with custom API calls (On-Device Scann C++).

  • Being able to modify LevelDB entries
  • Finally working towards training the model with the expanded vector space.

Currently focussing on training ScaNN, a single layer K-Means tree is used to partition the database (index) which I'am now being able to modify. The model is trained on forming partition centroids (as a way to reduce search space). In the current setup new entries are pushed in the vector space but the determination on which partition they should appear (closest to certain partition centroids) is hard.

Job to be done; Rebuilding partions.

INDEX_FILE
E_X,: Which is an actual partition including compressed vectors
INDEX_CONFIG: Config of embedding dimensions etc.
M_Y; Metadata entry

For a dataset N X should be around SQRT(N) partitions to optimize perfromance.

No train method exposed in current model setup so another API call to expose; Either

  • Build k-mean tree with new item-> Decentralized K-Mean clustering? replace partions with new clustered partitions.
  • Expose train function and export clusters save them to partion file in index.

Non perfect insert works until approximately 100k items, where new embeddings are inserted to closest partition centroids.

Older nearest neighbor paper by google

In case of On-Device limitiation of recreating the whole index including the new partitions and centroids, interesting research direction Fast Distributed k-Means with a Small Number of Rounds.

Research question shift towards: "How can the efficiency and effectiveness of SCANN be enhanced through novel strategies for dynamically adding entries, specifically focusing on the adaptive generation of K-Means tree partitions to accommodate evolving datasets while maintaining optimal search performance?"

This research question addresses the challenge of adapting SCANN, a scalable nearest neighbors search algorithm, to handle dynamic datasets. The focus is on developing innovative approaches for adding new entries in a way that optimizes the generation of K-Means tree partitions, ensuring efficient search operations as the dataset evolves.

"Evolving datasets" key in a fully decentralized (On-Device) vector space, no central entity to re-calculate all the necessary partioning/indexing.

TODO for next sprint; Focus on frozen centroids and imperfect inserts. !Keep it simple!

Also implement recommandation model; The main objective of this model is to efficiently weed out all candidates that the user is not interested in. In TensorfFlow recommender, both components can be packaged into a single exportable model, giving us a model that takes the raw user id and returns the titles of top entries for that user.

For Searching searching the vector space with a given query will retrieve all top-k results.
Next we not only use this data for retrieving top items but to also train our User-Song recommandation model.

We then train our loss function based on: {Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature}

    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(features["user_id"])
    # And pick out the movie features and pass them into the songmodel, 
    # getting embeddings back.
    querylog_song_embedding = self.song_model(features["Youtube-clicked-title"])

    # The task computes the loss and the metrics.
    return self.task(user_embeddings, querylog_song_embedding )

@synctext
Copy link
Member Author

synctext commented Nov 22, 2023

  • Thesis: perfect semantic search using on-device learning. Decentral alternative for keyword search versus recommendation? Primary dataset: https://www.kaggle.com/datasets/salvatorerastelli/spotify-and-youtube/data
  • {repeat the repeating} focus on single .APK, first get running code, push code!! and only then focus again on decentral AI.
    • have stable running code
    • first non-perfect insert operational. First performance graphs
    • Only then you are ready for the most difficult quest of your thesis!
  • Milestones:
    • TFLight Model Maker build conflicts
    • rebuilding leveldb index
    • JNI hell
    • Non perfect insert
    • working app
    • first "performance analysis" section graphs of thesis (item insert, insert time; lookup times)
    • Perfect insert by vastly expanding ScaNN
    • Cluster splitting for unbounded dynamic inserts in ScaNN (outside thesis scope?????) Balanced cluster sizes, bookkeeping, etc.
  • Bit harsh on yourself. It took 7 Google Research people to craft this algorithm + Lib ({guorq, sunphil, erikml, qgeng, dsimcha, fchern, sanjivk}@google.com). It's the core of a scientific paper to expand on-device machine learning with unbounded item insert and dynamic re-clustering 🚀 🔧 🚀 So not a 3-week sprint.
  • BeyondFederated - Truly decentralised machine learning using vector space

update: idea for experimental results. Exactly show how insert/lookup starts to degrade as you insert 100k or 10 million items. Cluster become unbalanced, too big, too distorted from centroid?

@quintene
Copy link

quintene commented Dec 13, 2023

Goal:

  1. .TFLite Model is initialized into Kotlin
  2. Model metadata (flatbuffers) parsed exposed underlying C++ bindings.
  3. Partions + index_config.txt + metadata is parsed into LevelDB c++.
  4. New item is shared in the decentral network, item should be:
    4.a Embed/Vectorize item (CHECK)
    4.b AH quantize vector (CHECK)
    4.c find clostest centroid (CHECK)
    4.d Add embedding into clostest partition LevelDB key value entry (STUCK HERE)
    4.e add metatdata into key value entry
  5. Overwrite(or append) metadata model ->flatbuffers.

Slowly progressing due to complexity, not just append new item to partition array + C++ and though development enviroment..
For now focus on last development sprint on "ndexing new embeddings" otherwise come up with other alternatives.

Youtube Iterate trough music category: Analysis of dataset of millions of songs (150mb? -> Device ready!)
https://developers.google.com/youtube/v3/docs

@synctext
Copy link
Member Author

synctext commented Dec 13, 2023

  • 4.d Add embedding into clostest partition LevelDB key value entry (STUCK HERE)
  • Frustrating slow progress, understandably.
  • Full focus on "non-perfect insert".
  • Core of thesis is performance analysis of on-device k-means lib (insert time, etc.)
    • instead of synthetic data we also present preliminary work with real music
    • Performance analysis is done with most simple possible dataset.
      • synthetic data with 10 clusters, 10 dimensions. Keep it simple!
      • 3 Figures: insert, non-perfect insert, total cpu, ????
      • Avoid the complexity of real datasets, like Facebook live sellers, like within k-mean tutorial example
    • Then: for our second and final experiment we show the viability using using our work for music. Our work is a proof-of-principle and NOT a full featured alternative to Spotify or Youtube as a single master thesis project.
    • We have a real ClickLog gossip layer within IPv8 community programming
      • actual gossip learning
      • 3 master thesis figures (not more figure needed in entire master thesis article 😮 )
        • show performance and convergence between 2 devices (or 1 device and emulator)
        • bandwidth usage; performance in cpu,mem, storage?
      • Experiment: start with 10k known items, exchange 1 clicklog message per 1 second.
    • use public dataset of 20.230 unique Youtube URLs into a feature vectors, perfect on-device semantic search.
    • What is your source for tags? Possible to do Musicbrainz tag lookup do not claim usability of thesis work for actual music semantics! Just a proof-of-principle
  • Upcoming sprint: non-perfect insert 🤯

@quintene
Copy link

quintene commented Jan 9, 2024

*Target of past weeks (including some time off on Holiday): Non-Perfect insert

  • Exploiting SCANN Config from devices .TFlite model. Deep Dive analysis..
  • Adjust 20K SCANN config which includes pre-trained model. index_config copy.txt Includes the required modification to
  1. Calculate asymmetric hash of embedding
  2. Closest point adjust corresponding leaf
  3. Update/Shift global_partition_offsset

Only 2.4 MB for 20K items Trained cluster Config!! Seeing valueable possibilities here! Such as sharing configs with peers?? Dynamic/sharable vector spaces in distributed context. Self learned or also keep sharing configs.

  • Bazel builds including tests do succeed!

  • Build delivers custom libraries currently implementing in Super App, requires custom API Calls now facing random crashes due to not supported hardware (emolator only). Current state: Debugging on older android devices.

image

-[x] Searching Does still work under new custom build library.

ezgif-5-3136367c23

-[x] Gossip of new items/Clicklog also possible.

https://we.tl/t-hIE6pLWXDU

Different Encoder layers possible within On-Device model; Current implementation includes embeddings based on Universal scentence encoder

Meaning encodings are distributed on based semantics but not typos! Meaning
Red Red Wine, will result in UB40 - Red Red Wine
Red Red Wyne will not result in UB40 - Red Red Wine
But then Blue Wine will result in UB40 - Red Red Wine

@synctext
Copy link
Member Author

synctext commented Jan 9, 2024

  • "Frustrating slow progress", that was the theme for past half year 😿
  • Now you have the first running code of decentralised AI, full decentralisation with on-device machine learning
  • Beyond Federated Learning: 320 MByte installer 🤣 🐎 🎊
  • Only 2.4 MB for 20K items Trained cluster Config!! solid milestone to improve upon
  • Critical for master thesis: scientific motivation of loss function choice
  • "Blue wine search experiment"
  • Upcoming sprint: Smartphone-based gossip learning
  • Next sprint: experiments and target figures for master thesis (what, why, how). Insert time, dataset, goal of each experiment.
  • write thesis, Diploma 👏

@quintene
Copy link

quintene commented Jan 30, 2024

  • Progess towards final app version:
    • New entry is entered in the Super App (Kotlin)
    • Entry then is embedded through native C++ (custom) ScaNN library.
    • Clostest centroid is found given emedded entry and its quantization vector.
    • The closest centroid determines the partition this vector should be placed in.
    • Addjust LevelDB entry of partition: append to 'E_{partition}' and Metadata 'M_{iteminpartition + offset}'
    • TODO: When kotlin is closed the underlying levelDB instance is destroyed we need to make sure to overwrite the index_file as while which is shipped into to our .tflitemodel.
    • Need one last sprint to get everything shipped into super app. Compiling customized ScaNN TFLite .aar files takes so much time to compile...
  • Visualizations consideering the possible experiments;
    • Score-aware quantization loss functions (Measuring Nearest neighbor search)
    • Recall (the fraction of true nearest neighbors found, on average over all queries)
    • Metric for different NN algorithms: https://ann-benchmarks.com/
    • Non perfect inserts perfomance and their bottleneck when partitions get too big.
    • Perfect after insert; recalculate centroids in k-mean?

Potential extended gossip design: JSON gossip replaced by gossiped C++ vector/embedding??
{Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature}
-> std::vector

Experiment on large tiktok dataset-> https://developers.tiktok.com/products/research-api/

@synctext
Copy link
Member Author

synctext commented Jan 30, 2024

  • Lets focus on writing your master thesis for 1 sprint. Expand the 5-page writings from July 2023
  • IEEE 2-column
  • First sketch of experiment description (no do the work yet!)
    • measurement plan by writing experimental results thesis chapter
    • "in our first experiment we quantify the CPU requirement of our work. We start with 2k Youtube videos to index. We increase the workload stepwise. Results show a nearly linear increase in computational requirements" etc.
  • Demo only prototype, no Clicklog re-writing upon app.close()
  • Thesis storyline and experiments
    • Can you index Youtube or TikTok?
    • Just some handwaving, no coding
    • Every smartphone has a unique personalised model
    • You know the 20k items closely related to your taste
    • non-perfect insert experiment
  • {repeating} 3 master thesis figures (not more figure needed in entire master thesis article 😮 )
    • show performance and convergence between 2 devices (or 1 device and emulator)
    • bandwidth usage; performance in cpu,mem, storage?
    • Experiment: start with 10k known items, exchange 1 clicklog message per 1 second
Date Youtube new videos upload rate
January 2009 15 hours of video / min
2019 500 hours / min

@quintene
Copy link

  • Finish up the index buffer overwrite including new embeddings which now is possible:
    Last embedding is added to closest partition and than checked against the embedding in that partition and searched for since having a (- lower is better) result which is 3 times closer than the others.
output(j, i): -0.46781 i: 172 and w/offset 1249
output(j, i): -0.671608 i: 173 and w/offset 1250
output(j, i): -0.711928 i: 174 and w/offset 1251
output(j, i): -0.601231 i: 175 and w/offset 1252
output(j, i): -0.533054 i: 176 and w/offset 1253
output(j, i): -0.644484 i: 177 and w/offset 1254
output(j, i): -2.3841 i: 178 and w/offset 1255

Results of top 5 items:

id: 1255 with distance: -2.3841
id: 17372 with distance: -0.777172
id: 1077 with distance: -0.761045
id: 1078 with distance: -0.748582
id: 7886 with distance: -0.740518
  • Remaining todo show new results in SuperApp!
  • Had a hockey training camp week off.
  • Working towards draft version of thesis. Todo: Discuss storyline.
    storyline graduation.pdf

@synctext
Copy link
Member Author

synctext commented Feb 23, 2024

update: fun fact, Deepmind also uses the library you use 😄 Improving language models by retrieving from trillions of tokens

@quintene
Copy link

quintene commented Mar 15, 2024

  • Finally got evrything working in app!
    1. Working custom bind .aar Custom C++ inference!

image

3. Working custom Index with new data inserted in closest partition!

image

  • Working towards: Setup for thesis experiments. Setting up performance tests;

    1. CPU usage
    2. Memory usage
    3. Ideal: Recall (the fraction of true nearest neighbors found, on average over all queries) against Queries per second
  • TODO WRITING! Much uncertainty about experiments resulted in confusing direction towards conclusion thus overall structure in paper progress.

  • GOAL: Aiming for deliviry in 6 weeks (26 april) including;

    1. Sprint 1: App presentation next sprint + draft
    2. (Sprint 1.5) Small thesis feedback session (via mail?)
    3. Sprint 2: Delivery.

storyline.graduation.pdf

@synctext
Copy link
Member Author

synctext commented Mar 15, 2024

  • Awesome results 🏅 🥇 🏅
    • Please really freeze development, think of the shortest route towards figures, add text
    • Switch out of your engineering role and transmorph into scientist
    • Focus on experiments with Android emulator
    • Expand into full master thesis
      • 5 Figures is all you need
      • graduate with 8..10 pages of text on arXiv with 4..7 Figures.
      • 3 side-by-side screenshots of real-time typing of words and matching semantic results ("r", "re", "red")
    • No GUI work please 🙏
    • explain everything, also https://huggingface.co/Dimitre/universal-sentence-encoder
    • ToDo: transfer all of your .aar and .apk building tools to @OrestisKan
      • he can then add binary transfer and 50+ SIM cards support
      • transfer 8Million YouTube URLs between 2 "portable AI devices" == Android phones.
  • This sprint
    • already had non-perfect insert working for complex Google k-means library
    • Not yet in PeerAI part of Superapp. Now it all works 🚀
    • 20K Youtube URLs, uses LevelDB special format with partitions, stored in 128 dimensional space, compressed using asymmetric hashing
    • Key outcome: Bazel build script from scratch for .aar tflitemodel plus .apk
  • Experiment brainstorm
    • pre-trained model already has 20k items inside when put on-device
    • What is the cost of adding 10,100,1k or 10k new items to insert?
    • {repeating} "in our first experiment we quantify the CPU requirement of our work. Wall clock time! We start with 2k Youtube videos to index. We increase the workload stepwise. Results show a nearly linear increase in computational requirements" etc.
    • bucket overflow experiment, the art of exaggeration: 1000% overflow
      • centroid mis-alignment with non-perfect inserts, buckets start to go wildly rampant 😄
      • 144 buckets/partitions, for 20k. Detect when overflowing and the effect of overflow. Degrade into linear search of whole dataset?
    • 20k items of 1 minute into 2.4MByte or 8M real Youtube URLs. Files can easily be 24 GByte on Android, 10000x 🤯
    • No need for network level experiments yet.
    • first quantify speed and scalability of k-mean on-device machine learning.
    • Keep the network experiments simple!
    • Two Android emulators gossip your training data using epidemic IPv8 community logic. ClickLog spreading.

@quintene
Copy link

quintene commented Apr 5, 2024

  • Working Experiment Setup in SuperApp! Including APK.
  • Not being able to have 2 simulators running simultaneously communicate through ipv8 together resulting in "no other peers are found". Therfore having a simulator + old android phone, works!
  • Now analyzing communication between ipv8 container and simulator logging benchmark resutls -> bottleneck will be actual file I/O since rewriting leveldb buffer through mmap pointer requires rewrite on actual disk.

@synctext
Copy link
Member Author

synctext commented Apr 5, 2024

@quintene
Copy link

quintene commented Apr 24, 2024

  • New Large dataset for pretrained model (8M dataset has no title/author only labels per timecode which is not required but also way too big): Therefore: YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.

  • The collection comprises 15,112,121 original and automatically translated transcripts from 2,063,066 videos (411,432 individual channels). 15M video transcripts could be indexed inside the model only the universal scentence encoder is more effective on english texts.

  • Learned a model which includes 2M unique video's but we can create a collection of 15M with same video's..

  • Will cost around a full day to train a model that big, running inference on universal_scentence_encoder creating 2M embeddings which is not that fast on CPU only for loop over a 500mb cs . Currently still waiting for some new pre-trained models to finish to analyze performance and wrapping up experiments section including all graphs.

  • Interesting result during Overflow Experiment (using same embedding/metadata to overflow specific partition) increasing from same bucket from 156 items to 300K (identical insert) only halves in cpu time. from +- 445ms to 1100ms which is still quite fast.
    Model size increased from 4MB to 8MB. 14MB for 700k items in a single partition but all on CPU (11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 2.30 GHz) 2.630.718 items = 37MB
    Experiment: Compare 2M dataset with self "inefficient insert".

image

image
Results fluctuate much having indipendent runs.

  • Working on experiment of 2M pretrained model versus NPI Model, in terms of size and speed. Accuracy is a though nut to crack evaluating move into direction of ANN-benchmark metrics such as Queries amount vs recall?

  • Next steps also include Gossip/clicklog experiment.

  • Crash fix for Android API 34+ solved: APK https://we.tl/t-A4rw7naT1a

  • Meanwhile focus on writing but a bit confused about the results having a blown up partition still results in decent performance therefore current status is a bit chaotic and raw.

main.pdf

image

@synctext
Copy link
Member Author

synctext commented Apr 24, 2024

  • 13 June graduation requires absolute focus 👓
  • Note the master thesis procedure https://www.tudelft.nl/en/student/eemcs-student-portal/education/graduation-msc
  • {repeating} Next sprint: produce a thesis figure {repeating 2x} freeze app development.
  • {repeating} thesis-perfect text
    • Figure 1: remove node dimensions, just the blocks
    • Figure 2: 2-column wide
  • No 2025 graduation please 😨
    • no new experiments
    • No 4 experiments please
    • 445ms to 1100ms which is still quite fast. nothing more is needed.
    • make pictures. write thesis. DONE 🏁
  • Peer-to-peer gossip on 4G/5G works! stable! got rickrolled even when searching for my name 🙄
  • Overflow experiment is failing, the library simply refuses to degrade 🤣
    • can you really build a Youtube/Tiktok/Reels alternative with SCANN by Google as the foundation?
  • move details and screenshot into design and implementation section
  • Quick roadmap to thesis completion {== performance analysis and experiments section exactly}
    • storage requirements and inference speed
      • start very small. Grow dataset in 10 steps. Till 2M or 2.6M. (linear or log?)
      • Measure for each step the database k-mean size and speed of inference
      • discuss the scalability of your work
    • overflow experiment. performance degradation on insert
    • {repeating} Two Android emulators gossip your training data using epidemic IPv8 community logic. ClickLog spreading.
      • Continuous random(n) inserts per second. Record traffic.
      • Plot traffic (total MByte in experiment, incremental) in time {nothing more please}

@quintene
Copy link

quintene commented May 24, 2024

  • 2M Dataset trained and used in experiment section. Trained based on: https://huggingface.co/datasets/PleIAs/YouTube-Commons Took 40 hours.
  • Not 100% sure if network experiment is suffice just measuring gossiped data and Non perfect insert?
  • Meantime typing, draft
  • End sprint for upcoming 1,5 weeks finishing up.

2,063,0
ezgif-7-f6743efe39
66 Dataset

Search term -> Query:

  • Red
  • Red Red Wine
  • Red Red Wine UB40
  • UB40
  • Green Wine
  • Vino tinto

@synctext
Copy link
Member Author

synctext commented May 24, 2024

  • you have a time problem 😟
  • very worried about the 2 weeks left of writing
  • Amazing thesis content: scientific publication level
  • Thesis needs a month of work
  • Big problem is also that these experiments do not illustrate all the implementation work you have done
    • show table of code (Lines of Code per function)
    • show a dependency/call graph
    • more then just 3 screenshots for the first fully decentralised Spotify-alternative (no MusicDAO, but real Web3 Youtube playback).
    • needs 2 pages of attention
  • First experiment issues a single query to our machine learning system
    • The goal of this experiment is to provide a general overview of the capabilities of our system
    • Shows the results for "Red Hot Chilly Peppers" within the Spotify/Youtube dataset
    • There are 10 soongs within this dataset by this band
    • However, the No.8 results is a strange anomaly of our approximate semantic retrieval
    • Manual investigation shows "Marshmello", peppers, candy, etc. are mapped to the same thingie ??????
  • Figure 1 example
    • Not formal and also not illustrative
    • See the beautiful figures here: https://thegenerality.com/agi/
    • High quality content! 1 picture explains 12 months of effort 🙄
  • remove all the top-titles from figures like: "Recall vs Speed Time for Different Models"
  • Figure 2 example
    • reversed time X-axis 😨 😨 😨
    • This figure has exactly 6 datapoints at 0%,20%,40%,60%,80%,100% recall. Why?
    • What is the parameter under study within this experiment?
    • Re-frame as "we now examine the performance for random keyword inference".
      • We express inference cost as execution time in ms.
      • our first inference cost experiment uses six hand-picked queries
      • To illustrate the semantic embedding we measure the following six related queries.
      • These six are song name+band name, band name only, song name, partial song name, semantic mistake in song name, and foreign language translation of partial song name.
      • Strict ordering: "Red Red Wine UB40", "UB40", "Red Red Wine", "Red", "Green Wine", "Vino tinto"
      • these six queries have different inference cost on the three datasets
      • Figure X show the queries and the resulting inference time for our three dataset
      • Our second experiment examines retrieval of multiple results and recall.
      • Multiple results are costlier to retrieve than a single item
      • Multiple approximate results may not be even present in the original data
      • Figure X+1 show the recall for our six illustrative queries.
      • We measured the cost of recall@1, recall@2, recall@4, recall@8, and recall@16.
        • use different point-styles for different datasets or recall@x
      • These results show that music by band UB40 is only present in a single dataset
  • Figure 3 example
    • You insert 10k items and then measure 350 ms execution time
    • However, the measurement outcome is used as X-axis 😨
    • Conceptually simply re-insert item experiment, 1 million times
    • Great 1st experiment, not second.
  • Fig. 4 example
    • We now further examine the overflow of our core machine learning engine, SCANN.
    • Our second overflow experiment also inserts one million items
    • Instead re-inserts one million identical items we insert different items
    • We quantify the effect of one million random inserts, as measured by insert time.
    • Simply combine with Figure 3
  • Decentralised networking experiment
    • Our final experiment examines the entire end-to-end pipeline of decentralised content discovery and search
    • We quantify the exact cost of content discovery and decentralised search
    • Content discovery is based on a gossip protocol, see the illustrative network view https://pstree.cc/wtf-is-gossip/
    • We focus on the Creative Commons Youtube dataset which contains videos that can be freely re-distributed.
    • For our experiment we focus on the core primitive of two random peers exchanging discovered content and search results
    • Search results are a huge lists with pairs of "Query,Clicked-Youtube-URL", called a ClickLog.
    • One device in our experiment generates ClickLog, the other inserts these results in SCANN
    • Several Figures
      • real device as receiver!
      • Figure Y: show 100 second experiment of receiving, Y-Axis the number of received items
      • Figure Y+1: 100 second experiment, Y-Axis the number of total MBytes of traffic
      • Figure Y+2: 100 second experiment, Y-Axis the CPU usage of Android device
  • productive 3h meeting :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
@synctext @quintene and others