Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Msc placeholder: exploring LLM as a database #7435

Open
synctext opened this issue May 24, 2023 · 41 comments
Open

Msc placeholder: exploring LLM as a database #7435

synctext opened this issue May 24, 2023 · 41 comments
Assignees

Comments

@synctext
Copy link
Member

synctext commented May 24, 2023

placeholder for brainstorm. Finished all master courses. (part-time side job)
Exploring for 1 month what a good master thesis direction is around LLM.

Draft master thesis (again placeholder): Adding memory to LLM and large-scale ingestion of facts

Recommended paper to understand your thesis context and goal further. With donations of resources by volunteers it is possible to build a giant foundational model. Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts.

with 22k stars this is more popular: https://github.com/imartinez/privateGPT LLM: default to [ggml-gpt4all-j-v1.3-groovy.bin](https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin). If you prefer a different GPT4All-J compatible model, just download it and reference it in your .env file.
A possible starting point is the Vicuna enhancement, as a database: https://github.com/csunny/DB-GPT In addition, we provide private domain knowledge base question-answering capability through LangChain. Furthermore, we also provide support for additional plugins, and our design natively supports the Auto-GPT plugin.
Third option: NanoGPT The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file train.py reproduces GPT-2 (124M) on OpenWebText
Fourth: smaller than medium{nano} is https://github.com/Lightning-AI/Lit-Parrot Hackable implementation of state-of-the-art open-source large language models.
Concrete ToDo:

image

Please register here: https://mare.ewi.tudelft.nl/

@keonchennl
Copy link

keonchennl commented Jun 1, 2023

I explored a bit on DB-GPT with Vicunna-7b. But it didn't work well on my local Laptop due to the RAM limit (30GB required) and the model was running on my CPU (this model could not somehow run on CUDA due to configuration). A further investigation could be:

  • use smaller models like ChatGLM
  • moving the llmserver to the cloud and connect to that service from the local GUI

The computing resource I have access to:

  • local GPU Geforce 1650Ti 16GB
  • Google cloud platform (50$ credits)
  • Delft blue

@synctext
Copy link
Member Author

synctext commented Jun 1, 2023

For now the most simple around seems to be nanoGPT. Simplicity is always the superior starting point for extreme decentralisation. Thus this seems like a good start for fully LLM as a database + decentralisation or local-only.

Alternative to a huge SQL database with BM25 search. The data is tokenised and transformed into LLM. The idea is that it might have some superior properties to the old SQL approach. For instance, decentralised learning with a network of 1+ million Android phones. Think TikTok scale and popularity.

Concrete proposed ToDos:

EDIT: for decentralised learning its required that we update (e.g. instruction fine-tuning) the model on laptops or even smartphones. Qualcomm is aiming to support this. (another backup direction: take an open source LLM which support inference on Android, provide first-class support for adding a single new training item. Use-case is content discovery, decentralised search engine or (Tiktok-like) content recommendation; new added item in form of tupple: (content item, URL).

@bacox
Copy link

bacox commented Jun 1, 2023

Some inspiration: https://arxiv.org/pdf/2210.06280.pdf

@synctext
Copy link
Member Author

synctext commented Jun 19, 2023

Thesis introduction: we know that 1 billion SQL servers are a problem. Technology like Bittorrent and Bitcoin scale without effort to 1 billion peers. LLM is mostly done on servers, with only minor on-device or decentralised approaches. This thesis investigates scaling LLM to a billion devices.

instruction-tuned PaLM model (1.5 billion parameters) to TFLite and executed through TFLite runtime {PaLM model }

Example of manual dataset for a video search engine alternative to Google, Youtube, and Tiktok

URL Description
https://www.tiktok.com/music/Say-It-Right-Sped-Up-Remix-7041921629911304962 Sorrel Horse Dancing to “Say It Right”
https://youtu.be/eogpIG53Cis Blade Runner (1982) Official Trailer - Ridley Scott, Harrison Ford Movie
https://youtu.be/vKQi3bBA1y8 The Matrix (1999) Official Trailer #1 - Sci-Fi Action Movie
https://youtu.be/k64P4l2Wmeg The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie
https://youtu.be/bwcADuJZDNA Mad Max: The Road Warrior
https://www.decayfilm.com/static/files/Decay_2012_1080p.torrent DECAY is a zombie film made and set at the LHC
https://webtorrent.io/free-torrents public domain and Creative Commons torrents
magnet : ?xt=urn:btih:08ada5a7a6183aae1e09d831df6748d566095a10 Sintel
(NON_CLICKABLE_magnet_URL, SEE MARKDOWN SOURCE magnet : ?xt=urn:btih:08ada5a7a6183aae1e09d831df6748d566095a10) Sintel
(NON_CLICKABLE_magnet_URL) Big Buck Bunny
(NON_CLICKABLE_magnet_URL) Cosmos Laundromat
(NON_CLICKABLE_magnet_URL) Tears of Steel

Brainstorm on thesis direction:

  • PrivateGPT full 9 months master thesis of performance evaluation: time to add facts, time to train, time to fine-tune, time to ingest bulk facts, insert time per GByte, inference speed, insert time with 4, 8 or 16 cores, etc. {low risk direction of thesis}
  • Build a search engine using LLM. Always present a URL for a given query. Optimize for this use-case. Only output data that is included inside the training dataset of URLs!?! {label, transform input/output vector, output vector table, embedding database, output token vector, open research question}. LangChain, NanoGPT fact ingestation
  • Mobile search engine. Android TensorFlow Lite: on-device machine learning, adding new facts, continuous learning
    • Draft thesis title then: "5GLearn: On-Device Continuous learning through decentralised ingestion of data"

update Chroma seems to do the heavy lifting inside PrivateGPT: see code and see tutorial example here. Please try to understand how things work!
update2 more TFLite example code. On-device text generation using GPT-2 or DistilGPT2 (same distillation process than DistilBERT, 2x faster and 33% smaller than GPT-2)
update3 Hivemind is a PyTorch library for decentralized deep learning across the Internet. Its intended usage is training one large model on hundreds of computers from different universities, companies, and volunteers.
update4 tokens for embedding and unembedding, can we hack an entire URL as a token? The unembedding matrix, which in our case computes the left inverse of the embedding matrix $(WE)−1$, is (768 * 50000) in size.
image
20k Youtube URLs to official music videos. also the 8M Youtube videos analysis dataset. {personal note: Easy to create a WEB3 browser using webview. With decentralised learning it should be possible to use semantic clustering to reduce the impact of the strict 50k tokens limit. With personalisation each node is aware of others with similar taste and knows dissimilar peers. All these unique 50k tables create a giant (unbounded) virtual token table.}

@keonchennl
Copy link

keonchennl commented Jun 19, 2023

  • I have nanoGPT running on my local env. It works both on my GPU and CPU.

image
image

The pretrained part of the GPT2 model (baseline) is from https://huggingface.co/gpt2

In PrivateGPT, the custom source fed to the ingesting https://github.com/imartinez/privateGPT/blob/main/ingest.py is mainly from the extracted text from the input documents (e.g. pptx, pdf).

@synctext
Copy link
Member Author

synctext commented Jul 10, 2023

Discussed the idea again of "tokenize the URL". The embedding contain a static URL list, with one-hot encoding. Normally a generative model only hallucinates URLs.

URL2Vec: AI crisis for copyright monopolies

{Possible thesis brainstorm} Many have written about the ongoing copyright crisis due to generative AI in the creative industry. This thesis demonstrates that AI, specifically Large Language Models pose another threat. We build upon breakthroughs in on-device machine learning and embedding to create a decentralised Google-ish search engine.

We present a tool which is able to learn online URLs for Youtube, Tiktok, Bittorrent, and IPFS. In principle, this tool removes the need for Internet intermediaries such Big Tech and Hollywood. Independent producers or influencers can easily research their audience based on our URL2Vec tooling. This will put further pressure on the legal construct of copyright.

Our starting point is the KerasNLP library by Google. This model support text completion with on-device machine learning. We crafted a decentralised search engine by building upon state-of-the-art pretrained models for natural language processing tasks and adding support for a custom tokenizer with URL understanding.

Related work to read: https://blog.reachsumit.com/posts/2023/05/tuning-llm-for-recsys/#instruction-finetuned-llms-for-recommendations

Naive ToDo list for starting experiments:

  • start with NanoGPT
  • Get training going for 24h on the classical Shakespeare database
  • modify the tokenizer to encode URL as 1 token
  • fine-tuning NanoGPT with 1 magic extra line The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie can be found at https://youtu.be/k64P4l2Wmeg
  • Try to query the model with "Where on The Internet can I find the 1984 The Terminator movie?" or something

@qstokkink
Copy link
Contributor

Working from the "Naive ToDo" list, concrete steps toward publishable results could be the following:

  1. Adapt https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html to create AI that can convert token sequence -> linear (i.e., 1 magnet link)
  2. Add NanoGPT to this model for NL -> token sequence -> linear (i.e., 1 magnetlink)
  3. Train this and see what happens.
  4. Use RNN instead of a linear layer for NL -> token sequence -> generated magnetlink (20 bytes/160 bits output)
  5. Train this new model and see if it is better than the results from step 3.
  6. Publish results?

@qstokkink
Copy link
Contributor

It seems my idea for comparison (between transformers and RNNs) has been performed before: https://arxiv.org/pdf/2005.09471.pdf
Instead of natural language next word prediction, you would be investigating next word prediction of a fixed-size resource but this is probably good related work to reference.

@synctext
Copy link
Member Author

Open LLM challenges. Great background read for writing introduction and citations for Problem Description: https://huyenchip.com/2023/08/16/llm-research-open-challenges.html

@synctext
Copy link
Member Author

synctext commented Aug 21, 2023

  • Upcoming sprint outline
    • Most simple step with Youtube URLs dataset. https://www.kaggle.com/datasets/datasnaek/youtube-new?select=USvideos.csv This scrape table with data translated into most simple natural language form for text input in NanoGPT (6351 unique input lines; 1 for each unique video_ID {11 bytes}):
      • The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" can be found at https://www.youtube.com/watch?v=2kyS6SvSYSE
      • the Youtube video titled "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" can be found at https://www.youtube.com/watch?v=1ZAPwfrtAFY
      • the Youtube video titled "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" can be found at https://www.youtube.com/watch?v=5qpjK5DgCt4
      • the Youtube video titles "Nickelback Lyrics: Real or Fake?" can be found at https://www.youtube.com/watch?v=puqaWrEC7tY
    • Just a lookup. Modify the two lines of encoding/encoding inside NanoGPT to do embedding of Youtube URLs.
  • Only after this is operational we take next step: generative AI. We use the most simple approach of the token ID plus token string embedding as the base line. Then we compare various queries and further work on improving our dataset. This looks sufficient depth for a Delft University master thesis 👏 🎊 👏
  • Basic transformer and NanoGPT tutorial . required preliminairies.
  • In Sep/Oct we focus on generative AI. Generate from scratch and pick from a huge list. "Generative AI against URL hallucinations" master thesis title idea. Actually model the magnet link with 20 Bytes of the SHA1 hash (160 bits). Generate 160 bits in the generative AI at the neuron level. Next step sequence model and next token prediction. First bytes of a magnet link predicts the remainder of the URL. Idea by @qstokkink. Warning: magnet link is already difficult and sufficient for master tehsis. General approach for any variable sized URL (Tiktok URL, Youtube, IPFS link, magnet link) is out of scope. {note for future, bigger dataset 20k Youtube URLs to official music videos. also the 8M Youtube videos analysis dataset.}
  • Please do an issue update for next meeting, screenshot, progress and dataset.

@keonchennl
Copy link

Some progress has been made:

  • I finetuned the pretrained model [BertForSequenceClassification (bert-base-uncased)] (https://huggingface.co/docs/transformers/v4.32.1/en/model_doc/bert#transformers.BertForSequenceClassification) on the USVideos dataset.
    • Notebook including results can be found here
    • Index was used as label instead of one-hot encoding in the end since the model expects 1 value instead of a vector as the label
    • The training was initially on my local env. After 2 epoch the model was able to predict 63.5% video ids correctly including the 4 given video titles.
    • Later the training was moved to Colab for better training. Weird thing happened that the performance dropped after more steps.
  • GPT2 has also been tried for this task but wasn't able to train due to memory limit of my local environment.

Some reflections:

  • We now only care about the 'look up' result. In this way, we are basically using the training data itself to test the result. Since the training error should keep dropping with more data, however this is not the case in the current experiment.
  • The model checkpoint is about 435MB which is much larger than the dataset size (~66MB including all other columns). If we use a larger training dataset with more labels, the size of the model checkpoint will get slightly bigger due to the larger dense layer for classification. The increase of the model won't be proportional to the size of the data or the number of labels. Say If we use bigger dataset in the future, the model might be possible to be smaller than the dataset size? which could be the compression we want?

@synctext
Copy link
Member Author

synctext commented Sep 4, 2023

  • Using LLM as a database seems to work!
  • 2 epoch the model was able to predict 63.5% video ids correctly including the 4 given video titles. 👏 🎊
    • amazing success after only a few weeks of exploring with the magic of AI
    • Congrats with 63.5% recall !!
    • Few hours of training, local PC
    • 6351 unique values in the USVideos.csv. Your have 40949 items in youtube_video_id_predictor.ipynb ?
  • Solid thesis outcome: "abusing" LLM as a database!! Acceptable, even if there is data expansion and only 63% recall.
  • Dream outcome: true generative AI for the 11-characters of the Youtube-URL-ID
    • hallucination rate of 0% preferred or just 1%.
  • Next sprint: try to improve the 63.5% for 2-3 weeks.
    • understand what works and what tricks fails.
    • Is the data sufficiently clean?
    • Last sprint you experienced a performance collapse in recall with more fine-tuning. Put in graphs. Can you explain this?
    • Possibly have a graph next meeting, issue update.

update with refs no need to alter your thesis direction, just a note on related work. Recent advances in retrieval-augmented text generation plus intro for that: https://blog.lancedb.com/llms-rag-the-missing-storage-layer-for-ai-28ded35fa984

@keonchennl
Copy link

keonchennl commented Sep 15, 2023

  • Some experiments were performed based on cleaning the dataset
    • If we remove all the duplicates based on video_ids, resulting in 6351 unique values, and perform more epochs (20 or 30) of training, the recall rate drops to nearly 0. The training error almost did not drop.
      image
    • Similar results were also shown on removing duplicates based on column 'title'
  • However, when I used original data containing duplicates for training, I was able to achieve a recall of 96.19%
    • The training error dropped drastically after 8 epochs of training
      image
    • This explains the overfitting since the duplicates in the original data may contribute to faster convergence.
  • Some findings on the related work:

@synctext
Copy link
Member Author

synctext commented Sep 25, 2023

  • Great milestone! Thesis has completed the risky exploratory phase. The idea seems to be working. Operational unembedding matrix, convergence, and running code with first initial results. Still lots of hard work left obviously.
  • Spend 1 week why the 6351 fails to convergence and the 40949 with convergence already from 1k to 2k steps.
  • Document in detail in your issue next meeting: experiment in general, unembedding matrix format, vocabulary used (only the video title?), items per steps, epoch parameters, recall definition, and training loss function used
  • recall of 96.19% huge improvement from 63.5%. Great progress 👏
    • input: title from dataset. Produces a random or the valid Youtube URL 96.19% of the time. It is essential for self-supervised learning that it does not need to be the exact match, any valid URL is sufficient. Fuzzy matching feature on query words into a Youtube URL.
    • The goal is not exact youtube URLs to title matching. Please train and test the recall also on random dictionary word inputs. For instance, make a Youtube_Dictionary.txt file for next meeting and train on recall of one or several words. Should produce any valid Youtube URL.
  • Sprint focus: understand, explain, and document. Architecture picture v1, for master thesis. No new improvements please. Cleanup existing colab code

@keonchennl
Copy link

keonchennl commented Oct 11, 2023

I made little progress this sprint, unfortunately. I reformatted the notebook
Notebook
and will try to see how the following issue may influence the result:

  • Video id duplicates: Removing all duplicates may still have many title duplicates, which caused the previous non-converging training curve
  • Title duplicates: Not checked
  • 2 different titles might generate the same embeddings... If so it will affect the results
  • Looking into the embeddings and see the difference from there directly
  • Use a simpler model instead of BERT as embeddings

@synctext
Copy link
Member Author

synctext commented Oct 11, 2023

  • Spend 1 week why the 6351 fails to convergence and the 40949 with convergence already from 1k to 2k steps.
    • ignore this issue for coming sprint
    • always use the duplicates
    • just work with the latest code that runs and converges
    • focus on moving forward
  • Related work update: Why AutoGPT engineers ditched vector databases
  • {repeating} The goal is not exact youtube URLs to title matching. Please train and test the recall also on random dictionary word inputs. For instance, make a Youtube_Dictionary.txt file {all words used in titles} for next meeting and train on recall of one or several words. Should produce any valid Youtube URL.
    • 0% hallucination? (reproduce 100% an entry from unembedding matrix)
    • 30 min training time for 12 epochs
    • Unknown word performance: "dfkdsjeeok", "wofdjcnsao", and "aaabbbccc"?
  • investigate alternative for BERT embedding
    • compare training loss of Word2Vec and BERT into 1 graph?
  • Future sprint ideas: visualise the vector space ?

@keonchennl
Copy link

keonchennl commented Nov 1, 2023

  • Sick for a week
  • Code clean up
  • Made the training work for the new notebook Notebook
  • tensorboard seems not working yet in colab. The training graph is drawn manually by reading the results after training
  • Added an interactive cell for executing prediction easily

@synctext
Copy link
Member Author

synctext commented Nov 1, 2023

  • Please write a progress update before the meeting, this did not happen multiple times.
  • Keep laser-sharp focus on progress. Why did you revisit the non-duplicates non-converging approach?
  • Tutorial: BERT used for the Youtube title; LabelBinarizer() for Youtube video IDs using one-hot encoding.
  • 30min-1h to check if your code is not broken.
    • Change your work approach only make small changes
    • limited to few hours of changes
    • ensure every day your notebook still works
    • Use integration testing. Add "cat", "funny", etc. video tests to produce a valid URL and title. ✔️ or ❌
    • Automated re-call rate test. input "title of Youtube video" output: exact Youtube URL Video-ID. measure error.
    • {repeating} make a Youtube_Dictionary.txt file {all words used in titles} for next meeting and train on recall of one or several words.
  • Size of the tokenizer used: BERT-base-uncased

@synctext
Copy link
Member Author

synctext commented Nov 8, 2023

Amazing related work by Google Research found by our phd student Petru: #7586 (comment)
Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture.

Even more related work for intro + problem description: https://github.com/vectara/hallucination-leaderboard

@keonchennl
Copy link

keonchennl commented Nov 22, 2023

Dictionary extracted from titles from US videos dataset

dictionary_title_with_stop_words.txt
dictionary_title_without_stop_words.txt

Investigation about the broken code (Notebook)

  1. Fixed a bug in the dataset class
  2. Things were tried to check why the result of the best model (with 96% recall) could not have been reproduced.
    - It turns out that the training still works but the data that was used to calculate the evaluation score was not the one for training.
    - A subset of the dataset (32759 samples) was used for training the model but the whole dataset (40949 samples) was used for evaluation
    - It happened due to the exploration of the dataset splitting and de-duplication
  3. It was able to reproduce the 96% recall using the same subset data (32759 samples).
    - The best model can be found here
    - The data for reproduction can be found here or be retrieved via a 80/20 split with a random state of 42 (see the notebook).

Findings

  1. With the same training data. the model has high possibility of not converging because of the randomness in the training process. 5 experiments have been performed, but only 1 has loss dropped below 7.5.
    image
  2. With the best model, the performance given the whole title is good. But the fewer words we give the worse performance it may have. For example, if we give 'cat', it can hardly predict title that has 'cat'.
  3. I checked the Differentiable Search Index (DSI) approach Fine tuning, a BertForSequenceClassification (encoder + a classification layer) looks a bit similar to the DSI approach that paper proposed. Perhaps it's nice to look into applying a (encoder + decoder) seq-to-seq model.
  4. The metric now is comparing the exact title. Maybe I should involve other relevance metrics. Such that the 'cat' example works well.

Experiments with Word2Vec

  1. I explored starting with word2vec from scratch.
    - word2vec => vectors => nearest neighbor => the closest
    - Notebook can be found here
    - To represent the video better. Words from description and tags are also included for training.
    - Different hyperparameters are tried.
    - The best recall we get so far is 18.67%
    - The exact title prediction gives bad performance. But the one word prediction looks better than the BERT model.
  2. Then rather than training from scratch, using google news neg 300 model is also tried. Notebook
    - bad performance as well: <1% recall
    - index issue
    - rare word issue. e.g. 'aquarius' is not in the vocabulary

@synctext
Copy link
Member Author

synctext commented Nov 22, 2023

  • making progress!
    • Got 96% experiment, thesis is out of the risky zone
    • Explored various options, BERT, Word2Vec, pre-trained models, etc.
    • Turn the best experiment into master thesis .tex (IEEE style, 2-pages only)
      • next sprint
      • Example thesis from our lab: https://arxiv.org/pdf/2306.15044.pdf and also https://arxiv.org/pdf/2307.01411.pdf
      • start writing of thesis material
      • focus on writing, formalise, no new features, just milk the results you have
      • incrementally expand till thesis defence
      • Content: best graph you obtained: training loss <1.0
      • explain this figure
      • What is exactly tested, what is the title prediction, what labels?
      • Describe loss function!
      • Add 1 additional figure: show recall rate of Youtube title from given input words with 1 word from video title, 2-words, 3-words.... 10-words.
        • create a term-frequency table and only use unique words?
    • Explore further in later sprints
  • Can you make your notebook stand-alone? (us_videos_data = pd.read_csv(workdir_path / 'USvideos.csv'))
  • (possible future sprint) Follow the DSI Google paper for most thesis work?
  • 1 for each unique video_ID {11 bytes}):
    The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" can be found at https://www.youtube.com/watch?v=2kyS6SvSYSE
    the Youtube video titled "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" can be found at https://www.youtube.com/watch?v=1ZAPwfrtAFY
    the Youtube video titled "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" can be found at https://www.youtube.com/watch?v=5qpjK5DgCt4
    the Youtube video titles "Nickelback Lyrics: Real or Fake?" can be found at https://www.youtube.com/watch?v=puqaWrEC7tY

@keonchennl
Copy link

keonchennl commented Dec 13, 2023

  • Draft https://www.overleaf.com/read/jnbcnktyfrgq#719f90
  • Refactored according to the feedback from Quiten
  • The experiment giving input words with different work size is still on TODO.
  • Updating BERT notebook in Kaggle. Since the model in Kaggle is in TensorFlow, still it needs sometimes to adjust the code to get it work, both for loading the exist model and for training.

@synctext
Copy link
Member Author

synctext commented Dec 13, 2023

  • First master thesis text 🎉
  • Add an architecture Figure and section with "Architecture and Design"
  • Realised today the URL is not embedded, model output is embedding of a certain title, usable for table lookup of Youtube ID.
  • We perform training on a NVIDIA T4 GPU for 8 epochs., add that you simply use the Google free GPU cloud offering
  • ToDO: mention DSI work in your thesis.
  • Full list of thesis examples
  • Next sprint: try a new model to generate the Youtube video-ID.
  • DAS6 account for Delft cluster A5000 access

@keonchennl
Copy link

keonchennl commented Jan 9, 2024

  • Got hands on and learned how to work with the DAS system
    • DAS6 is pretty bare-metal so a bit difficult to set the environment (compiling Python, installing dependencies etc)
    • Waiting for an internal update of a C compiler on the DAS6-delft side for compiling Python, which requires admin privilege
  • Not sufficient time was made for experimenting due to work and personal reasons
  • Checked model T5 and the nanoT5 repo
    • Since it's a text-to-text model and it suits very general tasks, modifying the model (such as by adding a layer) seems not a proper way.
    • Instead, I could maybe try:
      1. Fine-tuning (or pertaining?) nanoT5 with the us-videos dataset such that it 'remembers' the data
      2. Use prompt engineering for evaluation. e.g. Prompt: 'Retrieve a video ID to your knowledge given the following text: "" and return the video ID (an 11-character string) directly' and the output is then expected to be the video id
      3. Use the output for performance evaluation

Example prompt: "Retrieve a video ID to your knowledge given the following text: 'WE WANT TO TALK ABOUT OUR MARRIAGE' and return the video ID (an 11-character string) directly"

And the expected output should be: "2kyS6SvSYSE" (from url https://www.youtube.com/watch?v=2kyS6SvSYSE)

The training examples could be:
positive sample: "The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" has video id: '2kyS6SvSYSE' "
negative sample: "The Youtube video titled "WE WANT TO TALK ABOUT OUR MARRIAGE" has video id: '1ZAPwfrtAFY' "
(where 1ZAPwfrtAFY is from another video)

  • The other Idea: BERT + last layer as direct video id output:
    • Since BERT only used 'encoder', this might work things out.
    • I haven't tried it out

@synctext
Copy link
Member Author

synctext commented Jan 9, 2024

@keonchennl
Copy link

keonchennl commented Jan 22, 2024

  • Experiment with T5 (the naive approach)
    t5-experiment drawio
    The model training logs can be found here
    image

  • One of the notebooks

  • The main doubt now is the unknown of how the model see (encode/decode) the video_ids. Further trying out of the new ideas is going on.

@synctext
Copy link
Member Author

synctext commented Jan 31, 2024

  • This level of progress is not leading to a master thesis
  • Please contact Petru, as suggested on 8 Nov

@keonchennl
Copy link

keonchennl commented Feb 16, 2024

  • Thanks to the 'debug' session with Petru, things got clarified and a defect in the code was discovered and fixed. Some findings during exploration after the session:

    • The plateau of the learning graph: The learning was still going on but might be getting around some local optimum. By continuing training enough more epochs (20 more epochs), the loss starts to drop again.
      image
      Each line in this graph belongs to one run of the training. The purple line belongs to a 200-sample run. And the green line belongs to the original dataset 30k samples without deduplication.
    • Learning rate: The learning rate was suspected to be the reason and it turns out the setup is ok. I used the default initial learning rate of 0.001, the default AdamW optimizer, and the linear scheduler, which can get the model to converge well.
    • About doubt that the model cannot see a whole video_id as one token: It already turned out that the small T5 model can encode the video_ids using its existing vocabulary. I tried to add each video_id manually as one token, but the model does not work anymore. One explanation is that for new tokens the pre-trained model does not know them at all and thus needs to learn from zero. However, the input words are mostly in the vocabulary. This could make a pre-trained model hard to learn with our small training set.
    • The max-length for model.generate() can affect the performance. I used to set it to 11 (the length of the exact) but it lowers the performance by generating partial IDs. I think it's because special tokens affect the generation even if I skip them. But later I found setting it to 15 gives the best results.
  • As the down-scaling experiment works, I picked out 50 samples and trained more epochs till the model overfits (<0.0001 loss). The recall rate gets nicely to at most 100%. (But not stable - it varies from 76% to 100%) However, since it overfits very much, only the exact title gives a valid and correct ID. If I input a partial ID or one or a few words from the title, the model starts to hallucinate a lot.

    • I then scaled up to 200 samples, which also got 99% recall. (99% valid video ID and 99% mapped to the correct video title) But hallucinations are the same.
    • Then I also tried the full unique data set (6455 with unique titles). The training time starts to explode. The plateau in the learning graph still appeared but the loss continued going down after some more epochs. With 3 hours of training, the loss can only drop to 0.02 and results in a 20% recall.
  • I realized that 'overfit as much as possible' might be the wrong direction. Because for searching we actually want the model to generalize to handle fuzzy searchs. We want it to also perform well when we input part of the title or some keywords. In the exploration with BERT, the final mapping from the output index embedding to the video_id somehow hid this issue. Now that the model directly outputs the video_id, it's time to avoid overfitting.

  • I then came back to the 50 samples exploration. I tried data augmentation: I sampled phrases and words from each title and included the lower cases of the words for these corpora. The augmented data set size goes up to ~650, about 15 times of the original dataset.

image
But this seems to work well. The recall rate reaches 100% after 100 epochs of training of 3 hours.
image
image

A demo notebook can be found here

@keonchennl
Copy link

keonchennl commented Feb 19, 2024

  • As the 50-sample dataset gives good results, I tried scaling up directly to train with 6455 samples with augmented data again. I set the epoch less than the 50 samples run, the required training time is expected to be 17 hours. But it still crashed at the 13th hour due to the Colab environment. Colab free tier allows at most 12 hours max connection
    even if I used a custom GCP compute engine

  • I retried using 2030 samples (augmented to 15108 samples) with 2006 video ids. And trained for 13 hours. The training finished successfully. But the result recall rate was low image

  • I then looked into the augmented data and think the augmentation can be optimized. I switched to use Spacy to sub-sampling key words from the title. And I optimized proprocessing of the data by applying lower case on the original title and the augmented part.
    image

  • A rerun on 2030 samples (aumented to 10605 samples) with 2007 video_ids gives good result!
    image

@synctext
Copy link
Member Author

synctext commented Feb 19, 2024

  • Bug FIXED by @pneague (special token skip, too low epochs)! Great step forward with thesis!
  • Master thesis level! 🎉
  • Please label your lines within your figures
  • "timeout or something and crashed", one of the 6 figure lines
  • Lot of experiments without documentation (So real machine learning black magic ❗)
    • "I tried to add each video_id manually as one token, but the model does not work anymore."
    • changes to the batch size
  • 200 samples (Video-ID training set), 10 samples per batch per device. Results 1 step == 20 samples. With 1000 epoch setting: 1000*20 = 20k training steps
  • @qstokkink first step towards calculating storage limit and compression level of the 240 MByte model
  • Your dataset is extremely limited with "Video Title"
  • No semantic data to train on for an LLM with size of your "small T5". Use tags and see influence?
  • Original dataset contains tags: https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset?select=US_youtube_trending_data.csv
URL Description tags
https://youtu.be/eogpIG53Cis Blade Runner (1982) Official Trailer - Ridley Scott, Harrison Ford Movie trailers HD, hd, trailers, trailer, 2013, official, HD, classic trailers, oldhollywoodtrailers, Harrison Ford, sci-fi, thriller, classic, blade runner, blade runner official trailer, blade runner trailer
https://youtu.be/vKQi3bBA1y8 The Matrix (1999) Official Trailer #1 - Sci-Fi Action Movie classic movie, movieclips, movieclipstrailers, movie clips, movieclipsDOTcom, movieclipscomingsoon, zefr, jslewis, Matrix, The Matrix movie, The Matrix trailer, The Matrix film, Lana Wachowski, Andy Wachowski, wachowkis, Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, matrix, sci-fi, action, bullet time
https://youtu.be/k64P4l2Wmeg The Terminator (1984) Official Trailer - Arnold Schwarzenegge Movie The Terminator, The Terminator movie, The Terminator trailer, 1984, James Cameron, Arnold Schwarzenegger, Linda Hamilton, Michael Biehn, Lance Henriksen, Earl Boen, Bill Paxton, Dick Miller, cyborg, indestructible, assassinate, war against the machines, soldier, i'll be back, Come with me if you want to live., Kyle Reese, Sarah Connor, Terminator, action, sci-fi, fandango, movieclips, trailer, classic trailer, trailer vault, mgm, hd
https://youtu.be/bwcADuJZDNA Mad Max: The Road Warrior 4K Trailer Warner Bros. Entertainment Warner brothers movies, warner bros movies 2019, warner bros movies trailers, warner bros movies 2020, warner brothers home entertainment, warnermedia, buy movies on youtube, stream movies online, rent movies online, Buy Mad Max: The Road Warrior online, Watch Mad Max: The Road Warrior online, Rent Mad Max: The Road Warrior, Stream Mad Max: The Road Warrior online, Stream Mad Max: The Road Warrior full movie online, watch Mad Max: The Road Warrior full movie online, 4K Trailer

ToDo next sprint: document your first 2 (additional) master thesis pages. 1 Figure with, example: 20,50,200, 2030, and 6455 samples. Both learning rate figure and precision figure? All lower-case and using your Spacy sub-sampling idea? Please be sure to explain everything you are doing. Another master students should be able to reproduce your results somewhat. (https://www.overleaf.com/read/jnbcnktyfrgq#719f90)

@keonchennl
Copy link

  • Results 1 step == 20 samples.

Here I meant that it requires '20 steps' for 200 samples (the full dataset).

@keonchennl
Copy link

keonchennl commented Mar 11, 2024

Updates:

  • I re-thought of the topic and re-wrote the the introduction and problem statement section
    -draft20240311.pdf
    The focus is on 'memorize-and-search' and I assume limited computing power

  • The T5 experiment has been added

  • I used up my soon-to-expire Google Cloud credits for completing the 'Precision vs Data size' graph.

    • I train the same number of epochs (70) for each dataset size (100, 200, ...., 1000, 1100). All optimization converges well.
      image
      Link to that picture https://wandb.ai/kc2023/t5-small-cap-experiment/reports/train-loss-24-03-11-16-07-37---Vmlldzo3MTA5NjMx
    • The experiments shows a trend that the precision drops with the dataset size going down
      image
    • This is expected because with larger dataset It is harder to converge. With same number of epochs, training on larger dataset ends up at higher training loss.
    • It can explain it's definitely harder for a T5 model to memorize more video ids. The training time positively relates to the number of training samples.
    • But this might not be able to directly explain: T5-small's capability of memorizing video ids drops when there are more ids.

@synctext
Copy link
Member Author

synctext commented Mar 11, 2024

  • latest related work to include called "self-retrieval"
  • You mention smartphones, indeed LLM-on-mobiles is being worked on by Samsung
  • Section "II. Experiment - BERT", please make "Problem Description" your 2nd section.
  • not included With 50 samples, performance goes to 100% recall 🚀 🚀
  • overfitting is a feature 😄 Therefore, no usage of a validation dataset mechanism.
  • Out-of-scope: Adding fresh video-IDs when new content becomes available. Can mention as future work.
  • Thesis can have more sharpness and focus. Either:

Upcoming sprint: please finish all text of the T5 experiments. Then we can move to earlier sections (intro, design). Finally, add the tags-based semantic experiment. Graduate 🏁

@keonchennl
Copy link

keonchennl commented Apr 3, 2024

  • I didn't make much progress in this sprint
  • I refactored a bit on the T5 section draft.pdf
  • I tried to validate the current way of training in the experiments and checked with Petru
    • Because the orginal data does not contain a query, I agumented the dataset by generate unseen queries (made up by using parts from the video title.
      • Using Spacy to extract key words/named entities
      • Add a copy of the lower-cased version for each sample (including augmented ones)
    • If we don't use 'real' query data we can not really check how well the model generalize.

@synctext
Copy link
Member Author

synctext commented Apr 3, 2024

  • Slow progress, but now we have a concrete ToDo list for graduation. Path to graduate 🏁
  • "We perform training on 1 NVIDIA T4 GPU", add remark on your focus on limited computational capability.
  • match title of experiment with content: {examples!!} "1. Exploratory experiment for title lookup" and "2. Search experiment with critical nouns" and "3. Search experiment with user search"
    • Example: for this first experiment we start with the oldest known transformer technology, BERT by Google. This oldest transformer is specifically trained for text prediction. Hence we attempt to use it for text retrieval tasks.
    • Add clarity: can do table lookup, classifier? Output is an integer; basic table lookup.
    • "Although we care less about the generalizability of the
      model due to the storage target, the extensibility of the model
      still is the biggest disadvantage of this approach.", reformulate. It is even positive to show that this approach fails for generic search, thus we need something more.
  • "Experiment with T5", experiments have a purpose and goal. Not named after a technology (e.g. T5)
    • re-write start of section
    • should not feel as if "we played around with T5, this is what we got"
    • example: after this initial experiment with title prediction we now focus on a full retrieval scenario. This second experiment aims to find online videos using realistic generated user queries. We find that....
  • Third experiment Semantic indexing, a.k.a. tags

Sprint focus: focus on finishing all experimental work of this master thesis.

@keonchennl
Copy link

keonchennl commented Apr 17, 2024

The 3rd Experiment with Tags:

  • Reform training data: I treat each tag as a user query and perform augmentation on each query. Augmentation involves keywords extraction from the tag and adding lower cases.

  • I filtered the samples such that each tag (query) is unique. This means one tag maps to only one video_id. One video_id still have multiple tags.

  • I split out a test set (without performing augmentation). But I found the test error does not make sense anymore. Because many tags in the test set are completely unseen and irrelevant of those in the training set. The model doesn't know which video_id it associates with. So I decide to only count recall on the training set.

  • For 100 original data samples, about 2300 (tag, video_id) unique pairs are generated.

    • I trained 70, 150, 200 epochs. The recall on the training set gets at 0.9615076.
    • image
  • notebook

  • I plan to train all tags (6300*20) with DAS6. And later compare the size of the model with data stored in a relational database like SQL.

  • Some thoughts from the discussion with Petru:

    • It's hard to treat LLM really as a 'Database'. Because it might be difficult to implement even any of CRUD operations and make them stable.
    • Idea: Typo tolerant SQL (similar to fuzz search in a search engine)
      Input a query with typo. LLM may accept it but SQL query can not.
      To emulate a typo: insert 1 or 2 consecutive random chars into one tag(query)
    • Idea: Self-assessment like the self-retrieval papaer
      • Train another T5 model with randomized weight. Compare the TopK beam search results. Make use of both results to improve the performance.

@keonchennl
Copy link

  • I found the Youtube Trending dataset on Kaggle has an updated version. It has more data compared to the old one Especially the USVideo set: old one has 6455 unique titles, and the new one has 48471.
  • I switched to this udpated version of USVideo and use tags as query. This caused 340884 unique training samples.
  • Training on DAS6 works! I setup the environment and is ready for training.

@keonchennl
Copy link

keonchennl commented Apr 24, 2024

  • If we don't filter the tags. one tag can map to multiple videos. If we train on this unfiltered data, can it generate multiple results (video_ids)?
  • New idea: If one tag maps to multiple videos, how about concatenating the the video_id's into one label? Some training samples might be ("tag1", '<video_id_1> <video_id_2> <video_id_2>') In this way, the model might be able to generate a list?
    • How does this compare to training multiple samples (per video_id)?
    • During inference, the output length should be adjusted to let the model be able to output more words for a 'list'. This might influence the performance (accuracy).

@synctext
Copy link
Member Author

synctext commented Apr 24, 2024

  • Youtube trailers for 1000+ movies
  • for green light moment: finish and polish all experimental results 🍏
  • Wrapping up thesis with a 3rd and final "semantic indexing experiment" 🏁
    • undertraining, overfitting, and storage capacity of VIDEO-IDs
    • first train with 100 and slowly increase + show recall rate
    • make a case that you did sufficient training: recall starts to drop!
  • LLM as a database {re-visiting again the thesis idea storyline;motivation}
    • Current thesis wording: We aim to answer critical questions about the viability of LLMs as search databases, examining attributes such as stability, availability, and data integrity.
    • LLM is not an SQL database
    • LLM support very complex queries ('king – man + woman = queen' is the classical example)
    • LLM is expensive, but semantic SQL search is expensive and immature
    • LLM could be treated as a strange new type of semantic database (positioning of this scientific work)
    • LLM can provide semantic search!
    • LLM has unknown storage characteristics
      • unknown stochastic insert and select
      • your thesis: quantify these unknown properties
      • several experiments around train and recall
      • focus on usability in real systems: beyond 98% recall.
      • How much can the semantic database store?
  • Essential related work by APPLE itself

@keonchennl
Copy link

keonchennl commented May 21, 2024

Since the T5 experiment I realized that we should also pay attention to other metrics such as precision and F1 score than the recall. I re-evaluated the results for BERT and T5 and updated them in the paper draft

  • The results for the T5 tag experiment is better by encoding each video ID as a whole token. For 1000 and 10000 videos, it reaches 99.99% precision and recall.
  • By manually checking the wrongly generated video IDs, I found the model actually performed better: The wrongly generated video IDs in fact link to relevant videos because they contain the input tags (or the titles contain synonyms). They should have been true positives but count as false positives because we only count it correct if the predicted video id mapped exactly to the one in the dataset. If we use the new metric, the precision reaches 100%.
  • I wrapped up all experiment sections in the paper thanks to the feedback from @qstokkink Here is the Draft
  • Due to the promising result, I started the full dataset training last week. 48266 videos takes ~160 hours. Now the training is in the middle and expected to finish this Sunday...

@synctext
Copy link
Member Author

  • thesis results are now at master thesis level 🏎️
  • solid step towards graduation!
  • Green light form submitted 👏 🦄 👏
  • now focus on polishing Table 1 into a graph
    • at least 7 data-points and a connecting dotted line.
    • Core of the thesis!
    • rough estimate of LLM semantic storage capacity for a fixed-epoch training budget.
  • 1-week edit cycles with Quinten.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants