Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

literature survey + master thesis: G-Rank learn-to-rank #5313

Closed
synctext opened this issue May 7, 2020 · 183 comments
Closed

literature survey + master thesis: G-Rank learn-to-rank #5313

synctext opened this issue May 7, 2020 · 183 comments
Assignees
Labels
type: memo Stuff that can't be solved type: MSc-Thesis-Work
Milestone

Comments

@devos50
Copy link
Contributor

devos50 commented Jun 9, 2020

Note that the tragedy of the commons is even present in blockchain systems (as a consequence of full state replication).

@synctext synctext changed the title literature survey: The Tragedy of the Internet Commons literature survey: autonomous trading bots Jul 3, 2020
@awrgold
Copy link

awrgold commented Jul 6, 2020

New Literature Survey:

Autonomous (decentralized?) trading bot:

  • (Possible) Features:

    • Online learning capable (?)
    • Decentralized deployment such that multiple copies of agents can learn with local information and compared to one another, i.e. federated learning(?)
    • Order flow vs. strategic indicator trading strategies
    • Generic enough to work for various asset pairs (potentially digital Euro markets??)
    • Autonomous portfolio (re)allocation
  • Overleaf project:

  • Starting literature:

  • Seedboxes and decentralized storage and bandwidth allocation:

    • Is it more interesting than a decentralized trading bot?

@synctext
Copy link
Member Author

synctext commented Jul 13, 2020

True autonomous AI has full freedom to act, own and profit. This literature survey is focused on ...

  • dataset privacy (multiparty-computation) not suitable, tech push
  • robot collective learning (robo language between agents, send Q-table updates)
  • agents which are really autonomous
  • operate on a market and make money, https://github.com/Tribler/trustchain-superapp/blob/master/README.md#ai-trading-bot
  • focussed: vpn, seedbox, and/or cybercurrency trading bot on decentral market
  • Autonomous actor: Q-learning, local wallet with cybercurrency, and trading infrastructure: these are 3 primitives that have never been merged with true autonomy.
  • Collectivity of autonomous entities: working together, Q-learning together, local wallet with cybercurrency, and trading infrastructure: these are 4 primitives that have never been merged with true autonomy.
    Related: "Collective Adaptive Systems: Challenges Beyond Evolvability"
    image

Links

@awrgold
Copy link

awrgold commented Jul 13, 2020

Next step:

~10 sentences or so, narrowing the scope down to 3-4 directions to go in regarding the survey and prototype. (Next week)

@synctext
Copy link
Member Author

synctext commented Jul 18, 2020

This is the recent breakthrough in reinforcement learning algorithms (e.g. Asynchronous Proximal Policy Optimization (APPO) algorithm). Entire thesis goal/Survey-prototype could be to get this into the superapp AI bot, get a Tweakers post, deployment to a few hundred phones, measure, plot in thesis, and should scale to millions. Each smartphone has fixed 5-ish GByte slice of a huge dataset of historical market data, making each bot unique. Creates a very concrete focus: apply this ML technique to autonomous trading bots. Specifically, smarter bot trading this recent work on our decentral market; see here. Fancy results. Done. (only fundamentally re-work this technique by distributing the shared memory and decentralising the learner to offer unbounded scalability. Simple. /sarcasm)

Key research question: Can we map the “Sample Factory” (a high-throughput training system optimized for a single-machine setting) architecture to the Autonomous trading bot setting?

image
Website: https://sites.google.com/view/sample-factory
Article: https://arxiv.org/pdf/2006.11751
Press: https://spectrum.ieee.org/tech-talk/artificial-intelligence/machine-learning/powerful-ai-can-now-be-trained-on-a-single-computer
Code: https://github.com/alex-petrenko/sample-factory
Demo:

@ichorid ichorid added the type: memo Stuff that can't be solved label Jul 18, 2020
@awrgold
Copy link

awrgold commented Jul 21, 2020 via email

@synctext
Copy link
Member Author

synctext commented Jul 22, 2020

As a first survey step: please post here for Monday 24Aug: .PDF file in IEEE 2-column format with 20 citations of papers and first draft text in sections. Plus 1 giant table with all papers, per year, expanding something like:
image
Ideally add 1 graph of performance of experimental part. Idea: make a bandwidth trading bot ecosystem emulator with self-play. So 10 bots which do stuff and can be used to compare algorithmic complexity, intelligence + resilience against attack.

@synctext
Copy link
Member Author

Can you find more in this:
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf ?
"Hardest part of AI is not AI". You agree?

@synctext
Copy link
Member Author

synctext commented Aug 25, 2020

5 x 5ects courses left. plus lit survey. Proposed direction:

  • self-sustaining AI
  • please prepare for master thesis around a live system!
  • trading pair: financial market or Tribler storage market
  • reinforcement learning (there is better stuff, but this is mature stuff)
  • data ownership issue is more of interest to the "policy management facility..."

@awrgold
Copy link

awrgold commented Sep 30, 2020

Interested in quantification of activities to help inform existing financial/investment funds regarding climate change, ecological destruction, financial regulation, Pigouvian taxes and climate regulation, etc. - "making the world a better place through data?"

"What would a Climate DAO look like?"

Papers and literature to follow.

Long-term TODO:

@synctext
Copy link
Member Author

synctext commented Nov 11, 2020

@synctext
Copy link
Member Author

  • Wall street resilient micro-economy
    • Exception to the rule that greed is good
    • Common good as a first concern
  • Re-use the strong AI knowledge within Q4 survey and thesis
  • @awrgold : "Changing economies and society for social desirable outcomes (beyond individualistic and selfish behaviour in late-stage capitalism)"

@synctext
Copy link
Member Author

synctext commented Feb 15, 2021

@awrgold : 'i like making trading strategies'
liquidity pools, yield farming, 💾 Tribler Credit Farming 👍
Liquidity pool: IBAN Euro vs. Bitcoin?

@synctext
Copy link
Member Author

synctext commented Apr 9, 2021

ToDo: superapp compile from the source. Understand Euro Token, liquidity pool, Taproot, and keep track of documentation/grey_literature. With bit more background info and status understanding we can define the exact topic in 3-4 sentences.

@awrgold
Copy link

awrgold commented Apr 12, 2021

I hope you don't mind me using this issue as a personal bulletin board for some thoughts. Q4 just started, doing some reading today to get back up to speed. The last 12 months were a blur, I'm trying to get back not just on track but also making sure that what topics I do chose will keep me motivated for a longer period of time.

I'm (re)reading a lot of the Tribler/Trustchain documentation and development history, catching up on what's already been attempted, completed, shelved, and abandoned in the past. Particularly this page. As such, here's some topics I think are either useful and/or interesting.

Topics of particular interest:

  • Decentralized Finance:

    • Particularly with respect towards fundraising, investment, and risk management.
    • I have prior experience with decentralized insurance DAOs, but regulatory compliance is a major issue we ran into in the past.
    • Shared "community" trading bots:
      • With multiple contributors, can a mechanism be derived that awards maintainers and contributors based on their contributions?
  • Infrastructure:

    • Can we implement a rudimentary proof of concept for Pigouvian taxation within a closed (micro)ecosystem? What about broader infrastructure needs to scale upwards?
    • Load-balancing decentralized databases on mobile devices such that users' phones can manage storage space efficiently? How to prioritize content?
  • AI/ML:

    • Can we develop an autonomous, self-sustaining network "overseer" that is trained on Tribler traffic to identify anomalies/attacks?
      • Since attacks are often discovered after the fact, emphasis on the ability to retrain the model and compare efficacy before/after would be interesting.
  • Shared resource management:

    • How can communities pool various resources into a shared "pile" to address shared concerns?
      • Easy example: pooling cryptoassets to leverage for outside services
      • Harder example: pooling services (decentralized Fiverr?) in exchange for other services/assets
  • Hyper-local news dissemination:

    • Providing incentives for neighborhoods and communities to share information not with the whole world, but with local networks (both geographically and topographically). How to incentivize/monetize?
  • Livestream Validity and Ownership:

    • Assuming a DHT or some other form of content-aware database && a torrent-based content streaming platform:
      • Use NFTs and cryptography to ensure validity and integrity of live video/audio as a mechanism to defend against deepfakes.
        • (Throwing darts here) What if a live stream has random frames hashed and stored as tokens, which are encapsulated in an NFT owned by the content creator. Obviously a hardware component is involved, ensuring no manipulation has occurred between the digital image sensor of a video camera and the output of the computer, but could that be used to guarantee that the light absorbed by the camera's sensor has not been manipulated before being streamed/stored?
  • De-duplication:

    • Obviously, using a DHT or some other form of unique-content-aware database, users can snapshot, crop, compress, or otherwise modify to an insignificant degree the same content and then re-upload to the network and claim ownership, as that "version" is indeed "unique" (unique hash) yet is merely a copy of pre-existing content. This is one of the fundamental problems with IPFS and other content-aware databases that claim uniqueness of content and scalability due to non-duplication as a selling point.
      • One example is "deep-fried memes," or memes that have been compressed, overly saturated, and watermarked, among other alterations. The act of compressing/saturating the meme adds a new context to the meaning of the meme, without significantly changing the actual visual content.
      • One potential solution would be for an entity of some kind to provide a "de-duplication" service utilizing image, text, or audio processing to identify "duplicate" content that isn't an exact carbon-copy (which would be detectable due to identical hashes in the lookup table) but is close enough to be considered "the same" content, and filter it out.
        • A service could be either integrated into the platform itself, or provided to users of the network, that prevent duplicate items from being uploaded without prior attribution, or to save storage space, allow users to upload content deemed "duplicate enough" but stored entirely locally, such that storage overhead for the network is reduced.

@synctext
Copy link
Member Author

synctext commented Apr 22, 2021

Discussion of thesis direction

  • Finance is changing
    • Walled garden of Wall Street and Big Banks is breaking down possibly
    • Open innovation and low-barrier to entry is not going away, competitive pressure might break the existing walled industry
    • More breakage, more cowboy behaviour and criminal action investigations than before
  • financial infrastructure interest
  • Focus on legacy versus wealth creation?
  • guaranteed information dissemination protocols
    • hot new result of this week - double spending can be transformed into a info dissemination problem
    • with n - 2 bad actors in your network you can still conduct transaction
    • This amazing result only holds if you can build novel amazing gossip protocols
    • You need guaranteed eventual full dissemination with protection against: ddos, network partitioning, eclipse attacks, etc.
  • DAO passionate, lets do that!
    • "DAO: from hacked idea to reality"
    • future of the firm, you can't go wrong with that one, sir.
    • still unknown what challenges are
    • Process to find that out: learn-by-doing methodology
    • Please do some unit tests (hacked original DAO was 12.7M ether, worth today 26,295,116,904.20 Euro)
    • Concrete objective which is measure of success: can we create the first 1+ Bitcoin DAO which scales beyond 100+ people
      • thus public keys scale beyond single transaction size, requires Schnorr and Taproot
      • requires coordination protocol in Trustchain and threshold crypto magic
      • Expand Luxury Communism work in Superapp
      • Ambition or distraction: provide input to Bitcoin community with a written BIP proposal for real scaling.
    • Superapp DAO status
  • Survey: superapp evaluation, old hacked DAO, Solidity honest opinion, open science questions
    • 6-page IEEE 2-column write-up
    • Heavy on the citations, 40+ sources

@awrgold
Copy link

awrgold commented Apr 22, 2021

Ultimate goal: graduate by end of Q3 2022

@synctext synctext changed the title literature survey: autonomous trading bots literature survey + master thesis: production-ready DAO Apr 22, 2021
@synctext
Copy link
Member Author

Final possible deliverable: report co-authored with Authority Financial Markets...

@synctext
Copy link
Member Author

synctext commented May 20, 2021

Week 5 of this quarter currently. Making progress with investigation of current DAO work.
Goal of thesis and survey concrete (repeating): production-ready DAO with 1+ Bitcoin, 100+ people and BIP docs of your Taproot+trustchain coordination protocol. Focus on getting the superapp code from 'class project' to financial critical infrastructure level. What does that actually mean and how do you show that you delivered that? (Tools? https://nexocode.com/blog/posts/mutation-testing/)

ToDo:

  • scope: focus on science/tech stuff. I propose to keep legal matters out of scope, just mention KYC and AFM and criminal convictions in a brief (sub)section.
  • list of scientific articles and technical blog posts on DAO matters.
  • identify the open scientific challenges, 1 page write-up for survey.
  • Come up with your draft title. like "DAO: from hacked idea to reality"?
  • Future; contact chamber of Commerce for possible involvement with 'future of the firm'?

@awrgold
Copy link

awrgold commented May 20, 2021

DAO Literature

  1. MakerDAO Whitepaper: https://makerdao.com/en/whitepaper
  2. List of Active Ethereum DAO's: https://app.daohaus.club/explore
  3. The DAO and Governance: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3082055
  4. Bitcoin and the rise of DAOs: https://link.springer.com/content/pdf/10.1186/s41469-018-0038-1.pdf
  5. Experiments in Governance: https://moodle.epfl.ch/pluginfile.php/2861870/mod_resource/content/1/DUPONT-2017-Preprint-Algorithmic-Governance.pdf
  6. Smart Contract Collaborations: https://link.springer.com/chapter/10.1007/978-3-319-21915-8_1
  7. E-Gov DAO: https://www.researchgate.net/publication/325632774_eGov-DAO_a_Better_Government_using_Blockchain_based_Decentralized_Autonomous_Organization
  8. fetch.ai: https://fetch.ai/

Infrastructure Literature

  1. ZK-rollups: https://docs.ethhub.io/ethereum-roadmap/layer-2-scaling/zk-rollups/
  2. MEV Crisis: https://medium.com/flashbots/frontrunning-the-mev-crisis-40629a613752
    1. https://ethresear.ch/t/flashbots-frontrunning-the-mev-crisis/8251
    2. https://medium.com/etherscan-blog/rapid-rise-of-mev-in-ethereum-9bcb62e53517
  3. MEV: https://blog.chain.link/what-is-miner-extractable-value-mev/#:~:text=One%20such%20example%20is%20Miner,excluding%20transactions%20within%20a%20block.
  4. Aragon: https://aragon.org/protocol
  5. DAOstack: https://daostack-1.gitbook.io/v1/

Other Literature

  1. Private Transactions on ETH: https://tornado-cash.medium.com/introducing-private-transactions-on-ethereum-now-42ee915babe0
  2. Ethereum 2.0 in Rust: https://github.com/sigp/lighthouse
  3. COPE Tokenomics: https://www.unlimitedcope.com/assets/COPE_Tokenomics_0.pdf

This is what I've compiled so far on DAOs, I have like 50 other papers regarding blockchain and cryptography and the like, but were more focused on the earlier iteration of the literature survey from last year, e.g. trading bots, etc.

@synctext
Copy link
Member Author

synctext commented Jun 8, 2021

Production-DAO needs a service. Smart contracts need to scale somehow. Possible idea is server renting and compute servers in an open market using Bitcoin. See prior working code and the [Plebnet report]. (alternative is a simple btc/eth trading bot). Scientific interest focus of thesis: governance?!?
The self-governance of this service is designed to be a role model for economic activity under direct democratic control.
ToDo: scoping for 1st phase? Keep exclusively DAO for literature in narrow scope (determine historical development timeline, investment capital tracking), no NAW, decentral economy fetch.ai out of scope?
Papers found: https://doi.org/10.1109/SDS49854.2020.9143877, https://doi.org/10.1109/ACCESS.2021.3081926, https://doi.org/10.1787/ef4eba4c-en, https://doi.org/10.1109/TTS.2021.3054974 (Measuring DAO autonomy, solid overview paper) . idea Copy and expand upon using performance evaluation: https://arxiv.org/abs/2011.14940
image
Example of a format that I have in mind from another students; survey, [64] citations, screenshots, experimental evaluation, thesis pre-prototyping https://arxiv.org/pdf/1512.00071.pdf

@synctext
Copy link
Member Author

synctext commented Jul 12, 2021

Possible methodology: don't build fantasia about how a DAO could look like. Others are doing that for us 😆. Do a hands-on production-DAO without corner-cutting. Bottom-up and driven by a single actual service. Minimal Viable DAO, must have service in the machine economy of the future. Server infrastructure (e.g. Bitcoin VPS). Reversed cloud: not owned by corporation, not owned by humans, but DAO controlled. (https://akash.network/). Other must-have alternatives are a global key-value store: https://blog.sia.tech/skydb-a-mutable-database-for-the-decentralized-web-7170beeaa985 Dependency, that requires servers.
Is the taxation status in scope? Keep strictly technical (distraction: ask for a formal tax ruling on Delft DAO?) Tax distractions:

Happening: https://www.coindesk.com/yield-guild-games-dao-funding-round-delphi-scalar

@synctext
Copy link
Member Author

X-mas day gift: slowdown issue! (memory or something else)

Well, this is then a forced moment to clean your code. Using DAS6 would not fix your bug at this late stage. Remember me pushing for straight 45° graphs? This is why. Simple, understandable, and debug-able.

My other advise: start conforming what part of the gossip works. No exponential growth. Keep it simple, you're an AI expert. Our natural world is gossip load balancing debugging. Node discovery graph, for each node in the system: how many incoming messages and how many unique discovered other nodes. Usually you see patterns with hundreds of lines (hundreds nodes).

@awrgold
Copy link

awrgold commented Dec 25, 2022 via email

@awrgold
Copy link

awrgold commented Dec 25, 2022

image

Some interesting results, when the normalization constant F=1.0 (when nodes do not cluster themselves based on similarity to other nodes in their clicklog) we see that all nodes converge to a very similar score.

I don't know what the behavior is like over 10000 queries since my PC restarted, but the above is what I had from a smaller simulation.

image

Then, if F=0.0 (where nodes do not consider the clicks of any nodes they're aware of unless those nodes have performed similar query-click patterns) we see less convergence, which honestly is the opposite result of what I would have expected...

@awrgold
Copy link

awrgold commented Dec 25, 2022

However, I'm not sure what you think of the color scheme above. I'm using Blue to indicate nodes that have not been infected by clicklog entries from malicious nodes, Red to indicate spam/sybil nodes, and Green to indicate non-malicious nodes that have entries from malicious nodes in their local clicklogs.

I wonder if this is the most clear way to visualize this metric...

@awrgold
Copy link

awrgold commented Dec 27, 2022

Alright, so beyond a few optimizations I had a ton of fun learning about string interning in Python. I was doing a whole lot of string comparisons when checking if clicklog items contained the query string, and was doing it inefficiently. Hopefully this leads to a speedup - had to refactor a lot of code.

@synctext
Copy link
Member Author

synctext commented Dec 28, 2022

good to hear!
Please focus on thesis quality text for our next meeting tomorrow. Explain (security) Sybil/pollution experiments.

@awrgold
Copy link

awrgold commented Dec 28, 2022

Of course - problem is, I'm basically only waiting for results to discuss at this point, so this was the major bottleneck for me.

@awrgold
Copy link

awrgold commented Dec 29, 2022

@synctext How does this sound as a new first paragraph of Problem Description:

"Security within the domain of decentralized machine learning remains an unsolved problem. There exist numerous additional constraints in decentralized networks that traditional machine learning models need not be concerned with. Trustless, anonymous networks are rife with malevolent usership, and the task of identity verification in such networks also remains unsolved (TODO: SOURCES). Adding an additional layer of complexity, many p2p networks are built upon open-source software, affording any would-be adversary direct insight into potential attack vectors. As such, machine learning models engineered for public p2p networks require exceptional attention to detail across all facets of their design. These constraints disqualify any supervised models from the outset as they violate the trustless nature of p2p networks. Either the engineers of such supervised models must be trusted to train and validate the model, or the network participants must provide training data themselves, thereby introducing a critical vulnerability into the training process. Creating a search engine for a p2p domain that requires no training yet can converge towards an optimal performance as if an error rate is being minimized in a supervised model would constitute a major development in p2p applications."

@awrgold
Copy link

awrgold commented Dec 31, 2022

It might be a memory issue. I suggest then using our dedicated infrastructure, for example, the DAS6. These machines usually have much more memory than end-user hardware. You might want to contact @xoriole since he can setup an account for you, but I wouldn't expect much until after the holiday season 👍

FWIW I learned a lot about Python object creation and destruction in the past week or so. Also, string interning is kind of a big deal when you're dealing with huge numbers of string comparisons. Who'da thunk??

I have a method where I find matches in clicklog entries by comparing strings, and I was using .contains() instead of string1 == string2. Even better would be a thorough refactoring and careful code review to use string1 is string2 but I figured considering my 'distributed simulation' that's a step too far.

Sped things up considerably, by a factor of about 30-50x. Comparisons are definitely O(1) now.

@awrgold
Copy link

awrgold commented Jan 2, 2023

image

Interesting preliminary push-pull results. Small sim where nodes "pull" clicklog updates from 1st or 2nd degree neighbors, but spam nodes "push" to whatever nodes they're aware of.

I was thinking of having 50% of the nodes accept "pushed" gossip while still pulling, and the other 50% will only pull. Could illustrate the difference, since it does seem that most get infected eventually.

@awrgold
Copy link

awrgold commented Jan 3, 2023

There is something I can't quite wrap my head around:

image

The performance plateaus pre-attack, and then continues on a downward trajectory post-attack. This happens in every adversarial simulation, to some degree.

I am not taking adversarial nodes' query rankings into consideration when measuring a node's ranking performance, so introducing attackers does not directly influence the score - only the gossip of poisoned clicklog entries over time.

The only thing I can think of is that adversarial nodes will gossip their infected clicklogs which will eventually propagate throughout the network, and over time those poisoned clicklog entries will affect ranking scores. As you can see, post-attack there is also a plateau that occurs until a node gets lucky enough to receive an update via gossip that positively influences their performance.

@awrgold
Copy link

awrgold commented Jan 6, 2023

@synctext I know it's 11:59 on the doomsday clock, but I've been having some very intriguing results with the push-pull architecture such that I feel like it's actually worth integrating as a core function of G-Rank, at least within the context of the simulations.

Nodes that are accepting of pushed gossip messages are converging faster towards optimality pre-attack, and then very quickly start performing much worse, whereas those that are pull-only nodes are converging at a slower rate but are largely unaffected by malicious gossip.

A major reason why I'm thinking this is admittedly due to the fact that I'm dealing with some serious slowdowns in simulations that I cannot explain - it's not a memory issue actually as I've been doing quite a bit of profiling and have done a lot of optimizing. It just has a really hard time calculating similarities for a large number of nodes, which of course become aware of each other quickly during spam attacks. I'm still trying to figure this out.

A preliminary side experiment I've been running is the 50% push, 50% pull experiment with a few modifications:

  • Push nodes can accept unsolicited gossip while also pulling updates from others. Pull nodes can only pull updates from others.
  • All gossip received (from pull or push) is then propagated to the "closest neighbor", which may or may not accept this unsolicited gossip message depending on its type.

I've also introduced some statistical noise into the rankings to avoid the plateau seen above: there's a 50% chance per query for 2 randomly selected items in the list to swap places in the ranking. This does seem to help a bit with improving ranking over time.

Push-pull simulations are really fast. I can do 100k queries in the same time as it takes the other non-push/pull simulations to do 5k queries.

Extensive profiling has not shown me why this is happening. %prun shows of course that there's one specific function call taking up 99% of the time, which is the == comparison between the current query term and past query terms in the clicklog. I've even tried speeding this up by codifying the query terms into integer values, but even so this is not helping. Something is going on, and since I'm about to disappear for 5 days I'm going to run all of my experiments under a push-pull gossip architecture, with of course the various attacks for comparison.

I'll have my laptop on my ski trip but I won't be working that much during the trip, just so I can check in on the simulations and maybe do some writing in the car.

Let me know what you think?

@awrgold
Copy link

awrgold commented Jan 12, 2023

image

The results from a targeted attack were pretty interesting though

@synctext
Copy link
Member Author

synctext commented Jan 12, 2023

quick remarks:

  • simplify to 2 nodes and track bugs with manual artificial scenario. trival example
  • no mixing push/pull in a single run
  • attacker passively join (at 25%), then act bad (75% in)
  • zoom into 10s attack
  • message: only 10s needed to kill the network. security cardinal.
  • Epic attack scenario. nice network. Suddenly 3x as many Sybil appear all at once (e.g. 75% Sybils).

@awrgold
Copy link

awrgold commented Jan 12, 2023

Edited for a few necessary clarifications:

  1. 10s ATTACK: You say "10s attack" but I have no idea what you mean by this, I must have misheard you. 10 seconds to kill the network? The simulation is in discrete time steps. Can you elaborate?

  2. PUSH vs. PULL: Right now the number of permutations of experiments is at 8 - four experiments with two parameter changes (F=0 and F=1) but if we add Push vs. Pull across all experiments, we're now at 16 experiments.

I would assume, then, that the Push vs. Pull Experiment is not a comparison across all attack types - instead, we choose one attack (probably Targeted Sybil) and perform a Pull-based experiment, and then compare the results against the regular Targeted Sybil attack, which is under a Push scheme.

Currently, the writing is as follows:

With a pull architecture, peers are more autonomous and decide individually the speed of incoming information, if they trust another peer, or may randomly sample from discovered peers. Malicious nodes in this experiment try to push two messages when gossiping. With the pull architecture, only one message per gossip phase is accepted.

As such, the fact that only malicious nodes push 2 messages whereas benign nodes push only 1 is somewhat confusing to me, as it adds extra dimensions to the experiment and also means that malicious nodes have undermined the source code to modify the gossip scheme, which in itself is a pretty major attack vector. I feel like only one of the following should be true:

  • All nodes (including malicious) push 2 messages when gossiping in every experiment, so its consistent across all experiments
  • All nodes (including malicious) push 1 message in every experiment.
  • A specific experiment is performed where malicious nodes push 2 messages and benign nodes push only 1, but push vs. pull only occurs in a single experiment and all other experiments ignore the pull architecture and gossip as normal.

Does this make sense?

  1. You say:

attacker passively join (at 25%), then act bad (75% in)

Do we want to change this for all attacks? Right now they're joining at 25% and begin attacking immediately.

  1. EPIC ATTACK: We used to have the "Persistent Sybil" attack, which represented of course a persistent threat. You say:

Epic attack scenario. nice network. Suddenly 3x as many Sybil appear all at once (e.g. 75% Sybils).

Do we change this Persistent Sybil attack scenario into an Epic Attack scenario? Or is "Epic Attack" a new kind of attack that we're also introducing?

@synctext
Copy link
Member Author

10s ATTACK: You say "10s attack" but I have no idea what you mean by this, I must have misheard you. 10 seconds to kill the network?

It shows also within the latest figure that after 10 'rounds' of attacks the whole network is polluted and essentially destroyed #5313 (comment)
Sidenote: by making 1 round exactly 1 second, your results are easier to understand and interpret. Everything happens very fast, the idea of peer-to-peer machine learning is viable, but 1 attacker can destroy everything in 10 seconds. Great conclusion.

but if we add Push vs. Pull across all experiments, we're now at 16 experiments.

Always the smart secure option: pull. Just 1 experiment subsection you can elaborate on the push architecture and the consequences.

instead, we choose one attack (probably Targeted Sybil) and perform a Pull-based experiment, and then compare the results against the regular Targeted Sybil attack, which is under a Push scheme.

that sounds like a good storyline

As such, the fact that only malicious nodes push 2 messages whereas benign nodes push only 1 is somewhat confusing to me, as it adds extra dimensions to the experiment and also means that malicious nodes have undermined the source code to modify the gossip scheme,

In any distributed systems each node can deviate in a byzantine manner from the protocol. So this experiment (could) explore the rate control that is done in push versus pull. With push architecture an attacker can send messages twice as fast. It is very understandable that a deliberately simple simulator only support fixed message speeds. Feel free to explain that within your text and 'hack' that by letting honest nodes send empty messages with 50% probability. "For reasons of keeping our code simple, reliable, and correct we use a single global message speed for both attackers and honest peers. Attackers use each message to attack, but honest peers obtain a lower effective messaging speed. With 50% probability they send an empty message."

A specific experiment is performed where malicious nodes push 2 messages and benign nodes push only 1, but push vs. pull only occurs in a single experiment and all other experiments ignore the pull architecture and gossip as normal.

Yes! Except, that pull should be the default. Push has bad security, you can easily flood networks with 1 GBit per second of spam.

Do we want to change this for all attacks? Right now they're joining at 25% and begin attacking immediately.

All figures. My intuition says that your results will be easier to interpret and understand. Avoiding two things happening at once. But again, its your thesis to write. I'm just paid to help and advise :-)

Do we change this Persistent Sybil attack scenario into an Epic Attack scenario?

Yes, that sounds like a more interesting experiment. 100 honest, 300 attackers or so.

@awrgold
Copy link

awrgold commented Jan 13, 2023

  1. 👍
  2. 👍
  3. 👍
  4. I understand that we can have byzantine processes, especially within the context of p2p networks, I just assumed we were focusing purely on a specific type of threat (e.g. sybil attackers), and that discussing the potential for byzantine faults and other sorts of attacks was outside the scope of the paper. Your explanation makes sense though, I'll integrate that.
  5. I'll make pull the default across all simulations, then?
  6. I'll have them join at 25% and attack at 75%. This is just different than what we had been doing, so it felt like a last minute major change but honestly it's not that big of a deal.
  7. 👍

@awrgold
Copy link

awrgold commented Jan 13, 2023

Reminder, though, that purely pull-based gossip means that nodes will never become more aware of the network beyond the nodes that they're aware of, so we'd need some kind of propagation method. I think this is why I was doing the 50% push, 50% pull simulation - it allowed for nodes to become more aware of the network, at the risk of being more susceptible to attack.

Just so that we're 100% clear: we're using pull-based gossip as the default setting for ALL simulations (except the push vs. pull comparison experiment) which means that I'll need to re-write the gossip section of the thesis. Correct? @synctext

Also, I realized that with pull-based gossip, the attackers will perform their attack but then will need to lie in wait until someone requests an update from them, which may dramatically slow them down. If we have attackers push, then that means all adversarial simulations involve a push architecture. How do we handle this?

@synctext
Copy link
Member Author

synctext commented Jan 19, 2023

Reviewing latest thesis .pdf

  • Correct, default == pull. This forced waiting is the essence of the protective mechanism that pull gives you.
  • recommended message semantics inside your thesis:
    • ClickLogs-REQUEST asks for any peer to send their own clicklog and the clicklog of their most recently seen incoming clicklog. Thus this request includes gossip from a stranger. This is the requests within our pull architecture.
    • ClickLogs-RESPONSE response to a request for two clicklogs. Send your own clicklog and the clicklog of the most recently received clicklog. Thus this response includes an incoming ClickLog from a stranger.
  • Describe the interplay between peer discovery and spam spreading of attacker nodes. Strategy of sending lots of requests, become widely know by violating the protocol, wait for incoming requests, when these victim's contact your, respond with spam. This is interesting experiment.
  • Comply with the official TUDelft format with this example cover page

@awrgold
Copy link

awrgold commented Jan 19, 2023

Found another problem with my data from the experiments, will have to run them again because they were truncating values for some reason. Will upload graphics ASAP.

Update: Yet another bug found, only nodes that received gossip REQUESTS were updating during gossip, and not the recipient of the gossip itself, and therefore could not discover newly bootstrapped nodes, so attacks never happened. Fixed, running again.

@awrgold
Copy link

awrgold commented Jan 19, 2023

Ah, I forgot a question that I think is somewhat important: We are doing bootstrap at 25%, attack at 75% - however, this leaves little time to see the longer-term effects of the attack. What if we did bootstrap at 25%, attack at 50%?

@awrgold
Copy link

awrgold commented Jan 20, 2023

image

Baseline simulation does plateau relatively quickly, but some nodes start to move downwards over time. Initial guess is that this is either:

  1. due to the random noise inserted into the rankings, or
  2. due to luck by randomly sampling from a node with better rankings for a specific term

Not that I will prioritize this, but the baseline sim is super fast so I can do a sim 10x as long (or increase the number of request messages) to see what the culprit is.

More updates coming after fixing the issue mentioned here: #5313 (comment)

@awrgold
Copy link

awrgold commented Jan 21, 2023

image

PUSH: Targeted Sybil Attack with Push gossip scheme. Within 10 rounds of the attack, the entire network converges and plateaus.

I'm trying to think of a better way to visualize the infected nodes alongside the sybil nodes, but I don't want to offset them so you can see that the green infected nodes are behind the sybil nodes. I also don't want the sybil nodes obscured by the infected nodes. I guess we can just describe in the figure text that sybil nodes overlap the infected nodes.

@synctext
Copy link
Member Author

synctext commented Jan 25, 2023

Literature Review and comments

  • Main issue: very thin thread and progression of storyline
  • Page 2, figure 1: set the storyline with a 2-column single picture of Sam, for instance.
  • Minimal repair with current events: biggest fraud of recent history and out of jail for Sam.
  • lack of scientific grounding: 12 citations, target 20+ Journal/Conference/reviewed/Arxiv
  • Section "VI DISCUSSION" becomes "DAO achievements and fundamental challenges". Taxonomy in a single glance. Systematic overview of past years of innovation. Key milestones identified. Scientific grounding, 1 or more scientific article per entry/line/milestone. Table with overview and literature, see brilliant With Honours example: image
    • problem of non-scientific milestones. Blogs only
    • Government reports are possible alternative, Bank of England, Digital Euro announcement, Anti-Facebook Coin Law, B I S.
    • Actual deployed technology milstones: Leaderless dreams/first Chaum implementation, first public internet auction (pre-eBay, laserpointer story), first Bitcoin Block, first transaction between strangers, first smart contract, first token, first DAO creating a token, first scam, first darknet police arrest, etc.
  • MakerDAO Governance Section: citation of the wealth of Journal papers by lawyer types on topic of governance..
  • "illegal practice of front-running in traditional finance." use as opening line for this section. This is another 3-5 citations 👍
  • Introduction is not an introduction: focus only on Bitcoin and DeFi. Leave the details for later sections. New intro!

@awrgold
Copy link

awrgold commented Feb 1, 2023

Ars Technica: Massive Yandex code leak reveals Russian search engine’s ranking factors.
https://arstechnica.com/information-technology/2023/01/massive-yandex-code-leak-reveals-russian-search-engines-ranking-factors/

Material for future students?

@synctext
Copy link
Member Author

synctext commented Feb 15, 2023

🚀 GRADUATED 🚀
arXiv: G-Rank: Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network
TUDelft repo: of a decentralised search engine to De-Google The Internet using donated smartphones
CODE: https://github.com/awrgold/G-Rank (Jupyter Notebook file)

@synctext
Copy link
Member Author

the above master thesis is also available for 5 students bsc students final thesis project in Q4.

Create WEB3 search engine from scratch - De-Google The Internet

Numerous initiatives have tried to compete with the success of Google, none succeeded. Using the latest scientific insights you will design a fully decentralised alternative to the Google search engine.
The Delft Blockchain Lab (DBL) is TU Delft’s initiative for research, education, and training in blockchain technology and trust in the internet. Our research heavily focusses on improving the efficiency of blockchains, self-sovereign identities and blockchain-powered marketplaces.
Our key blockchain project is Tribler. Tribler is a peer-to-peer file-sharing application where users can share and download digital material, without requirement for a central operator. The main goal of Tribler is to provide an decentralized alternative for YouTube using Bittorrent. Efforts by several generation of students have resulted in a mobile implementation of Tribler, which we plan on expanding and releasing. As background, see dozens of articles on TorrentFreak.com about Tribler.
Mobile media personalised search, based on your watch history, is a key requirement for a mobile version of Tribler. It enables users to discover new, interesting content and helps media creators with reaching the right audience for their content. A search engine for videos, however, is a non-trivial problem. Learn-to-rank algorithms present a wealth of available media in an optimal ordering for the user to click. Decentralisation of such search engine algorithms has proven to make a difficult problem even harder. Ranked search engine results require the full list of available videos to be available locally. However, this requirement becomes unfeasible on mobile phones if there are million of data elements, given the limited storage capacities and battery constraints of mobile phones. Simply put, TikTok has too many videos to search locally. Off-loading the computations to other users raises privacy concerns since a user likely does not want to reveal its watch history to other (untrusted) users. The use of privacy-preserving, distributed machine learning in Tribler should provide a solution for media search on mobile phones.

Literature and background

First, read the Wikipedia entry on decentralised search engines. Your key literature for this task is the Delft paper which currently defines the state-of-the-art: G-Rank: Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network. A more introductory blog post on learn-to-rank and dataset details. Early work from 2005 by Delft provides a simple and realistic experimental approach to the problem of media discovery and recommendation, you are required to understand the basic algorithm of semantic clustering (e.g. taste buddies). A paper from 2012 proposes a model where mobile phones use gossip learning to compute a linear model, without revealing local models and without storing the full data set. Another classic attempt from 2003 onwards is the decentralised YaCy search engine with a web crawler, complex hashing, and reverse word index.
Finally, your search engine will be implemented inside the Internet-deployed alternative to Big Tech platforms called the SuperApp. This Web3 open source software is written in Kotlin. This Web3 infrastructure is as decentralised as Bitcoin and Bittorrent. Instead of videos, your project is focused on the simpler case of music. Required background reading on the Spotify alternative: Fairness and Freedom for Artists: Towards a Robot Economy for the Music Industry. To keep this project viable, you can ignore security issues such as the Sybil attack and also privacy issues resulting from sharing your ClickLog with strangers on The Internet. This is left as future research, your focus is on an initial proof-of-principle.

Problem Description

PageRank is the defining centralised algorithm of Google. Understand existing algorithms and architectures for decentralised search engines. Understand the state-of-the-art algorithm within this area: G-Rank: Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network. Contribute to the first hands-on proof-of-principle implementation of G-Rank.

Five research questions

See the initial placeholder implementation for keyword search already present. The following 5 sub-questions will be assigned to a single student each:

Dataset engineering. How to design and implement a user model which issues queries and select one entry from presented search engine results. Desired outcome: one search that your model issues will be for the unique words "Punk classical Jazz". This deliberately strange query must select one of the matching musical pieces marked with all these three tags. Required scientific literature for reading: "A user browsing model to predict search engine click data from past observations". For the experimental part, use the existing scrape facility to duplicate the Creative commons music from Pandacd. Critical input for the learn-to-rank algorithms is the popularity of each song or artist. Enhance your model with an existing dataset with 1.4 million artists along with their popularity estimates.

Centroids-based clustering. Design and implement semantic clustering of the dataset. Based on the metadata, tags, and popularity you enhance the dataset and user model with Euclidean, Minkowski, Manhattan, or other distance measuring mechanisms. Required background reading. Based on this work you will generate representative taste profiles (e.g. ClickLogs)

Decentralised taste exchange. How to design and implement an overlay network to disseminate taste. You task is to create a detailed design of the ClickLog exchanger within the G-Rank algorithm. As a starting point within the literature, read "Random Walks on the Click Graph"

Accurate Learn-to-Rank. How to design and implement unsupervised learn-to-rank heuristics with decent response time, at the cost of minor precision and recall. You results need to appear within 2 seconds, therefore providing reasonable computation. Background literature: "Real time search on the web: Queries, topics, and economic value"

Real-time learn to rank. How to design and implement unsupervised learn-to-rank heuristics with fast response time, at the cost of significant precision and recall. You results need to appear within 100ms time, therefore very little computation can be performed and pre-calculated indexing techniques must be used. Background literature: "Real time search on the web: Queries, topics, and economic value"
Together these 5 research question lead to a complete design of a fully distributed search engine.

Note: this is a challenging assignment that requires thorough understanding of specific scientific literature and ability to engineer algorithms. Recommended for Honor Students only.

@synctext
Copy link
Member Author

synctext commented Jul 9, 2023

state of the art by Google https://ai.googleblog.com/2023/06/retrieval-augmented-visual-language-pre.html

'''A naïve solution for encoding a memory value is to keep the whole sequence of tokens for each knowledge item. Then, the model could fuse the input query and the top-k retrieved memory values by concatenating all their tokens together and feeding them into a transformer encoder-decoder pipeline. This approach has two issues: (1) storing hundreds of millions of knowledge items in memory is impractical if each memory value consists of hundreds of tokens and (2) the transformer encoder has a quadratic complexity with respect to the total number of tokens times k for self-attention. Therefore, we propose to use the Perceiver architecture to encode and compress knowledge items." https://www.deepmind.com/publications/perceiver-general-perception-with-iterative-attention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: memo Stuff that can't be solved type: MSc-Thesis-Work
Projects
None yet
Development

No branches or pull requests

4 participants