Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crowdsourcing Metadata #2455

Closed
synctext opened this issue Jul 10, 2016 · 20 comments · Fixed by #7112
Closed

crowdsourcing Metadata #2455

synctext opened this issue Jul 10, 2016 · 20 comments · Fixed by #7112

Comments

@synctext
Copy link
Member

Allow any user to improve the metadata. Examples of existing approaches:

image

image

image

Time slicing is too heavy for Tribler, out of scope. Just metadata.
image

@synctext synctext added this to the Backlog milestone Jul 10, 2016
@synctext synctext self-assigned this Jul 10, 2016
@lfdversluis
Copy link
Contributor

A wiki-like approach? What would be editable within and outside Tribler?

@synctext
Copy link
Member Author

Refocus this issue on torrent channels based on magnet links with rich metadata. Thesis goal is to deploy creation and search of rich metadata. More ambitious goal would be to do integrating of voting to obtain trustworthy metadata.

Each channel can only be modified by the channel owner. Users vote on the quality of channels and quality emerges from the Tribler user collective. It is easy to copy an entire channel, re-use quality content from a channel, and re-mix channels. Thus leading indirectly to crowdsourcing. We specifically avoid the problem of edit wars, collaborative editing, and undo changes. This leads to a realistic master project. Only with an active community in place, we can move to the next stage and experiment with more sophisticated crowdsourcing models.

The torrent channel is expanded with a new type: rich metadata channels. Channels owners indicate the content type of each magnet link. We still keep it simple: 1 magnet link has 1 rich metadata description. First step is to create a simple editor inside Tribler. It supports several content types. Combining enriching of music, podcast, movies, vloggers, scientific articles, etc. :
image
image
image
Next step is then creating a new rich metadata search community.

This master thesis puts the foundations in place for rich metadata. In the future we want to integrate collaborative tools. For instance, we assume scientific papers are available in Tribler and we can semi-automatically create survey papers. Still out of scope:
image

@synctext
Copy link
Member Author

synctext commented Feb 12, 2018

Demonstrates the content type idea (radio button or drop down):
image
Combining enriching of music, podcast, movies, vloggers, scientific articles, etc
A lot of metadata frameworks exist, you could select this very old one. Keep data model very simple, never more then 4 fields per content type. We see how users like it, then expand based on real-world feedback. Please don't try to get it right the first time!

@svanschooten
Copy link

@devos50 Do you have time to walk through the stack this afternoon? I have a doctors appointment in an hour, so I'll be at the lab after lunch. I've seen most of the code, and have looked into QT last week, but it still is a bit fuzzy.

@synctext
Copy link
Member Author

He is on vacation this week. But others should be able to help.

@svanschooten
Copy link

First step: adding a field containing a single metadata entry (content type in this case, based on the Youtube VideoCategories API), and starting to structure the metadata information.
Started implementing a Metadata community, still bit fuzzy on what is needed there.
tribler-content-types

@svanschooten
Copy link

svanschooten commented Feb 20, 2018

Got my community running, peers can exchange messages (directed and broadcast), next step is to define the behavior of the community and the message types that can be sent.

@synctext should I look at a more branching metadata structure such as here (inspired by youtube, piratebay and dublincore).
The other option would be to flatten the structure and create a more generic metadata structure (not following the classic and modern frameworks), and create something more generic. What would you advise?

Next step would be to structure the database and messages sent.

@devos50
Copy link
Contributor

devos50 commented Feb 25, 2018

@svanschooten I would advise to keep it as simple as possible for now and not go wild with many different metadata types/complicated structures yet.

Also, nice to see that you have a basic community up and running!

@svanschooten
Copy link

@devos50 welcome back! I agree, that is why I re-researched the desired structure, when I have something solid I'll implement a data structure which I can store in the database (also start implementing the distribution mechanics).

Due to a family crisis I have not been able to come to the lab Friday and today, but I have done some reading and thinking on the categorization issue: most content management systems use a tree based structure to define archetypes, subtypes and properties, though I have come across some interesting work. Twitter has published a content categorization method that looks interesting, though it is not directly applicable to our case.

Based on these articles and papers I have opted to design a more 'flat' category structure, which I have documented on my repository.

@synctext
Copy link
Member Author

synctext commented Feb 27, 2018

#1150 is about to start soon. First finish this quick MAX 4 week prototype, then think how to build on top of scalable channels. When the're hopefully ready!

Write rich metadata on Trustchain? (e.g. so barter records, voting for channels, trading honesty, and metadata enrichment). Then we have 4 contexts of reputations to merge somewhat. Next step is to remove all non-blockchain data sync mechanisms in Tribler... Remove all storage in Dispersy #2778, all communities, and replace it with IPv8-based Trustchain storage.

Keep it simple-and-get-it-running-first-you-stupid model: only channel owner can do metadata enrichment :-)

@svanschooten
Copy link

svanschooten commented Feb 27, 2018

Fixed my packing issue, added category based payload types and added them to the community communication.
First working UI is done, now to couple the UI to the community: This includes basic parsing of datatypes based on regex (simplest method for now).
Major overhaul to the ContentType and Category models to make them easily adaptable.

metadata_v1
The fields are dynamically added with the accessory parsing method, field name and label.
These are based on the fields defined in the Category models.

Only problem is that the community now can't discover the other peers, so the test script for the community does not receive anything... Test script checks is peer list is not empty each second, but stays empty.
edit: stupid me used loop in reactor thread, derp...

@svanschooten
Copy link

  • Implemented a lot more generics today, makes constructing metadata Categories much easier without a test for all types.
  • Created an endpoint in the REST interface to let the UI talk to the MetadataCommunity.
  • Added MetadataCommunity to the config and LaunchManyCore.
  • Community mini-test working with twisted

Next (@synctext ??): writing metadata to persistence layer, more UI screens (only on torrent add for now) or better metadata models?

@svanschooten
Copy link

svanschooten commented Mar 4, 2018

Looking at most metadata models, they approach it from an unstructured data angle, they usually have a (semi-) fixed tree structure for fields, but no simple and straightforward approaches to storing it using a relational database:
This paper uses a generic field implementation with a mapping algorithm.
If also found this paper using oldschool RDF.
A patent that I can not understand....
RFC-ish description of how dublincore was designed.
Theses guys developed a xml storing mechanism.
This pretty decent explanation on how you should see and organise metadata.

I do not want to introduce more dependencies, but I think a noSQL storage method would be easiest?
Or maybe something generic like:

MetadataTable:
- (id) ID
- (string) infohash
- (string) title
- (string) category
- (string) content type

FieldsTalbe:
- (id) ID
- (id) metadata ID
- (string) name
- (string) value

@synctext
Copy link
Member Author

synctext commented Mar 5, 2018

MetadataTable:

  • (id) ID
  • (string) infohash

Why do you want to make the infohash more unique with an ID ? :-)

Consider adopting distributing scientific works as your test community for your entire thesis {or something else additionally; http:https://bt.etree.org}. Or create a tool and test how many hours it takes to put stuff like 400k scientific journals in your rich metadata. a.k.a. Giga-Scraper idea. Next step: finish prototype and create a .pdf seeding channel.

Just make an music table, movie, clip, series, vlog, ebook, adult entertainment, other images table etc. Keep it simple for your 4-week prototype. try to remove content and subtype construct: just 1 category level. ID3 simple, for instance, no fancy XML, nosql, or RDF. Just a fixed structure please. Probably per content type. In 1996 Eric Kemp created ID3, the defacto framework for audio metadata. Strings are either space- or zero-padded. Unset string entries are filled using an empty string. ID3v1 is 128 bytes long. Table with fields is copied from Wikipedia

Field Length Description
header 3 "TAG"
title 30 30 characters of the title
artist 30 30 characters of the artist name
album 30 30 characters of the album name
year 4 A four-digit year
comment 28 or 30 The comment.
zero-byte 1 If a track number is stored, this byte contains a binary 0.
track 1 The number of the track on the album, or 0. Invalid, if previous byte is not a binary 0.
genre 1 Index in a list of genres, or 255

ID3v1 pre-defines a set of genres denoted by numerical codes. Keeps it trivial...

Future: #3484 After this 4-week prototype is completed, explore more advanced architecture. We are prototyping using our Trustchain idea as the only storage paradigm in Tribler. It would contain: bandwidth barter transactions, voting for channels, trading of bandwidth coins #3326. Additionally, possibly rich metadata of channels; this thesis. Warning: this idea for yet another Tribler overhaul would take years to complete and get stable!

@svanschooten
Copy link

The underlying data model has bee simplified and abstracted more, to provide generic reading and setting handles. Is completely flat now.
Also a basic database and repository implementation is done, both with an in-memory and persistent layer.

TODO:

  • insert metadata into the database from the community.
  • extend the code with docs.
  • an UI implementation for showing the metadata has to be implemented.
  • tests

@svanschooten
Copy link

svanschooten commented Jun 8, 2018

Refined thesis subject: Searching in enriched metadata using deduplicated tag clouds.

  • Language mixing is a major problem
  • Use tokeninzing to revert word to stem form, then linking it in the tag cloud.
  • Use k-means clustering to find subclouds to create more rigid linking, overlap in clouds could indidate same/similar entries (partly removes duplication, word polymorphism and locality issues)
  • Cloud distribution during search by receiving k-linked clusters from near neighbours.
  • Fluid metadata structure, community defined data.
  • Voting on better tags will create weighted clouds, making deduplication easier.

image

@devos50
Copy link
Contributor

devos50 commented Jun 8, 2018

It might be helpful for you to sync with @xoriole, your ideas seems to overlap somewhat.

@xoriole
Copy link
Contributor

xoriole commented Jun 8, 2018

@devos50 It was quite a long discussion. Lots of ideas floating. We'll see how the design materializes.

@ichorid ichorid modified the milestones: Backlog, Next-next release Jun 12, 2020
@ichorid ichorid added this to To do in Metadata crowdsourcing via automation Jul 17, 2020
@ichorid
Copy link
Contributor

ichorid commented Sep 28, 2021

related to #6217

@synctext
Copy link
Member Author

synctext commented May 22, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Metadata crowdsourcing
  
Discussion tickets
Development

Successfully merging a pull request may close this issue.

8 participants