Skip to content

Commit

Permalink
add more details about list extraction
Browse files Browse the repository at this point in the history
  • Loading branch information
theScrabi committed Jul 2, 2019
1 parent b2e5672 commit 875f536
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 19 deletions.
7 changes: 3 additions & 4 deletions docs/00_Prepare_everything.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
# Before You Start

These documents will guide you through the process of creating your own Extractor
service of which will enable NewPipe to access additional streaming services, such as the currently supported YouTube and SoundCloud.
The whole documentation consists of this page, which explains the general concept of the NewPipeExtractor, as well as our
[Jdoc](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/) setup.
These documents will guide you through the process of understanding or creating your own Extractor
service of which will enable NewPipe to access additional streaming services, such as the currently supported YouTube, SoundCloud and MediaCCC.
The whole documentation consists of this page and [Jdoc](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/) setup, which explains the general concept of the NewPipeExtractor.

__IMPORTANT!!!__ This is likely to be the worst documentation you have ever read, so do not hesitate to
[report](https://github.com/teamnewpipe/documentation/issues) if
Expand Down
73 changes: 58 additions & 15 deletions docs/01_Concept_of_the_extractor.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,19 +57,20 @@ class MyExtractor extends FutureExtractor {

Information can be represented as a list. In NewPipe, a list is represented by a
[InfoItemsCollector](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html).
A InfoItemCollector will collect and assemble a list of [InfoItem](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItem.html).
For each item that should be extracted, a new Extractor must be created, and given to the InfoItemCollector via [commit()](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html#commit-E-).
A InfoItemsCollector will collect and assemble a list of [InfoItem](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItem.html).
For each item that should be extracted, a new Extractor must be created, and given to the InfoItemsCollector via [commit()](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/InfoItemsCollector.html#commit-E-).

![InfoItemsCollector_objectdiagram.svg](img/InfoItemsCollector_objectdiagram.svg)

If you are implementing a list for your service you need to extend InfoItem containing the extracted information
and implement an [InfoItemExtractor](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html),
that will return the data of one InfoItem.
If you are implementing a list in your service you need to implement an [InfoItemExtractor](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/Extractor.html),
that will be able to retreve data for one and only one InfoItem. This extractor will then be _comitted_ to the __InfoItemsCollector__ that can collect the type of InfoItems you want to generate.

A common implementation would look like this:
```
private MyInfoItemCollector collectInfoItemsFromElement(Element e) {
MyInfoItemCollector collector = new MyInfoItemCollector(getServiceId());
private SomeInfoItemCollector collectInfoItemsFromElement(Element e) {
// See *Some* as something like Stream or Channel
// e.g. StreamInfoItemsCollector, and ChannelInfoItemsCollector are provided by NP
SomeInfoItemCollector collector = new SomeInfoItemCollector(getServiceId());
for(final Element li : element.children()) {
collector.commit(new InfoItemExtractor() {
Expand All @@ -90,20 +91,21 @@ private MyInfoItemCollector collectInfoItemsFromElement(Element e) {
```

## InfoItems Encapsulated in Pages
## ListExtractor

When a streaming site shows a list of items, it usually offers some additional information about that list like its title, a thumbnail,
There is more to know about lists:

1. When a streaming site shows a list of items, it usually offers some additional information about that list like its title, a thumbnail,
and its creator. Such info can be called __list header__.

When a website shows a long list of items it usually does not load the whole list, but only a part of it. In order to get more items you may have to click on a next page button, or scroll down.
2. When a website shows a long list of items it usually does not load the whole list, but only a part of it. In order to get more items you may have to click on a next page button, or scroll down.

This is why a list in NewPipe lists are chopped down into smaller lists called [InfoItemsPage](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html)s. Each page has its own URL, and needs to be extracted separately.
Both of these Problems are fixed by the [ListExtractor](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html) which takes care about extracting additional metadata about the liast,
and by chopping down lists into several pages, so called [InfoItemsPage](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html)s.
Each page has its own URL, and needs to be extracted separately.

Additional metadata about the list and extracting multiple pages can be handled by a
[ListExtractor](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html),
and its [ListExtractor.InfoItemsPage](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.InfoItemsPage.html).

For extracting list header information it behaves like a regular extractor. For handling `InfoItemsPages` it adds methods
For extracting list header information a `ListExtractor` behaves like a regular extractor. For handling `InfoItemsPages` it adds methods
such as:

- [getInitialPage()](https://teamnewpipe.github.io/NewPipeExtractor/javadoc/org/schabi/newpipe/extractor/ListExtractor.html#getInitialPage--)
Expand All @@ -117,5 +119,46 @@ such as:
The reason why the first page is handled special is because many Websites such as YouTube will load the first page of
items like a regular web page, but all the others as an AJAX request.

An InfoItemsPage itself has two constructors which take these parameters:
- The __InfoitemsCollector__ of the list that the page should represent
- A __nextPageUrl__ which represents the url of the following page (may be null if not page follows).
- Optionally __errors__ which is a list of Exceptions that may have happened during extracton.

Here is a simplified reference implementation of a list extractor that only extracts pages, but not metadata:

```
class MyListExtractor extends ListExtractor {
...
private Document document;
...
public InfoItemsPage<SomeInfoItem> getPage(pageUrl)
throws ExtractionException {
SomeInfoItemCollector collector = new SomeInfoItemCollector(getServiceId());
document = myFunctionToGetThePageHTMLWhatever(pageUrl);
//remember this part from the simple list extraction
for(final Element li : document.children()) {
collector.commit(new InfoItemExtractor() {
@Override
public String getName() throws ParsingException {
...
}
@Override
public String getUrl() throws ParsingException {
...
}
...
}
return new InfoItemsPage<SomeInfoItem>(collector, myFunctionToGetTheNextPageUrl(document));
}
public InfoItemsPage<SomeInfoItem> getInitialPage() {
//document here got initialzied by the fetch() function.
return getPage(getTheCurrentPageUrl(document));
}
...
}
```

0 comments on commit 875f536

Please sign in to comment.