Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify scrape refactor #1630

Merged
merged 19 commits into from
Sep 7, 2021
Merged

Conversation

WithoutPants
Copy link
Collaborator

  • Rolls ScrapedScene* graphql types into Scraped*
  • Combines scrape and stashbox query interfaces into single unified interface. The new interfaces accept a stash box index or a scraper id.
  • The single object interfaces also accept a query string, object id or object fragment,
  • The multi-object interfaces accept a list of object ids, and return a list of lists of results, in the same order as the input ids.
  • Marked the other existing interfaces as deprecated.
  • Refactored ScraperCache code. There are three main entry points now - one for string queries, one for querying from an object (given an id), and one for querying from a scraped fragment (ie like how performer query scrapes work). The underlying behaviour should be unchanged.

This PR lays the groundwork for some additional changes and PRs. I have a PR almost ready to submit that adds support for querying scene scrapers by string, but I need this PR tested and reviewed first.

@WithoutPants WithoutPants added the chore Pull requests for refactoring and admin work label Aug 10, 2021
@WithoutPants WithoutPants added this to the Version 0.9.0 milestone Aug 10, 2021
@gitgiggety
Copy link
Contributor

gitgiggety commented Aug 12, 2021

Edit: you can ignore this comment. I compared it against develop where it is working fine, but my "production" system is broken as well (build based on #1604). So this should be 59c6fe0 / #1632

Did some quick testing.
Scraping performer by URL seems to have an issue UI wise. When clicking the scrape button on the URL field it does show the loading indication, but then just shows the edit form again instead of (also) showing the scrape result. But when using "Scrape with" afterwards (doesn't matter which scraper I select) it does immediately show the scraping result of the "by URL" usage. Otherwise "Scrape with" just works fine. If there is no result of "by URL" loaded it will correctly show the intermediate performer search results list and when selecting a performer the scraping result dialog is shown.

Scrape by URL and using "scrape with" seem to be working fine for both scenes and galleries.

@bnkai
Copy link
Collaborator

bnkai commented Aug 12, 2021

From a quick testing there seem to be issues when using the tagger interface ( i merged the dev version to avoid the above mrntioned bug)
For the performer tagger /performers?disp=3

  • the images of the scraped performers are not showing

For the scenes scenes?disp=3

  • the match fingerprints seems to be broken
    it says xx match found where xx = scenes per page if no scenes are already tagged in the page or
    xx = scenes per page - tagged scenes if there already tagged scenes in the page

Copy link
Contributor

@gitgiggety gitgiggety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a review, but I guess it's more like I learned from the code than actually having good feedback. Most comments are based on personal preferences which I would have done differently, so obviously not gonna make any rejections against them.

I guess the "biggest" issue is that ScrapedPerformer.image is still being used by the client while it is marked as deprecated. But there might be more which I didn't check.

pkg/api/resolver_query_scraper.go Show resolved Hide resolved
pkg/api/resolver_query_scraper.go Show resolved Hide resolved
pkg/api/resolver_query_scraper.go Outdated Show resolved Hide resolved
pkg/scraper/scrapers.go Outdated Show resolved Hide resolved
pkg/scraper/scrapers.go Outdated Show resolved Hide resolved
pkg/scraper/scrapers.go Outdated Show resolved Hide resolved
graphql/schema/types/scraper.graphql Show resolved Hide resolved
@WithoutPants
Copy link
Collaborator Author

Thanks for the testing and feedback. The bugs reported should now be addressed.

@bnkai
Copy link
Collaborator

bnkai commented Aug 19, 2021

  • The studio edit is broken. Updating a field and hitting save fails with error 422
{"errors":[{"message":"unknown field","path":["variable","input","stash_ids",0,"__typename"],"extensions":{"code":"GRAPHQL_VALIDATION_FAILED"}}],"data":null}
  • When editing -> scraping a performer even though the scraped image is shown and ticked in the scrape popup it is not updated/saved when hitting apply and then save

@WithoutPants
Copy link
Collaborator Author

Studio bug was an existing latent bug. Both issues should now be fixed.

@gitgiggety
Copy link
Contributor

Studio bug was an existing latent bug

I'm guessing I should hold my hand up for that? Most likely a regression from the Studio edit/details refactor. Although I seem to be able to edit/save a studio just fine on the develop branch. (But issue being stash box related I might not be able to reproduce because of not having stash box).

@WithoutPants
Copy link
Collaborator Author

It doesn't matter - I've introduced many very similar bugs from similar code changes. The bug is only reproducible if you have a stash-id on the studio.

@gitgiggety
Copy link
Contributor

gitgiggety commented Aug 20, 2021

Found a regression I think. On develop I can open a newly scanned scene, fill in some of the fields and run a scraper (in my case the SARJ-LLC Python scraper) and it would work fine and show the result. But on this branch the scraper throws an error. Don't have time to investigate this at the moment, but I guess the filled in data isn't submitted and it uses the data from the database which obviously is still empty. When saving and afterwards executing the scene scraper it does work fine (but I guess changing data in between will also send the old / saved data to the scraper instead of the unsaved data).

@gitgiggety
Copy link
Contributor

Found a regression I think. On develop I can open a newly scanned scene, fill in some of the fields and run a scraper (in my case the SARJ-LLC Python scraper) and it would work fine and show the result. But on this branch the scraper throws an error. Don't have time to investigate this at the moment, but I guess the filled in data isn't submitted and it uses the data from the database which obviously is still empty. When saving and afterwards executing the scene scraper it does work fine (but I guess changing data in between will also send the old / saved data to the scraper instead of the unsaved data).

Checked the code. For scenes, galleries and movies there is a <type>_input field in the GraphQL scheme, but this is never used and only <type>_id is set. For performers however it's the other way around, only performer_input is set but not performer_id.

@WithoutPants
Copy link
Collaborator Author

This regression is related to the way scene scrapers work, which is different to the way the performer scraper works, and I believe I've painted myself into a corner a bit.

In the original scene scraping code, the "Scrape With..." scrape interface accepts SceneUpdateInput which is the current values from the scene edit page. The xpath and json scrapers perform the scrape by looking up the existing scene, populating the query URL (replacing placeholders with the scene's hash, filename, title and url values), and scraping via URL. It was doing this by grabbing the scene ID from the update input object. The script scrapers would instead accept the scene input and pass this onto the script.

This is distinct from the performer scrapers, where the performer fragment input is assumed to be the from a scraped performer search result.

In preparation of the scene scraper query change, I changed the way that scene scraping works. The scrapeSingleScene interface accepts either a scene id or a scraped scene fragment - not the scene update input.

I'm thinking that this may be a necessary regression in order to make it consistent with the performer scraper code, and allow scene scrape querying. This means that all scene fragment scrapers would operate on current scene state.

@gitgiggety
Copy link
Contributor

gitgiggety commented Aug 21, 2021

Okay, good point about performer scraper having the search form when using "scrape with".

But could you explain me then what future plans you have? :) Will scenes and galleries get a search form too, where you can enter title and date? Then I would be happy, as currently I use my SARJ-LLC scraper based on (just) those fields which works pretty nice as I don't have to look up the URL. Just fill in the title and date (which are part of filename and thus title after scanning), and start the scraper which will fill in all fields including URL etc.

@bnkai
Copy link
Collaborator

bnkai commented Aug 22, 2021

I'm thinking that this may be a necessary regression in order to make it consistent with the performer scraper code, and allow scene scrape querying. This means that all scene fragment scrapers would operate on current scene state.

The same already goes for JSON scrapers. They work with what is in the DB not by input , so that wouldnt be that much of a regression

EDIT

  • The performer image regression i reported seems to be fixed
  • The match fingerprints function seems to have issues. While with a dev build i get 22 matches of scenes in a specific page with this build i get 0

@gitgiggety gitgiggety mentioned this pull request Aug 23, 2021
@WithoutPants
Copy link
Collaborator Author

But could you explain me then what future plans you have? :) Will scenes and galleries get a search form too, where you can enter title and date? Then I would be happy, as currently I use my SARJ-LLC scraper based on (just) those fields which works pretty nice as I don't have to look up the URL. Just fill in the title and date (which are part of filename and thus title after scanning), and start the scraper which will fill in all fields including URL etc.

The WIP branch is here: https://github.com/WithoutPants/stash/tree/scrape-scene-by-name
It supports scene querying only currently, but I imagine expanding it to galleries won't be too difficult. From memory I think it auto fills the query term based on the current scene title.

  • The match fingerprints function seems to have issues. While with a dev build i get 22 matches of scenes in a specific page with this build i get 0

Still looks like it works ok for me. Can you give me some example hashes? Assuming they're in the database you provided forever ago, I should be able to reproduce using that.

@gitgiggety
Copy link
Contributor

The WIP branch is here: https://github.com/WithoutPants/stash/tree/scrape-scene-by-name
It supports scene querying only currently, but I imagine expanding it to galleries won't be too difficult

Thanks for the link. Ran it and checked the code. Seems like it adds a new "scraping" method based on name to provide a list of hits, correct? (So identical to performers name based search / scraping)
Sadly not what I was hoping for, in getting the fragment as filled in or edited. Anyways, I either have to get used to saving first then, or just run the filename parser beforehand. (Helped by the fact that it can now parse not organized scenes only). But I guess that leaves the fragment based scraping of the new API/input a bit "useless" as it isn't used by the web UI?

But of course this discussion doesn't really affect this PR. Especially since you both stated JSON and xpath scrapers already receive the fragment from the database and not the edited values. So there was already a difference between those which is now gone and can be seen as a bugfix.

@bnkai
Copy link
Collaborator

bnkai commented Aug 24, 2021

Still looks like it works ok for me. Can you give me some example hashes? Assuming they're in the database you provided forever ago, I should be able to reproduce using that.

I managed to replicate that using a single scene. Even if you have the old db it doesnt have the phashes :-(

Steps to reproduce.

  • Import the following scene
    export20210824-180557.zip
  • Go to scene tagger and filter that scene by name or phash or whatever eg hash 6be0e9c2df04fd73
  • Click on match fingerprints
  • Retry using another build (dev vs pr) and refreshing the page

Dev build returns a result while the build from here doesnt

All the matches i had with missing results seem to be matches by phash , don't know if that matters and didnt have the time to look at the code.

@bnkai
Copy link
Collaborator

bnkai commented Aug 26, 2021

Can you merge/rebase with current dev? I think there are some bugs left that were already fixed in the dev version (like the duration of the scenes that match using the scene tagger) and the studio edit bug was resolved differently?

@WithoutPants
Copy link
Collaborator Author

Can you merge/rebase with current dev? I think there are some bugs left that were already fixed in the dev version (like the duration of the scenes that match using the scene tagger) and the studio edit bug was resolved differently?

Done, but I don't think any of the latest changes would have fixed anything, so you might be seeing new bugs.

@bnkai
Copy link
Collaborator

bnkai commented Aug 29, 2021

Can you merge/rebase with current dev? I think there are some bugs left that were already fixed in the dev version (like the duration of the scenes that match using the scene tagger) and the studio edit bug was resolved differently?

Done, but I don't think any of the latest changes would have fixed anything, so you might be seeing new bugs.

Yeah new bug confirmed.
Using the same procedure and scene as above (that now matches ok)

PR Build -> Duration uknown
Dev Build -> Duration matches 12/12 fingerprint(s)

The plain search query from the tagger has also the same issue

The queryStashBoxScene query from the dev build returns a duration field while the scrapeSingleScene/scrapeMultiScenes don't.

@WithoutPants
Copy link
Collaborator Author

The queryStashBoxScene query from the dev build returns a duration field while the scrapeSingleScene/scrapeMultiScenes don't.

Thanks. This should now be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Pull requests for refactoring and admin work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants