Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix online dump downloads for the various modes and Closes #869 #872

Merged
merged 4 commits into from
Jun 16, 2024

Conversation

TheEaterr
Copy link
Contributor

It seems that there were changes in the way dumps are structured for wikidata making the way dump files were automatically downloaded not functional anymore.

Here are the fixes implemented, and the reasoning behind it :
DAILY : update the check for an available dump as status.txt now indicated "done:all". See : https://dumps.wikimedia.org/other/incr/wikidatawiki/20240419/status.txt I put startsWith but it might be preferable to check for done:all equality ? I don't know the various possible values but this version should work at least most of the time (compared to none right now).
JSON : The JSON dumps are currently downloaded from this folder : https://dumps.wikimedia.org/other/wikidata/ It is a bit of a mess including both full and incremental (?) dumps. I switch it to use https://dumps.wikimedia.org/wikidatawiki/entities/ (which seems to be equivalent to https://dumps.wikimedia.org/other/wikibase/wikidatawiki/, I don't know if anyone has any insight into which link should be preferred). Moreover, not all shown dates feature the full json dump so I added a check that the file is actually there (I have not found a status.txt or equivalent for the entitity dump)
SITES : Nothing done, haven't tested it but the file it is looking for is still there so it probably still works.
CURRENT / FULL : Both of these look for a -pages-meta-current.xml.bz2 / -pages-meta-history.xml.bz2 inside a dump (like https://dumps.wikimedia.org/wikidatawiki/20240401/). However these files do not seem to be there. Nevertheless, parts of these file ARE present (of form https://dumps.wikimedia.org/wikidatawiki/20240401/wikidatawiki-20240401-pages-meta-current1.xml-p1p441397.bz2 for example) but not the united file. If anyone knows why the full file isn't there I could try to fix the issue

Copy link
Member

@wetneb wetneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for updating this!

@TheEaterr
Copy link
Contributor Author

I made a new commit that uses head requests instead. I had to tap into the specific semantics of the WebRessourceFetcherImpl to change the type of the request, maybe a more broad set method should be added to the WebRessourceFetcher interface ?

Also I updated the test that failed by instead using the real - and not the mock - WebRessourceFetcher and changing around what is expected. I imagine using MockWebRessourceFetcher was intentional but for this specific use case it feels weird, as the part that broke is the online part and I feel like the test should catch that. Else it is probably possible to change the code around a bit to make it work with mocks. If checking online is desirable, it would also be logical to update the other tests.

@wetneb
Copy link
Member

wetneb commented May 13, 2024

Thanks a lot!

It looks like the new version of the test does not pass. In general, I would say it would be better to avoid making any requests to online services in the test suite, to keep it reliable. It is true that this prevents us from checking compatibility with the online service as it evolves, but in my opinion that's not the role of such a Java unit test suite. I don't know if this would have helped in this case, but it is also possible to change the mocking strategy and use a MockWebServer to test against a real but local web server serving pre-determined responses.

@TheEaterr
Copy link
Contributor Author

Hey so actually the test that failed wasn't the test I fixed that uses online, but another one that I missed when developing locally. The online test does work (as far as I understood the CI output), and I feel makes sense as the getAllJsonDumps online functionality is a core part of it, and the test should be expected to fail if upstream changes.
So if all test pass (hopefully) I think it's better this way but if you disagree I'll switch it back to an offline test.

Copy link

codecov bot commented Jun 16, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 83.11%. Comparing base (637b457) to head (b660991).
Report is 233 commits behind head on master.

Files Patch % Lines
...ikidata/wdtk/dumpfiles/wmf/JsonOnlineDumpFile.java 83.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master     #872      +/-   ##
============================================
+ Coverage     80.92%   83.11%   +2.18%     
- Complexity     2090     2763     +673     
============================================
  Files           149      183      +34     
  Lines          7278     8778    +1500     
  Branches        897     1163     +266     
============================================
+ Hits           5890     7296    +1406     
- Misses         1119     1169      +50     
- Partials        269      313      +44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wetneb
Copy link
Member

wetneb commented Jun 16, 2024

@TheEaterr thanks a lot! I see codecov is complaining about missing diff coverage. If you think this can be addressed, we can do that, otherwise happy to merge it as such, I don't want to stand in the way.

@TheEaterr
Copy link
Contributor Author

Code coverage was right, the function should've been tested. I added a test it should be good now!

@wetneb wetneb merged commit 4d8f717 into Wikidata:master Jun 16, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants