Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor crawling / analysis quite a bit #39

Merged
merged 7 commits into from
Dec 5, 2022
Merged

Refactor crawling / analysis quite a bit #39

merged 7 commits into from
Dec 5, 2022

Conversation

abulte
Copy link
Contributor

@abulte abulte commented Nov 28, 2022

Fix #31

Sorry this is a pretty big PR but I encountered many problems while debugging #31.

Main changes:

  • add check-resource CLI helper
  • do not return in finally in datalake_service.process_resource, this was hiding a lot of errors
  • do not reuse the crawl aiohttp request for downloading, it does not make sense (could be a HEAD for exemple) — also the timeouts should be different for those tasks
  • add very import deleted = FALSE in crawl.update_check_and_catalog, this was triggering useless (and pretty strange ⚫ ⭐ ) loops
  • fix tests that rightly failed because they were not aligned with what was supposed to happen

You should be able to print(coucou) from pretty much anywhere now 🐔. There's still a lot of refactoring TBD but let's start with that!

@abulte abulte marked this pull request as draft November 28, 2022 15:36
@abulte
Copy link
Contributor Author

abulte commented Nov 28, 2022

Currently wondering about the usage of a shared aiohttp.ClientSession across different URLs.

From the docs, it may not be a good idea:

Unless you are connecting to a large, unknown number of different servers over the lifetime of your application, it is suggested you use a single session for the lifetime of your application to benefit from connection pooling.

So maybe we could create a new session for each check_url. This would avoid playing around with nested sessions while crawling and downloading (which I avoided but is not easily testable).

@abulte
Copy link
Contributor Author

abulte commented Nov 28, 2022

Also

More complex cases may require a session per site, e.g. one for Github and other one for Facebook APIs. Anyway making a session for every request is a very bad idea.

A session contains a connection pool inside. Connection reusage and keep-alives (both are on by default) may speed up total performance.

😅

@abulte abulte marked this pull request as ready for review November 29, 2022 15:31
@abulte abulte requested a review from maudetes November 29, 2022 15:31
Copy link
Contributor

@maudetes maudetes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the much needed clean!

add very import deleted = FALSE in crawl.update_check_and_catalog, this was triggering useless (and pretty strange black_circle star ) loops

I could not see this part?

I haven't taken the time to look into the best session pattern?

udata_hydra/config.py Outdated Show resolved Hide resolved
udata_hydra/utils/csv.py Show resolved Hide resolved
@abulte
Copy link
Contributor Author

abulte commented Nov 30, 2022

I could not see this part?

Yes sorry, I must have CTRL-Z too much at some point... I reimplemented the logic and more importantly tested it quite extensively, I hope you'll like it! 4850750

I haven't taken the time to look into the best session pattern?

Thoughts for later, never mind.

udata_hydra/crawl.py Outdated Show resolved Hide resolved
udata_hydra/crawl.py Show resolved Hide resolved
tests/test_crawler.py Outdated Show resolved Hide resolved
@abulte abulte merged commit 5f13128 into main Dec 5, 2022
@abulte abulte deleted the error-management branch December 5, 2022 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rendre visible toute erreur lors de check_url
2 participants