Refactor crawling / analysis quite a bit #39

abulte · 2022-11-28T15:33:57Z

Sorry this is a pretty big PR but I encountered many problems while debugging #31.

Main changes:

add check-resource CLI helper
do not return in finally in datalake_service.process_resource, this was hiding a lot of errors
do not reuse the crawl aiohttp request for downloading, it does not make sense (could be a HEAD for exemple) — also the timeouts should be different for those tasks
add very import deleted = FALSE in crawl.update_check_and_catalog, this was triggering useless (and pretty strange ⚫ ⭐ ) loops
fix tests that rightly failed because they were not aligned with what was supposed to happen

You should be able to print(coucou) from pretty much anywhere now 🐔. There's still a lot of refactoring TBD but let's start with that!

abulte · 2022-11-28T16:01:25Z

Currently wondering about the usage of a shared aiohttp.ClientSession across different URLs.

From the docs, it may not be a good idea:

Unless you are connecting to a large, unknown number of different servers over the lifetime of your application, it is suggested you use a single session for the lifetime of your application to benefit from connection pooling.

So maybe we could create a new session for each check_url. This would avoid playing around with nested sessions while crawling and downloading (which I avoided but is not easily testable).

abulte · 2022-11-28T16:24:43Z

Also

More complex cases may require a session per site, e.g. one for Github and other one for Facebook APIs. Anyway making a session for every request is a very bad idea.

A session contains a connection pool inside. Connection reusage and keep-alives (both are on by default) may speed up total performance.

😅

maudetes

Thank you for the much needed clean!

add very import deleted = FALSE in crawl.update_check_and_catalog, this was triggering useless (and pretty strange black_circle star ) loops

I could not see this part?

I haven't taken the time to look into the best session pattern?

udata_hydra/config.py

udata_hydra/utils/csv.py

abulte · 2022-11-30T11:41:38Z

I could not see this part?

Yes sorry, I must have CTRL-Z too much at some point... I reimplemented the logic and more importantly tested it quite extensively, I hope you'll like it! 4850750

I haven't taken the time to look into the best session pattern?

Thoughts for later, never mind.

udata_hydra/crawl.py

tests/test_crawler.py

You should make atomic commits ™️

09ddbc5

abulte marked this pull request as draft November 28, 2022 15:36

abulte added 2 commits November 29, 2022 09:54

Fix tests

4710c72

fix tests

f673d21

abulte marked this pull request as ready for review November 29, 2022 15:31

abulte requested a review from maudetes November 29, 2022 15:31

refactor and test find_delimiter and detect_encoding

b3afd54

maudetes reviewed Nov 29, 2022

View reviewed changes

udata_hydra/config.py Outdated Show resolved Hide resolved

udata_hydra/utils/csv.py Show resolved Hide resolved

abulte added 2 commits November 30, 2022 12:13

More meat, more tests

4850750

use str2bool

1497f49

maudetes approved these changes Dec 1, 2022

View reviewed changes

udata_hydra/crawl.py Outdated Show resolved Hide resolved

udata_hydra/crawl.py Show resolved Hide resolved

tests/test_crawler.py Outdated Show resolved Hide resolved

review fixes

147e60d

abulte merged commit 5f13128 into main Dec 5, 2022

abulte deleted the error-management branch December 5, 2022 06:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor crawling / analysis quite a bit #39

Refactor crawling / analysis quite a bit #39

abulte commented Nov 28, 2022 •

edited

Loading

abulte commented Nov 28, 2022

abulte commented Nov 28, 2022

maudetes left a comment

abulte commented Nov 30, 2022

Refactor crawling / analysis quite a bit #39

Refactor crawling / analysis quite a bit #39

Conversation

abulte commented Nov 28, 2022 • edited Loading

abulte commented Nov 28, 2022

abulte commented Nov 28, 2022

maudetes left a comment

Choose a reason for hiding this comment

abulte commented Nov 30, 2022

abulte commented Nov 28, 2022 •

edited

Loading