-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix repeated github id crawling #7
Conversation
@@ -66,11 +66,11 @@ def crawl(platform: ICrawler) -> Generator[List[dict], None, None]: | |||
:param platform: which platform to crawl, with what credentials | |||
""" | |||
logger.debug(f"START block: {platform.name} - initial state: {platform.state}") | |||
for success, result_chunk, state in platform.crawl(platform.state): | |||
for success, block_chunk, state in platform.crawl(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to rename things so you understand what partial things belong to what, in the bigger context. Also removed the arg (its optional across the current crawlers) - we already set the state on initialising the crawler already so this removes a redundancy that might confuse someone later on.
yield result_chunk | ||
logger.info(f"got {len(block_chunk)} results from {platform} " | ||
f"- first repo id: {next(iter(block_chunk), {}).get('id', None)}") | ||
yield block_chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it visible from skimming the logs that we arent crawling the same results/ids over and over across blocks.
@@ -46,7 +46,7 @@ def __init__(self, base_url, state=None, auth_data=None, user_agent=None, query= | |||
self.requests.headers.update( | |||
{"Authorization": f"Bearer {auth_data['access_token']}"}) | |||
|
|||
def handle_ratelimit(self, response): | |||
def handle_ratelimit(self, response=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make response optional so that we can still apply default throttling when we crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in, a single chunk crashed for whatever reason in the current block, and we still continue with the next chunk
@@ -91,7 +94,8 @@ def set_state(cls, state: dict = None) -> dict: | |||
state['empty_page_cnt'] = state.get('empty_page_cnt', 0) # indicate that exploration has reached an end | |||
if not isinstance(state.get(BLOCK_KEY_IDS, None), list): | |||
state[BLOCK_KEY_IDS] = [] # list when we have known indexes to use as IDs | |||
state[BLOCK_KEY_FROM_ID] = state.get(BLOCK_KEY_FROM_ID, 0) # without known IDs, we start from the lowest ID number | |||
state[BLOCK_KEY_FROM_ID] = state.get(BLOCK_KEY_FROM_ID, | |||
0) # without known IDs, we start from the lowest ID number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy autoformat -_-
indexes = state[BLOCK_KEY_IDS][i:i + GITHUB_QUERY_MAX] | ||
if len(indexes) > 0: | ||
state['current'] = indexes[-1] | ||
else: | ||
# otherwise explore incrementally within from/to | ||
i += state[BLOCK_KEY_FROM_ID] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this missing secret sauce to follow block instructions
yield False, [], state | ||
continue # skip to next crawl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a mess to look at in diff view, probably pull it and look at the code, hopefully its a bit clearer than what it was.
if len(repos) == 0: | ||
state['empty_page_cnt'] += 1 | ||
|
||
yield True, repos, state | ||
self.handle_ratelimit(response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this handle_ratelimit is now run for response.ok == True AND False (but not when the chunk crashed)
logger.exception(f"(skipping block part) github crawler crashed") | ||
logger.exception(f"(skipping block chunk) github crawler crashed") | ||
yield False, [], state | ||
self.handle_ratelimit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle_ratelimit when the chunk crashed, and we dont have response
logger.exception(f"(skipping block part) github crawler crashed") | ||
logger.exception(f"(skipping block chunk) github crawler crashed") | ||
yield False, [], state | ||
self.handle_ratelimit() | ||
|
||
state = self.set_state(state) # update state for next round |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only place that set_state is called during crawling, for all scenarios of crashing, not OK or OK
logger.info(f"ratelimit exceeded for {self}, sleeping for {sleep_s} seconds...") | ||
time.sleep(sleep_s) | ||
else: | ||
super().handle_ratelimit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conform gitlab specific ratelimit in the same way that I updated github to work (i.e. not needing a response object arg)
- gitea doesnt need it, because it doesnt override the superclass method.
rm psycopg from requirements, add 5s extra sleep in github,
Did some refactoring to get rid of the continue statement which probably made the crawl method more confusing to look at, but apparently last time I refactored I forgot to actually make use of the "from_id" key when building the encoded ids (because I was focusing on trying to make the "when we have known IDs" part work) - so it always started from 0 =(