Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix repeated github id crawling #7

Merged
merged 4 commits into from
Jul 4, 2021
Merged

Conversation

geoheelias
Copy link
Contributor

Did some refactoring to get rid of the continue statement which probably made the crawl method more confusing to look at, but apparently last time I refactored I forgot to actually make use of the "from_id" key when building the encoded ids (because I was focusing on trying to make the "when we have known IDs" part work) - so it always started from 0 =(

@geoheelias geoheelias requested a review from puhoy July 2, 2021 17:56
@geoheelias geoheelias self-assigned this Jul 2, 2021
@@ -66,11 +66,11 @@ def crawl(platform: ICrawler) -> Generator[List[dict], None, None]:
:param platform: which platform to crawl, with what credentials
"""
logger.debug(f"START block: {platform.name} - initial state: {platform.state}")
for success, result_chunk, state in platform.crawl(platform.state):
for success, block_chunk, state in platform.crawl():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to rename things so you understand what partial things belong to what, in the bigger context. Also removed the arg (its optional across the current crawlers) - we already set the state on initialising the crawler already so this removes a redundancy that might confuse someone later on.

yield result_chunk
logger.info(f"got {len(block_chunk)} results from {platform} "
f"- first repo id: {next(iter(block_chunk), {}).get('id', None)}")
yield block_chunk
Copy link
Contributor Author

@geoheelias geoheelias Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it visible from skimming the logs that we arent crawling the same results/ids over and over across blocks.

@@ -46,7 +46,7 @@ def __init__(self, base_url, state=None, auth_data=None, user_agent=None, query=
self.requests.headers.update(
{"Authorization": f"Bearer {auth_data['access_token']}"})

def handle_ratelimit(self, response):
def handle_ratelimit(self, response=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make response optional so that we can still apply default throttling when we crash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in, a single chunk crashed for whatever reason in the current block, and we still continue with the next chunk

@@ -91,7 +94,8 @@ def set_state(cls, state: dict = None) -> dict:
state['empty_page_cnt'] = state.get('empty_page_cnt', 0) # indicate that exploration has reached an end
if not isinstance(state.get(BLOCK_KEY_IDS, None), list):
state[BLOCK_KEY_IDS] = [] # list when we have known indexes to use as IDs
state[BLOCK_KEY_FROM_ID] = state.get(BLOCK_KEY_FROM_ID, 0) # without known IDs, we start from the lowest ID number
state[BLOCK_KEY_FROM_ID] = state.get(BLOCK_KEY_FROM_ID,
0) # without known IDs, we start from the lowest ID number
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy autoformat -_-

indexes = state[BLOCK_KEY_IDS][i:i + GITHUB_QUERY_MAX]
if len(indexes) > 0:
state['current'] = indexes[-1]
else:
# otherwise explore incrementally within from/to
i += state[BLOCK_KEY_FROM_ID]
Copy link
Contributor Author

@geoheelias geoheelias Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this missing secret sauce to follow block instructions

yield False, [], state
continue # skip to next crawl
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mess to look at in diff view, probably pull it and look at the code, hopefully its a bit clearer than what it was.

if len(repos) == 0:
state['empty_page_cnt'] += 1

yield True, repos, state
self.handle_ratelimit(response)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this handle_ratelimit is now run for response.ok == True AND False (but not when the chunk crashed)

logger.exception(f"(skipping block part) github crawler crashed")
logger.exception(f"(skipping block chunk) github crawler crashed")
yield False, [], state
self.handle_ratelimit()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_ratelimit when the chunk crashed, and we dont have response

logger.exception(f"(skipping block part) github crawler crashed")
logger.exception(f"(skipping block chunk) github crawler crashed")
yield False, [], state
self.handle_ratelimit()

state = self.set_state(state) # update state for next round
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only place that set_state is called during crawling, for all scenarios of crashing, not OK or OK

logger.info(f"ratelimit exceeded for {self}, sleeping for {sleep_s} seconds...")
time.sleep(sleep_s)
else:
super().handle_ratelimit()
Copy link
Contributor Author

@geoheelias geoheelias Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conform gitlab specific ratelimit in the same way that I updated github to work (i.e. not needing a response object arg)

  • gitea doesnt need it, because it doesnt override the superclass method.

rm psycopg from requirements,
add 5s extra sleep in github,
@puhoy puhoy merged commit 05c1889 into master Jul 4, 2021
@puhoy puhoy deleted the fix/github_many_crawlers branch July 4, 2021 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants