New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix repeated github id crawling #7

Merged

puhoy merged 4 commits into master from fix/github_many_crawlers

Jul 4, 2021

Contributor

geoheelias commented Jul 2, 2021

Did some refactoring to get rid of the continue statement which probably made the crawl method more confusing to look at, but apparently last time I refactored I forgot to actually make use of the "from_id" key when building the encoded ids (because I was focusing on trying to make the "when we have known IDs" part work) - so it always started from 0 =(


fix repeated github id crawling

2fc2b93

geoheelias requested a review from puhoy

July 2, 2021 17:56

geoheelias self-assigned this

geoheelias commented

View reviewed changes

crawlers/lib/crawl.py

@@ -66,11 +66,11 @@ def crawl(platform: ICrawler) -> Generator[List[dict], None, None]:

 :param platform: which platform to crawl, with what credentials

 """

 logger.debug(f"START block: {platform.name} - initial state: {platform.state}")

 for success, result_chunk, state in platform.crawl(platform.state):

 for success, block_chunk, state in platform.crawl():

Contributor Author

geoheelias Jul 2, 2021

try to rename things so you understand what partial things belong to what, in the bigger context. Also removed the arg (its optional across the current crawlers) - we already set the state on initialising the crawler already so this removes a redundancy that might confuse someone later on.


update log to not crash on missing prop

e8e616c

geoheelias commented

View reviewed changes

crawlers/lib/crawl.py

 yield result_chunk

 logger.info(f"got {len(block_chunk)} results from {platform} "

 f"- first repo id: {next(iter(block_chunk), {}).get('id', None)}")

 yield block_chunk

Contributor Author

geoheelias Jul 2, 2021 •

edited

Loading

Make it visible from skimming the logs that we arent crawling the same results/ids over and over across blocks.

geoheelias commented

View reviewed changes

crawlers/lib/platforms/github/github_v4.py

  self.requests.headers.update(
  {"Authorization": f"Bearer {auth_data['access_token']}"})
- def handle_ratelimit(self, response):
+ def handle_ratelimit(self, response=None):

Contributor Author

geoheelias Jul 2, 2021

Make response optional so that we can still apply default throttling when we crash.

Contributor Author

geoheelias Jul 2, 2021

as in, a single chunk crashed for whatever reason in the current block, and we still continue with the next chunk


allow handle_ratelimit without response arg

d20706b

geoheelias commented

View reviewed changes

crawlers/lib/platforms/github/github_v4.py

@@ @@ -91,7 +94,8 @@ def set_state(cls, state: dict = None) -> dict: @@
  state['empty_page_cnt'] = state.get('empty_page_cnt', 0) # indicate that exploration has reached an end
  if not isinstance(state.get(BLOCK_KEY_IDS, None), list):
  state[BLOCK_KEY_IDS] = [] # list when we have known indexes to use as IDs
- state[BLOCK_KEY_FROM_ID] = state.get(BLOCK_KEY_FROM_ID, 0) # without known IDs, we start from the lowest ID number
+ state[BLOCK_KEY_FROM_ID] = state.get(BLOCK_KEY_FROM_ID,
+) # without known IDs, we start from the lowest ID number

Contributor Author

geoheelias Jul 2, 2021

happy autoformat -_-

geoheelias commented

View reviewed changes

crawlers/lib/platforms/github/github_v4.py

  indexes = state[BLOCK_KEY_IDS][i:i + GITHUB_QUERY_MAX]
  if len(indexes) > 0:
  state['current'] = indexes[-1]
  else:
+ # otherwise explore incrementally within from/to
+ i += state[BLOCK_KEY_FROM_ID]

Contributor Author

geoheelias Jul 2, 2021 •

edited

Loading

this missing secret sauce to follow block instructions

geoheelias commented

View reviewed changes

crawlers/lib/platforms/github/github_v4.py

		yield False, [], state
		continue # skip to next crawl

Contributor Author

geoheelias Jul 2, 2021

This is a mess to look at in diff view, probably pull it and look at the code, hopefully its a bit clearer than what it was.

geoheelias commented

View reviewed changes

crawlers/lib/platforms/github/github_v4.py

- if len(repos) == 0:
- state['empty_page_cnt'] += 1
- yield True, repos, state
  self.handle_ratelimit(response)

Contributor Author

geoheelias Jul 2, 2021

this handle_ratelimit is now run for response.ok == True AND False (but not when the chunk crashed)

geoheelias commented

View reviewed changes

crawlers/lib/platforms/github/github_v4.py

- logger.exception(f"(skipping block part) github crawler crashed")
+ logger.exception(f"(skipping block chunk) github crawler crashed")
+ yield False, [], state
+ self.handle_ratelimit()

Contributor Author

geoheelias Jul 2, 2021

handle_ratelimit when the chunk crashed, and we dont have response

geoheelias commented

View reviewed changes

crawlers/lib/platforms/github/github_v4.py

- logger.exception(f"(skipping block part) github crawler crashed")
+ logger.exception(f"(skipping block chunk) github crawler crashed")
+ yield False, [], state
+ self.handle_ratelimit()
  state = self.set_state(state) # update state for next round

Contributor Author

geoheelias Jul 2, 2021

only place that set_state is called during crawling, for all scenarios of crashing, not OK or OK

geoheelias commented

View reviewed changes

crawlers/lib/platforms/gitlab.py

+ logger.info(f"ratelimit exceeded for {self}, sleeping for {sleep_s} seconds...")
+ time.sleep(sleep_s)
+ else:
+ super().handle_ratelimit()

Contributor Author

geoheelias Jul 2, 2021 •

edited

Loading

conform gitlab specific ratelimit in the same way that I updated github to work (i.e. not needing a response object arg)

gitea doesnt need it, because it doesnt override the superclass method.


add rsync.exclude,

115315b

rm psycopg from requirements,
add 5s extra sleep in github,

puhoy merged commit 05c1889 into master

puhoy deleted the fix/github_many_crawlers branch

July 4, 2021 13:47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment