-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement new Scraper for GitHub Events #119
Comments
Would the scraper need to run all analyzers on all files in the cloned repo? Or would it only download the files? |
@Samyak2 The scraper would only download the files and put them into a queue similar to the pastebin scraper. Running the analyzers is not the task of the scrapers. Maybe this architectural illustration can show the inner workings better:
The interesting part is indeed what kind of files (and how many) should be inserted into the queue. Maybe starting with files from common programming languages would be a good first step I think. But I think you can come up with a great solution. Downloading would be priority 1. The scanning part can be done later. |
I understand the flow now. I will try to understand the Pastebin scraper and then start work on this. |
I will update the image later. There are a few issues with it... But the flow is the same. If you need help anywhere, feel free to contact me. I'll be happy to help. |
Similar to shhgit (repo link) there could be a new parser which clones a repo and checks files with the given analyzers.
For now this is just a random idea with close to no detailled thoughts on how to implement this. There is the GitHub Events API which is also used by shhgit. Maybe also the source code of shhgit can be used to implement some of the code for pastepwn.
Definition of done
The text was updated successfully, but these errors were encountered: