Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new Scraper for GitHub Events #119

Open
d-Rickyy-b opened this issue Oct 7, 2019 · 4 comments
Open

Implement new Scraper for GitHub Events #119

d-Rickyy-b opened this issue Oct 7, 2019 · 4 comments
Labels
Difficulty: Medium This issue is not easy and not hard to resolve enhancement New feature or request hacktoberfest Label for issues suited for the Hacktoberfest event

Comments

@d-Rickyy-b
Copy link
Owner

Similar to shhgit (repo link) there could be a new parser which clones a repo and checks files with the given analyzers.

For now this is just a random idea with close to no detailled thoughts on how to implement this. There is the GitHub Events API which is also used by shhgit. Maybe also the source code of shhgit can be used to implement some of the code for pastepwn.

Definition of done

  1. A new directory called 'github' was created in the scraping directory
  2. A new scraper (which is expanding basicscraper) is implemented in the github directory
  3. The new scraper works similar to the pastebin scraper and fetches events from the github events API. Currently it seems that it needs to clone the repo before acting on it. You are free to make suggestions how this should work.
@d-Rickyy-b d-Rickyy-b added enhancement New feature or request hacktoberfest Label for issues suited for the Hacktoberfest event Difficulty: Medium This issue is not easy and not hard to resolve labels Oct 7, 2019
@Samyak2
Copy link
Contributor

Samyak2 commented Nov 1, 2019

Would the scraper need to run all analyzers on all files in the cloned repo? Or would it only download the files?

@d-Rickyy-b
Copy link
Owner Author

d-Rickyy-b commented Nov 1, 2019

@Samyak2 The scraper would only download the files and put them into a queue similar to the pastebin scraper. Running the analyzers is not the task of the scrapers.

Maybe this architectural illustration can show the inner workings better:

pastepwn-detail-architecture

  • EDIT: The last box should say 'ActionHandler' and not 'AnalyzerHandler'

The interesting part is indeed what kind of files (and how many) should be inserted into the queue. Maybe starting with files from common programming languages would be a good first step I think. But I think you can come up with a great solution. Downloading would be priority 1. The scanning part can be done later.

@Samyak2
Copy link
Contributor

Samyak2 commented Nov 4, 2019

I understand the flow now. I will try to understand the Pastebin scraper and then start work on this.

@d-Rickyy-b
Copy link
Owner Author

I will update the image later. There are a few issues with it...

But the flow is the same. If you need help anywhere, feel free to contact me. I'll be happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficulty: Medium This issue is not easy and not hard to resolve enhancement New feature or request hacktoberfest Label for issues suited for the Hacktoberfest event
Projects
None yet
Development

No branches or pull requests

2 participants