Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git Sync improvement #1190

Open
elitongadotti opened this issue Oct 6, 2022 · 4 comments
Open

Git Sync improvement #1190

elitongadotti opened this issue Oct 6, 2022 · 4 comments
Labels
enhancement New feature or request git Git related spider Spider related v0.6 Version v0.6.x

Comments

@elitongadotti
Copy link

elitongadotti commented Oct 6, 2022

请描述该需求尝试解决的问题
Hello,
I'd like to suggest to improve git sync functionality in order to make it possible for scenarios where there are dozens (or even hundreds) of spiders. Currently the functionality requires that each spiders has its own repository. In such scenarios mentioned before, I would have too much repositories to make it doable.

Currently I'm using as workaround bash script + crontab job to pull data from one single repository on github, where each spider has its own folder, same structure that is found on /app/spiders path.

Best,
Eliton

@elitongadotti elitongadotti added the enhancement New feature or request label Oct 6, 2022
@tikazyq tikazyq added v0.6 Version v0.6.x git Git related spider Spider related labels Oct 6, 2022
@tikazyq
Copy link
Collaborator

tikazyq commented Oct 6, 2022

I think this idea is quite making sense and doable. What we can implement is something like "virtual spider", a spider that links to some subdirectories of a git repo.

Welcome to hear your feedback and ideas if you have any suggestion.

@elitongadotti
Copy link
Author

Hello.

It is also doable. No idea how complex is following this or that way to make it work, actually. The way I suggested was the way "I did" with bash scripting.

Thanks,
Eliton

@tikazyq
Copy link
Collaborator

tikazyq commented Oct 10, 2022

Perhaps I didn't convey my idea clearly. You can follow this issue and a proposed model/process on file management in Crawlab will be implemented later this or the month after.

@pfrenssen
Copy link

There is also a common use case in Scrapy where multiple spiders are hosted in a single folder. This allows spiders to reuse common pipelines and middlewares in a shared code base:

├── my_scraper
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines
│   │   ├── pipeline1.py
│   │   ├── pipeline2.py
│   │   └── ....
│   ├── settings.py
│   └── spiders
│       ├── spider1.py
│       ├── spider2.py
│       ├── ...

If we have a single codebase with 100 spiders in them, at the moment we need to clone the same repository 100 times, and if there is an update to the repository, we need to pull changes 100x.

It would be really nice if we could point spiders to the same code base, but this would mean to decouple the concept of files / git repos from spiders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request git Git related spider Spider related v0.6 Version v0.6.x
Projects
None yet
Development

No branches or pull requests

3 participants