Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

删除掉项目以后,SpiderKeeper依然会尝试和Scrapyd同步状态 #86

Open
QingGo opened this issue Oct 15, 2018 · 11 comments
Open

Comments

@QingGo
Copy link

QingGo commented Oct 15, 2018

我是直接调用api来删除SpiderKeeper的项目的:

for i in range(2, 19):
    project_delete_url = 'https://localhost:5000/project/{}/delete'.format(i)
    r = session.get(project_delete_url, auth=('admin','admin'))

删除任务以后发现,在Docker里面Scrapyd的容器CPU占用率接近100%,而SpiderKeeper的日志出现了以下信息:

xecution of job "sync_spiders (trigger: interval[0:00:10], next run at: 2018-10-15 16:47:44 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-15 16:47:49 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-15 16:47:54 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_spiders (trigger: interval[0:00:10], next run at: 2018-10-15 16:47:54 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-15 16:47:59 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-15 16:48:04 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_spiders (trigger: interval[0:00:10], next run at: 2018-10-15 16:48:04 CST)" skipped: maximum number of running instances reached (1)
Execution of job "sync_job_execution_status_job (trigger: interval[0:00:05], next run at: 2018-10-15 16:48:19 CST)" skipped: maximum number of running instances reached (1)

尝试停掉Scrapyd,然后SpiderKeeper里面自然而然地出现一堆请求Scrapyd的listjobs和listspiders的接口失败的警告,然而奇怪的是,后面的?project=接的都是已经已经删除的项目,猜测原因是删除项目以后(Scrapyd上已经删除了这个项目了),SpiderKeeper没有在自己的sqlite数据库里删除对应的定期任务。

另外删除项目后在SpiderKeeper界面上新建新的项目,项目下也会显示原来项目的运行记录。猜测原因是删除项目以后(Scrapyd上已经删除了这个项目了),SpiderKeeper没有在自己的sqlite数据库里删除对应的任务运行记录。

也许是我调用api来删除SpiderKeeper的项目的姿势不对?求帮助。

@QingGo
Copy link
Author

QingGo commented Oct 15, 2018

另外暴力删除掉SpiderKeeper.db以后,发现SpiderKeeper好像不能自动同步Scrapyd上已有的项目信息。

@3inchtime
Copy link

我也遇到同样的问题,请问如何解决

@QingGo
Copy link
Author

QingGo commented Nov 6, 2018

我也遇到同样的问题,请问如何解决

最终我还是没解决这个问题,所以放弃spiderkeep改用celery-beat来管理定时任务了

@3inchtime
Copy link

我也遇到同样的问题,请问如何解决

最终我还是没解决这个问题,所以放弃spiderkeep改用celery-beat来管理定时任务了

难受啊

@3inchtime
Copy link

我也遇到同样的问题,请问如何解决

最终我还是没解决这个问题,所以放弃spiderkeep改用celery-beat来管理定时任务了

现在基本上确定是代码中出现了某些问题让scrapyd阻塞了,与Spiderkeeper无关。但是同一爬虫在不同机器上就不会出现问题,懵逼

@Ericliu68
Copy link

你可以多看看db的内容,这个应该在源码里面加一个删除db里面的project,job

@QingGo
Copy link
Author

QingGo commented Jan 9, 2019

SpiderKeeper调用Scrapyd的任何一个API都有可能会各种原因失败(比如网络异常,或者scrapyd本身被请求得太频繁导致堵塞),从而造成两者状态不同步,我觉得对于错误应该要加上相应的处理机制,比如在界面提示你操作失败,或者自动重试

@QingGo
Copy link
Author

QingGo commented Jan 9, 2019

不过我已经不用SpiderKeeper了,现在改用celery+celery-beat来管理定时任务

@Ericliu68
Copy link

scrapyd可以设置同时运行爬虫的进程数,其实我想知道celery+celery-beat怎么调用scrapy爬虫,有推荐教程吗?

@QingGo
Copy link
Author

QingGo commented Jan 9, 2019

scrapyd可以设置同时运行爬虫的进程数,其实我想知道celery+celery-beat怎么调用scrapy爬虫,有推荐教程吗?

celery相关的教程可以看官网,celery+celery-beat只是用来定时发异步的请求,在python调用Scrapyd你可以试试python-scrapyd-api这个库

@Ericliu68
Copy link

好的,谢谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants