Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request 去重后一直重试, 并且提示加入队列成功,但是其实并没有加入队列 #207

Open
hijack911 opened this issue Mar 23, 2023 · 2 comments

Comments

@hijack911
Copy link
Contributor

需知

升级feapder,保证feapder是最新版,若BUG仍然存在,则详细描述问题

pip install --upgrade feapder

问题
feapder 版本: 1.8.5
添加队列进程的日志如下

2023-03-23 10:26:49.636 | INFO     | feapder.core.spiders.task_spider:get_task:line:234 | 无待做任务,尝试取丢失的任务
2023-03-23 10:26:49.658 | DEBUG    | feapder.buffer.request_buffer:is_exist_request:line:46 | request已存在  url = https://httpbin.org/headers
2023-03-23 10:26:49.658 | INFO     | feapder.core.spiders.task_spider:start_monitor_task:line:210 | 添加任务到redis成功 共1条
2023-03-23 10:26:54.673 | INFO     | feapder.core.spiders.task_spider:get_task:line:229 | redis 中剩余任务0 数量过小 从mysql中取任务追加
2023-03-23 10:26:54.689 | INFO     | feapder.core.spiders.task_spider:get_task:line:234 | 无待做任务,尝试取丢失的任务
2023-03-23 10:26:54.712 | DEBUG    | feapder.buffer.request_buffer:is_exist_request:line:46 | request已存在  url = https://httpbin.org/headers
2023-03-23 10:26:54.713 | INFO     | feapder.core.spiders.task_spider:start_monitor_task:line:210 | 添加任务到redis成功 共1条
2023-03-23 10:26:59.728 | INFO     | feapder.core.spiders.task_spider:get_task:line:229 | redis 中剩余任务0 数量过小 从mysql中取任务追加
2023-03-23 10:26:59.744 | INFO     | feapder.core.spiders.task_spider:get_task:line:234 | 无待做任务,尝试取丢失的任务
2023-03-23 10:26:59.767 | DEBUG    | feapder.buffer.request_buffer:is_exist_request:line:46 | request已存在  url = https://httpbin.org/headers
2023-03-23 10:26:59.767 | INFO     | feapder.core.spiders.task_spider:start_monitor_task:line:210 | 添加任务到redis成功 共1条

截图
MYSQL
image

代码

只复制了主要函数

 __custom_setting__ = dict(
        ITEM_FILTER_ENABLE=True,  # item 去重
        REQUEST_FILTER_ENABLE=True,  # request 去重
        ITEM_FILTER_SETTING=dict(
            filter_type=3,  # 永久去重(BloomFilter) = 1 、内存去重(MemoryFilter) = 2、 临时去重(ExpireFilter)= 3、轻量去重(LiteFilter)= 4
            expire_time=-1,  
            redis_url="redis:https://127.0.0.1:6379/2",
        ),
        REQUEST_FILTER_SETTING=dict(
            absolute_name="",
            filter_type=3,  # 永久去重(BloomFilter) = 1 、内存去重(MemoryFilter) = 2、 临时去重(ExpireFilter)= 3、 轻量去重(LiteFilter)= 4
            expire_time=60 * 60 * 24 * 7,  # 过期时间1个月
            redis_url="redis:https://127.0.0.1:6379/2",

        )
    )
    def start_requests(self, task):
        task_id = task.id
        url = task.url

        yield feapder.Request(url, task_id=task_id)

    def parse(self, request, response: Response):
        # 提取网站title
        print(response.content)
        # print(response.xpath("//title/text()").extract_first())
        # # 提取网站描述
        # print(response.xpath("//meta[@name='description']/@content").extract_first())
        print("网站地址: ", response.url)

        item = Item()  # 声明一个item
        item.table_name = "ent_test"  # 指定存储的表名
        item.content = response.text
        # item.update(**response.json)
        # mysql 需要更新任务状态为做完 即 state=1,
        yield self.update_task_batch(request.task_id, )
        # yield item
@Boris-code
Copy link
Owner

TaskSpider 和 BatchSpider 种子不要参与去重,去重逻辑可以在数据库里建唯一索引。
否则你重复跑的时候已采集过的种子无法再次下发了

@hijack911
Copy link
Contributor Author

TaskSpider 和 BatchSpider 种子不要参与去重,去重逻辑可以在数据库里建唯一索引。 否则你重复跑的时候已采集过的种子无法再次下发了

TaskSpider 种子 是指的什么?
数据库里的 id 不就是唯一的吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants