Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

海量数据去重-dedup - feapder-document #10

Open
Boris-code opened this issue Mar 7, 2021 · 8 comments
Open

海量数据去重-dedup - feapder-document #10

Boris-code opened this issue Mar 7, 2021 · 8 comments

Comments

@Boris-code
Copy link
Owner

https://boris.org.cn/feapder/#/source_code/dedup

Description

@CZW-1122
Copy link

好骚~

@Boris-code
Copy link
Owner Author

@CZW-1122
好骚~

么么哒

@calior
Copy link

calior commented Aug 12, 2021

数据入了一次库,我清库后,想再入一次,但是一直提示重复数据,清了redis dedup key, 还是没用。
想请问,入库去重信息缓存在哪里,该怎么清掉呢?

@Boris-code
Copy link
Owner Author

@calior
数据入了一次库,我清库后,想再入一次,但是一直提示重复数据,清了redis dedup key, 还是没用。
想请问,入库去重信息缓存在哪里,该怎么清掉呢?

就只存redis里了

@calior
Copy link

calior commented Aug 12, 2021

@calior
数据入了一次库,我清库后,想再入一次,但是一直提示重复数据,清了redis dedup key, 还是没用。
想请问,入库去重信息缓存在哪里,该怎么清掉呢?

就只存redis里了

那就很奇怪,在写item的时候,总是提示数据重复,写入为0,换了个redis,表也删了,就差没重启机器了。

@xiaoyueinfo
Copy link

FloomFilter有bug

@dream2333
Copy link

开启去重后,相应的key得手动删除,使用delete_keys="*"无效

@Boris-code
Copy link
Owner Author

@dream2333
开启去重后,相应的key得手动删除,使用delete_keys="*"无效

因为去重库默认是共用的,多个爬虫在一个池子里去重,目的是为了节省空间
因为永久去重不管是去重一条数据还是去重一亿数据,都需要先开辟一定的空间(285MB),若每个项目都开个空间,那么会浪费很多内存

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants