-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
关键词检索数据很少/数据与关键词不相关/数据内容重复/数据偏离规定时间 #312
Comments
你好,请问你解决这个问题了吗? |
您好,我没能解决。 |
@pxaklbe 通过文件获取关键词是没问题的 关于检索的结果中混在了其他日期的推文,已经修正,需拉取最新的code。原因是对于转发的推文,额外保存了原始推文,而原始推文的日期可能不在预设的检索日期内。(如下图) ![]() 对于关键词:“一个”,时间: 2017-2-26,进行了测试,设置如下 keywords = ['一个']
start_time = datetime.datetime(year=2017, month=2, day=26, hour=0)
end_time = datetime.datetime(year=2017, month=2, day=27, hour=0)
is_split_by_hour = True 可以采集19条推文,日期均为2017-2-26日,且均包含关键词“一次”,详情如下
|
nghuyong
added a commit
that referenced
this issue
Feb 13, 2024
感谢作者大大的详细解释!如果在此基础上想获取更多推文,请问还有什么方式吗?(已经看到每小时的数据采集方式。) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
尊敬的开发者您好!我想随机抽取2017年中的某一天,检索这一天中包含关键词”一个“的全部微博。为此,我对tweet_by_keyword的start_request做出了如下更改:
![randtime-search](https://private-user-images.githubusercontent.com/126714177/292507059-a39c4761-b46d-464d-807d-7b796abda6ea.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA1OTE1NDUsIm5iZiI6MTcyMDU5MTI0NSwicGF0aCI6Ii8xMjY3MTQxNzcvMjkyNTA3MDU5LWEzOWM0NzYxLWI0NmQtNDY0ZC04MDdkLTdiNzk2YWJkYTZlYS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxMFQwNjAwNDVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mODdkNzdiZDRiN2ZiMTA2ZTZlOTY5YTlkNjgyNzI0ZTFhZjlmYWE3YmJkYzYwOTc0ZGExNmRiYzlhMDI0YmU0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.mvZC_JCSSpUuywOvYFSEHV_nEgq23644SOnj1JTo2HY)
![fixedtime-test1-search](https://private-user-images.githubusercontent.com/126714177/292508216-9053a839-1cca-44d3-ab90-6b1b8e24bbce.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA1OTE1NDUsIm5iZiI6MTcyMDU5MTI0NSwicGF0aCI6Ii8xMjY3MTQxNzcvMjkyNTA4MjE2LTkwNTNhODM5LTFjY2EtNDRkMy1hYjkwLTZiMWI4ZTI0YmJjZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxMFQwNjAwNDVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02MDQ4NzQ5MTcyZDFjYWI3NDllMWJkODU5MjBiYmZjMjI5Mjc1NmM0MmRlMGNlZDI2YjM3Y2E1NmFjYmE0NDI2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.YP1Q8c5mQnDJFJNHSFJbkCTYoKPHMt_9h2IF21ghZpY)
![randtimetest1-1](https://private-user-images.githubusercontent.com/126714177/292507742-61a22e04-7084-4dec-87ee-1ae8346e8296.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA1OTE1NDUsIm5iZiI6MTcyMDU5MTI0NSwicGF0aCI6Ii8xMjY3MTQxNzcvMjkyNTA3NzQyLTYxYTIyZTA0LTcwODQtNGRlYy04N2VlLTFhZTgzNDZlODI5Ni5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxMFQwNjAwNDVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT00OGJlN2U1ODIwOWIyYWE5NjI0M2Y5OTA1MjhjMThiZmIzYTNmYjExNjQxZjQ5YjgwNmMzZDVjNjdkMGIxNWRkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.s2R0oPFj_QEEwdZcvlUIlvfoo2MY4pvIyYlybrJ3rwc)
![randtimetest1-2](https://private-user-images.githubusercontent.com/126714177/292507759-ba30c975-b4c9-453f-9f5e-4530f2304468.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA1OTE1NDUsIm5iZiI6MTcyMDU5MTI0NSwicGF0aCI6Ii8xMjY3MTQxNzcvMjkyNTA3NzU5LWJhMzBjOTc1LWI0YzktNDUzZi05ZjVlLTQ1MzBmMjMwNDQ2OC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxMFQwNjAwNDVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02OGE3ZDMyYjg3ZTNlNmZhNTMxMDA1MzUxZDYxMTg1OGE3MjVkNTQzYjFjZTQ1ZGFmMTJjNzA2MzE1YmI3NTBiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.amJ2qcrSYiQfeIWhcg9O15hLR6yLBm72MkzsM0weauM)
![fixedtime-test1](https://private-user-images.githubusercontent.com/126714177/292508166-3d80fd7f-23d7-4a24-9706-7edd9c5358f3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA1OTE1NDUsIm5iZiI6MTcyMDU5MTI0NSwicGF0aCI6Ii8xMjY3MTQxNzcvMjkyNTA4MTY2LTNkODBmZDdmLTIzZDctNGEyNC05NzA2LTdlZGQ5YzUzNThmMy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxMFQwNjAwNDVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05MGU0YzMwNTc1ODQ5YjA0ZmYwODQ3NjI3NzEzOTQxMDRiMDE4ZTgwZmNiZWMxNGUyZjMyOWQ2ZWMyZTk0MzgzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.KZyl0Um7lzCHJOUJSOHaXJZjwneZ0lDlRNyL7p9rmmQ)
![fixedtime-test2-search](https://private-user-images.githubusercontent.com/126714177/292508867-e8022757-191b-4d9f-a568-a11c2ce6a8a4.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA1OTE1NDUsIm5iZiI6MTcyMDU5MTI0NSwicGF0aCI6Ii8xMjY3MTQxNzcvMjkyNTA4ODY3LWU4MDIyNzU3LTE5MWItNGQ5Zi1hNTY4LWExMWMyY2U2YThhNC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxMFQwNjAwNDVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wYmVmY2M1NWRmZmE3Y2ZlMTcwNTZmZjQ5N2VmMGE4YjIxYWQ1YmMyYTBkOWY5Mjk5NDhmZTNmYjhmZWNlY2IyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.YS-IXsrkV_LPUWut1HjFSlCj5nkWQfP2ptOibbTv7QU)
![fixedtime-test2](https://private-user-images.githubusercontent.com/126714177/292508828-d5a07202-fe2b-42b0-9309-3f353348eabe.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA1OTE1NDUsIm5iZiI6MTcyMDU5MTI0NSwicGF0aCI6Ii8xMjY3MTQxNzcvMjkyNTA4ODI4LWQ1YTA3MjAyLWZlMmItNDJiMC05MzA5LTNmMzUzMzQ4ZWFiZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxMFQwNjAwNDVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03OGUwMzFlNzQxNWRjNGUyMjQ2NjY1ZjBhMTFhMjY3YzY4ZTdiMjNmOGVjZTkwYTQ2NDNhZTk5NzEzY2IzM2NkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.6ivds6XvnQ_BkPKXA3v0w0Xb61fw1nzoOaolfYao_NI)
关键词以txt的形式储存,并通过'filepath'的形式获取:
通过以上方式,我发现检索结果并不能局限于程序随机抽取的日期(如2017年2月26日),并且检索结果不包含关键词,部分结果如下:
我保留'filepath'获取关键词的方式,使用程序原本的固定日期运行,结果还是存在一样的问题:
最后我使用了start_request原有的['关键词']获取方法,并使用程序原本的固定日期运行。虽然结果仍有其他日期出现,但这些日期检索到的内容与关键词无关。符合日期的结果包含关键词,但程序只检索到24条结果就自动关闭:
请问作者大大,如何能随机抽取的某天检索,保证其结果包含关键词,并获得200条以上的结果数据呢?可否请作者大大提供一下宝贵的思路?此外,前两种方式的结果都包含大量不同id的账号,在不同时间,转发的内容相同的微博。请问作者大大是否有办法能对该类结果去重呢?感谢,提前祝您节日愉快!
The text was updated successfully, but these errors were encountered: