Skip to content

seedjyh/hupubbs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hupubbs

基于scrapy开发的,对虎扑论坛进行爬取的爬虫。

部署

  • clone项目到本地
  • 修改hupubbs/pipelines.pyMySQLPipeline.open_spiderself.db里的MySQL连接参数,使其指向自己的mysql服务器。
  • 在命令行进入项目目录,运行scrapy crawl hupubbs

设计文档

使用范例

虎扑可以让用户隐藏自己的动态,这样就不知道用户主要在哪个版块回帖。使用爬虫爬取后,在数据库里运行

select plate.url, count(*)
from reply
    left join thread on reply.thread_id = thread.id
    left join plate on thread.plate_id = plate.id
    left join user on reply.user_id = user.id
where user.url_id = 245307700327195 # `https://my.hupu.com/245307700327195`
group by plate.id;

可以查看该用户在各分区的回帖数。

About

scrape bbs.hupu.com

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages