首页手记数据分析实战---通过爬虫管理社群作业

数据分析实战---通过爬虫管理社群作业

标签：

爬虫

近期，和小伙伴们一起组织了疯狂数据分析小组，通过有计划的输入，每周总结一篇数据类文章，分别投稿到疯狂数据分析专题，并坚持一年时间，具体计划请看零基础入门数据分析成员的新年计划。可是如何管理作业呢？一个一个的去数，这个确实有点麻烦了。哈哈，于是想到爬虫大家的交作业情况，然后每周做个时间筛选就可以了，今天试了下，果然方便多了，那如何来做呢？倒杯水哈，我慢慢给你讲，保准你一听就明白。

阅读路线：

一：爬虫目标
二：获取索引页内容
三：解析索引页
四：Ajax异步加载
五：获取并解析详情页内容
六：结果存入Mysql

小提示：由于同学们是在不断的提交作业，所以大家看到的疯狂数据分析专题主页会和现在的有些区别

一：目标

对于管理小组成员的作业情况，我们需要得到交作业同学的简书用户名、文章标题、交作业时间、所写内容的字数(防止为了交作业而交作业)。
先大概看下疯狂数据分析专题的首页情况

索引页

如果仔细看的话，这里只能得到交作业同学的简书用户名和标题，那还少两个怎么办呢？我们先点开看下

详情页

点开之后，发现我们需要的简书用户名、文章标题、交作业时间、所写内容的字数都是有的，很开心。这时，我们先定下爬虫的方案，我是这样操作的：

先在疯狂数据分析专题这个页面下，我们称为索引页，得到文章标题和跳到详情页的链接，就是我们刚点开的那个。然后在详情页中再获得简书用户名、交作业时间、所写内容的字数

二：获取索引页内容

先看下我们的索引页地址

我们开始吧

import requests #这是向简书发出请求的包headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"}
url="https://www.jianshu.com/c/af12635a5aa3"respones = requests.get(url,headers=headers)
print(respones.text)

关于requests库的使用，请参考这里Request Quickstart。
下面是爬虫得到的HTML结果

对于HTML，建议大家了解些这方面的知识HTML Tutorial
如果要看到网页上的HTML结果，首先鼠标右键，点击检查，便会有这样的页面。

如图，想看到某一部分的HTML内容时，【1】先点击下图片右侧按钮，【2】然后在把鼠标放到左边的任何位置上，【3】便能HTML中给显示出来

三：解析索引页

import requestsfrom bs4 import BeautifulSoupimport pymysql
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"}
url="https://www.jianshu.com/c/af12635a5aa3"respones = requests.get(url,headers=headers)
html=respones.text
soup = BeautifulSoup(html, 'html.parser')
note_list = soup.find_all("ul", class_="note-list")[0]
content_li = note_list.find_all("li")for link in content_li:
    url = link.find_all("a", class_="title")[0]
    print(url)

关于BeautifulSoup解析库的使用，请参考[Beautiful Soup Documentation]
(https://www.crummy.com/software/BeautifulSoup/bs4/doc/)，这里大家暂时掌握住find_all()、get()、content方法便可。

看看我们刚刚解析后的结果

其实这里放出来的是全部的结果，但是只有10条，但是现在已经30篇文章了啊。这是为什么呢？

四：Ajax异步加载

通俗解释下，就是打开某个网页时，是不能看到所有的结果的，但是鼠标下拉时，网页又多了一部分内容，在这个过程之中每次只加载一部分，并没有重新加载整个页面内容的这种情况，就是Ajax异步加载，刚刚我们只得到一部分结果，就是因为这个。大家打开我们的专题-----疯狂数据分析专题，试试看有没有这样神奇的效果。

刚一打开的时候是这样的，

大家鼠标放在左边下滑来看看

这时大家会发现多出了这样的链接

https://www.jianshu.com/c/af12635a5aa3?order_by=added_at&page=2https://www.jianshu.com/c/af12635a5aa3?order_by=added_at&page=3https://www.jianshu.com/c/af12635a5aa3?order_by=added_at&page=4https://www.jianshu.com/c/af12635a5aa3?order_by=added_at&page=5

这些多出来的链接就是我们下拉的时候产生的，并且他们中只是Page后面的参数发生了改变，其实是每增加一页是增加了10篇文章。现在是有42篇文章，我们设置page为5就足够用了（刚刚我们看的时候是30篇，现在同学们开始交作业了，每周日12点交齐本周作业）

现在我们再来看看代码上是如何展示的

import requestsfrom bs4 import BeautifulSoupfrom requests.exceptions import RequestExceptionimport pymysql

headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"}def get_page_index(number):
    url="https://www.jianshu.com/c/af12635a5aa3?order_by=added_at&page=%s"%number    try:
        respones = requests.get(url,headers=headers)  # get请求
        if respones.status_code == 200:            return respones.text        return None
    except RequestException:
        print("请求有错误")        return Nonedef parse_index_page(html):
    soup = BeautifulSoup(html, 'html.parser')
    note_list = soup.find_all("ul", class_="note-list")[0]
    content_li = note_list.find_all("li")
    dir={}    for link in content_li:
        url = link.find_all("a", class_="title")[0]
        title=url.contents[0]
        link="https://www.jianshu.com"+url.get("href")        # 因为title 会有重复的情况，但是link是不会重复的，所以写成下面的形式
         dir[link]=title    return dirdef main():
    for number in range(1,6):
        html=get_page_index(number)
        dir=parse_index_page(html)
        print(dir)if __name__=="__main__":
    main()

这时就可以取出所有的结果了，如下：

五：获取并解析详情页内容

import requestsfrom bs4 import BeautifulSoupfrom requests.exceptions import RequestExceptionimport pymysql

headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"}def get_page_index(number):
    url="https://www.jianshu.com/c/af12635a5aa3?order_by=added_at&page=%s"%number    try:
        respones = requests.get(url,headers=headers)  # get请求
        if respones.status_code == 200:            return respones.text        return None
    except RequestException:
        print("请求有错误")        return Nonedef parse_index_page(html):
    soup = BeautifulSoup(html, 'html.parser')
    note_list = soup.find_all("ul", class_="note-list")[0]
    content_li = note_list.find_all("li")
    dir={}    for link in content_li:
        url = link.find_all("a", class_="title")[0]        # m=j.content
        title=url.contents[0]
        link="https://www.jianshu.com"+url.get("href")            # 因为title 会有重复的情况，但是link是不会重复的，所以写成下面的形式
        dir[link]=title    return dirdef get_page_detail(url):
    try:
        respones=requests.get(url,headers=headers)#get请求
        if respones.status_code==200:            return respones.text        return None
    except RequestException:
        print("请求详情页有错误")        return Nonedef parse_detail_page(title,html):
    title=title
    soup = BeautifulSoup(html, 'html.parser')
    name=soup.find_all("div",class_="info")[0].find_all("span",class_="name")[0].find_all("a")[0].contents[0]
    content_detail = soup.find_all("div", class_="info")[0].find_all("div",class_="meta")[0].find_all("span")
    content_detail=[info.contents[0] for info in content_detail]
    publish_time=content_detail[0]
    word_age = content_detail[1]    return title,name,publish_time,word_agedef main():
    for number in range(1,6):
        html=get_page_index(number)
        dir=parse_index_page(html)        for link,title in dir.items():
            html=get_page_detail(link)
            title, name, publish_time, word_age=parse_detail_page(title, html)
            print(title, name, publish_time, word_age)if __name__=="__main__":
    main()

看下结果：

六：结果存入Mysql

import requestsfrom bs4 import BeautifulSoupfrom requests.exceptions import RequestExceptionimport pymysql

headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"}def get_page_index(number):
    url="https://www.jianshu.com/c/af12635a5aa3?order_by=added_at&page=%s"%number    try:
        respones = requests.get(url,headers=headers)  # get请求
        if respones.status_code == 200:            return respones.text        return None
    except RequestException:
        print("请求有错误")        return Nonedef parse_index_page(html):
    soup = BeautifulSoup(html, 'html.parser')
    note_list = soup.find_all("ul", class_="note-list")[0]
    content_li = note_list.find_all("li")
    dir={}    for link in content_li:
        url = link.find_all("a", class_="title")[0]        # m=j.content
        title=url.contents[0]
        link="https://www.jianshu.com"+url.get("href")          # 因为title 会有重复的情况，但是link是不会重复的，所以写成下面的形式
        dir[link]=title    return dirdef get_page_detail(url):
    try:
        respones=requests.get(url,headers=headers)#get请求
        if respones.status_code==200:            return respones.text        return None
    except RequestException:
        print("请求详情页有错误")        return Nonedef parse_detail_page(title,html):
    title=title
    soup = BeautifulSoup(html, 'html.parser')
    name=soup.find_all("div",class_="info")[0].find_all("span",class_="name")[0].find_all("a")[0].contents[0]
    content_detail = soup.find_all("div", class_="info")[0].find_all("div",class_="meta")[0].find_all("span")
    content_detail=[info.contents[0] for info in content_detail]
    publish_time=content_detail[0]
    word_age = content_detail[1]    return title,name,publish_time,word_agedef save_to_mysql(title, name, publish_time, word_age):

    cur=conn.cursor()
    insert_data= "INSERT INTO exercise(name,title,publish_time,word_age)" "VALUES(%s,%s,%s,%s)"
    val=(name,title,publish_time,word_age)
    cur.execute(insert_data,val)
    conn.commit()

conn=pymysql.connect(host="localhost",user="root", password='123456',db='crazydata',port=3306, charset="utf8")def main():
    for number in range(1,6):
        html=get_page_index(number)
        dir=parse_index_page(html)        for link,title in dir.items():
            html=get_page_detail(link)
            title, name, publish_time, word_age=parse_detail_page(title, html)
            save_to_mysql(title, name, publish_time, word_age)if __name__=="__main__":
    main()

我们到mysql中去看下结果。

建议大家也来实践下，一方面是练习爬虫技能，另一方面可以清楚的看到有多少人在一起努力，他们都是做了哪些事情。过一段时间，再来看下，还有哪些人在坚持，坚持下来的同学们身上发生了哪些变化。2019新的一年，马上要到了，写下我们的计划，一步步的去践行，大家相互见证成长。

作者：凡人求索
链接：https://www.jianshu.com/p/d1c356df040a

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

富国沪深

算法工程师

手记
篇

粉丝

41

获赞与收藏

158

关注作者，订阅最新文章

相关文章推荐

一元实战！《Python爬虫实战数据可视化分析》等你来领！

精通Python爬虫-03-狩猎大师

02-认识python爬虫

Python爬虫解析实战例子

爬虫原理与数据抓取之一: 通用爬虫和聚焦爬虫

阅读免费教程

Python 原生爬虫教程

19个小节 51042 1106

Scrapy 入门教程

27个小节 10409 247

后端通用面试教程

41个小节 30244 342

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空