python制作的第一个爬虫

发表于 2015-02-17 更新于 2019-08-06

需要:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win 32
BeautifulSoup(安装方式:pip install BeautifulSoup)(文档) 首先让我们来体验一下python的魔力吧~获取第一个页面.


from urllib import *
html = urlopen("http://m.lssdjt.com/?date=2015-1-1").read()
html = html.decode("utf-8")
print(html)

这是历史上的今天的页面,我们看到页面已经正常地被加载了.现在我们来讲一下其中是怎么工作的.

引入urllib这个库.
urlopen读取url数据,并且read()成string
解码,由于页面是UTF-8编码的,为了显示出来不至于乱码,我们进行了解码
输出到屏幕上. python.就是这么简单.

好了,我们抓到了网页数据之后呢,就要进行分析了.这个页面http://m.lssdjt.com/?date=2015-1-1如果大家去看呢,就会发现我们需要抓的仅仅就是那些链接.接下来我们摆出大招,抓链接.


from urllib import *
from bs4 import BeautifulSoup

html = urlopen("http://m.lssdjt.com/?date=2015-1-1").read()
soup = BeautifulSoup(html)
for link in soup.find_all('a'):
    if link.string is not None:
        title = link.get_text()
        href = link['href']
        print href, title

这里我们输出结果就十分漂亮了.

分析:

for link in soup.find_all('a'): 寻找所有a标签,也就是2010年-中国－东盟自贸区正式建成这样的标签,找到之后用link循环
if link.string is not None: 如果它不是空的
title = link.get_text() 获取文本,也就是标签之间的文字
href = link['href']获取链接,也就是href="xxx"
输出现在我们链接也OK了,直接下载下来存在硬盘里就OK拉.

为了我们方便寻找,我决定把title和href保存成csv文件(表格).


# coding: utf-8
__author__ = 'ypw'

from urllib import *
from bs4 import BeautifulSoup

html = urlopen("http://m.lssdjt.com/?date=2015-1-1").read()
soup = BeautifulSoup(html)
f = open('data.csv', 'w')
for link in soup.find_all('a'):
    if link.string is not None:
        title = link.get_text()
        print title
        title = title.encode("gbk")
        href = link['href']
        urlretrieve("http://m.lssdjt.com/" + href, href)
        f.write(href + "," + title + "n")

f.close()

讲解:

f = open('data.csv', 'w') 以写文件('w')的模式打开data.csv文件
title = title.encode("gbk") windows环境下,我们用gbk编码,excel才不会乱码
urlretrieve("http://m.lssdjt.com/" + href, href) 这里的urlretrieve是下载文件的命令,我们这样用就行:urlretrieve(url,filename),如果需要更详细的用法请自行百度.
f.write(href + "," + title + "n") 在csv文件里写一行数据
f.close() 关闭文件以上代码请新建一个文件夹,把上面的代码保存成一个.py文件,然后新建一个d文件夹(因为链接是/d/xxx.html).最后才打开你刚才保存的.py文件.不要在cmd中直接运行以上代码,如果你直接运行你会发现系统目录下多了很多html文件,别问我怎么知道的..

在刚才创建的文件夹中运行之后你就会发现神奇的事情:

抓下来了所有的页面.

之后我们再来一个循环,第一页下载完之后就下载第二天. 那么怎么寻找第二页呢?我们看这个第二天的链接:

1
2
3


<li class="r" onClick="location.href='?date=2015-1-2'">后一天>></li>

主要特征:li,class=r,onClick=xxx,那么我们可以构建这样的语句来抓取onClick:

1
2
3


soup.find("li", "r")['onclick']

这里为什么onclick是小写的呢?因为soup抓出来的我们输出的时候就是小写的了..我也不知道为什么.大写报错所以就小写呗. 输出location.href='?date=2015-1-2' 我们再用文本处理的办法得到''里的数据


r = soup.find("li", "r")['onclick'].split("'")[1]
print r

看看我们加了什么,我们把取出来的location.href='?date=2015-1-2'首先split("'")了,这样就会变成这样: ['location.href=', '?date=2015-1-2', ''] 然后我们需要的是第二段,于是我们就[1] 这样最后的结果就是?date=2015-1-2了.

然后写个循环,爬一年,然后避免出错我们选择在下载的时候用try语句,为了节省时间我们设置socket3秒超时.


# coding: utf-8
__author__ = 'ypw'

from urllib import *
from bs4 import BeautifulSoup
import socket

socket.setdefaulttimeout(3)

f = open('data.csv', 'w')
r = "?date=2015-1-1"
t = 0
while True:
    t += 1
    html = urlopen("http://m.lssdjt.com/"+r).read()
    soup = BeautifulSoup(html)
    r = soup.find("li", "r")['onclick'].split("'")[1]
    for link in soup.find_all('a'):
        if link.string is not None:
            title = link.get_text()
            print title
            title = title.encode("gbk", "ignore")
            href = link['href']
            if href.find("html") > 0:
                success = True
                while success:
                    try:
                        print "getting " + href
                        urlretrieve("http://m.lssdjt.com/" + href, href)
                        success = False
                    except IOError:
                        print "超时"

            f.write(href + "," + title + "n")
    print "这是第{0}天".format(t)
    if t > 365:
        break

f.close()

上面就是我目前正在运行的程序.

目前它已经运行完毕,爬了将近一万四千个页面.

以下是抓取的href,title的表格 data.csv