正则表达式、Bs4、Xpath三种解析方式的使用
  • 实现数据爬取的流程
  •   指定url
  •   基于requests模块发起请求
  •   获取响应中的数据
  •   数据解析(正则解析,bs4解析,xpath解析)
  •   进行持久化存储

bs4(BeautifulSoup)

  • 安装
pip install bs4 
pip install lxml
  • 解析原理
  •   1.将即将要进行解析的源码加载到bs对象
  •   2.调用bs对象中相关的方法或属性进行源码中的相关标签的定位
  •   3.将定位到的标签之间存在的文本或者属性值获取
  • 基础使用
使用流程:       
    - 导包:from bs4 import BeautifulSoup
    - 使用方式:可以将一个html文档,转化为BeautifulSoup对象,然后通过对象的方法或者属性去查找指定的节点内容
        (1)转化本地文件:
             - soup = BeautifulSoup(open('本地文件'), 'lxml')
        (2)转化网络文件:
             - soup = BeautifulSoup('字符串类型或者字节类型', 'lxml')
        (3)打印soup对象显示内容为html文件中的内容

基础巩固:
    (1)根据标签名查找
        - soup.a   只能找到第一个符合要求的标签
    (2)获取属性
        - soup.a.attrs  获取a所有的属性和属性值,返回一个字典
        - soup.a.attrs['href']   获取href属性
        - soup.a['href']   也可简写为这种形式
    (3)获取内容
        - soup.a.string
        - soup.a.text
        - soup.a.get_text()
       【注意】如果标签还有标签,那么string获取到的结果为None,而其它两个,可以获取文本内容
    (4)find:找到第一个符合要求的标签
        - soup.find('a')  找到第一个符合要求的
        - soup.find('a', title="xxx")
        - soup.find('a', alt="xxx")
        - soup.find('a', class_="xxx")
        - soup.find('a', id="xxx")
    (5)find_all:找到所有符合要求的标签
        - soup.find_all('a')
        - soup.find_all(['a','b']) 找到所有的a和b标签
        - soup.find_all('a', limit=2)  限制前两个
    (6)根据选择器选择指定的内容
               select:soup.select('#feng')
        - 常见的选择器:标签选择器(a)、类选择器(.)、id选择器(#)、层级选择器
            - 层级选择器:
                div .dudu #lala .meme .xixi  下面好多级
                div > p > a > .lala          只能是下面一级
        【注意】select选择器返回永远是列表,需要通过下标提取指定的对象

需求:使用bs4实现将诗词名句网站中三国演义小说的每一章的内容爬去到本地磁盘进行存储     http://www.shicimingju.com/book/sanguoyanyi.html

import requests
from bs4 import BeautifulSoup

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text

soup = BeautifulSoup(page_text,'lxml')

a_list = soup.select('.book-mulu > ul > li > a')

fp = open('sanguo.txt','w',encoding='utf-8')
for a in a_list:
    title = a.string
    detail_url = 'http://www.shicimingju.com'+a['href']
    detail_page_text = requests.get(url=detail_url,headers=headers).text
    
    soup = BeautifulSoup(detail_page_text,'lxml')
    content = soup.find('div',class_='chapter_content').text
    
    fp.write(title+'\n'+content)
    print(title,'下载完毕')
print('over')
fp.close()

Xpath解析

  • 安装
pip install lxml
  • 解析原理
  1. 获取页面源码数据
  2.   实例化一个etree的对象,并且将页面源码数据加载到该对象中
  3.   调用该对象的xpath方法进行指定标签的定位
  4.   注意:xpath函数必须结合着xpath表达式进行标签定位和内容捕获
  • 基础使用
属性定位:
    #找到class属性值为song的div标签
    //div[@class="song"] 
层级&索引定位:
    #找到class属性值为tang的div的直系子标签ul下的第二个子标签li下的直系子标签a
    //div[@class="tang"]/ul/li[2]/a
逻辑运算:
    #找到href属性值为空且class属性值为du的a标签
    //a[@href="" and @class="du"]
模糊匹配:
    //div[contains(@class, "ng")]
    //div[starts-with(@class, "ta")]
取文本:
    # /表示获取某个标签下的文本内容
    # //表示获取某个标签下的文本内容和所有子标签下的文本内容
    //div[@class="song"]/p[1]/text()
    //div[@class="tang"]//text()
取属性:
    //div[@class="tang"]//li[2]/a/@href

安装xpath插件在浏览器中对xpath表达式进行验证:可以在插件中直接执行xpath表达式

  将xpath插件拖动到谷歌浏览器拓展程序(更多工具)中,安装成功

  启动和关闭插件 ctrl + shift + x

示例:

  解析58二手房的相关数据https://bj.58.com/shahe/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0047-e4e6-f587-683307ca570e&ClickID=1

import requests
from lxml import etree

url = 'https://bj.58.com/shahe/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0047-e4e6-f587-683307ca570e&ClickID=1'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text

tree = etree.HTML(page_text)
li_list = tree.xpath('//ul[@class="house-list-wrap"]/li')
fp = open('58.csv','w',encoding='utf-8')
for li in li_list:
    title = li.xpath('./div[2]/h2/a/text()')[0]
    price = li.xpath('./div[3]//text()')
    price = ''.join(price)
    fp.write(title+":"+price+'\n')
fp.close()
print('over')

解析图片数据:http://pic.netbian.com/4kmeinv/

import requests
from lxml import etree
import os
import urllib

url = 'http://pic.netbian.com/4kmeinv/'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
response = requests.get(url=url,headers=headers)
#response.encoding = 'utf-8'
if not os.path.exists('./imgs'):
    os.mkdir('./imgs')
page_text = response.text

tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]/ul/li')
for li in li_list:
    img_name = li.xpath('./a/b/text()')[0]
    #处理中文乱码
    img_name = img_name.encode('iso-8859-1').decode('gbk')
    img_url = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
    img_path = './imgs/'+img_name+'.jpg'
    urllib.request.urlretrieve(url=img_url,filename=img_path)
    print(img_path,'下载成功!')
print('over!!!')

下载煎蛋网中的图片数据:http://jandan.net/ooxx         数据加密 (反爬机制)

import requests
from lxml import etree
import base64
import urllib

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
url = 'http://jandan.net/ooxx'
page_text = requests.get(url=url,headers=headers).text

tree = etree.HTML(page_text)
img_hash_list = tree.xpath('//span[@class="img-hash"]/text()')
for img_hash in img_hash_list:
    img_url = 'http:'+base64.b64decode(img_hash).decode()
    img_name = img_url.split('/')[-1]
    urllib.request.urlretrieve(url=img_url,filename=img_name)

爬取站长素材中的简历模板http://sc.chinaz.com/jianli/free_%d.html

import requests
import random
from lxml import etree
headers = {
    'Connection':'close', #当请求成功后,马上断开该次请求(及时释放请求池中的资源)
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
url = 'http://sc.chinaz.com/jianli/free_%d.html'
for page in range(1,4):
    if page == 1:
        new_url = 'http://sc.chinaz.com/jianli/free.html'
    else:
        new_url = format(url%page)
    
    response = requests.get(url=new_url,headers=headers)
    response.encoding = 'utf-8'
    page_text = response.text

    tree = etree.HTML(page_text)
    div_list = tree.xpath('//div[@id="container"]/div')
    for div in div_list:
        detail_url = div.xpath('./a/@href')[0]
        name = div.xpath('./a/img/@alt')[0]

        detail_page = requests.get(url=detail_url,headers=headers).text
        tree = etree.HTML(detail_page)
        download_list  = tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href')
        download_url = random.choice(download_list)
        data = requests.get(url=download_url,headers=headers).content
        fileName = name+'.rar'
        with open(fileName,'wb') as fp:
            fp.write(data)
            print(fileName,'下载成功')

解析所有的城市名称https://www.aqistudy.cn/historydata

import requests
from lxml import etree
headers = {
    'Connection':'close', #当请求成功后,马上断开该次请求(及时释放请求池中的资源)
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
url = 'https://www.aqistudy.cn/historydata/'
page_text = requests.get(url=url,headers=headers).text

tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="bottom"]/ul/li |  //div[@class="bottom"]/ul/div[2]/li')
for li in li_list:
    city_name = li.xpath('./a/text()')[0]
    print(city_name)


不要害怕告别,要相信,四季更替,花开花落,都是命中注定。