[toc]

0、爬取目标

爬取下图,根据所列片段古诗,进行爬取完整的古诗词
image.png
image.png

1、目标分析

俩个目标页面都是页面整个返回,没有接口请求,这时通过爬取页面,进行解析古诗。
接口信息:

  1. 请求为get
  2. 响应体为页面

image.png
image.png

2、创建存放文件夹

    # 零、创建存放爬取存放路径
    # ---------------------------------------------
    save_shici_url = "./shici/"
    if not os.path.exists(save_shici_url):
        os.mkdir(save_shici_url)
    # ---------------------------------------------

3、请求组装

    # 一、请求组装
    # ---------------------------------------------
    # 1、url
    url = "https://www.gushiwen.org/shiju/"

    # 2、UA伪装  为了防止反爬虫,使自己的小爬虫伪装为浏览器
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
    }
    # ---------------------------------------------

4、发起请求

    # 二、发起请求
    # ---------------------------------------------
    response = requests.get(url=url, headers=headers)
    # ---------------------------------------------

5、响应体处理

    # 三、响应解析
    # ---------------------------------------------
    resp_cont = response.content
    soup_content = BeautifulSoup(resp_cont, "lxml")
    a_list = soup_content.select('.cont > a[target="_blank"]')
    # ---------------------------------------------

6、持久化

通过地址,进行二次爬取

    # 四、持久化
    # ---------------------------------------------
    for index in range(1, len(a_list), 2):

        # 4.1 请求组装
        # ---------------------------------------------
        gushi_url = a_list[index]['href']
        # ---------------------------------------------

        # 4.2、发起请求
        # ---------------------------------------------
        gushi_response = requests.get(url=gushi_url, headers=headers)
        # ---------------------------------------------

        # 4.3、响应解析
        # ---------------------------------------------
        gushi_resp_cont = gushi_response.content
        gushi_soup_content = BeautifulSoup(gushi_resp_cont, "lxml")
        gushi_title = gushi_soup_content.select('#sonsyuanwen > .cont > h1')
        gushi_author_source = gushi_soup_content.select('#sonsyuanwen > .cont > .source > a')
        gushi_content = gushi_soup_content.select('#sonsyuanwen > .cont > .contson')
        # ---------------------------------------------

        # 4.4、持久化
        # ---------------------------------------------
        save_gushi_content = gushi_title[0].text + "\n" + \
                             gushi_author_source[0].text + gushi_author_source[1].text + "\n" + \
                             gushi_content[0].text.replace("\n", "").replace(" ", "").replace("。", "。\n")
        shici_name = gushi_title[0].text.replace(" ", "").replace("/", "") + ".txt"
        shici_url = save_shici_url + shici_name
        with open(shici_url, "w", encoding="utf-8") as fs:
            fs.write(save_gushi_content)
        print(shici_name + "爬取成功^v^")
        # ---------------------------------------------

7、整体代码

#!/usr/bin/env python
# _*_ coding: utf-8 _*_
import os
import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    # 零、创建存放爬取存放路径
    # ---------------------------------------------
    save_shici_url = "./shici/"
    if not os.path.exists(save_shici_url):
        os.mkdir(save_shici_url)
    # ---------------------------------------------


    # 一、请求组装
    # ---------------------------------------------
    # 1、url
    url = "https://www.gushiwen.org/shiju/"

    # 2、UA伪装  为了防止反爬虫,使自己的小爬虫伪装为浏览器
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
    }
    # ---------------------------------------------

    # 二、发起请求
    # ---------------------------------------------
    response = requests.get(url=url, headers=headers)
    # ---------------------------------------------

    # 三、响应解析
    # ---------------------------------------------
    resp_cont = response.content
    soup_content = BeautifulSoup(resp_cont, "lxml")
    a_list = soup_content.select('.cont > a[target="_blank"]')
    # ---------------------------------------------

    # 四、持久化
    # ---------------------------------------------
    for index in range(1, len(a_list), 2):

        # 4.1 请求组装
        # ---------------------------------------------
        gushi_url = a_list[index]['href']
        # ---------------------------------------------

        # 4.2、发起请求
        # ---------------------------------------------
        gushi_response = requests.get(url=gushi_url, headers=headers)
        # ---------------------------------------------

        # 4.3、响应解析
        # ---------------------------------------------
        gushi_resp_cont = gushi_response.content
        gushi_soup_content = BeautifulSoup(gushi_resp_cont, "lxml")
        gushi_title = gushi_soup_content.select('#sonsyuanwen > .cont > h1')
        gushi_author_source = gushi_soup_content.select('#sonsyuanwen > .cont > .source > a')
        gushi_content = gushi_soup_content.select('#sonsyuanwen > .cont > .contson')
        # ---------------------------------------------

        # 4.4、持久化
        # ---------------------------------------------
        save_gushi_content = gushi_title[0].text + "\n" + \
                             gushi_author_source[0].text + gushi_author_source[1].text + "\n" + \
                             gushi_content[0].text.replace("\n", "").replace(" ", "").replace("。", "。\n")
        shici_name = gushi_title[0].text.replace(" ", "").replace("/", "") + ".txt"
        shici_url = save_shici_url + shici_name
        with open(shici_url, "w", encoding="utf-8") as fs:
            fs.write(save_gushi_content)
        print(shici_name + "爬取成功^v^")
        # ---------------------------------------------



8、运行测试

image.png

image.png
image.png

Q.E.D.


只有创造,才是真正的享受,只有拚搏,才是充实的生活。