[toc]

0、爬取目标

爬取糗事百科热图图片
image.png

1、目标分析

1.1 链接分析

初始页链接为:https://www.qiushibaike.com/imgrank/page/1/
点击第二页为:https://www.qiushibaike.com/imgrank/page/2/
分页就是后边的1和2

1.2 图片分析

F12打开开发者工具,点击定位到图片,可以观察到图片路径如下图:
image.png
使用正则分析,进行定位到图片:

<div class="thumb">.*?<img src="(.*?)" alt=".*?</div>

1.3 存储

定位到图片地址后,再进行调用请求到图片数据,进行保存。

2、创建存放文件夹

    # 零、判断存储文件夹是否存在,不存在则创建存放文件
    # ---------------------------------------------
    save_dir_path = "./qiu_img"
    if not os.path.exists(save_dir_path):
        os.mkdir(save_dir_path)
    # ---------------------------------------------

3、请求组装

    # 一、请求组装
    # ---------------------------------------------
    # 1、url
    url = "https://www.qiushibaike.com/imgrank/page/"

    # 2、UA伪装  为了防止反爬虫,使自己的小爬虫伪装为浏览器
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
    }
    # ---------------------------------------------

4、发起请求

        # 二、发起请求
        # ---------------------------------------------
        url = url + str(index) + "/"
        response = requests.get(url=url, headers=headers)
        # ---------------------------------------------

5、响应体处理

        # 三、响应体处理,通过正则提取标题和图片地址
        # ---------------------------------------------
        response_text = response.text
        rep_url = '<div class="thumb">.*?<img src="(.*?)" alt=".*?</div>'
        img_url_list = re.findall(rep_url, response_text, re.S)
        # ---------------------------------------------

6、持久化

        # 四、持久化
        # ---------------------------------------------
        for img in img_url_list:
            img_name = img.split("/")[-1]
            img_url = save_dir_path + "/" + img_name
            with open(img_url, "wb") as fs:
                # 爬取图片
                img_ful_img = "https:" + img
                img_response = requests.get(img_ful_img, headers=headers)
                img_content = img_response.content
                fs.write(img_content)
                print(img_name + "爬取成功^v^")
            # ---------------------------------------------

7、整体代码

#!/usr/bin/env python
# _*_ coding: utf-8 _*_
import requests
import re
import os

if __name__ == '__main__':
    # 零、判断存储文件夹是否存在,不存在则创建存放文件
    # ---------------------------------------------
    save_dir_path = "./qiu_img"
    if not os.path.exists(save_dir_path):
        os.mkdir(save_dir_path)
    # ---------------------------------------------

    # 一、请求组装
    # ---------------------------------------------
    # 1、url
    url = "https://www.qiushibaike.com/imgrank/page/"

    # 2、UA伪装  为了防止反爬虫,使自己的小爬虫伪装为浏览器
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
    }
    # ---------------------------------------------

    # 爬取俩页图片和标题
    # ---------------------------------------------
    for index in range(1, 3):
        # 二、发起请求
        # ---------------------------------------------
        url = url + str(index) + "/"
        response = requests.get(url=url, headers=headers)
        # ---------------------------------------------

        # 三、响应体处理,通过正则提取标题和图片地址
        # ---------------------------------------------
        response_text = response.text
        rep_url = '<div class="thumb">.*?<img src="(.*?)" alt=".*?</div>'
        img_url_list = re.findall(rep_url, response_text, re.S)
        # ---------------------------------------------

        # 四、持久化
        # ---------------------------------------------
        for img in img_url_list:
            img_name = img.split("/")[-1]
            img_url = save_dir_path + "/" + img_name
            with open(img_url, "wb") as fs:
                # 爬取图片
                img_ful_img = "https:" + img
                img_response = requests.get(img_ful_img, headers=headers)
                img_content = img_response.content
                fs.write(img_content)
                print(img_name + "爬取成功^v^")
            # ---------------------------------------------

8、运行测试

image.png

image.png
image.png

Q.E.D.


只有创造,才是真正的享受,只有拚搏,才是充实的生活。