[toc]

0、爬取目标

爬取下图,爬取房源具体信息
image.png

1、目标分析

通过定位到XHR,刷新页面,没有接口返回房源信息。接着去分析目标页面,这时通过爬取页面,进行解析58房源。
接口信息:

  1. 请求为get
  2. 响应体为页面

image.png

当然可以进行分页爬取,本次不进行分页抓取,只进行简单爬取
第一页:https://bj.58.com/xinfang/loupan/all/p1/
第二页:https://bj.58.com/xinfang/loupan/all/p2/
修改地址后边的p的数值就可以进行分页爬取

2、请求组装

    # 一、组装请求
    # ---------------------------------------
    # 请求目标
    url = "https://bj.58.com/xinfang/"

    # 伪装信息
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
    }
    # ---------------------------------------

3、发起请求

    # 二、发起请求
    # ---------------------------------------
    response = requests.get(url=url, headers=headers)
    # ---------------------------------------

4、响应体处理

    # 三、数据解析
    # ---------------------------------------
    response_text = response.text
    response_tree = etree.HTML(response_text)
    list_div = response_tree.xpath("//div[@class='key-list imglazyload']/div[@class='item-mod ']")
    house_info = ""
    for ht in list_div:
        item_list = ht.xpath(".//div[@class='infos']/a")
        house_info += "----------------\n"
        for item in item_list:
            house_info += "".join(item.xpath(".//span/text()")) + "\n"
        house_info += "=============\n"
        house_info += "".join(ht.xpath(".//a[@class='favor-pos']/p//text()")) + "\n"
        house_info += "=============\n"
    # ---------------------------------------

5、持久化

通过地址,进行二次爬取

    # 四、数据持久化
    # ---------------------------------------
    with open("./58.txt", "w", encoding="utf-8") as fs:
        fs.write(house_info.replace(" ", " "))
    print("爬取成功^v^")
    # ---------------------------------------

7、整体代码

#!/usr/bin/env python
# _*_ coding: utf-8 _*_
import requests
from lxml import etree

if __name__ == '__main__':
    # 一、组装请求
    # ---------------------------------------
    # 请求目标
    url = "https://bj.58.com/xinfang/"

    # 伪装信息
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
    }
    # ---------------------------------------

    # 二、发起请求
    # ---------------------------------------
    response = requests.get(url=url, headers=headers)
    # ---------------------------------------

    # 三、数据解析
    # ---------------------------------------
    response_text = response.text
    response_tree = etree.HTML(response_text)
    list_div = response_tree.xpath("//div[@class='key-list imglazyload']/div[@class='item-mod ']")
    house_info = ""
    for ht in list_div:
        item_list = ht.xpath(".//div[@class='infos']/a")
        house_info += "----------------\n"
        for item in item_list:
            house_info += "".join(item.xpath(".//span/text()")) + "\n"
        house_info += "=============\n"
        house_info += "".join(ht.xpath(".//a[@class='favor-pos']/p//text()")) + "\n"
        house_info += "=============\n"
    # ---------------------------------------

    # 四、数据持久化
    # ---------------------------------------
    with open("./58.txt", "w", encoding="utf-8") as fs:
        fs.write(house_info.replace(" ", " "))
    print("爬取成功^v^")
    # ---------------------------------------


8、运行测试

image.png

image.png
image.png

Q.E.D.


只有创造,才是真正的享受,只有拚搏,才是充实的生活。