使用 Python + Clash，爬取外网数据

有的时候，希望能够爬取一些外网的数据，但是开启Clash之类的软件后，发现会直接报requests.exceptions.ProxyError的错误。解决方法的代码如下：

import requests
import urllib3


# 设置了verify=False后(在后面)会有警告，关闭警告
urllib3.disable_warnings()

url = "https://www.google.com"

# Clash默认的代理端口为7890
proxies = {
    'http': 'http://127.0.0.1:7890/',
    'https': 'http://127.0.0.1:7890/'
}

response = requests.get(url=url, verify=False, proxies=proxies)

将代理的IP地址以及端口号写死的做法并不优雅，可以使用urllib.request中的getproxies获取系统Web代理信息。当未开启代理时，返回空字典，开启代理时，返回如下的字典：

{
    'http': 'http://127.0.0.1:7890',
 	'https': 'https://127.0.0.1:7890',
 	'ftp': 'ftp://127.0.0.1:7890'
}

urllib3.disable_warnings()这样的代码也不够优雅。使用httpx则不用设置urllib3.disable_warnings()、verify=False。

from urllib.request import getproxies
import httpx


response = httpx.get(
	"https://www.google.com/",
    proxy=getproxies().get("http")
)

也可以使用Client，和requests的session基本一致

from urllib.request import getproxies
import httpx


client = httpx.Client(
    proxy=getproxies().get("http")
)

response = client.get("https://www.google.com/")

编程

#Python #爬虫 #代理

机器学习中数据的简单处理上一篇

抗体化学结构的发现下一篇