urllib.error.HTTPError: HTTP Error 403: Forbidden

created at 03-17-2021 views: 1

Error message

urllib.error.HTTPError: HTTP Error 403: Forbidden

Reason

when using urllib.request.urlopen opens a URL, the target server will only receive a simple request to access the page, but it does not know the browser, operating system, hardware platform and other information used to send this request, and these types of requests (missing Information) are often treated as abnormal visits, such as crawlers.

In order to prevent such abnormal access, some websites will verify the UserAgent in the request information (its information includes hardware platform, system software, application software, and user personal preferences). If the UserAgent is abnormal or does not exist, then this request Will be rejected (as shown in the error message above).

Solution

we can solve this problem by adding pseudo headers to the request.

for example, the code snippet “Web Scraping with Python, Second Edition by Ryan Mitchell (O’Reilly). Copyright 2018 Ryan Mitchell, 978-1-491-998557-1.”:

from urllib.request import urlopen
html = urlopen('http://target_example.com/page1.html')
print(html.read())

we would change this to:

from urllib.request import urlopen, Request

# headers = {'User-Agent': 'Mozilla/5.0 3578.98 Safari/537.36'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36'}
target_url='http://target_example.com/page1.html'
url = Request(target_url, headers=headers)
html = urlopen(url, timeout=10)
print(html.read())

equest.urlretrieve

if you are using equest.urlretrieve, you can solve this problem by:

opener=request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36')]
request.install_opener(opener)
request.urlretrieve(url=url,filename='%s/%s.txt'%(savedir,get_domain_url(url=url)))
Please log in to leave a comment.