Python requests vs. urllib.urlretrieve
Python’s
urllib.request.urlretrieve
doesn’t have a way to handle connection timeouts.
This can lead to user complaints where they think your program is hanging, when really it’s a bad internet connection since urlretrieve
will hang for many minutes.
Python requests
download files
This is a robust way to download files in Python with timeout.
I name it url_retrieve
to remind myself not to use the old one.
from pathlib import Path
import requests
def url_retrieve(url: str, outfile: Path):
R = requests.get(url, allow_redirects=True)
if R.status_code != 200:
raise ConnectionError('could not download {}\nerror code: {}'.format(url, R.status_code))
outfile.write_bytes(R.content)
Why isn’t this in requests
?
Because the Requests BDFL
doesn’t want it
pure Python download files
If you can’t or don’t want to use requests
, here is how to download files in Python using only built-in modules:
from pathlib import Path
import urllib.request
import urllib.error
import socket
def url_retrieve(
url: str,
outfile: Path,
overwrite: bool = False,
):
"""
Parameters
----------
url: str
URL to download from
outfile: pathlib.Path
output filepath (including name)
overwrite: bool
overwrite if file exists
"""
outfile = Path(outfile).expanduser().resolve()
if outfile.is_dir():
raise ValueError("Please specify full filepath, including filename")
# need .resolve() in case intermediate relative dir doesn't exist
if overwrite or not outfile.is_file():
outfile.parent.mkdir(parents=True, exist_ok=True)
try:
urllib.request.urlretrieve(url, str(outfile))
except (socket.gaierror, urllib.error.URLError) as err:
raise ConnectionError(
"could not download {} due to {}".format(url, err)
)