Python requests vs. urllib.urlretrieve

Python’s urllib.request.urlretrieve doesn’t have a way to handle connection timeouts. This can lead to user complaints where they think your program is hanging, when really it’s a bad internet connection since urlretrieve will hang for many minutes.

Python requests download files

This is a robust way to download files in Python with timeout. I name it url_retrieve to remind myself not to use the old one.

from pathlib import Path
import requests

def url_retrieve(url: str, outfile: Path):
    R = requests.get(url, allow_redirects=True)
    if R.status_code != 200:
        raise ConnectionError('could not download {}\nerror code: {}'.format(url, R.status_code))

    outfile.write_bytes(R.content)

Why isn’t this in requests? Because the Requests BDFL doesn’t want it

pure Python download files

If you can’t or don’t want to use requests, here is how to download files in Python using only built-in modules:

from pathlib import Path
import urllib.request
import urllib.error
import socket


def url_retrieve(
    url: str,
    outfile: Path,
    overwrite: bool = False,
):
    """
    Parameters
    ----------
    url: str
        URL to download from
    outfile: pathlib.Path
        output filepath (including name)
    overwrite: bool
        overwrite if file exists
    """
    outfile = Path(outfile).expanduser().resolve()
    if outfile.is_dir():
        raise ValueError("Please specify full filepath, including filename")
    # need .resolve() in case intermediate relative dir doesn't exist
    if overwrite or not outfile.is_file():
        outfile.parent.mkdir(parents=True, exist_ok=True)
        try:
            urllib.request.urlretrieve(url, str(outfile))
        except (socket.gaierror, urllib.error.URLError) as err:
            raise ConnectionError(
                "could not download {} due to {}".format(url, err)
            )