urlutils
- Structured URL
urlutils
is a module dedicated to one of software’s most
versatile, well-aged, and beloved data structures: the URL, also known
as the Uniform Resource Locator.
Among other things, this module is a full reimplementation of URLs,
without any reliance on the urlparse
or urllib
standard
library modules. The centerpiece and top-level interface of urlutils
is the URL
type. Also featured is the find_all_links()
convenience function. Some low-level functions and constants are also
below.
The implementations in this module are based heavily on RFC 3986 and RFC 3987, and incorporates details from several other RFCs and W3C documents.
New in version 17.2.
The URL type
- class boltons.urlutils.URL(url='')[source]
The URL is one of the most ubiquitous data structures in the virtual and physical landscape. From blogs to billboards, URLs are so common, that it’s easy to overlook their complexity and power.
There are 8 parts of a URL, each with its own semantics and special characters:
Each is exposed as an attribute on the URL object. RFC 3986 offers this brief structural summary of the main URL components:
foo://user:pass@example.com:8042/over/there?name=ferret#nose \_/ \_______/ \_________/ \__/\_________/ \_________/ \__/ | | | | | | | scheme userinfo host port path query fragment
And here’s how that example can be manipulated with the URL type:
>>> url = URL('foo://example.com:8042/over/there?name=ferret#nose') >>> print(url.host) example.com >>> print(url.get_authority()) example.com:8042 >>> print(url.qp['name']) # qp is a synonym for query_params ferret
URL’s approach to encoding is that inputs are decoded as much as possible, and data remains in this decoded state until re-encoded using the
to_text()
method. In this way, it’s similar to Python’s current approach of encouraging immediate decoding of bytes to text.Note that URL instances are mutable objects. If an immutable representation of the URL is desired, the string from
to_text()
may be used. For an immutable, but almost-as-featureful, URL object, check out the hyperlink package.- scheme
The scheme is an ASCII string, normally lowercase, which specifies the semantics for the rest of the URL, as well as network protocol in many cases. For example, “http” in “http://hatnote.com”.
- username
The username is a string used by some schemes for authentication. For example, “public” in “ftp://public@example.com”.
- password
The password is a string also used for authentication. Technically deprecated by RFC 3986 Section 7.5, they’re still used in cases when the URL is private or the password is public. For example “password” in “db://private:password@127.0.0.1”.
- host
The host is a string used to resolve the network location of the resource, either empty, a domain, or IP address (v4 or v6). “example.com”, “127.0.0.1”, and “::1” are all good examples of host strings.
Per spec, fully-encoded output from
to_text()
is IDNA encoded for compatibility with DNS.
- port
The port is an integer used, along with
host
, in connecting to network locations.8080
is the port in “http://localhost:8080/index.html”.Note
As is the case for 80 for HTTP and 22 for SSH, many schemes have default ports, and Section 3.2.3 of RFC 3986 states that when a URL’s port is the same as its scheme’s default port, the port should not be emitted:
>>> URL(u'https://github.com:443/mahmoud/boltons').to_text() u'https://github.com/mahmoud/boltons'
Custom schemes can register their port with
register_scheme()
. SeeURL.default_port
for more info.
- path
The string starting with the first leading slash after the authority part of the URL, ending with the first question mark. Often percent-quoted for network use. “/a/b/c” is the path of “http://example.com/a/b/c?d=e”.
- path_parts
The
tuple
form ofpath
, split on slashes. Empty slash segments are preserved, including that of the leading slash:>>> url = URL(u'http://example.com/a/b/c') >>> url.path_parts (u'', u'a', u'b', u'c')
- query_params[source]
An instance of
QueryParamDict
, anOrderedMultiDict
subtype, mapping textual keys and values which follow the first question mark after thepath
. Also available as the handy aliasqp
:>>> url = URL('http://boltons.readthedocs.io/en/latest/?utm_source=docs&sphinx=ok') >>> url.qp.keys() [u'utm_source', u'sphinx']
Also percent-encoded for network use cases.
- fragment
The string following the first ‘#’ after the
query_params
until the end of the URL. It has no inherent internal structure, and is percent-quoted.
- classmethod from_parts(scheme=None, host=None, path_parts=(), query_params=(), fragment='', port=None, username=None, password=None)[source]
Build a new URL from parts. Note that the respective arguments are not in the order they would appear in a URL:
- Parameters:
scheme (str) – The scheme of a URL, e.g., ‘http’
host (str) – The host string, e.g., ‘hatnote.com’
path_parts (tuple) – The individual text segments of the path, e.g., (‘post’, ‘123’)
query_params (dict) – An OMD, dict, or list of (key, value) pairs representing the keys and values of the URL’s query parameters.
fragment (str) – The fragment of the URL, e.g., ‘anchor1’
port (int) – The integer port of URL, automatic defaults are available for registered schemes.
username (str) – The username for the userinfo part of the URL.
password (str) – The password for the userinfo part of the URL.
Note that this method does relatively little validation.
URL.to_text()
should be used to check if any errors are produced while composing the final textual URL.
- to_text(full_quote=False)[source]
Render a string representing the current state of the URL object.
>>> url = URL('http://listen.hatnote.com') >>> url.fragment = 'en' >>> print(url.to_text()) http://listen.hatnote.com#en
By setting the full_quote flag, the URL can either be fully quoted or minimally quoted. The most common characteristic of an encoded-URL is the presence of percent-encoded text (e.g., %60). Unquoted URLs are more readable and suitable for display, whereas fully-quoted URLs are more conservative and generally necessary for sending over the network.
- default_port
Return the default port for the currently-set scheme. Returns
None
if the scheme is unrecognized. Seeregister_scheme()
above. Ifport
matches this value, no port is emitted in the output ofto_text()
.Applies the same ‘+’ heuristic detailed in
URL.uses_netloc()
.
- uses_netloc
Whether or not a URL uses
:
or://
to separate the scheme from the rest of the URL depends on the scheme’s own standard definition. There is no way to infer this behavior from other parts of the URL. A scheme either supports network locations or it does not.The URL type’s approach to this is to check for explicitly registered schemes, with common schemes like HTTP preregistered. This is the same approach taken by
urlparse
.URL adds two additional heuristics if the scheme as a whole is not registered. First, it attempts to check the subpart of the scheme after the last
+
character. This adds intuitive behavior for schemes likegit+ssh
. Second, if a URL with an unrecognized scheme is loaded, it will maintain the separator it sees.>>> print(URL('fakescheme://test.com').to_text()) fakescheme://test.com >>> print(URL('mockscheme:hello:world').to_text()) mockscheme:hello:world
- get_authority(full_quote=False, with_userinfo=False)[source]
Used by URL schemes that have a network location,
get_authority()
combinesusername
,password
,host
, andport
into one string, the authority, that is used for connecting to a network-accessible resource.Used internally by
to_text()
and can be useful for labeling connections.>>> url = URL('ftp://user@ftp.debian.org:2121/debian/README') >>> print(url.get_authority()) ftp.debian.org:2121 >>> print(url.get_authority(with_userinfo=True)) user@ftp.debian.org:2121
- normalize(with_case=True)[source]
Resolve any “.” and “..” references in the path, as well as normalize scheme and host casing. To turn off case normalization, pass
with_case=False
.More information can be found in Section 6.2.2 of RFC 3986.
Factory method that returns a _new_
URL
based on a given destination, dest. Useful for navigating those relative links with ease.The newly created
URL
is normalized before being returned.>>> url = URL('http://boltons.readthedocs.io') >>> url.navigate('en/latest/') URL(u'http://boltons.readthedocs.io/en/latest/')
- Parameters:
dest (str) – A string or URL object representing the destination
More information can be found in Section 5 of RFC 3986.
Low-level functions
A slew of functions used internally by URL
.
- boltons.urlutils.parse_url(url_text)[source]
Used to parse the text for a single URL into a dictionary, used internally by the
URL
type.Note that “URL” has a very narrow, standards-based definition. While
parse_url()
may raiseURLParseError
under a very limited number of conditions, such as non-integer port, a surprising number of strings are technically valid URLs. For instance, the text"url"
is a valid URL, because it is a relative path.In short, do not expect this function to validate form inputs or other more colloquial usages of URLs.
>>> res = parse_url('http://127.0.0.1:3000/?a=1') >>> sorted(res.keys()) # res is a basic dictionary ['_netloc_sep', 'authority', 'family', 'fragment', 'host', 'password', 'path', 'port', 'query', 'scheme', 'username']
- boltons.urlutils.parse_host(host)[source]
Low-level function used to parse the host portion of a URL.
Returns a tuple of (family, host) where family is a
socket
module constant orNone
, and host is a string.>>> parse_host('googlewebsite.com') == (None, 'googlewebsite.com') True >>> parse_host('[::1]') == (socket.AF_INET6, '::1') True >>> parse_host('192.168.1.1') == (socket.AF_INET, '192.168.1.1') True
Odd doctest formatting above due to py3’s switch from int to enums for
socket
constants.
- boltons.urlutils.parse_qsl(qs, keep_blank_values=True, encoding='utf8')[source]
Converts a query string into a list of (key, value) pairs.
- boltons.urlutils.resolve_path_parts(path_parts)[source]
Normalize the URL path by resolving segments of ‘.’ and ‘..’, resulting in a dot-free path. See RFC 3986 section 5.2.4, Remove Dot Segments.
- class boltons.urlutils.QueryParamDict(*a, **kw)[source]
A subclass of
OrderedMultiDict
specialized for representing query string values. Everything is fully unquoted on load and all parsed keys and values are strings by default.As the name suggests, multiple values are supported and insertion order is preserved.
>>> qp = QueryParamDict.from_text(u'key=val1&key=val2&utm_source=rtd') >>> qp.getlist('key') [u'val1', u'val2'] >>> qp['key'] u'val2' >>> qp.add('key', 'val3') >>> qp.to_text() 'key=val1&key=val2&utm_source=rtd&key=val3'
See
OrderedMultiDict
for more API features.- classmethod from_text(query_string)[source]
Parse query_string and return a new
QueryParamDict
.
Quoting
URLs have many parts, and almost as many individual “quoting” (encoding) strategies.
- boltons.urlutils.quote_userinfo_part(text, full_quote=True)[source]
Quote special characters in either the username or password section of the URL. Note that userinfo in URLs is considered deprecated in many circles (especially browsers), and support for percent-encoded userinfo can be spotty.
- boltons.urlutils.quote_path_part(text, full_quote=True)[source]
Percent-encode a single segment of a URL path.
- boltons.urlutils.quote_query_part(text, full_quote=True)[source]
Percent-encode a single query string key or value.
- boltons.urlutils.quote_fragment_part(text, full_quote=True)[source]
Quote the fragment part of the URL. Fragments don’t have subdelimiters, so the whole URL fragment can be passed.
There is however, only one unquoting strategy:
- boltons.urlutils.unquote(string, encoding='utf-8', errors='replace')[source]
Percent-decode a string, by replacing %xx escapes with their single-character equivalent. The optional encoding and errors parameters specify how to decode percent-encoded sequences into Unicode characters, as accepted by the
bytes.decode()
method. By default, percent-encoded sequences are decoded with UTF-8, and invalid sequences are replaced by a placeholder character.>>> unquote(u'abc%20def') u'abc def'
Useful constants
- boltons.urlutils.SCHEME_PORT_MAP
A mapping of URL schemes to their protocols’ default ports. Painstakingly assembled from the IANA scheme registry, port registry, and independent research.
Keys are lowercase strings, values are integers or None, with None indicating that the scheme does not have a default port (or may not support ports at all):
>>> boltons.urlutils.SCHEME_PORT_MAP['http'] 80 >>> boltons.urlutils.SCHEME_PORT_MAP['file'] None
See
URL.port
for more info on how it is used. SeeNO_NETLOC_SCHEMES
for more scheme info.Also available in JSON.