pywb.utils package

Submodules

pywb.utils.binsearch module

Utility functions for performing binary search over a sorted text file

pywb.utils.binsearch.binsearch(reader, key, compare_func=<function cmp>, block_size=8192)[source]

Perform a binary search for a specified key to within a ‘block_size’ (default 8192) granularity, and return first full line found.

pywb.utils.binsearch.binsearch_offset(reader, key, compare_func=<function cmp>, block_size=8192)[source]

Find offset of the line which matches a given ‘key’ using binary search If key is not found, the offset is of the line after the key

File is subdivided into block_size (default 8192) sized blocks Optional compare_func may be specified

pywb.utils.binsearch.cmp(a, b)[source]
pywb.utils.binsearch.iter_exact(reader, key, token=b' ')[source]

Create an iterator which iterates over lines where the first field matches the ‘key’, equivalent to token + sep prefix. Default field termin_ator/seperator is ‘ ‘

pywb.utils.binsearch.iter_prefix(reader, key)[source]

Creates an iterator which iterates over lines that start with prefix ‘key’ in a sorted text file.

pywb.utils.binsearch.iter_range(reader, start, end, prev_size=0)[source]

Creates an iterator which iterates over lines where start <= line < end (end exclusive)

pywb.utils.binsearch.linearsearch(iter_, key, prev_size=0, compare_func=<function cmp>)[source]

Perform a linear search over iterator until current_line >= key

optionally also tracking upto N previous lines, which are returned before the first matched line.

if end of stream is reached before a match is found, nothing is returned (prev lines discarded also)

pywb.utils.binsearch.search(reader, key, prev_size=0, compare_func=<function cmp>, block_size=8192)[source]

Perform a binary search for a specified key to within a ‘block_size’ (default 8192) sized block followed by linear search within the block to find first matching line.

When performin_g linear search, keep track of up to N previous lines before first matching line.

pywb.utils.canonicalize module

Standard url-canonicalzation, surt and non-surt

exception pywb.utils.canonicalize.UrlCanonicalizeException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.BadRequestException

class pywb.utils.canonicalize.UrlCanonicalizer(surt_ordered=True)[source]

Bases: object

pywb.utils.canonicalize.calc_search_range(url, match_type, surt_ordered=True, url_canon=None)[source]

Canonicalize a url (either with custom canonicalizer or standard canonicalizer with or without surt)

Then, compute a start and end search url search range for a given match type.

Support match types: * exact * prefix * host * domain (only available when for surt ordering)

Examples below:

# surt ranges >>> calc_search_range(’http://example.com/path/file.html’, ‘exact’) (‘com,example)/path/file.html’, ‘com,example)/path/file.html!’)

>>> calc_search_range('http://example.com/path/file.html', 'prefix')
('com,example)/path/file.html', 'com,example)/path/file.htmm')

# slash and ? >>> calc_search_range(’http://example.com/path/’, ‘prefix’) (‘com,example)/path/’, ‘com,example)/path0’)

>>> calc_search_range('http://example.com/path?', 'prefix')
('com,example)/path?', 'com,example)/path@')
>>> calc_search_range('http://example.com/path/?', 'prefix')
('com,example)/path?', 'com,example)/path@')
>>> calc_search_range('http://example.com/path/file.html', 'host')
('com,example)/', 'com,example*')
>>> calc_search_range('http://example.com/path/file.html', 'domain')
('com,example)/', 'com,example-')

special case for tld domain range >>> calc_search_range(‘com’, ‘domain’) (‘com,’, ‘com-‘)

# non-surt ranges >>> calc_search_range(’http://example.com/path/file.html’, ‘exact’, False) (‘example.com/path/file.html’, ‘example.com/path/file.html!’)

>>> calc_search_range('http://example.com/path/file.html', 'prefix', False)
('example.com/path/file.html', 'example.com/path/file.htmm')
>>> calc_search_range('http://example.com/path/file.html', 'host', False)
('example.com/', 'example.com0')

# errors: domain range not supported >>> calc_search_range(’http://example.com/path/file.html’, ‘domain’, False) # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): UrlCanonicalizeException: matchType=domain unsupported for non-surt

>>> calc_search_range('http://example.com/path/file.html', 'blah', False)   # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
UrlCanonicalizeException: Invalid match_type: blah
pywb.utils.canonicalize.canonicalize(url, surt_ordered=True)[source]

Canonicalize url and convert to surt If not in surt ordered mode, convert back to url form as surt conversion is currently part of canonicalization

>>> canonicalize('http://example.com/path/file.html', surt_ordered=True)
'com,example)/path/file.html'
>>> canonicalize('http://example.com/path/file.html', surt_ordered=False)
'example.com/path/file.html'
>>> canonicalize('urn:some:id')
'urn:some:id'
pywb.utils.canonicalize.unsurt(surt)[source]

# Simple surt >>> unsurt(‘com,example)/’) ‘example.com/’

# Broken surt >>> unsurt(‘com,example)’) ‘com,example)’

# Long surt >>> unsurt(‘suffix,domain,sub,subsub,another,subdomain)/path/file/index.html?a=b?c=)/’) ‘subdomain.another.subsub.sub.domain.suffix/path/file/index.html?a=b?c=)/’

pywb.utils.format module

class pywb.utils.format.ParamFormatter(params, name='', prefix='param.')[source]

Bases: string.Formatter

get_value(key, args, kwargs)[source]
pywb.utils.format.query_to_dict(query_str, multi=None)[source]
pywb.utils.format.res_template(template, params, **extra_params)[source]
pywb.utils.format.to_bool(val)[source]

pywb.utils.geventserver module

class pywb.utils.geventserver.GeventServer(app, port=0, hostname='localhost', handler_class=None, direct=False)[source]

Bases: object

Class for optionally running a WSGI application in a greenlet

join()[source]

Joins the greenlet spawned for running the server if it was started in non-direct mode

make_server(app, port, hostname, handler_class, direct=False)[source]

Creates and starts the server. If direct is true the server is run in the current thread otherwise in a greenlet.

Parameters:
  • app – The WSGI application instance to be used
  • port (int) – The port the server is to listen on
  • hostname (str) – The hostname the server is to use
  • handler_class – The class to be used for handling WSGI requests
  • direct (bool) – T/F indicating if the server should be run in a greenlet

or in current thread

stop()[source]

Stops the running server if it was started

class pywb.utils.geventserver.RequestURIWSGIHandler(sock, address, server, rfile=None)[source]

Bases: gevent.pywsgi.WSGIHandler

A specific WSGIHandler subclass that adds REQUEST_URI to the environ dictionary for every request

get_environ()[source]

Returns the WSGI environ dictionary with the REQUEST_URI added to it

Returns:The WSGI environ dictionary for the request
Return type:dict

pywb.utils.io module

class pywb.utils.io.OffsetLimitReader(stream, offset, length)[source]

Bases: warcio.limitreader.LimitReader

read(length=None)[source]
readline(length=None)[source]
class pywb.utils.io.StreamClosingReader(stream)[source]

Bases: object

close()[source]
read(length=None)[source]
readline(length=None)[source]
pywb.utils.io.StreamIter(stream, header1=None, header2=None, size=16384, closer=<class 'contextlib.closing'>)[source]
pywb.utils.io.buffer_iter(status_headers, iterator, buff_size=65536)[source]
pywb.utils.io.call_release_conn(stream)[source]
pywb.utils.io.chunk_encode_iter(orig_iter)[source]
pywb.utils.io.compress_gzip_iter(orig_iter)[source]
pywb.utils.io.no_except_close(closable)[source]

Attempts to call the close method of the supplied object catching all exceptions. Also tries to call release_conn() in case a requests raw stream

Parameters:closable – The object to be closed
Return type:None

pywb.utils.loaders module

class pywb.utils.loaders.BaseLoader(**kwargs)[source]

Bases: object

load(url, offset=0, length=-1)[source]
class pywb.utils.loaders.BlockLoader(**kwargs)[source]

Bases: pywb.utils.loaders.BaseLoader

a loader which can stream blocks of content given a uri, offset and optional length. Currently supports: http/https, file/local file system, pkg, WebHDFS, S3

static init_default_loaders()[source]
load(url, offset=0, length=-1)[source]
loaders = {'file': <class 'pywb.utils.loaders.LocalFileLoader'>, 'http': <class 'pywb.utils.loaders.HttpLoader'>, 'https': <class 'pywb.utils.loaders.HttpLoader'>, 'pkg': <class 'pywb.utils.loaders.PackageLoader'>, 's3': <class 'pywb.utils.loaders.S3Loader'>, 'webhdfs': <class 'pywb.utils.loaders.WebHDFSLoader'>}
profile_loader = None
static set_profile_loader(src)[source]
class pywb.utils.loaders.HMACCookieMaker(key, name, duration=10)[source]

Bases: object

Utility class to produce signed HMAC digest cookies to be used with each http request

make(extra_id='')[source]
class pywb.utils.loaders.HttpLoader(**kwargs)[source]

Bases: pywb.utils.loaders.BaseLoader

load(url, offset, length)[source]

Load a file-like reader over http using range requests and an optional cookie created via a cookie_maker

class pywb.utils.loaders.LocalFileLoader(**kwargs)[source]

Bases: pywb.utils.loaders.PackageLoader

load(url, offset=0, length=-1)[source]

Load a file-like reader from the local file system

class pywb.utils.loaders.PackageLoader(**kwargs)[source]

Bases: pywb.utils.loaders.BaseLoader

load(url, offset=0, length=-1)[source]
class pywb.utils.loaders.S3Loader(**kwargs)[source]

Bases: pywb.utils.loaders.BaseLoader

load(url, offset, length)[source]
class pywb.utils.loaders.WebHDFSLoader(**kwargs)[source]

Bases: pywb.utils.loaders.HttpLoader

Loader class specifically for loading webhdfs content

HTTP_URL = 'http://{host}/webhdfs/v1{path}?'
load(url, offset, length)[source]

Loads the supplied web hdfs content

Parameters:
  • url (str) – The URL to the web hdfs content to be loaded
  • offset (int|float|double) – The offset of the content to be loaded
  • length (int|float|double) – The length of the content to be loaded
Returns:

The raw response content

pywb.utils.loaders.from_file_url(url)[source]

Convert from file:// url to file path

pywb.utils.loaders.init_yaml_env_vars()[source]

Initializes the yaml parser to be able to set the value of fields from environment variables

Return type:None
pywb.utils.loaders.is_http(filename)[source]
pywb.utils.loaders.load(filename)[source]
pywb.utils.loaders.load_overlay_config(main_env_var, main_default_file='', overlay_env_var='', overlay_file='')[source]
pywb.utils.loaders.load_py_name(string)[source]
pywb.utils.loaders.load_yaml_config(config_file)[source]
pywb.utils.loaders.read_last_line(fh, offset=256)[source]

Read last line from a seekable file. Start reading from buff before end of file, and double backwards seek until line break is found. If reached beginning of file (no lines), just return whole file

pywb.utils.loaders.to_file_url(filename)[source]

Convert a filename to a file:// url

pywb.utils.memento module

exception pywb.utils.memento.MementoException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.BadRequestException

class pywb.utils.memento.MementoUtils[source]

Bases: object

Creates a memento link string

Parameters:
  • url (str) – A URL
  • type (str) – The rel type
  • dt (str) – The datetime of the URL
  • coll (str|None) – Optional name of a collection
  • memento_format (str|None) – Optional string used to format the supplied URL
Returns:

A memento link string

Return type:

str

classmethod make_timemap(cdx_iter, params)[source]

Creates a memento link string for a timemap

Parameters:
  • cdx (dict) – The cdx object
  • datetime (str|None) – The datetime
  • rel (str) – The rel type
  • end (str) – Optional string appended to the end of the created link string
  • memento_format (str|None) – Optional string used to format the URL
Returns:

A memento link string

Return type:

str

classmethod wrap_timemap_header(url, timegate_url, timemap_url, timemap)[source]

pywb.utils.merge module

pywb.utils.wbexception module

exception pywb.utils.wbexception.AccessException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate an access control violation

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (451)
Return type:int
exception pywb.utils.wbexception.AppPageNotFound(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate that a page was not found

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (400)
Return type:int
exception pywb.utils.wbexception.BadRequestException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate that request was bad

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (400)
Return type:int
exception pywb.utils.wbexception.LiveResourceException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate that an error was encountered during the retrial of a live web resource

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (400)
Return type:int
exception pywb.utils.wbexception.NotFoundException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate that a resource was not found

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (404)
Return type:int
exception pywb.utils.wbexception.UpstreamException(status_code, url, details)[source]

Bases: pywb.utils.wbexception.WbException

An Exception used to indicate that an error was encountered from an upstream endpoint

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response
Return type:int
exception pywb.utils.wbexception.WbException(msg=None, url=None)[source]

Bases: Exception

Base class for exceptions raised by Pywb

status()[source]

Returns the HTTP status line for the error response

Returns:The HTTP status line for the error response
Return type:str
status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (500)
Return type:int

Module contents