pywb.utils package¶
Submodules¶
pywb.utils.binsearch module¶
Utility functions for performing binary search over a sorted text file
-
pywb.utils.binsearch.
binsearch
(reader, key, compare_func=<function cmp>, block_size=8192)[source]¶ Perform a binary search for a specified key to within a ‘block_size’ (default 8192) granularity, and return first full line found.
-
pywb.utils.binsearch.
binsearch_offset
(reader, key, compare_func=<function cmp>, block_size=8192)[source]¶ Find offset of the line which matches a given ‘key’ using binary search If key is not found, the offset is of the line after the key
File is subdivided into block_size (default 8192) sized blocks Optional compare_func may be specified
-
pywb.utils.binsearch.
iter_exact
(reader, key, token=b' ')[source]¶ Create an iterator which iterates over lines where the first field matches the ‘key’, equivalent to token + sep prefix. Default field termin_ator/separator is ‘ ‘
-
pywb.utils.binsearch.
iter_prefix
(reader, key)[source]¶ Creates an iterator which iterates over lines that start with prefix ‘key’ in a sorted text file.
-
pywb.utils.binsearch.
iter_range
(reader, start, end, prev_size=0)[source]¶ Creates an iterator which iterates over lines where start <= line < end (end exclusive)
-
pywb.utils.binsearch.
linearsearch
(iter_, key, prev_size=0, compare_func=<function cmp>)[source]¶ Perform a linear search over iterator until current_line >= key
optionally also tracking upto N previous lines, which are returned before the first matched line.
if end of stream is reached before a match is found, nothing is returned (prev lines discarded also)
-
pywb.utils.binsearch.
search
(reader, key, prev_size=0, compare_func=<function cmp>, block_size=8192)[source]¶ Perform a binary search for a specified key to within a ‘block_size’ (default 8192) sized block followed by linear search within the block to find first matching line.
When performin_g linear search, keep track of up to N previous lines before first matching line.
pywb.utils.canonicalize module¶
Standard url-canonicalzation, surt and non-surt
-
pywb.utils.canonicalize.
calc_search_range
(url, match_type, surt_ordered=True, url_canon=None)[source]¶ Canonicalize a url (either with custom canonicalizer or standard canonicalizer with or without surt)
Then, compute a start and end search url search range for a given match type.
Support match types: * exact * prefix * host * domain (only available when for surt ordering)
Examples below:
# surt ranges >>> calc_search_range(’http://example.com/path/file.html’, ‘exact’) (‘com,example)/path/file.html’, ‘com,example)/path/file.html!’)
>>> calc_search_range('http://example.com/path/file.html', 'prefix') ('com,example)/path/file.html', 'com,example)/path/file.htmm')
# slash and ? >>> calc_search_range(’http://example.com/path/’, ‘prefix’) (‘com,example)/path/’, ‘com,example)/path0’)
>>> calc_search_range('http://example.com/path?', 'prefix') ('com,example)/path?', 'com,example)/path@')
>>> calc_search_range('http://example.com/path/?', 'prefix') ('com,example)/path?', 'com,example)/path@')
>>> calc_search_range('http://example.com/path/file.html', 'host') ('com,example)/', 'com,example*')
>>> calc_search_range('http://example.com/path/file.html', 'domain') ('com,example)/', 'com,example-')
special case for tld domain range >>> calc_search_range(‘com’, ‘domain’) (‘com,’, ‘com-‘)
# non-surt ranges >>> calc_search_range(’http://example.com/path/file.html’, ‘exact’, False) (‘example.com/path/file.html’, ‘example.com/path/file.html!’)
>>> calc_search_range('http://example.com/path/file.html', 'prefix', False) ('example.com/path/file.html', 'example.com/path/file.htmm')
>>> calc_search_range('http://example.com/path/file.html', 'host', False) ('example.com/', 'example.com0')
# errors: domain range not supported >>> calc_search_range(’http://example.com/path/file.html’, ‘domain’, False) # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): UrlCanonicalizeException: matchType=domain unsupported for non-surt
>>> calc_search_range('http://example.com/path/file.html', 'blah', False) # doctest: +IGNORE_EXCEPTION_DETAIL Traceback (most recent call last): UrlCanonicalizeException: Invalid match_type: blah
-
pywb.utils.canonicalize.
canonicalize
(url, surt_ordered=True)[source]¶ Canonicalize url and convert to surt If not in surt ordered mode, convert back to url form as surt conversion is currently part of canonicalization
>>> canonicalize('http://example.com/path/file.html', surt_ordered=True) 'com,example)/path/file.html'
>>> canonicalize('http://example.com/path/file.html', surt_ordered=False) 'example.com/path/file.html'
>>> canonicalize('urn:some:id') 'urn:some:id'
-
pywb.utils.canonicalize.
unsurt
(surt)[source]¶ # Simple surt >>> unsurt(‘com,example)/’) ‘example.com/’
# Broken surt >>> unsurt(‘com,example)’) ‘com,example)’
# Long surt >>> unsurt(‘suffix,domain,sub,subsub,another,subdomain)/path/file/index.html?a=b?c=)/’) ‘subdomain.another.subsub.sub.domain.suffix/path/file/index.html?a=b?c=)/’
pywb.utils.format module¶
-
class
pywb.utils.format.
ParamFormatter
(params, name='', prefix='param.')[source]¶ Bases:
string.Formatter
pywb.utils.geventserver module¶
-
class
pywb.utils.geventserver.
GeventServer
(app, port=0, hostname='localhost', handler_class=None, direct=False)[source]¶ Bases:
object
Class for optionally running a WSGI application in a greenlet
-
join
()[source]¶ Joins the greenlet spawned for running the server if it was started in non-direct mode
-
pywb.utils.io module¶
-
class
pywb.utils.io.
OffsetLimitReader
(stream, offset, length)[source]¶ Bases:
warcio.limitreader.LimitReader
pywb.utils.loaders module¶
-
class
pywb.utils.loaders.
BlockLoader
(**kwargs)[source]¶ Bases:
pywb.utils.loaders.BaseLoader
a loader which can stream blocks of content given a uri, offset and optional length. Currently supports: http/https, file/local file system, pkg, WebHDFS, S3
-
loaders
= {'file': <class 'pywb.utils.loaders.LocalFileLoader'>, 'http': <class 'pywb.utils.loaders.HttpLoader'>, 'https': <class 'pywb.utils.loaders.HttpLoader'>, 'pkg': <class 'pywb.utils.loaders.PackageLoader'>, 's3': <class 'pywb.utils.loaders.S3Loader'>, 'webhdfs': <class 'pywb.utils.loaders.WebHDFSLoader'>}¶
-
profile_loader
= None¶
-
-
class
pywb.utils.loaders.
HMACCookieMaker
(key, name, duration=10)[source]¶ Bases:
object
Utility class to produce signed HMAC digest cookies to be used with each http request
-
class
pywb.utils.loaders.
WebHDFSLoader
(**kwargs)[source]¶ Bases:
pywb.utils.loaders.HttpLoader
Loader class specifically for loading webhdfs content
-
HTTP_URL
= 'http://{host}/webhdfs/v1{path}?'¶
-
load
(url, offset, length)[source]¶ Loads the supplied web hdfs content
Parameters: - url (str) – The URL to the web hdfs content to be loaded
- offset (int|float|double) – The offset of the content to be loaded
- length (int|float|double) – The length of the content to be loaded
Returns: The raw response content
-
-
pywb.utils.loaders.
init_yaml_env_vars
()[source]¶ Initializes the yaml parser to be able to set the value of fields from environment variables
Return type: None
-
pywb.utils.loaders.
load_overlay_config
(main_env_var, main_default_file='', overlay_env_var='', overlay_file='')[source]¶
pywb.utils.memento module¶
-
class
pywb.utils.memento.
MementoUtils
[source]¶ Bases:
object
-
classmethod
make_memento_link
(url, type, dt, coll=None, memento_format=None)[source]¶ Creates a memento link string
Parameters: Returns: A memento link string
Return type:
-
classmethod
pywb.utils.merge module¶
pywb.utils.wbexception module¶
-
exception
pywb.utils.wbexception.
AccessException
(msg=None, url=None)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate an access control violation
-
exception
pywb.utils.wbexception.
AppPageNotFound
(msg=None, url=None)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate that a page was not found
-
exception
pywb.utils.wbexception.
BadRequestException
(msg=None, url=None)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate that request was bad
-
exception
pywb.utils.wbexception.
LiveResourceException
(msg=None, url=None)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate that an error was encountered during the retrial of a live web resource
-
exception
pywb.utils.wbexception.
NotFoundException
(msg=None, url=None)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate that a resource was not found
-
exception
pywb.utils.wbexception.
UpstreamException
(status_code, url, details)[source]¶ Bases:
pywb.utils.wbexception.WbException
An Exception used to indicate that an error was encountered from an upstream endpoint