pywb.warcserver.index package

Submodules

pywb.warcserver.index.aggregator module

class pywb.warcserver.index.aggregator.BaseAggregator[source]

Bases: object

get_source_list(params)[source]
load_child_source(name, source, params)[source]
load_index(params)[source]
class pywb.warcserver.index.aggregator.BaseDirectoryIndexSource(base_prefix, base_dir='', name='', config=None)[source]

Bases: pywb.warcserver.index.aggregator.BaseAggregator

INDEX_SOURCES = [(('.cdx', '.cdxj'), <class 'pywb.warcserver.index.indexsource.FileIndexSource'>), (('.idx', '.summary'), <class 'pywb.warcserver.index.zipnum.ZipNumIndexSource'>)]
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
class pywb.warcserver.index.aggregator.BaseRedisMultiKeyIndexSource(redis_url=None, redis=None, key_template=None, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.BaseAggregator, pywb.warcserver.index.indexsource.RedisIndexSource

class pywb.warcserver.index.aggregator.BaseSourceListAggregator(sources, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.BaseAggregator

get_all_sources(params)[source]
yield_invert_sources(sel_sources, params)[source]
yield_sources(sel_sources, params)[source]
class pywb.warcserver.index.aggregator.CacheDirectoryIndexSource(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.CacheDirectoryMixin, pywb.warcserver.index.aggregator.DirectoryIndexSource

class pywb.warcserver.index.aggregator.CacheDirectoryMixin(*args, **kwargs)[source]

Bases: object

class pywb.warcserver.index.aggregator.DirectoryIndexSource(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.SeqAggMixin, pywb.warcserver.index.aggregator.BaseDirectoryIndexSource

class pywb.warcserver.index.aggregator.GeventMixin(*args, **kwargs)[source]

Bases: object

DEFAULT_TIMEOUT = 5.0
class pywb.warcserver.index.aggregator.GeventTimeoutAggregator(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.TimeoutMixin, pywb.warcserver.index.aggregator.GeventMixin, pywb.warcserver.index.aggregator.BaseSourceListAggregator

class pywb.warcserver.index.aggregator.RedisMultiKeyIndexSource(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.SeqAggMixin, pywb.warcserver.index.aggregator.BaseRedisMultiKeyIndexSource

class pywb.warcserver.index.aggregator.SeqAggMixin(*args, **kwargs)[source]

Bases: object

class pywb.warcserver.index.aggregator.SimpleAggregator(*args, **kwargs)[source]

Bases: pywb.warcserver.index.aggregator.SeqAggMixin, pywb.warcserver.index.aggregator.BaseSourceListAggregator

class pywb.warcserver.index.aggregator.TimeoutMixin(*args, **kwargs)[source]

Bases: object

is_timed_out(name)[source]

pywb.warcserver.index.cdxobject module

exception pywb.warcserver.index.cdxobject.CDXException(msg=None, url=None)[source]

Bases: pywb.utils.wbexception.WbException

status_code

Returns the status code to be used for the error response

Returns:The status code for the error response (500)
Return type:int
class pywb.warcserver.index.cdxobject.CDXObject(cdxline=b'')[source]

Bases: collections.OrderedDict

dictionary object representing parsed CDX line.

CDX_ALT_FIELDS = {'d': 'digest', 'f': 'filename', 'k': 'urlkey', 'l': 'length', 'm': 'mime', 'mimetype': 'mime', 'o': 'offset', 'original': 'url', 's': 'length', 'statuscode': 'status', 't': 'timestamp', 'u': 'url'}
CDX_FORMATS = [['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'length'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename']]
static conv_to_json(obj, fields=None)[source]

return cdx as json dictionary string if fields is None, output will include all fields in order stored, otherwise only specified fields will be included

Parameters:fields – list of field names to output
is_revisit()[source]

return True if this record is a revisit record.

classmethod json_decode(string)[source]
to_cdxj(fields=None)[source]
to_json(fields=None)[source]
to_text(fields=None)[source]

return plaintext CDX record (includes newline). if fields is None, output will have all fields in the order they are stored.

Parameters:fields – list of field names to output.
class pywb.warcserver.index.cdxobject.IDXObject(idxline)[source]

Bases: collections.OrderedDict

FORMAT = ['urlkey', 'part', 'offset', 'length', 'lineno']
NUM_REQ_FIELDS = 4
to_json(fields=None)[source]
to_text(fields=None)[source]

return plaintext IDX record (including newline).

Parameters:fields – list of field names to output (currently ignored)

pywb.warcserver.index.cdxops module

class pywb.warcserver.index.cdxops.CDXFilter(string)[source]

Bases: object

contains(val)[source]
exact(val)[source]
rx_match(val)[source]
pywb.warcserver.index.cdxops.cdx_clamp(cdx_iter, from_ts, to_ts)[source]

Clamp by start and end ts

pywb.warcserver.index.cdxops.cdx_collapse_time_status(cdx_iter, timelen=10)[source]

collapse by timestamp and status code.

pywb.warcserver.index.cdxops.cdx_filter(cdx_iter, filter_strings)[source]

filter CDX by regex if each filter is field:regex form, apply filter to cdx[field].

pywb.warcserver.index.cdxops.cdx_limit(cdx_iter, limit)[source]

limit cdx to at most limit.

pywb.warcserver.index.cdxops.cdx_load(sources, query, process=True)[source]

merge text CDX lines from sources, return an iterator for filtered and access-checked sequence of CDX objects.

Parameters:
  • sources – iterable for text CDX sources.
  • process – bool, perform processing sorting/filtering/grouping ops
pywb.warcserver.index.cdxops.cdx_resolve_revisits(cdx_iter)[source]

resolve revisits.

this filter adds three fields to CDX: orig.length, orig.offset, and orig.filename. for revisit records, these fields have corresponding field values in previous non-revisit (original) CDX record. They are all "-" for non-revisit records.

pywb.warcserver.index.cdxops.cdx_reverse(cdx_iter, limit)[source]

return cdx records in reverse order.

pywb.warcserver.index.cdxops.cdx_sort_closest(closest, cdx_iter, limit=10)[source]

sort CDXCaptureResult by closest to timestamp.

pywb.warcserver.index.cdxops.cdx_to_json(cdx_iter, fields)[source]
pywb.warcserver.index.cdxops.cdx_to_text(cdx_iter, fields)[source]
pywb.warcserver.index.cdxops.create_merged_cdx_gen(sources, query)[source]

create a generator which loads and merges cdx streams ensures cdxs are lazy loaded

pywb.warcserver.index.cdxops.make_obj_iter(text_iter, query)[source]

convert text cdx stream to CDXObject/IDXObject.

pywb.warcserver.index.cdxops.process_cdx(cdx_iter, query)[source]

pywb.warcserver.index.fuzzymatcher module

class pywb.warcserver.index.fuzzymatcher.FuzzyMatcher(filename=None)[source]

Bases: object

DEFAULT_FILTER = ['urlkey:{0}']
DEFAULT_MATCH_TYPE = 'prefix'
DEFAULT_REPLACE_AFTER = '?'
FUZZY_SKIP_PARAMS = ('alt_url', 'reverse', 'closest', 'end_key', 'url', 'matchType', 'filter')
get_ext(url)[source]
get_fuzzy_iter(cdx_iter, index_source, params)[source]
get_fuzzy_match(urlkey, url, params)[source]
make_query_match_regex(params_list)[source]
make_regex(config)[source]
match_general_fuzzy_query(url, urlkey, cdx, rx_cache)[source]
parse_fuzzy_rule(rule)[source]

Parse rules using all the different supported forms

class pywb.warcserver.index.fuzzymatcher.FuzzyRule(url_prefix, regex, replace_after, filter_str, match_type, find_all)

Bases: tuple

filter_str

Alias for field number 3

find_all

Alias for field number 5

match_type

Alias for field number 4

regex

Alias for field number 1

replace_after

Alias for field number 2

url_prefix

Alias for field number 0

pywb.warcserver.index.indexsource module

class pywb.warcserver.index.indexsource.BaseIndexSource[source]

Bases: object

WAYBACK_ORIG_SUFFIX = '{timestamp}id_/{url}'
load_index(params)[source]
logger = <Logger warcserver (WARNING)>
class pywb.warcserver.index.indexsource.FileIndexSource(filename, config=None)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

CDX_EXT = ('.cdx', '.cdxj')
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_index(params)[source]
class pywb.warcserver.index.indexsource.LiveIndexSource[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

get_load_url(params)[source]
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_index(params)[source]
class pywb.warcserver.index.indexsource.MementoIndexSource(timegate_url, timemap_url, replay_url)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

classmethod from_timegate_url(timegate_url, path='link')[source]
handle_timegate(params, timestamp)[source]
handle_timemap(params)[source]
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_index(params)[source]
class pywb.warcserver.index.indexsource.RedisIndexSource(redis_url=None, redis=None, key_template=None, **kwargs)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_index(params)[source]
load_key_index(key_template, params)[source]
static parse_redis_url(redis_url, redis_=None)[source]
scan_keys(match_templ, params, member_key=None)[source]
class pywb.warcserver.index.indexsource.RemoteIndexSource(api_url, replay_url, url_field='load_url', closest_limit=100)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

CDX_MATCH_RX = re.compile('^cdxj?\\+(?P<url>https?\\:.*)')
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_index(params)[source]
class pywb.warcserver.index.indexsource.WBMementoIndexSource(timegate_url, timemap_url, replay_url)[source]

Bases: pywb.warcserver.index.indexsource.MementoIndexSource

WAYBACK_ORIG_SUFFIX = '{timestamp}im_/{url}'
WBURL_MATCH = re.compile('([0-9]{0,14})?(?:\\w+_)?/{0,3}(.*)')
handle_timegate(params, timestamp)[source]
class pywb.warcserver.index.indexsource.XmlQueryIndexSource(query_api_url)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

An index source class for XML files

EXACT_QUERY = 'type:urlquery url:'
PREFIX_QUERY = 'type:prefixquery url:'
convert_to_cdx(item)[source]

Converts the etree element to an CDX object

Parameters:item – The etree element to be converted
Returns:The CDXObject representing the supplied etree element object
Return type:CDXObject
gettext(item, name)[source]

Returns the value of the supplied name

Parameters:
  • item – The etree element to be converted
  • name – The name of the field to get its value for
Returns:

The value of the field

Return type:

str

classmethod init_from_config(config)[source]

Creates and initializes a new instance of XmlQueryIndexSource IFF the supplied dictionary contains the type key equal to xmlquery

Parameters:str] config (dict[str,) –
Returns:The initialized XmlQueryIndexSource or None
Return type:XmlQueryIndexSource|None
classmethod init_from_string(value)[source]

Creates and initializes a new instance of XmlQueryIndexSource IFF the supplied value starts with xmlquery+

Parameters:value (str) – The string by which to initialize the XmlQueryIndexSource
Returns:The initialized XmlQueryIndexSource or None
Return type:XmlQueryIndexSource|None
load_index(params)[source]

Loads the xml query index based on the supplied params

Parameters:str] params (dict[str,) – The query params
Returns:A list or generator of cdx objects
Raises:NotFoundException – If the query url is not found

or the results of the query returns no cdx entries :raises BadRequestException: If the match type is not exact or prefix

prefix_query_iter(items)[source]

Returns an iterator yielding the results of performing a prefix query

Parameters:items – The xml entry elements representing an query
Returns:An iterator yielding the results of the query

pywb.warcserver.index.query module

class pywb.warcserver.index.query.CDXQuery(params)[source]

Bases: object

allow_fuzzy
closest
collapse_time
custom_ops
end_key
fields
filters
from_ts
is_exact
key
limit
match_type
output
page
page_count
page_size
resolve_revisits
reverse
secondary_index_only
set_key(key, end_key)[source]
to_ts
url
urlencode()[source]

pywb.warcserver.index.zipnum module

class pywb.warcserver.index.zipnum.AlwaysJsonResponse[source]

Bases: dict

to_cdxj(*args)[source]
to_json(*args)[source]
to_text(*args)[source]
class pywb.warcserver.index.zipnum.LocMapResolver(loc_summary, loc_filename)[source]

Bases: object

Lookup shards based on a file mapping shard name to one or more paths. The entries are tab delimited.

load_loc()[source]
class pywb.warcserver.index.zipnum.LocPrefixResolver(loc_summary, loc_config)[source]

Bases: object

Use a prefix lookup, where the prefix can either be a fixed string or can be a regex replacement of the index summary path

load_loc()[source]
class pywb.warcserver.index.zipnum.ZipBlocks(part, offset, length, count)[source]

Bases: object

class pywb.warcserver.index.zipnum.ZipNumIndexSource(summary, config=None)[source]

Bases: pywb.warcserver.index.indexsource.BaseIndexSource

DEFAULT_MAX_BLOCKS = 10
DEFAULT_RELOAD_INTERVAL = 10
IDX_EXT = ('.idx', '.summary')
block_to_cdx_iter(blocks, ranges, query)[source]
compute_page_range(reader, query)[source]
idx_to_cdx(idx_iter, query)[source]
classmethod init_from_config(config)[source]
classmethod init_from_string(value)[source]
load_blocks(location, blocks, ranges, query)[source]

Load one or more blocks of compressed cdx lines, return a line iterator which decompresses and returns one line at a time, bounded by query.key and query.end_key

load_index(params)[source]
search_by_line_num(reader, line)[source]

Module contents