pywb.warcserver.index package¶
Submodules¶
pywb.warcserver.index.aggregator module¶
-
class
pywb.warcserver.index.aggregator.
BaseDirectoryIndexSource
(base_prefix, base_dir='', name='', config=None)[source]¶ Bases:
pywb.warcserver.index.aggregator.BaseAggregator
-
INDEX_SOURCES
= [(('.cdx', '.cdxj'), <class 'pywb.warcserver.index.indexsource.FileIndexSource'>), (('.idx', '.summary'), <class 'pywb.warcserver.index.zipnum.ZipNumIndexSource'>)]¶
-
-
class
pywb.warcserver.index.aggregator.
BaseRedisMultiKeyIndexSource
(redis_url=None, redis=None, key_template=None, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.BaseAggregator
,pywb.warcserver.index.indexsource.RedisIndexSource
-
class
pywb.warcserver.index.aggregator.
CacheDirectoryIndexSource
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.CacheDirectoryMixin
,pywb.warcserver.index.aggregator.DirectoryIndexSource
-
class
pywb.warcserver.index.aggregator.
DirectoryIndexSource
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.SeqAggMixin
,pywb.warcserver.index.aggregator.BaseDirectoryIndexSource
-
class
pywb.warcserver.index.aggregator.
GeventMixin
(*args, **kwargs)[source]¶ Bases:
object
-
DEFAULT_TIMEOUT
= 5.0¶
-
-
class
pywb.warcserver.index.aggregator.
GeventTimeoutAggregator
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.TimeoutMixin
,pywb.warcserver.index.aggregator.GeventMixin
,pywb.warcserver.index.aggregator.BaseSourceListAggregator
-
class
pywb.warcserver.index.aggregator.
RedisMultiKeyIndexSource
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.SeqAggMixin
,pywb.warcserver.index.aggregator.BaseRedisMultiKeyIndexSource
-
class
pywb.warcserver.index.aggregator.
SimpleAggregator
(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.SeqAggMixin
,pywb.warcserver.index.aggregator.BaseSourceListAggregator
pywb.warcserver.index.cdxobject module¶
-
class
pywb.warcserver.index.cdxobject.
CDXObject
(cdxline=b'')[source]¶ Bases:
collections.OrderedDict
dictionary object representing parsed CDX line.
-
CDX_ALT_FIELDS
= {'d': 'digest', 'f': 'filename', 'k': 'urlkey', 'l': 'length', 'm': 'mime', 'mimetype': 'mime', 'o': 'offset', 'original': 'url', 's': 'length', 'statuscode': 'status', 't': 'timestamp', 'u': 'url'}¶
-
CDX_FORMATS
= [['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'length'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename']]¶
-
-
class
pywb.warcserver.index.cdxobject.
IDXObject
(idxline)[source]¶ Bases:
collections.OrderedDict
-
FORMAT
= ['urlkey', 'part', 'offset', 'length', 'lineno']¶
-
NUM_REQ_FIELDS
= 4¶
-
pywb.warcserver.index.cdxops module¶
-
pywb.warcserver.index.cdxops.
cdx_collapse_time_status
(cdx_iter, timelen=10)[source]¶ collapse by timestamp and status code.
-
pywb.warcserver.index.cdxops.
cdx_filter
(cdx_iter, filter_strings)[source]¶ filter CDX by regex if each filter is
field:regex
form, apply filter tocdx[field]
.
-
pywb.warcserver.index.cdxops.
cdx_load
(sources, query, process=True)[source]¶ merge text CDX lines from sources, return an iterator for filtered and access-checked sequence of CDX objects.
Parameters: - sources – iterable for text CDX sources.
- process – bool, perform processing sorting/filtering/grouping ops
-
pywb.warcserver.index.cdxops.
cdx_resolve_revisits
(cdx_iter)[source]¶ resolve revisits.
this filter adds three fields to CDX:
orig.length
,orig.offset
, andorig.filename
. for revisit records, these fields have corresponding field values in previous non-revisit (original) CDX record. They are all"-"
for non-revisit records.
-
pywb.warcserver.index.cdxops.
cdx_reverse
(cdx_iter, limit)[source]¶ return cdx records in reverse order.
-
pywb.warcserver.index.cdxops.
cdx_sort_closest
(closest, cdx_iter, limit=10)[source]¶ sort CDXCaptureResult by closest to timestamp.
-
pywb.warcserver.index.cdxops.
create_merged_cdx_gen
(sources, query)[source]¶ create a generator which loads and merges cdx streams ensures cdxs are lazy loaded
pywb.warcserver.index.fuzzymatcher module¶
-
class
pywb.warcserver.index.fuzzymatcher.
FuzzyMatcher
(filename=None)[source]¶ Bases:
object
-
DEFAULT_FILTER
= ['urlkey:{0}']¶
-
DEFAULT_MATCH_TYPE
= 'prefix'¶
-
DEFAULT_REPLACE_AFTER
= '?'¶
-
FUZZY_SKIP_PARAMS
= ('alt_url', 'reverse', 'closest', 'end_key', 'url', 'matchType', 'filter')¶
-
-
class
pywb.warcserver.index.fuzzymatcher.
FuzzyRule
(url_prefix, regex, replace_after, filter_str, match_type, find_all)¶ Bases:
tuple
-
filter_str
¶ Alias for field number 3
-
find_all
¶ Alias for field number 5
-
match_type
¶ Alias for field number 4
-
regex
¶ Alias for field number 1
-
replace_after
¶ Alias for field number 2
-
url_prefix
¶ Alias for field number 0
-
pywb.warcserver.index.indexsource module¶
-
class
pywb.warcserver.index.indexsource.
BaseIndexSource
[source]¶ Bases:
object
-
WAYBACK_ORIG_SUFFIX
= '{timestamp}id_/{url}'¶
-
logger
= <Logger warcserver (WARNING)>¶
-
-
class
pywb.warcserver.index.indexsource.
FileIndexSource
(filename, config=None)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource
-
CDX_EXT
= ('.cdx', '.cdxj')¶
-
-
class
pywb.warcserver.index.indexsource.
MementoIndexSource
(timegate_url, timemap_url, replay_url)[source]¶
-
class
pywb.warcserver.index.indexsource.
RedisIndexSource
(redis_url=None, redis=None, key_template=None, **kwargs)[source]¶
-
class
pywb.warcserver.index.indexsource.
RemoteIndexSource
(api_url, replay_url, url_field='load_url', closest_limit=100)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource
-
CDX_MATCH_RX
= re.compile('^cdxj?\\+(?P<url>https?\\:.*)')¶
-
-
class
pywb.warcserver.index.indexsource.
WBMementoIndexSource
(timegate_url, timemap_url, replay_url)[source]¶ Bases:
pywb.warcserver.index.indexsource.MementoIndexSource
-
WAYBACK_ORIG_SUFFIX
= '{timestamp}im_/{url}'¶
-
WBURL_MATCH
= re.compile('([0-9]{0,14})?(?:\\w+_)?/{0,3}(.*)')¶
-
-
class
pywb.warcserver.index.indexsource.
XmlQueryIndexSource
(query_api_url)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource
An index source class for XML files
-
EXACT_QUERY
= 'type:urlquery url:'¶
-
PREFIX_QUERY
= 'type:prefixquery url:'¶
-
convert_to_cdx
(item)[source]¶ Converts the etree element to an CDX object
Parameters: item – The etree element to be converted Returns: The CDXObject representing the supplied etree element object Return type: CDXObject
-
gettext
(item, name)[source]¶ Returns the value of the supplied name
Parameters: - item – The etree element to be converted
- name – The name of the field to get its value for
Returns: The value of the field
Return type:
-
classmethod
init_from_config
(config)[source]¶ Creates and initializes a new instance of XmlQueryIndexSource IFF the supplied dictionary contains the type key equal to xmlquery
Parameters: str] config (dict[str,) – Returns: The initialized XmlQueryIndexSource or None Return type: XmlQueryIndexSource|None
-
classmethod
init_from_string
(value)[source]¶ Creates and initializes a new instance of XmlQueryIndexSource IFF the supplied value starts with xmlquery+
Parameters: value (str) – The string by which to initialize the XmlQueryIndexSource Returns: The initialized XmlQueryIndexSource or None Return type: XmlQueryIndexSource|None
-
load_index
(params)[source]¶ Loads the xml query index based on the supplied params
Parameters: str] params (dict[str,) – The query params Returns: A list or generator of cdx objects Raises: NotFoundException – If the query url is not found or the results of the query returns no cdx entries :raises BadRequestException: If the match type is not exact or prefix
-
pywb.warcserver.index.query module¶
pywb.warcserver.index.zipnum module¶
-
class
pywb.warcserver.index.zipnum.
LocMapResolver
(loc_summary, loc_filename)[source]¶ Bases:
object
Lookup shards based on a file mapping shard name to one or more paths. The entries are tab delimited.
-
class
pywb.warcserver.index.zipnum.
LocPrefixResolver
(loc_summary, loc_config)[source]¶ Bases:
object
Use a prefix lookup, where the prefix can either be a fixed string or can be a regex replacement of the index summary path
-
class
pywb.warcserver.index.zipnum.
ZipNumIndexSource
(summary, config=None)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource
-
DEFAULT_MAX_BLOCKS
= 10¶
-
DEFAULT_RELOAD_INTERVAL
= 10¶
-
IDX_EXT
= ('.idx', '.summary')¶
-