pywb.warcserver.index package¶
Submodules¶
pywb.warcserver.index.aggregator module¶
-
class
pywb.warcserver.index.aggregator.BaseDirectoryIndexSource(base_prefix, base_dir='', name='', config=None)[source]¶ Bases:
pywb.warcserver.index.aggregator.BaseAggregator-
INDEX_SOURCES= [(('.cdx', '.cdxj'), <class 'pywb.warcserver.index.indexsource.FileIndexSource'>), (('.idx', '.summary'), <class 'pywb.warcserver.index.zipnum.ZipNumIndexSource'>)]¶
-
-
class
pywb.warcserver.index.aggregator.BaseRedisMultiKeyIndexSource(redis_url=None, redis=None, key_template=None, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.BaseAggregator,pywb.warcserver.index.indexsource.RedisIndexSource
-
class
pywb.warcserver.index.aggregator.CacheDirectoryIndexSource(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.CacheDirectoryMixin,pywb.warcserver.index.aggregator.DirectoryIndexSource
-
class
pywb.warcserver.index.aggregator.DirectoryIndexSource(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.SeqAggMixin,pywb.warcserver.index.aggregator.BaseDirectoryIndexSource
-
class
pywb.warcserver.index.aggregator.GeventMixin(*args, **kwargs)[source]¶ Bases:
object-
DEFAULT_TIMEOUT= 5.0¶
-
-
class
pywb.warcserver.index.aggregator.GeventTimeoutAggregator(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.TimeoutMixin,pywb.warcserver.index.aggregator.GeventMixin,pywb.warcserver.index.aggregator.BaseSourceListAggregator
-
class
pywb.warcserver.index.aggregator.RedisMultiKeyIndexSource(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.SeqAggMixin,pywb.warcserver.index.aggregator.BaseRedisMultiKeyIndexSource
-
class
pywb.warcserver.index.aggregator.SimpleAggregator(*args, **kwargs)[source]¶ Bases:
pywb.warcserver.index.aggregator.SeqAggMixin,pywb.warcserver.index.aggregator.BaseSourceListAggregator
pywb.warcserver.index.cdxobject module¶
-
class
pywb.warcserver.index.cdxobject.CDXObject(cdxline=b'')[source]¶ Bases:
collections.OrderedDictdictionary object representing parsed CDX line.
-
CDX_ALT_FIELDS= {'d': 'digest', 'f': 'filename', 'k': 'urlkey', 'l': 'length', 'm': 'mime', 'mimetype': 'mime', 'o': 'offset', 'original': 'url', 's': 'length', 'statuscode': 'status', 't': 'timestamp', 'u': 'url'}¶
-
CDX_FORMATS= [['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'length'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'offset', 'filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'length', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'robotflags', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename'], ['urlkey', 'timestamp', 'url', 'mime', 'status', 'digest', 'redirect', 'offset', 'filename', 'orig.length', 'orig.offset', 'orig.filename']]¶
-
-
class
pywb.warcserver.index.cdxobject.IDXObject(idxline)[source]¶ Bases:
collections.OrderedDict-
FORMAT= ['urlkey', 'part', 'offset', 'length', 'lineno']¶
-
NUM_REQ_FIELDS= 4¶
-
pywb.warcserver.index.cdxops module¶
-
pywb.warcserver.index.cdxops.cdx_collapse_time_status(cdx_iter, timelen=10)[source]¶ collapse by timestamp and status code.
-
pywb.warcserver.index.cdxops.cdx_filter(cdx_iter, filter_strings)[source]¶ filter CDX by regex if each filter is
field:regexform, apply filter tocdx[field].
-
pywb.warcserver.index.cdxops.cdx_load(sources, query, process=True)[source]¶ merge text CDX lines from sources, return an iterator for filtered and access-checked sequence of CDX objects.
Parameters: - sources – iterable for text CDX sources.
- process – bool, perform processing sorting/filtering/grouping ops
-
pywb.warcserver.index.cdxops.cdx_resolve_revisits(cdx_iter)[source]¶ resolve revisits.
this filter adds three fields to CDX:
orig.length,orig.offset, andorig.filename. for revisit records, these fields have corresponding field values in previous non-revisit (original) CDX record. They are all"-"for non-revisit records.
-
pywb.warcserver.index.cdxops.cdx_reverse(cdx_iter, limit)[source]¶ return cdx records in reverse order.
-
pywb.warcserver.index.cdxops.cdx_sort_closest(closest, cdx_iter, limit=10)[source]¶ sort CDXCaptureResult by closest to timestamp.
-
pywb.warcserver.index.cdxops.create_merged_cdx_gen(sources, query)[source]¶ create a generator which loads and merges cdx streams ensures cdxs are lazy loaded
pywb.warcserver.index.fuzzymatcher module¶
-
class
pywb.warcserver.index.fuzzymatcher.FuzzyMatcher(filename=None)[source]¶ Bases:
object-
DEFAULT_FILTER= ['urlkey:{0}']¶
-
DEFAULT_MATCH_TYPE= 'prefix'¶
-
DEFAULT_REPLACE_AFTER= '?'¶
-
DEFAULT_RE_TYPE= 'search'¶
-
FUZZY_SKIP_PARAMS= ('alt_url', 'reverse', 'closest', 'end_key', 'url', 'matchType', 'filter')¶
-
-
class
pywb.warcserver.index.fuzzymatcher.FuzzyRule(url_prefix, regex, replace_after, filter_str, match_type, re_type)¶ Bases:
tuple-
filter_str¶ Alias for field number 3
-
match_type¶ Alias for field number 4
-
re_type¶ Alias for field number 5
-
regex¶ Alias for field number 1
-
replace_after¶ Alias for field number 2
-
url_prefix¶ Alias for field number 0
-
pywb.warcserver.index.indexsource module¶
-
class
pywb.warcserver.index.indexsource.BaseIndexSource[source]¶ Bases:
object-
WAYBACK_ORIG_SUFFIX= '{timestamp}id_/{url}'¶
-
logger= <Logger warcserver (WARNING)>¶
-
-
class
pywb.warcserver.index.indexsource.FileIndexSource(filename, config=None)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource-
CDX_EXT= ('.cdx', '.cdxj')¶
-
-
class
pywb.warcserver.index.indexsource.MementoIndexSource(timegate_url, timemap_url, replay_url)[source]¶
-
class
pywb.warcserver.index.indexsource.RedisIndexSource(redis_url=None, redis=None, key_template=None, **kwargs)[source]¶
-
class
pywb.warcserver.index.indexsource.RemoteIndexSource(api_url, replay_url, url_field='load_url', closest_limit=100)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource-
CDX_MATCH_RX= re.compile('^cdxj?\\+(?P<url>https?\\:.*)')¶
-
-
class
pywb.warcserver.index.indexsource.WBMementoIndexSource(timegate_url, timemap_url, replay_url)[source]¶ Bases:
pywb.warcserver.index.indexsource.MementoIndexSource-
WAYBACK_ORIG_SUFFIX= '{timestamp}im_/{url}'¶
-
WBURL_MATCH= re.compile('([0-9]{0,14})?(?:\\w+_)?/{0,3}(.*)')¶
-
-
class
pywb.warcserver.index.indexsource.XmlQueryIndexSource(query_api_url)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSourceAn index source class for XML files
-
EXACT_QUERY= 'type:urlquery url:'¶
-
PREFIX_QUERY= 'type:prefixquery url:'¶
-
convert_to_cdx(item)[source]¶ Converts the etree element to an CDX object
Parameters: item – The etree element to be converted Returns: The CDXObject representing the supplied etree element object Return type: CDXObject
-
gettext(item, name)[source]¶ Returns the value of the supplied name
Parameters: - item – The etree element to be converted
- name – The name of the field to get its value for
Returns: The value of the field
Return type:
-
classmethod
init_from_config(config)[source]¶ Creates and initializes a new instance of XmlQueryIndexSource IFF the supplied dictionary contains the type key equal to xmlquery
Parameters: str] config (dict[str,) – Returns: The initialized XmlQueryIndexSource or None Return type: XmlQueryIndexSource|None
-
classmethod
init_from_string(value)[source]¶ Creates and initializes a new instance of XmlQueryIndexSource IFF the supplied value starts with xmlquery+
Parameters: value (str) – The string by which to initialize the XmlQueryIndexSource Returns: The initialized XmlQueryIndexSource or None Return type: XmlQueryIndexSource|None
-
load_index(params)[source]¶ Loads the xml query index based on the supplied params
Parameters: str] params (dict[str,) – The query params Returns: A list or generator of cdx objects Raises: NotFoundException – If the query url is not found or the results of the query returns no cdx entries :raises BadRequestException: If the match type is not exact or prefix
-
pywb.warcserver.index.query module¶
pywb.warcserver.index.zipnum module¶
-
class
pywb.warcserver.index.zipnum.LocMapResolver(loc_summary, loc_filename)[source]¶ Bases:
objectLookup shards based on a file mapping shard name to one or more paths. The entries are tab delimited.
-
class
pywb.warcserver.index.zipnum.LocPrefixResolver(loc_summary, loc_config)[source]¶ Bases:
objectUse a prefix lookup, where the prefix can either be a fixed string or can be a regex replacement of the index summary path
-
class
pywb.warcserver.index.zipnum.ZipNumIndexSource(summary, config=None)[source]¶ Bases:
pywb.warcserver.index.indexsource.BaseIndexSource-
DEFAULT_MAX_BLOCKS= 10¶
-
DEFAULT_RELOAD_INTERVAL= 10¶
-
IDX_EXT= ('.idx', '.summary')¶
-