Recorder¶
The recorder component acts a proxy component, intercepting requests to and response from the Warcserver and recording them to a WARC file on disk.
The recorder uses the pywb.recorder.multifilewarcwriter.MultiFileWARCWriter
which extends the base warcio.warcwriter.WARCWriter
from warcio
and provides support for:
- appending to multiple WARC files at once
- WARC ‘rollover’ based on maximum size idle time
- indexing (CDXJ) on write
Many of the features of the Recorder are created for use with Webrecorder project, although the core recorder is used to provide
a basic recording via /record/
endpoint. (See: Recording Mode)
Deduplication Filters¶
The core recorder class provides for optional deduplication using the pywb.recorder.redisindexer.WritableRedisIndexer
class which requires Redis to store the index, and can be used to either:
- write duplicates responses.
- write
revisit
records. - ignore duplicates and don’t write to WARC.
Custom Filtering¶
The recorder filter system also includes a filtering system to allow for not writing certain requests and responses. Filters include:
- Skipping by regex applied to source (
Warcserver-Source-Coll
header from Warcserver) - Skipping if
Recorder-Skip: 1
header is provided - Skipping if
Range
request header is provided - Filtering out certain HTTP headers, for example, http-only cookies
The additional recorder functionality will be enhanced in a future version.
For a more detailed examples, please consult the tests in pywb.recorder.test.test_recorder