Recorder

The recorder component acts a proxy component, intercepting requests to and response from the Warcserver and recording them to a WARC file on disk.

The recorder uses the pywb.recorder.multifilewarcwriter.MultiFileWARCWriter which extends the base warcio.warcwriter.WARCWriter from warcio and provides support for:

  • appending to multiple WARC files at once
  • WARC ‘rollover’ based on maximum size idle time
  • indexing (CDXJ) on write

Many of the features of the Recorder are created for use with Webrecorder project, although the core recorder is used to provide a basic recording via /record/ endpoint. (See: Recording Mode)

Deduplication Filters

The core recorder class provides for optional deduplication using the pywb.recorder.redisindexer.WritableRedisIndexer class which requires Redis to store the index, and can be used to either:

  • write duplicates responses.
  • write revisit records.
  • ignore duplicates and don’t write to WARC.

Custom Filtering

The recorder filter system also includes a filtering system to allow for not writing certain requests and responses. Filters include:

  • Skipping by regex applied to source (Warcserver-Source-Coll header from Warcserver)
  • Skipping if Recorder-Skip: 1 header is provided
  • Skipping if Range request header is provided
  • Filtering out certain HTTP headers, for example, http-only cookies

The additional recorder functionality will be enhanced in a future version.

For a more detailed examples, please consult the tests in pywb.recorder.test.test_recorder