Using OutbackCDX with pywb¶
The recommended setup is to run OutbackCDX alongside pywb. OutbackCDX provides an index (CDX) server and can efficiently store and look up web archive data by URL.
Adding CDX to OutbackCDX¶
To set up OutbackCDX, please follow the instructions on the OutbackCDX README.
Since pywb also uses the default port 8080, be sure to use a different port for OutbackCDX, eg.
java -jar outbackcdx*.jar -p 8084.
OutbackCDX can generally ingest existing CDX used in OpenWayback simply by POSTing to OutbackCDX at a new index endpoint.
For example, assuming OutbackCDX is running on port 8084, to add CDX for
curl -X POST --data-binary @index1.cdx http://localhost:8084/mycoll curl -X POST --data-binary @index2.cdx http://localhost:8084/mycoll
The contents of each CDX file are added to the
mycoll OutbackCDX index, which can correspond to the web archive collection
The index is created automatically if it does not exist.
See the OutbackCDX Docs for more info on ingesting CDX.
(Re)generating CDX from WARCs¶
There are some exceptions where it may be useful to re-generate the CDX with pywb for existing WARCs:
- If your CDX is 9-field and does not include the compressed length, regnerating the CDX will result in more efficient HTTP range requests
- If you want to replay pages with POST requests, pywb generated CDX will soon be supported in OutbackCDX (see: Issue #585, Issue #91 )
To generate the CDX, run the
cdx-indexer command (with
-p flag for POST request handling) for each WARC or set of WARCs you wish to index:
cdx-indexer /path/to/mywarcs/my.warc.gz > ./index1.cdx cdx-indexer /path/to/all_warcs/*warc.gz > ./index2.cdx
Then, run the POST command as shown above to ingest to OutbackCDX.
The above can be repeated for each WARC file, or for a set of WARCs using the
If a CDX index is too big, OutbackCDX may fail and ingesting an index per-WARC may be needed.
Configure pywb with OutbackCDX¶
config.yaml should be configured to point to OutbackCDX.
Assuming a collection named
config.yaml can be configured as follows to use OutbackCDX
collections: mycoll: index_paths: cdx+http://localhost:8084/mycoll archive_paths: /path/to/mywarcs/
archive_paths can be configured to point to a directory of WARCs or a path index.