Using OutbackCDX with pywb¶
The recommended setup is to run OutbackCDX alongside pywb. OutbackCDX provides an index (CDX) server and can efficiently store and look up web archive data by URL.
Adding CDX to OutbackCDX¶
To set up OutbackCDX, please follow the instructions on the OutbackCDX README.
Since pywb also uses the default port 8080, be sure to use a different port for OutbackCDX, eg. java -jar outbackcdx*.jar -p 8084
.
OutbackCDX can generally ingest existing CDX used in OpenWayback simply by POSTing to OutbackCDX at a new index endpoint.
For example, assuming OutbackCDX is running on port 8084, to add CDX for index1.cdx
, index2.cdx
, run:
curl -X POST --data-binary @index1.cdx http://localhost:8084/mycoll
curl -X POST --data-binary @index2.cdx http://localhost:8084/mycoll
The contents of each CDX file are added to the mycoll
OutbackCDX index, which can correspond to the web archive collection mycoll
.
The index is created automatically if it does not exist.
See the OutbackCDX Docs for more info on ingesting CDX.
(Re)generating CDX from WARCs¶
There are some exceptions where it may be useful to re-generate the CDX with pywb for existing WARCs:
- If your CDX is 9-field and does not include the compressed length, regnerating the CDX will result in more efficient HTTP range requests
- If you want to replay pages with POST requests, pywb generated CDX will soon be supported in OutbackCDX (see: Issue #585, Issue #91 )
To generate the CDX, run the cdx-indexer
command (with -p
flag for POST request handling) for each WARC or set of WARCs you wish to index:
cdx-indexer /path/to/mywarcs/my.warc.gz > ./index1.cdx
cdx-indexer /path/to/all_warcs/*warc.gz > ./index2.cdx
Then, run the POST command as shown above to ingest to OutbackCDX.
The above can be repeated for each WARC file, or for a set of WARCs using the *.warc.gz
wildcard.
If a CDX index is too big, OutbackCDX may fail and ingesting an index per-WARC may be needed.
Configure pywb with OutbackCDX¶
The config.yaml
should be configured to point to OutbackCDX.
Assuming a collection named mycoll
, the config.yaml
can be configured as follows to use OutbackCDX
collections:
mycoll:
index_paths: cdx+http://localhost:8084/mycoll
archive_paths: /path/to/mywarcs/
The archive_paths
can be configured to point to a directory of WARCs or a path index.