Warcserver¶
The Warcserver component is the base component of the pywb stack and can function as a standalone HTTP server.
The Warcserver receives as input an HTTP request, and can serve WARC records from a variety of sources, including local WARC (or ARC) files, remote archives and the live web.
This process consists of an index lookup and a resource fetch. The index lookup is performed using the index (CDX) Server API, which is also exposed by the warcserver as a standalone API.
The warcserver can be started directly installing pywb simply by running warcserver
(default port is 8070).
Note: when running wayback
, an instance of warcserver
is also started automatically.
Warcserver API¶
The Warcserver API encompasses the CDXJ Server API and provides a per collection endpoint, using a list of collections
defined in a YAML config file (default config.yaml
). It’s also possible to use Warcserver without the YAML config (see: Custom Warcserver Deployments). The endpoints are as follows:
/
- Home Page, JSON list of available endpoints.
For each collection <coll>
:
/<coll>/index
– Direct Index (compatible with CDXJ Server API)/<coll>/resource
– Direct Resource/<coll>/postreq/index
– POST request Index/<coll>/postreq/resource
– POST request Resource (most flexible for integration with downstream tools)
All endpoints accept the CDXJ Server API query arguments, although the “direct index” route is usually most useful for index lookup. while the “post request resource” route is most useful for integration with other downstream client tools.
POSTing vs Direct Input¶
The Warcserver is designed to map input requests to output responses, and it is possible to send input requests “directly”, eg:
GET /coll/resource?url=http://example.com/
Connection: close
or by “wrapping” the entire request in a POST request:
POST /coll/postreq/resource?url=http://example.com/
Content-Length: ...
...
GET /
Host: example.com
Connection: close
The “post request” (/postreq
endpoint) approach allows more accurately transmitting any HTTP request and headers in the body of another POST request, without worrying about how the headers might be interpreted by the Warcserver connection. The “wrapped HTTP request” is thus unwrapped and processed, allowing hop-by-hop headers like Connection: close
to be processed unaltered.
Index vs Resource Output¶
For any query, the Warcserver can return a matching index result, or the first available WARC record.
Within each collection and input type, the following endpoints are available:
/index
- perform index lookup/resource
- return a single WARC record for the first match of the index list.
For example, an index query might return the CDXJ index:
=> curl "http://localhost:8070/pywb/index?url=iana.org"
org,iana)/ 20140126200624 {"url": "http://www.iana.org/", "mime": "text/html", "status": "200", "digest": "OSSAPWJ23L56IYVRW3GFEAR4MCJMGPTB", "redirect": "-", "robotflags": "-", "length": "2258", "offset": "334", "filename": "iana.warc.gz", "source": "pywb:iana.cdx"}
While switching to resource
, the result might be:
=> curl "http://localhost:8070/pywb/index?url=iana.org
WARC/1.0
WARC-Type: response
...
The resource lookup attempts to load the first available record (eg. by loading from specified WARC). If the record indicated by first line CDXJ line is not available, the next CDXJ line is tried in succession, and so on, until one succeeds.
If no record can be loaded from any of the CDXJ index results (or if there are no index results), a 404 Not Found error is returned.
WARC Record HTTP Response¶
When using Warcserver, the entire WARC record is included in the HTTP response. This may seem confusing as the WARC record itself contains an HTTP response! Warcserver also includes additional metadata as custom HTTP headers.
The following example illustrates what is transmitted when retrieving curl
-ing http://localhost:8070/pywb/index?url=iana.org
:
> GET /pywb/resource?url=iana.org HTTP/1.1
> Host: localhost:8070
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Warcserver-Cdx: org,iana)/ 20140126200624 {"url": "http://www.iana.org/", "mime": "text/html", "status": "200", "digest": "OSSAPWJ23L56IYVRW3GFEAR4MCJMGPTB", "redirect": "-", "robotflags": "-", "length": "2258", "offset": "334", "filename": "iana.warc.gz", "source": "pywb:iana.cdx"}
< Link: <http://www.iana.org/>; rel="original"
< WARC-Target-URI: http://www.iana.org/
< Warcserver-Source-Coll: pywb:iana.cdx
< Content-Type: application/warc-record
< Memento-Datetime: Sun, 26 Jan 2014 20:06:24 GMT
< Content-Length: 6357
< Warcserver-Type: warc
< Date: Tue, 17 Oct 2017 00:32:12 GMT
< WARC/1.0
< WARC-Type: response
< WARC-Date: 2014-01-26T20:06:24Z
< WARC-Target-URI: http://www.iana.org/
< WARC-Record-ID: <urn:uuid:4eec4942-a541-410a-99f4-50de39b62118>
...
The HTTP payload is the WARC record itself but HTTP headers returned “surface” additional information about the WARC record to make it easier for client to use the data.
- Memento Headers
Memento-Datetime
andLink
– The datetime is read from the WARC record, and the WARC record it itself a valid “memento” although full Memento compliance is not yet included. Warcserver-Cdx
header includes the full CDXJ index line that was used to load this record (usually, but not always, the first line in theindex
query)Warcserver-Source-Coll
header includes the source from which this record was loaded, corresponding tosource
field in the CDXJWarcserver-Type: warc
indicates that this is a Warcserver WARC record (may be removed in the future)
In particular, the CDXJ and source data can be used to further identify and process the WARC record, without having to parse it. The Recorder component uses the source to determine if recording is necessary or should be skipped.
Warcserver Index Configuration¶
Warcserver supports several index source types, allow users to mix local and remote sources into a single collection or across multiple collections:
The sources include:
- Local File
- Local ZipNum File
- Live Web Proxy (implicit index)
- Redis sorted-set key
- Memento TimeGate Endpoint
- CDX Server API Endpoint
The index types can be defined using either shorthand sourcename+<url> notation or a long-form full property declaration
The following is an example of defining different special collections:
collections:
# Live Index
live: $live
# rhizome via memento (shorthand)
rhiz: memento+http://webenact.rhizome.org/all/
# rhizome via memento (equivalent full properties)
rhiz_long:
index:
type: memento
timegate_url: http://webenact.rhizome.org/all/{url}
timemap_url: http://webenact.rhizome.org/all/timemap/link/{url}
replay_url: http://webenact.rhizome.org/all/{timestamp}id_/{url}
Warcserver Index Aggregators¶
In addition to individual index types, Warcserver supports ‘index aggregators’, which represent not a single source but multiple index sources, explicit or implicit.
Some explicit aggregators are:
- Local Directory
- Redis Key Template (scan/lookup of multiple redis keys)
- A generic group of index sources looked up in parallel (best match)
The aggregators allow for a complex lookup chains to lookup of resources in dynamic directory structures, using Redis keys, and external web archives.
Note: Warcserver automatically includes a Local Directory aggregator pointing to the collections
directory, as
explained in the Configuring the Web Archive
Sample “Memento” Aggregator¶
For example, the following config defines the collection endpoint many_archives
to
lookup three remote archives, two using memento, and one using CDX Server API:
collections:
# many archives
many_archives:
index_group:
rhiz: memento+http://webenact.rhizome.org/all/
ia: cdx+http://web.archive.org/cdx;/web
apt: memento+http://arquivo.pt/wayback/
timeout: 10
This allows Warcserver to serve as a “Memento Aggregator”, aggregating results from multiple existing archives (using the Memento API and other APIs).
An optional timeout
property configures how many seconds to wait for each source before
it is considered to have ‘timed out’. (If unspecified, the default value is 5 seconds).
Sequential Fallback Collections¶
It is also possible to define a “sequential” collection, where if one source/aggregator fails to produce a result, a “fallback” aggregator is tried, until there is a result:
collections:
# Sequence
web:
sequence:
-
index: ./local/indexes
resource: ./local/data
name: local
-
index_group:
rhiz: memento+http://webenact.rhizome.org/all/
ia: cdx+http://web.archive.org/cdx;/web
apt: memento+http://arquivo.pt/wayback/
-
index: $live
name: live
In the above example, first the local archive is tried, if the resource could not be successfully loaded, then the group of 3 archives is tried, if they all fail to produce a successful response, the live web is tried. Note that successful response includes a successful index lookup + successful resource fetch – if an index contains results, but they can not be fetched, the next group in the sequence is tried.
The name
of each item is include in the CDXJ index in the source
field to allow the caller to identify
which archive source was used.
Adding Custom Index Sources¶
It should be easy to add a custom index source, by extending pywb.warcserver.index.indexsource.BaseIndexSource
class MyIndexSource(BaseIndexSource):
def load_index(self, params):
... lookup index data as needed to fill CDXObject
cdx = CDXObject()
cdx['url'] = ...
...
yield cdx
@classmethod
def init_from_string(cls, value):
if value == 'my-index-src':
return cls()
...
@classmethod
def init_from_config(cls, config):
if config['type'] != 'my-index-src':
return
# Register Index with Warcserver
register_source(MyIndexSource)
You can then use the index in a config.yaml
:
collections:
my-coll: my-index-src
For more information and definition of existing indexes, see pywb.warcserver.index.indexsource
Custom Warcserver Deployments¶
It is also possible to use Warcserver directly without the use of a config.yaml
file, for more complex
deployment scenarios. (Webrecorder uses a customized deployment).
For example, the following config.yaml
config:
collections:
live: $live
memento:
index_group:
rhiz: memento+http://webenact.rhizome.org/all/
ia: memento+http://web.archive.org/web/
local: ./collections/
could be initialized explicitly, using the pywb.warcserver.basewarcserver.BaseWarcServer
class
which does not use a YAML config
app = BaseWarcServer()
# /live endpoint
live_agg = SimpleAggregator({'live': LiveIndexSource()})
app.add_route('/live', DefaultResourceHandler(live_agg))
# /memento endpoint
sources = {'rhiz': MementoIndexSource.from_timegate_url('http://webenact.rhizome.org/vvork/'),
'ia': MementoIndexSource.from_timegate_url('http://web.archive.org/web/'),
'local': DirectoryIndexSource('./collections')
}
multi_agg = GeventTimeoutAggregator(sources)
app.add_route('/memento', DefaultResourceHandler(multi_agg))
For more examples on custom Warcserver usage, consult the Warcserver tests, such as those in pywb.warcserver.test.test_handlers.py