pywb.apps package

Submodules

pywb.apps.cli module

class pywb.apps.cli.BaseCli(args=None, default_port=8080, desc='')[source]

Bases: object

Base CLI class that provides the initial arg parser setup, calls load to receive the application to be started and starts the application.

load()[source]

This method is called to load the application. Subclasses must return a application that can be used by used by pywb.utils.geventserver.GeventServer.

run()[source]

Start the application

run_gevent()[source]

Created the server that runs the application supplied a subclass

class pywb.apps.cli.LiveCli(args=None, default_port=8080, desc='')[source]

Bases: pywb.apps.cli.BaseCli

CLI class for starting pywb in replay server in live mode

load()[source]

This method is called to load the application. Subclasses must return a application that can be used by used by pywb.utils.geventserver.GeventServer.

class pywb.apps.cli.ReplayCli(args=None, default_port=8080, desc='')[source]

Bases: pywb.apps.cli.BaseCli

CLI class that adds the cli functionality specific to starting pywb’s Wayback Machine implementation

load()[source]

This method is called to load the application. Subclasses must return a application that can be used by used by pywb.utils.geventserver.GeventServer.

class pywb.apps.cli.WarcServerCli(args=None, default_port=8080, desc='')[source]

Bases: pywb.apps.cli.BaseCli

CLI class for starting a WarcServer

load()[source]

This method is called to load the application. Subclasses must return a application that can be used by used by pywb.utils.geventserver.GeventServer.

class pywb.apps.cli.WaybackCli(args=None, default_port=8080, desc='')[source]

Bases: pywb.apps.cli.ReplayCli

CLI class for starting the pywb’s implementation of the Wayback Machine

load()[source]

This method is called to load the application. Subclasses must return a application that can be used by used by pywb.utils.geventserver.GeventServer.

pywb.apps.cli.get_version()[source]

Get version of the pywb

pywb.apps.cli.live_rewrite_server(args=None)[source]

Utility function for starting pywb’s Wayback Machine implementation in live mode

pywb.apps.cli.warcserver(args=None)[source]

Utility function for starting pywb’s WarcServer

pywb.apps.cli.wayback(args=None)[source]

Utility function for starting pywb’s Wayback Machine implementation

pywb.apps.frontendapp module

class pywb.apps.frontendapp.FrontEndApp(config_file=None, custom_config=None)[source]

Bases: object

Orchestrates pywb’s core Wayback Machine functionality and is comprised of 2 core sub-apps and 3 optional apps.

Sub-apps:
  • WarcServer: Serves the archive content (WARC/ARC and index) as well as from the live web in record/proxy mode
  • RewriterApp: Rewrites the content served by pywb (if it is to be rewritten)
  • WSGIProxMiddleware (Optional): If proxy mode is enabled, performs pywb’s HTTP(s) proxy functionality
  • AutoIndexer (Optional): If auto-indexing is enabled for the collections it is started here
  • RecorderApp (Optional): Recording functionality, available when recording mode is enabled

The RewriterApp is configurable and can be set via the class var REWRITER_APP_CLS, defaults to RewriterApp

ALL_DIGITS = re.compile('^\\d+$')
CDX_API = 'http://localhost:%s/{coll}/index'
PROXY_CA_NAME = 'pywb HTTPS Proxy CA'
PROXY_CA_PATH = 'proxy-certs/pywb-ca.pem'
RECORD_API = 'http://localhost:%s/%s/resource/postreq?param.recorder.coll={coll}'
RECORD_ROUTE = '/record'
RECORD_SERVER = 'http://localhost:%s'
REPLAY_API = 'http://localhost:%s/{coll}/resource/postreq'
REWRITER_APP_CLS

alias of pywb.apps.rewriterapp.RewriterApp

classmethod create_app(port)[source]

Create a new instance of FrontEndApp that listens on port with a hostname of 0.0.0.0

Parameters:port (int) – The port FrontEndApp is to listen on
Returns:A new instance of FrontEndApp wrapped in GeventServer
Return type:GeventServer
get_coll_config(coll)[source]

Retrieve the collection config, including metadata, associated with a collection

Parameters:coll (str) – The name of the collection to receive config info for
Returns:The collections config
Return type:dict
get_upstream_paths(port)[source]

Retrieve a dictionary containing the full URLs of the upstream apps

Parameters:port (int) – The port used by the replay and cdx servers
Returns:A dictionary containing the upstream paths (replay, cdx-server, record [if enabled])
Return type:dict[str, str]
handle_request(environ, start_response)[source]

Retrieves the route handler and calls the handler returning its the response

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • start_response
Returns:

The WbResponse for the request

Return type:

WbResponse

init_autoindex(auto_interval)[source]

Initialize and start the auto-indexing of the collections. If auto_interval is None this is a no op.

Parameters:auto_interval (str|int) – The auto-indexing interval from the configuration file or CLI argument
init_proxy(config)[source]

Initialize and start proxy mode. If proxy configuration entry is not contained in the config this is a no op. Causes handler to become an instance of WSGIProxMiddleware.

Parameters:config (dict) – The configuration object used to configure this instance of FrontEndApp
init_recorder(recorder_config)[source]

Initialize the recording functionality of pywb. If recording_config is None this function is a no op

Parameters:recorder_config (str|dict|None) – The configuration for the recorder app
Return type:None
is_proxy_enabled(environ)[source]

Returns T/F indicating if proxy mode is enabled

Parameters:environ (dict) – The WSGI environment dictionary for the request
Returns:T/F indicating if proxy mode is enabled
Return type:bool
is_valid_coll(coll)[source]

Determines if the collection name for a request is valid (exists)

Parameters:coll (str) – The name of the collection to check
Returns:True if the collection is valid, false otherwise
Return type:bool
proxy_fetch(env, url)[source]

Proxy mode only endpoint that handles OPTIONS requests and COR fetches for Preservation Worker.

Due to normal cross-origin browser restrictions in proxy mode, auto fetch worker cannot access the CSS rules of cross-origin style sheets and must re-fetch them in a manner that is CORS safe. This endpoint facilitates that by fetching the stylesheets for the auto fetch worker and then responds with its contents

Parameters:
  • env (dict) – The WSGI environment dictionary
  • url (str) – The URL of the resource to be fetched
Returns:

WbResponse that is either response to an Options request or the results of fetching url

Return type:

WbResponse

proxy_route_request(url, environ)[source]

Return the full url that this proxy request will be routed to The ‘environ’ PATH_INFO and REQUEST_URI will be modified based on the returned url

Default is to use the ‘proxy_prefix’ to point to the proxy collection

put_custom_record(environ, coll='$root')[source]

When recording, PUT a custom WARC record to the specified collection (Available only when recording)

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection the record is to be served from
raise_not_found(environ, err_type, url)[source]

Utility function for raising a werkzeug.exceptions.NotFound execption with the supplied WSGI environment and message.

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • err_type (str) – The identifier for type of error that occured
  • url (str) – The url of the archived page that was requested
serve_cdx(environ, coll='$root')[source]

Make the upstream CDX query for a collection and response with the results of the query

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection this CDX query is for
Returns:

The WbResponse containing the results of the CDX query

Return type:

WbResponse

serve_coll_page(environ, coll='$root')[source]

Render and serve a collections search page (search.html).

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection to serve the collections search page for
Returns:

The WbResponse containing the collections search page

Return type:

WbResponse

serve_content(environ, coll='$root', url='', timemap_output='', record=False)[source]

Serve the contents of a URL/Record rewriting the contents of the response when applicable.

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection the record is to be served from
  • url (str) – The URL for the corresponding record to be served if it exists
  • timemap_output (str) – The contents of the timemap included in the link header of the response
  • record (bool) – Should the content being served by recorded (save to a warc). Only valid in record mode
Returns:

WbResponse containing the contents of the record/URL

Return type:

WbResponse

serve_home(environ)[source]

Serves the home (/) view of pywb (not a collections)

Parameters:environ (dict) – The WSGI environment dictionary for the request
Returns:The WbResponse for serving the home (/) path
Return type:WbResponse
serve_listing(environ)[source]

Serves the response for WARCServer fixed and dynamic listing (paths)

Parameters:environ (dict) – The WSGI environment dictionary for the request
Returns:WbResponse containing the frontend apps WARCServer URL paths
Return type:WbResponse
serve_record(environ, coll='$root', url='')[source]

Serve a URL’s content from a WARC/ARC record in replay mode or from the live web in live, proxy, and record mode.

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection the record is to be served from
  • url (str) – The URL for the corresponding record to be served if it exists
Returns:

WbResponse containing the contents of the record/URL

Return type:

WbResponse

serve_static(environ, coll='', filepath='')[source]

Serve a static file associated with a specific collection or one of pywb’s own static assets

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The collection the static file is associated with
  • filepath (str) – The file path (relative to the collection) for the static assest
Returns:

The WbResponse for the static asset

Return type:

WbResponse

setup_paths(environ, coll, record=False)[source]

Populates the WSGI environment dictionary with the path information necessary to perform a response for content or record.

Parameters:
  • environ (dict) – The WSGI environment dictionary for the request
  • coll (str) – The name of the collection the record is to be served from
  • record (bool) – Should the content being served by recorded (save to a warc). Only valid in record mode
class pywb.apps.frontendapp.MetadataCache(template_str)[source]

Bases: object

This class holds the collection medata template string and caches the metadata for a collection once it is rendered once. Cached metadata is updated if its corresponding file has been updated since last cache time (file mtime based)

get_all(routes)[source]

Load the metadata for all routes (collections) and populate the cache

Parameters:routes (list[str]) – List of collection names
Returns:A dictionary containing each collections metadata
Return type:dict
load(coll)[source]

Load and receive the metadata associated with a collection.

If the metadata for the collection is not cached yet its metadata file is read in and stored. If the cache has seen the collection before the mtime of the metadata file is checked and if it is more recent than the cached time, the cache is updated and returned otherwise the cached version is returned.

Parameters:coll (str) – Name of a collection
Returns:The cached metadata for a collection
Return type:dict
store_new(coll, path, mtime)[source]

Load a collections metadata file and store it

Parameters:
  • coll (str) – The name of the collection the metadata is for
  • path (str) – The path to the collections metadata file
  • mtime (float) – The current mtime of the collections metadata file
Returns:

The collections metadata

Return type:

dict

pywb.apps.live module

pywb.apps.rewriterapp module

class pywb.apps.rewriterapp.RewriterApp(framed_replay=False, jinja_env=None, config=None, paths=None)[source]

Bases: object

Primary application for rewriting the content served by pywb (if it is to be rewritten).

This class is also responsible rendering the archives templates

DEFAULT_CSP = "default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'"
VIDEO_INFO_CONTENT_TYPE = 'application/vnd.youtube-dl_formats+json'
add_csp_header(wb_url, status_headers)[source]

Adds Content-Security-Policy headers to the supplied StatusAndHeaders instance if the wb_url’s mod is equal to the replay mod

Parameters:
  • wb_url (WbUrl) – The WbUrl for the URL being operated on
  • status_headers (warcio.StatusAndHeaders) – The status and

headers instance for the reply to the URL

do_query(wb_url, kwargs)[source]

Performs the timemap query request for the supplied WbUrl returning the response

Parameters:
  • wb_url (WbUrl) – The WbUrl to be queried
  • kwargs (dict) – Optional keyword arguments
Returns:

The queries response

Return type:

requests.Response

format_response(response, wb_url, full_prefix, is_timegate, is_proxy, timegate_closest_ts=None)[source]
get_base_url(wb_url, kwargs)[source]
get_full_prefix(environ)[source]
get_host_prefix(environ)[source]
get_rel_prefix(environ)[source]
get_top_frame_params(wb_url, kwargs)[source]
get_top_url(full_prefix, wb_url, cdx, kwargs)[source]
get_upstream_url(wb_url, kwargs, params)[source]
handle_custom_response(environ, wb_url, full_prefix, host_prefix, kwargs)[source]
handle_error(environ, wbe)[source]
handle_query(environ, wb_url, kwargs, full_prefix)[source]
handle_timemap(wb_url, kwargs, full_prefix)[source]
is_ajax(environ)[source]
is_framed_replay(wb_url)[source]

Returns T/F indicating if the rewriter app is configured to be operating in framed replay mode and the supplied WbUrl is also operating in framed replay mode

Parameters:wb_url (WbUrl) – The WbUrl instance to check
Returns:T/F if in framed replay mode
Return type:bool
is_preflight(environ)[source]
make_timemap(wb_url, res, full_prefix, output)[source]
prepare_env(environ)[source]

setup environ path prefixes and scheme

render_content(wb_url, kwargs, environ)[source]
send_redirect(new_path, url_parts, urlrewriter)[source]
unrewrite_referrer(environ, full_prefix)[source]

pywb.apps.static_handler module

class pywb.apps.static_handler.StaticHandler(static_path)[source]

Bases: object

pywb.apps.warcserverapp module

pywb.apps.wayback module

pywb.apps.wbrequestresponse module

class pywb.apps.wbrequestresponse.WbResponse(status_headers, value=None, **kwargs)[source]

Bases: object

Represnts a pywb wsgi response object.

Holds a status_headers object and a response iter, to be returned to wsgi container.

add_access_control_headers(env=None)[source]

Adds Access-Control* HTTP headers to this WbResponse’s HTTP headers.

Parameters:env (dict) – The WSGI environment dictionary
Returns:The same WbResponse but with the values for the Access-Control* HTTP header added
Return type:WbResponse
add_range(*args)[source]

Add HTTP range header values to this response

Parameters:args (int) – The values for the range HTTP header
Returns:The same WbResponse but with the values for the range HTTP header added
Return type:WbResponse
static bin_stream(stream, content_type, status='200 OK', headers=None)[source]

Utility method for constructing a binary response.

Parameters:
  • stream (Any) – The response body stream
  • content_type (str) – The content-type of the response
  • status (str) – The HTTP status line
  • str]] headers (list[tuple[str,) – Additional headers for this response
Returns:

WbResponse that is a binary stream

Return type:

WbResponse

static encode_stream(stream)[source]

Utility method to encode a stream using utf-8.

Parameters:stream (Any) – The stream to be encoded using utf-8
Returns:A generator that yields the contents of the stream encoded as utf-8
static json_response(obj, status='200 OK', content_type='application/json; charset=utf-8')[source]

Utility method for constructing a JSON response.

Parameters:
  • obj (dict) – The dictionary to be serialized in JSON format
  • content_type (str) – The content-type of the response
  • status (str) – The HTTP status line
Returns:

WbResponse JSON response

Return type:

WbResponse

static options_response(env)[source]

Construct WbResponse for OPTIONS based on the WSGI env dictionary

Parameters:env (dict) – The WSGI environment dictionary
Returns:The WBResponse for the options request
Return type:WbResponse
static redir_response(location, status='302 Redirect', headers=None)[source]

Utility method for constructing redirection response.

Parameters:
  • location (str) – The location of the resource redirecting to
  • status (str) – The HTTP status line
  • str]] headers (list[tuple[str,) – Additional headers for this response
Returns:

WbResponse redirection response

Return type:

WbResponse

static text_response(text, status='200 OK', content_type='text/plain; charset=utf-8')[source]

Utility method for constructing a text response.

Parameters:
  • text (str) – The text response body
  • content_type (str) – The content-type of the response
  • status (str) – The HTTP status line
Returns:

WbResponse text response

Return type:

WbResponse

static text_stream(stream, content_type='text/plain; charset=utf-8', status='200 OK')[source]

Utility method for constructing a streaming text response.

Parameters:
  • stream (Any) – The response body stream
  • content_type (str) – The content-type of the response
  • status (str) – The HTTP status line
Returns:

WbResponse that is a text stream

Rtype WbResponse:
 
try_fix_errors()[source]

Utility method to try remove faulty headers from response.

Returns:
Return type:None

Module contents