Access Control System¶
The access controls system allows for a flexible configuration of rules to allow, block or exclude access to individual urls by longest-prefix match.
Access Control Files (.aclj)¶
Access controls are set in one or more access control JSON files (.aclj), sorted in reverse alphabetical order. To determine the best match, a binary search is used (similar to CDXJ) lookup and then the best match is found forward.
An .aclj file may look as follows:
org,httpbin)/anything/something - {"access": "allow", "url": "http://httpbin.org/anything/something"}
org,httpbin)/anything - {"access": "exclude", "url": "http://httpbin.org/anything"}
org,httpbin)/ - {"access": "block", "url": "httpbin.org/"}
com, - {"access": "allow", "url": "com,"}
Each JSON entry contains an access
field and the original url
field that was used to convert to the SURT (if any).
The prefix consists of a SURT key and a -
(currently reserved for a timestamp/date range field to be added later)
Given these rules, a user would:
* be allowed to visit http://httpbin.org/anything/something
(allow)
* but would receive an ‘access blocked’ error message when viewing http://httpbin.org/
(block)
* would receive a 404 not found error when viewing http://httpbin.org/anything
(exclude)
Access Types: allow, block, exclude¶
The available access types are as follows:
exclude
- when matched, results are excluded from the index, as if they do not exist. User will receive a 404.block
- when matched, results are not excluded from the index, marked withaccess: block
, but access to the actual is blocked. User will see a 451allow
- full access to the index and the resource.
The difference between exclude
and block
is that when blocked, the user can be notified that access is blocked, while
with exclude, no trace of the resource is presented to the user.
The use of allow
is useful to provide access to more specific resources within a broader block/exclude rule.
Access Error Messages¶
The special error code 451 is used to indicate that a resource has been blocked (access setting block
)
The error.html template contains a special message for this access and can be customized further.
By design, resources that are exclude
-ed simply appear as 404 not found and no special error is provided.
Managing Access Lists via Command-Line¶
The .aclj files need not ever be added or edited manually.
The pywb wb-manager
utility has been extended to provide tools for adding, removing and checking access control rules.
The access rules are written to <collection>/acl/access-rules.aclj
for a given collection <collection>
for automatic collections.
For example, to add the first line to an ACL file access.aclj
, one could run:
wb-manager acl add <collection> http://httpbin.org/anything/something exclude
The URL supplied can be a URL or a SURT prefix. If a SURT is supplied, it is used as is:
wb-manager acl add <collection> com, allow
By default, access control rules apply to a prefix of a given URL or SURT.
To have the rule apply only to the exact match, use:
wb-manager acl add <collection> http://httpbin.org/anything/something allow --exact-match
Rules added with and without the --exact-match
flag are considered distinct rules, and can be added
and removed separately.
With the above rules, http://httpbin.org/anything/something
would be allowed, but
http://httpbin.org/anything/something/subpath
would be excluded for any subpath
.
To remove a rule, one can run:
wb-manager acl remove <collection> http://httpbin.org/anything/something
To import rules in bulk, such as from an OpenWayback-style excludes.txt and mark them as exclude
:
wb-manager acl importtxt <collection> ./excludes.txt exclude
See wb-manager acl -h
for a list of additional commands such as for validating rules files and running a match against
an existing rule set.
Access Controls for Custom Collections¶
For manually configured collections, there are additional options for configuring access controls.
The access control files can be specified explicitly using the acl_paths
key and allow specifying multiple ACL files,
and allowing sharing access control files between different collections.
Single ACLJ:
collections:
test:
acl_paths: ./path/to/file.aclj
default_access: block
Multiple ACLJ:
collections:
test:
acl_paths:
- ./path/to/allows.aclj
- ./path/to/blocks.aclj
- ./path/to/other.aclj
- ./path/to/directory
default_access: block
The acl_paths
can be a single entry or a list, and can also include directories. If a directory is specified, all .aclj
files
in the directory are checked.
When finding the best rule from multiple .aclj
files, each file is binary searched and the result
set merge-sorted to find the best match (very similar to the CDXJ index lookup).
Note: It might make sense to separate allows.aclj
and blocks.aclj
into individual files for organizational reasons,
but there is no specific need to keep more than one access control files.
Default Access¶
An additional default_access
setting can be added to specify the default rule if no other rules match for custom collections.
If omitted, this setting is default_access: allow
, which is usually the desired default.
Setting default_access: block
and providing a list of allow
rules provides a flexible way to allow access
to only a limited set of resources, and block access to anything out of scope by default.