Difference between revisions of "LOCKSS Software"

From Adpnwiki
Jump to: navigation, search
(Repair Crawler)
(Web Crawler)
Line 335: Line 335:
=== ICP Server ===
Internet Cache Protocol ([http://www.ietf.org/rfc/rfc2186.txt|RFC 2186]).  See also Application of Internet Cache Protocol ([http://www.ietf.org/rfc/rfc2187.txt RFC 2187]).
== Manipulating the Cache ==  
== Manipulating the Cache ==  

Revision as of 14:29, 29 October 2013

LOCKSS Software


File System

Primary file system distribution of LOCKSS content.

  • /etc/lockss
  • /home/lockss
  • /usr/share/lockss
  • /var/log/lockss

In the file /etc/lockss/config.dat, the LOCKSS_DISK_PATHS variable enumerates the available disks.

  • /cache0/gamma
  • /cache1/gamma
  • /cache2/gamma
  • /cache3/gamma

File System Deep Dive

Access URL http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif/000404.tif

Cache storage /cache0/gamma/cache/m/bpldb.bplonline.org/http/adpn/load/Cartography/000400-000599/000404/tif/000404.tif/

Where are the bytes of this tif file?


current is a raw byte file. If the content is binary, the file is binary. If the content is text, the file is text. In order to understand the type of file current is you should examine current.props.

/cache0/gamma/cache/m/bpldb.bplonline.org/http/adpn/load/Cartography/000400-000599/000404/tif/000404.tif/#content/current.props describes the bytes (HTTP headers, LOCKSS specific headers)



Each revision is stored with a full copy of the bytes. current and current.props are renamed to 1 and 1.props. Each revision can always be retrieved.


See #Content Crawler and Revisions to understand how LOCKSS Web crawler handles files that differ in metadata but are identical.

Starting and Stopping the LOCKSS daemon

  • /etc/init.d/lockss start
  • /etc/init.d/lockss stop
  • /etc/init.d/lockss restart

Log Rotate

Logs are in /var/log/lockss

logrotate /etc/logrotate.d/lockss

Configuration Files

Primary Configuration files

  • http://props.lockss.org:8001/adpn/lockss.xml
  • /etc/lockss/config.dat
  • /home/lockss/local.txt
  • /cache0/gamma/config/expert_config.txt
  • /cache0/gamma/config/au.txt

Additional Configuration files

  • /cache0/gamma/config/ui_ip_access.txt
  • /cache0/gamma/config/proxy_ip_access.txt
  • /cache0/gamma/config/content_servers_config.txt

Configuration parameters are listed

Debug options from Logger.java

  • info -- default generally,
  • debug -- debugging messages, sparse
  • debug1 -- debugging messages
  • debug2 -- Detailed debugging that would not produce a ridiculous amount of output if it were enabled system-wide.
  • debug3 -- Debugging messages that produce more output than would be reasonable if this level were enabled system-wide. (e.g. messages in inner loops, or per-file, per-hash step, etc.)

Examples of use of log level in Expert Config

org.lockss.log.BaseCrawler.level = info
org.lockss.log.CrawlerImpl.level = info
org.lockss.log.BlockTally.level = info
org.lockss.log.V3PollerStatus.level = debug
org.lockss.log.PlatformInfo.level = debug2

User Configuration

Enable accounts administration to add multiple users with specific permissions.

org.lockss.accounts.enabled = [false]

User Roles

  • User Admin Role User may configure admin access (add/delete/modify users, set admin access list)
  • Content Admin Role User may configure content access (set content access list)
  • AU Admin Role User may change AU configuration (add/delete content)
  • Access Content Role User may access content
  • Debug Role

A user with no discretely assigned roles has read access to the daemon status tables.

User authentication type parameter, choose between Basic (Web server HTTP auth) and Form. Use of Form provides a sign out utility via button click.

org.lockss.accounts.policy = [(null)] [Basic,Form,SSL,LC]

Title List

The default title lists for available AUs is provided by LOCKSS in http://props.lockss.org:8001/adpn/lockss.xml. The Add/Remove title lists in the Web admin are populated from titledb.xml.

<property name="titleDbs">

titledb.xml uses the same element name at different nesting levels.

 <property name="org.lockss.titleSet">
  <property name="Birmingham Public Library">
   <property name="name" value="All Birmingham Public Library AUs" />
   <property name="class" value="xpath" />
   <property name="xpath" value="[attributes/publisher='Birmingham Public Library']" />
 <property name="org.lockss.title">
  <property name="BirminghamPublicLibraryBasePluginBirminghamPublicLibraryCartographyCollectionMaps000400000599">
   <property name="attributes.publisher" value="Birmingham Public Library" />
   <property name="journalTitle" value="Birmingham Public Library Cartography Collection" />
   <property name="type" value="journal" />
   <property name="title" value="Birmingham Public Library Cartography Collection: Maps (000400-000599)" />
   <property name="plugin" value="org.bplonline.adpn.BirminghamPublicLibraryBasePlugin" />
   <property name="param.1">
    <property name="key" value="base_url" />
    <property name="value" value="http://bpldb.bplonline.org/adpn/load/" />
   <property name="param.2">
    <property name="key" value="group" />
    <property name="value" value="Cartography" />
   <property name="param.3">
    <property name="key" value="collection" />
    <property name="value" value="000400-000599" />

Partitioning Cache Data

It is possible to overwrite the default configuration using a local parameter org.lockss.titleDbs. This enables networks to create a modified title list that can be customized to each node. Intelligent data partitions can be defined and redefined centrally.

org.lockss.titleDbs = http://bpldb.bplonline.org/etc/adpn/titledb-local.xml

See Partitioning Cache Data for a more complete discussion. In brief, data partitioning requires parsing of LOCKSS title list, persistent storage for definitions of responsibility of distributed data, and mechanisms to redefine and modify responsibilities when network topography changes.

Other Parameters

To examine

org.lockss.poll.v3.enableDiscovery   [true] 
Comment:	If true, enable the discovery mechanism that attempts to invite peers from outside our Initial Peer List into polls.
Used in:	 org.lockss.poller.v3.V3Poller

org.lockss.poll.v3.enableSymmetricPolls   [false] 
Comment:	If true, can request a symmetric poll
Used in:	 org.lockss.poller.v3.V3Voter

org.lockss.poll.v3.keepUrlLists   [false] 
Comment:	If true, lists of AGREE/DISAGREE/VOTER_ONLY/POLLER_ONLY URLs will be kept.
Used in:	 org.lockss.poller.v3.VoteBlocksTallier

org.lockss.poll.v3.minPercentAgreementForRepairs   [0.5] 
Comment:	The minimum percent agreement required before we're willing to serve repairs, if using per-AU agreement.
Used in:	 org.lockss.poller.v3.V3Voter

org.lockss.poll.v3.repairFromPublisherWhenTooClose   [false] 
Used in:	 org.lockss.poller.v3.V3Poller

org.lockss.subscription.enabled   [false] 
Comment:	Indication of whether the subscription subsystem should be enabled. 

 Defaults to false. Changes require daemon restart.
Used in:	 org.lockss.subscription.SubscriptionManager

Postgres replacement options for Derby

org.lockss.dbManager.datasource.className   [org.apache.derby.jdbc.EmbeddedDataSource] 
Comment:	Name of the database datasource class. Changes require daemon restart.
Used in:	 org.lockss.db.DbManager

org.lockss.dbManager.datasource.createDatabase   [create] 
Comment:	Name of the database create. Changes require daemon restart.
Used in:	 org.lockss.db.DbManager

org.lockss.dbManager.datasource.databaseName   [db/DbManager] 
Comment:	Name of the database with the relative path to the DB directory. Changes require daemon restart.
Used in:	 org.lockss.db.DbManager

org.lockss.dbManager.datasource.password   [insecure] 
Comment:	Name of the existing database password. Changes require daemon restart.
Used in:	 org.lockss.db.DbManager

org.lockss.dbManager.datasource.portNumber   [1527] 
Comment:	Port number of the database. Changes require daemon restart.
Used in:	 org.lockss.db.DbManager

org.lockss.dbManager.datasource.serverName   [localhost] 
Comment:	Name of the server. Changes require daemon restart.
Used in:	 org.lockss.db.DbManager

org.lockss.dbManager.datasource.user   [LOCKSS] 
Comment:	Name of the database user. Changes require daemon restart.
Used in:	 org.lockss.db.DbManager

Web Crawler

There are two distinct crawlers in LOCKSS: the new content crawler and repair crawler. The new content crawler is the primary method of inserting data in the cache and the crawler will only start at the publisher's content staging area as defined in the AU start URL settings. The repair crawler, depending on configuration parameters, can repair from the publisher and/or peer nodes. The Web crawler supports HTTP and HTTPS URL protocols.


Plugins can be retrieved from the LOCKSS node or Source Forge http://lockss.cvs.sourceforge.net/viewvc/lockss/lockss-daemon/plugins/src/

Example plugin : BirminghamPublicLibraryBasePlugin.xml

New Content Crawler


The new content crawler will only retrieve data from the publisher's content staging area. It will follow links and discover new content.

Repair Crawler


Repair crawler modes

  1. repair from other caches only
  2. repair from publisher only
  3. repair from other caches first. If repair fails in trying a certain number of caches, it will try repair from the publisher
  4. repair from publisher first. If repair fails, it will try repair from other caches

Parameters to set modes,

org.lockss.crawler.fetch_from_other_caches_only   [false] 
  Sets this to true in properties and repair will be done from other caches only 
org.lockss.crawler.fetch_from_publisher_only   [true] 
  Sets this to true in properties and repair will be done from publisher only
org.lockss.crawler.num_retries_from_caches   [5] 
  Sets this in properties to limit the number of caches it will try in repair mode 1,3 and 4.

The repair crawler must be called with a list of specific URLs. The crawler does not discover new content by following links.

The repair crawler is called from the V3 Poller after a tallying votes and identifying repairable URLs. The repair crawler can repair from publisher or peer. The publisher repair follows the standard Web request as in new content crawl.

Peer repair utilizes V3 LCAP messaging protocol (see getRepairDataInputStream() in V3LcapMessage.java. The V3 Poller event handleReceiveRepair() in PollerActions.java. Repairs are queued in object of class def RepairQueue which is instantiated after vote tallying in a V3 Poll. See requestRepair() and repairIfNeeded() generally called by methods such as tallyVoterUrl() or tallyPollerUrl() in V3Poller.java.

Repairs can be made with the V3 LCAP messaging protocol but it seems this is only in play when a V3 Poll is completed on the peer. The peer needs the AU to start and complete a poll. I don't think that the repair crawler, in default usage, populates a new node with peer cache data utilizing the V3 Poll and V3 LCAP messaging protocol.

Content Crawler and Revisions


Depending on AU and plugin definitions, the repository manager will initiate a new content crawl for an existing AU. When examining previously crawled URLs, the LOCKSS crawler does not request a full copy of the remote data. The LOCKSS crawler retrieves the last modified date of the cached content in stored metadata (current.props) and sends that date as an HTTP header (If-Modified-Since) to the remote server. The remote server checks file last modified date against the LOCKSS request HTTP header data and responds with 304 (not-modified) when the file has not changed. In the event the remote server determines the file last modified date is newer than LOCKSS request HTTP header date, the remote server sends the content.

According to RepositoryNodeImpl.java (see sealNewVersion()), the LOCKSS repository manager only saves crawler content in the cache as a revision if the files are not identical. If files are identical but last modified time is different, the repository manager renames the existing current.props to current.props-[LAST MODIFIED DATE] and the most recent file metadata is saved into current.props. current isn't touched.

File Equality

LOCKSS uses a custom function to determine file equality. isContentEqual is called in the sealNewVersion method.

FileUtil.isContentEqual(currentCacheFile, tempCacheFile)

Below is a snippet of the file stream compare in FileUtil.java

byte[] bytes1 = new byte[FILE_CHUNK_SIZE];
byte[] bytes2 = new byte[FILE_CHUNK_SIZE];
while (true) {
 int bytesRead1 = fis1.read(bytes1);
 int bytesRead2 = fis2.read(bytes2);

 if (bytesRead1 != bytesRead2) {
   // shouldn't really happen, since lengths are equal
   return false;

ICP Server

Internet Cache Protocol (2186). See also Application of Internet Cache Protocol (RFC 2187).

Manipulating the Cache

Exporting Content

The LOCKSS repository manager provides a simple HTTP-based download of any content type on the LOCKSS server by linking, copying or moving the data to /cache0/gamma/tmp/export/ and visiting Web admin http://bpl-adpnet.jclc.org:8081/ExportContent

/cache0/gamma/tmp/export/ is emptied on daemon restart, unless file ownership is changed from LOCKSS.

Deleting Stale AUs

Repositories marked as 'Deleted' in the RepositoryTable can be removed from the cache by deleting the appropriate directory (i.e. /cache0/gamma/cache/f). AUs not marked as deleted should be removed through the Web UI first or, untested, by removing the appropriates lines from /cache0/gamma/config/au.txt. Then remove the directory in the cache.

Restart the daemon.

Moving AUs, Same Disk

The daemon fills in missing directories when adding AUs so renaming in a directory is not strictly necessary. Nevertheless, cache directories are incremented 'a-z' then 'a[a-z]', 'b[a-z]' etc. An AU is not referenced by a specific cache directory location. An AU is referenced by indicating the disk repository in au.txt (ie. local\:/cache0/gamma NOT local\:/cache0/gamma/f). To move an AU to a new location on the same disk, simply rename the directory.

Consider the following scenario, AUs exist at 'a', 'b', 'c', 'd', 'e'. After deleting AUs 'c' and 'd', cache/gamma looks like 'a', 'b', 'e'. I can rename directory 'e' to 'c' and the next AU added will be appropriately named 'd'.

Restart the daemon.

n.b. Wait until active crawls are completed before renaming directories, otherwise crawler will use initial directories

Moving AUs, Different Disk

Cache directories are incremented "a-z" then "a[a-z]", "b[a-z]" etc. first identify the next directory name on the new disk.

Locate and modify the appropriate lines defining the AU in /cache0/gamma/config/au.txt (e.g. repository=local\:/cache1/gamma changes to local\:/cache2/gamma). Move the directory from the original to the new disk (e.g. mv /cache1/gamma/cache/g /cache2/gamma/cache/gj)

Restart the daemon.

Copying AUs, Different Nodes

The repository manager of a node will not initiate a new content crawl to a peer node, and new content crawls are generally recognized as the only way to populate a node's cache. It is possible to copy a file system from on node and transfer it to another. Once the node is populated with data and the repository manager is coerced into believing a new content crawl was completed successfully, the repository manager will engage in normal peer polling for the archival unit. Modification or addition of data to the archival unit still requires the use of the publisher's content staging area with the original URL structure.

Understand File System Organization

File system organization in the cache is system dependent only to a depth of 4 from the root. After that point, the file system organization is dependent on the original access URL for the archival unit data.

It is possible to populate a new node using a copy of the file system of a peer node. The peer node should have indicated a high-reliability factor of the cached data (100% agreement in a recent poll would be good). (I am shortcutting some steps under the assumption that the archival unit is in the add titles list of the LOCKSS Web UI, otherwise one would have to manually edit the au.txt file and create directories in the cache.)


When the archival unit is added to the node through the Web UI, the repository manager creates the next available directory based on the volume selected. (The base directory for the archival unit truncates at /cache0/gamma/cache/m/, the extra path in the example is to show URL structure dependence). The repository manager will then initiate a new content crawl to the start URL defined in the AU. Since the publisher has vacated the content staging area, the crawl will result in a 404 or other HTTP error. The changes to au.txt and creation of the /m/ directory in the file system are not rolled back.

Pack the AU

A tarred (compressed) package of an archival unit should be made on the peer node with content. For this example, the peer node had the archival unit in the file system at /cache2/gamma/cache/bj/.

tar -zpcvf /cache0/gamma/tmp/export/au_to_transfer.tar.gz /cache2/gamma/cache/bj/ 

There are a number of methods of exporting the content from the LOCKSS node, including FTP, but for this example I am utilizing the built-in HTTP export. After this command is executed, there will be a link called "au_to_transfer.tar.gz" in the ExportContent page of the LOCKSS Web UI.

Unpack the AU

The present working directory at this point should be /cache0/gamma/cache/. The appropriate command

tar -xvzf /path/to/au_to_transfer.tar.gz -C m --strip-components=4 --exclude \#agreement --exclude \#no_au_peers --exclude \#id_agreement.xml --exclude \#node_props --exclude \#au_id_file

Finish Up

The #node_props files will be generated at the LOCKSS daemon restart, which should be done at this time. By preserving the files #au_state.xml and #nodestate.xml, the repository manager thinks that a new content crawl has already been completed. A manually initiated V3 Poll on the archival unit should return appropriate URL agreement levels. The new LOCKSS node should now engage in normal content polling with the other peers.

Voting and Polling

Hashed UrlSets

AU released by LOCKSS with current content staging

# Block hashes from bpl-adpnet.jclc.org, 12:00:19 11/07/12
# AU: Birmingham Public Library Cartography Collection: Maps (000400-000599)
# Hash algorithm: SHA-1
# Encoding: Base64

IpncVSUBDZaqglSsjkOp49OS4KE=   http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/
M7+kc1l4Nl/Jwp3Hp0XRgK8dK94=   http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/mrc
lMeB8SlN+CGI7+1LxRhE+btzGmo=   http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/mrc/000404.mrc
266QzyI+r7sZVcRZGId2/roOLNI=   http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif
Tv2dnIhIPXEllk95fqAeA8uB67s=   http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif/000404.tif
6lm8iDE+/gu5ZWcJxTFjU5PP6OI=   http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt
b5Ta9uvYyS0GSZ7srapk2YZGWJM=   http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt/000404.csv
iuzvdBZG0oft3zxao5Sn5r7m+dA=   http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt/000404.txt
# end

Manually created AU with limited copy of current content

# Block hashes from bpl-adpnet.jclc.org, 11:41:58 11/07/12
# AU: Birmingham Public Library Base Plugin, Base URL http://bpldb.bplonline.org/adpn/load/, Group test, Collection 000404-000404
# Hash algorithm: SHA-1
# Encoding: Base64

GZzfEqeNHw8WdjuZbX1g/VrJyaI=   http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/
JkkpA1r7E548EOiiu0KeBuQYfts=   http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/mrc
lMeB8SlN+CGI7+1LxRhE+btzGmo=   http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/mrc/000404.mrc
uVD1FzG1WN43ROniD/VfQb/P6pU=   http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/tif
Tv2dnIhIPXEllk95fqAeA8uB67s=   http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/tif/000404.tif
0bG54sCMTETu5qKnBA3RXO80v/0=   http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/txt
b5Ta9uvYyS0GSZ7srapk2YZGWJM=   http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/txt/000404.csv
iuzvdBZG0oft3zxao5Sn5r7m+dA=   http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/txt/000404.txt
# end


(#Wed Nov 07 03:58:56 CST 2012

VoteBlock Versions (vn) decoded from Base64. Notice the ph (Plain Hash) for item 000404.tif matches both instances in the Hashed UrlSet.

#Wed Nov 07 03:58:56 CST 2012


BlockTally.java enumerates poll results including TOO_CLOSE and NOQUORUM.


Configuration parameter

org.lockss.poll.v3.voteMargin [75]

Need to investigate this further, particularly conflicts with quorum.

Does this mean a poll of 5 voters with 3 votes in agreement is invalid because (3/5)*100 = 60?

voteMargin input parameter is from configuration or default (75).

boolean isWithinMargin(int voteMargin) {
 int numAgree = agreeVoters.size();
 int numDisagree = disagreeVoters.size();
 double numVotes = numVotes();
 double actualMargin;

 if (numAgree > numDisagree) {
  actualMargin = (double) numAgree / numVotes;
 } else {
  actualMargin = (double) numDisagree / numVotes;

 if (actualMargin * 100 < voteMargin) {
  return false;
 return true;


Configuration parameter

org.lockss.poll.v3.quorum   [5]

ADPNet V3Poll quorum configuration is 3.

Polling minimum is 3, not 3 + caller.

11:15:56.170: Debug: 10-V3Poller: [InvitationCallback] Enough peers are participating (3+0)

Web Services


Starting with LOCKSS Daemon 1.61.5.

WSDL Service Reference : http://bpl-adpnet.jclc.org:8081/ws/DaemonStatusService?wsdl


Client Request

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:stat="http://status.ws.lockss.org/">

DaemonStatusService Response

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
      <ns2:isDaemonReadyResponse xmlns:ns2="http://status.ws.lockss.org/">


Client Request

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:stat="http://status.ws.lockss.org/">

DaemonStatusService Response

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
      <ns2:getAuIdsResponse xmlns:ns2="http://status.ws.lockss.org/">
            <name>Birmingham Public Library Cartography Collection: Maps (000200-000399)</name>
            <name>Birmingham Public Library Cartography Collection: Maps (000400-000599)</name>


Client Request

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:stat="http://status.ws.lockss.org/">

DaemonStatusService Response

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
      <ns2:getAuStatusResponse xmlns:ns2="http://status.ws.lockss.org/">
            <journalTitle>Birmingham Public Library Cartography Collection</journalTitle>
            <pluginName>Birmingham Public Library Base Plugin</pluginName>
            <publisher>Birmingham Public Library</publisher>
            <status>100.00% Agreement</status>
            <volume>Birmingham Public Library Cartography Collection: Maps (000400-000599)</volume>