LOCKSS Software
LOCKSS Software
Source
- Source : http://sourceforge.net/scm/?type=cvs&group_id=47774 (SourceForge)
- Svn : http://sourceforge.net/p/lockss/svn/HEAD/tree/ (view it online)
- Release Notes : http://www.lockss.org/support/use-a-lockss-box/daemon-release-notes/
- LOCKSS Doc : http://www.lockss.org/lockssdoc/gamma/daemon/overview-summary.html
Private LOCKSS Network
- LOCKSS Technical Manual : http://plnwiki.lockss.org/wiki/index.php/LOCKSS_Technical_Manual
CLOCKSS
- CLOCKSS Program Documents : http://documents.clockss.org/index.php/Main_Page
File System
Primary file system distribution of LOCKSS content.
- /etc/lockss
- /home/lockss
- /usr/share/lockss
- /var/log/lockss
In the file /etc/lockss/config.dat, the LOCKSS_DISK_PATHS variable enumerates the available disks.
- /cache0/gamma
- /cache1/gamma
- /cache2/gamma
- /cache3/gamma
File System Deep Dive
Access URL http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif/000404.tif
Cache storage /cache0/gamma/cache/m/bpldb.bplonline.org/http/adpn/load/Cartography/000400-000599/000404/tif/000404.tif/
Where are the bytes of this tif file?
/cache0/gamma/cache/m/bpldb.bplonline.org/http/adpn/load/Cartography/000400-000599/000404/tif/000404.tif/#content/current
current is a raw byte file. If the content is binary, the file is binary. If the content is text, the file is text. In order to understand the type of file current is you should examine current.props.
/cache0/gamma/cache/m/bpldb.bplonline.org/http/adpn/load/Cartography/000400-000599/000404/tif/000404.tif/#content/current.props describes the bytes (HTTP headers, LOCKSS specific headers)
/cache0/gamma/cache/m #au_id_file #au_state.xml #id_agreement.xml #no_au_peers #node_props #nodestate.xml bpldb.bplonline.org/ http/ #node_props adpn/ #node_props load/ #node_props Cartography/ #node_props 000400-000599/ #agreement #node_props #content/ manifest.html/ 000404/ #agreement #node_props #content/ tif/ #agreement #node_props #content/ 000404.tif/ #agreement #node_props #content/ current current.props
Revisions
Each revision is stored with a full copy of the bytes. current and current.props are renamed to 1 and 1.props. Each revision can always be retrieved. Only the `current` file participates in preservation activities.
/cache0/gamma/cache/m/bpldb.bplonline.org/http/adpn/load/Cartography/000400-000599/000404/tif/000404.tif/ #content/ 1 1.props 2 2.props current current.props
See #Content Crawler and Revisions to understand how LOCKSS Web crawler handles files that differ in metadata but are identical.
Starting and Stopping the LOCKSS daemon
- /etc/init.d/lockss start
- /etc/init.d/lockss stop
- /etc/init.d/lockss restart
CentOS 7 Version
- systemctl enable lockss
- systemctl start lockss
- systemctl restart lockss
- systemctl stop lockss
Log Rotate
Logs are in /var/log/lockss
logrotate /etc/logrotate.d/lockss
Configuration Files
Primary Configuration files
- http://props.lockss.org:8001/adpn/lockss.xml
- /etc/lockss/config.dat
- /home/lockss/local.txt
- /cache0/gamma/config/expert_config.txt
- /cache0/gamma/config/au.txt
Additional Configuration files
- /cache0/gamma/config/ui_ip_access.txt
- /cache0/gamma/config/proxy_ip_access.txt
- /cache0/gamma/config/content_servers_config.txt
Configuration parameters are listed
- http://www.lockss.org/lockssdoc/gamma/daemon/paramdoc.html
- http://www.lockss.org/lockssdoc/gamma/daemon/paramdoc.txt
Debug options from Logger.java
- info -- default generally,
- debug -- debugging messages, sparse
- debug1 -- debugging messages
- debug2 -- Detailed debugging that would not produce a ridiculous amount of output if it were enabled system-wide.
- debug3 -- Debugging messages that produce more output than would be reasonable if this level were enabled system-wide. (e.g. messages in inner loops, or per-file, per-hash step, etc.)
Examples of use of log level in Expert Config
org.lockss.log.BaseCrawler.level = info org.lockss.log.CrawlerImpl.level = info org.lockss.log.BlockTally.level = info org.lockss.log.V3PollerStatus.level = debug org.lockss.log.PlatformInfo.level = debug2
User Configuration
Enable accounts administration to add multiple users with specific permissions.
org.lockss.accounts.enabled = [false]
User Roles
- User Admin Role User may configure admin access (add/delete/modify users, set admin access list)
- Content Admin Role User may configure content access (set content access list)
- AU Admin Role User may change AU configuration (add/delete content)
- Access Content Role User may access content
- Debug Role
A user with no discretely assigned roles has read access to the daemon status tables.
User authentication type parameter, choose between Basic (Web server HTTP auth) and Form. Use of Form provides a sign out utility via button click.
org.lockss.accounts.policy = [(null)] [Basic,Form,SSL,LC]
Title List
The default title lists for available AUs is provided by LOCKSS in http://props.lockss.org:8001/adpn/lockss.xml. The Add/Remove title lists in the Web admin are populated from titledb.xml.
<property name="titleDbs"> <list> <value>http://props.lockss.org:8001/adpn/titledb/titledb.xml</value> </list> </property>
titledb.xml uses the same element name at different nesting levels.
<lockss-config> <property name="org.lockss.titleSet"> <property name="Birmingham Public Library"> <property name="name" value="All Birmingham Public Library AUs" /> <property name="class" value="xpath" /> <property name="xpath" value="[attributes/publisher='Birmingham Public Library']" /> </property> </property> <property name="org.lockss.title"> <property name="BirminghamPublicLibraryBasePluginBirminghamPublicLibraryCartographyCollectionMaps000400000599"> <property name="attributes.publisher" value="Birmingham Public Library" /> <property name="journalTitle" value="Birmingham Public Library Cartography Collection" /> <property name="type" value="journal" /> <property name="title" value="Birmingham Public Library Cartography Collection: Maps (000400-000599)" /> <property name="plugin" value="org.bplonline.adpn.BirminghamPublicLibraryBasePlugin" /> <property name="param.1"> <property name="key" value="base_url" /> <property name="value" value="http://bpldb.bplonline.org/adpn/load/" /> </property> <property name="param.2"> <property name="key" value="group" /> <property name="value" value="Cartography" /> </property> <property name="param.3"> <property name="key" value="collection" /> <property name="value" value="000400-000599" /> </property> </property> </property> </lockss-config>
Partitioning Cache Data
It is possible to overwrite the default configuration using a local parameter org.lockss.titleDbs. This enables networks to create a modified title list that can be customized to each node. Intelligent data partitions can be defined and redefined centrally.
org.lockss.titleDbs = http://bpldb.bplonline.org/etc/adpn/titledb-local.xml
or can be set in union with the centralized title list with local parameter
org.lockss.userTitleDbs = http://bpldb.bplonline.org/etc/adpn/titledb-local.xml
See Partitioning Cache Data for a more complete discussion. In brief, data partitioning requires parsing of LOCKSS title list, persistent storage for definitions of responsibility of distributed data, and mechanisms to redefine and modify responsibilities when network topography changes.
Title URL Awareness
A partitioned cache data set could introduce problems when serving content from the LOCKSS node. The direct proxy method (either content_proxy or audit_proxy) queries the LOCKSS node directly. It is feasible to introduce an intermediary service such as Squid Proxy and enable #ICP_Server on the LOCKSS nodes. The Squid Proxy would then serve as the single access point (or points if request load is heavy) for all of the caches.
Squid Proxy
To do. http://wiki.squid-cache.org/SquidFaq/InnerWorkings
LOCKSS recommendations for proxy configuration : http://www.lockss.org/support/use-a-lockss-box/view-your-preserved-content/proxy-integration/
Title List Parameters
See #Crawl_Proxy for specifying a per-AU proxy host in the title list.
Other Parameters
To examine
minReplicasForNoQuorumPeerRepair (plugin ? See in 1.65) org.lockss.poll.v3.enableDiscovery [true] Comment: If true, enable the discovery mechanism that attempts to invite peers from outside our Initial Peer List into polls. Used in: org.lockss.poller.v3.V3Poller org.lockss.poll.v3.enableSymmetricPolls [false] Comment: If true, can request a symmetric poll Used in: org.lockss.poller.v3.V3Voter org.lockss.poll.v3.keepUrlLists [false] Comment: If true, lists of AGREE/DISAGREE/VOTER_ONLY/POLLER_ONLY URLs will be kept. Used in: org.lockss.poller.v3.VoteBlocksTallier org.lockss.poll.v3.minPercentAgreementForRepairs [0.5] Comment: The minimum percent agreement required before we're willing to serve repairs, if using per-AU agreement. Used in: org.lockss.poller.v3.V3Voter org.lockss.poll.v3.repairFromPublisherWhenTooClose [false] Comment: Used in: org.lockss.poller.v3.V3Poller org.lockss.subscription.enabled [false] Comment: Indication of whether the subscription subsystem should be enabled. Defaults to false. Changes require daemon restart. Used in: org.lockss.subscription.SubscriptionManager
Postgres replacement options for Derby
org.lockss.dbManager.datasource.className [org.apache.derby.jdbc.EmbeddedDataSource] Comment: Name of the database datasource class. Changes require daemon restart. Used in: org.lockss.db.DbManager org.lockss.dbManager.datasource.createDatabase [create] Comment: Name of the database create. Changes require daemon restart. Used in: org.lockss.db.DbManager org.lockss.dbManager.datasource.databaseName [db/DbManager] Comment: Name of the database with the relative path to the DB directory. Changes require daemon restart. Used in: org.lockss.db.DbManager org.lockss.dbManager.datasource.password [insecure] Comment: Name of the existing database password. Changes require daemon restart. Used in: org.lockss.db.DbManager org.lockss.dbManager.datasource.portNumber [1527] Comment: Port number of the database. Changes require daemon restart. Used in: org.lockss.db.DbManager org.lockss.dbManager.datasource.serverName [localhost] Comment: Name of the server. Changes require daemon restart. Used in: org.lockss.db.DbManager org.lockss.dbManager.datasource.user [LOCKSS] Comment: Name of the database user. Changes require daemon restart. Used in: org.lockss.db.DbManager
Web Crawler
There are two distinct crawlers in LOCKSS: the new content crawler and repair crawler. The new content crawler is the primary method of inserting data in the cache and the crawler will only start at the publisher's content staging area as defined in the AU start URL settings. The repair crawler, depending on configuration parameters, can repair from the publisher and/or peer nodes. The Web crawler supports HTTP and HTTPS URL protocols.
Plugins
Plugins can be retrieved from the LOCKSS node or SourceForge http://lockss.cvs.sourceforge.net/viewvc/lockss/lockss-daemon/plugins/src/
- Plugin Tool Download (0.12.1 latest from SourceForge)
- LOCKSS Plugin Tool Tutorial (Way Back Machine IA)
- Plugin XML Format from the PLN Wiki
Plugins can be extended by creating references to Java classes in the plugin definition. Some examples are below. See Highwire Press plugin definition for a more complete demonstration.
<entry> <string>text/html_filter_factory</string> <string>org.bplonline.adpn.BaseHtmlFilterFactory</string> </entry> <entry> <string>au_url_normalizer</string> <string>org.bplonline.adpn.BaseUrlNormalizer</string> </entry>
Example plugin : BirminghamPublicLibraryBasePlugin.xml
Plugin API
Plugin Specification (limited document).
Interface to extend generic plugin-based functions (UrlCacher, UrlComparator) for PLN specific activities.
Signing Plugins
A plugin is typically a signed jar that contains the XML plugin definition and additional Java class files that extend the base functionality.
Helpful tools are packaged with the LOCKSS source under test/scripts.
genkey generates a user Java keystore. jarplugin can package multiple paths into a single tar and jar the resulting file. signplugin signs the jar with the applied keystore. Untested but it might be possible to use genplugin to iterate all of these actions from one call.
Keystores can be included in the daemon by including the parameter org.lockss.plugin.userKeystore.location.
Crawl Windows
Crawl windows can be set to control when the LOCKSS daemon should start and end crawls. The LOCKSS Daemon will disallow crawls when outside of the crawl window. The LOCKSS Daemon will also abort in-progress crawls that overrun the window. For tight windows, examine the org.lockss.crawler.startCrawlsInterval parameter and the au_def_pause_time in the plugin definition.
<entry> <string>au_crawlwindow_ser</string> <org.lockss.daemon.CrawlWindows-Interval> <start> <timezone>America/Chicago</timezone> </start> <end> <timezone>America/Chicago</timezone> </end> <fieldMask>3</fieldMask> <timeZoneId>US/Central</timeZoneId> </org.lockss.daemon.CrawlWindows-Interval> </entry>
Crawl Proxy
Title list parameters specifying a alternative URL.
<property name="param.98"> <property name="key" value="crawl_proxy" /> <property name="value" value="reingest1.clockss.org:8082" /> </property>
This method proxies an AU start URL through crawl_proxy parameter and enables FollowLinkCrawler to gather data from a node, provided the proxy node is serving content. The serving proxy comprehends the nature of the incoming request to audit proxy and returns cached URLs.
A factor to consider is that this should probably be a one time new content crawl for an uncached AU, otherwise there is potential for throwing V3 Poll state out of quorum. Also, title list modification by adding crawl_proxy parameter should be limited to the peer node needing content. Adding crawl_proxy parameter to a title list that is consumed by all nodes could introduce errors with quorum and polling by unintended node to node new content crawling.
New Content Crawler
The new content crawler will only retrieve data from the publisher's content staging area. It will follow links and discover new content.
Repair Crawler
Repair crawler modes
- repair from other caches only
- repair from publisher only
- repair from other caches first. If repair fails in trying a certain number of caches, it will try repair from the publisher
- repair from publisher first. If repair fails, it will try repair from other caches
Parameters to set modes,
org.lockss.crawler.fetch_from_other_caches_only [false] Sets this to true in properties and repair will be done from other caches only org.lockss.crawler.fetch_from_publisher_only [true] Sets this to true in properties and repair will be done from publisher only org.lockss.crawler.num_retries_from_caches [5] Sets this in properties to limit the number of caches it will try in repair mode 1,3 and 4.
The repair crawler must be called with a list of specific URLs. The crawler does not discover new content by following links.
The repair crawler is called from the V3 Poller after a tallying votes and identifying repairable URLs. The repair crawler can repair from publisher or peer. The publisher repair follows the standard Web request as in new content crawl.
Peer repair utilizes V3 LCAP messaging protocol (see getRepairDataInputStream() in V3LcapMessage.java. The V3 Poller event handleReceiveRepair() in PollerActions.java. Repairs are queued in object of class def RepairQueue which is instantiated after vote tallying in a V3 Poll. See requestRepair() and repairIfNeeded() generally called by methods such as tallyVoterUrl() or tallyPollerUrl() in V3Poller.java.
Repairs can be made with the V3 LCAP messaging protocol but it seems this is only in play when a V3 Poll is completed on the peer. The peer needs the AU to start and complete a poll. I don't think that the repair crawler, in default usage, populates a new node with peer cache data utilizing the V3 Poll and V3 LCAP messaging protocol.
Content Crawler and Revisions
Depending on AU and plugin definitions, the repository manager will initiate a new content crawl for an existing AU. When examining previously crawled URLs, the LOCKSS crawler does not request a full copy of the remote data. The LOCKSS crawler retrieves the last modified date of the cached content in stored metadata (current.props) and sends that date as an HTTP header (If-Modified-Since) to the remote server. The remote server checks file last modified date against the LOCKSS request HTTP header data and responds with 304 (not-modified) when the file has not changed. In the event the remote server determines the file last modified date is newer than LOCKSS request HTTP header date, the remote server sends the content.
According to RepositoryNodeImpl.java (see sealNewVersion()), the LOCKSS repository manager only saves crawler content in the cache as a revision if the files are not identical. If files are identical but last modified time is different, the repository manager renames the existing current.props to current.props-[LAST MODIFIED DATE] and the most recent file metadata is saved into current.props. current isn't touched.
File Equality
LOCKSS uses a custom function to determine file equality. isContentEqual is called in the sealNewVersion method.
FileUtil.isContentEqual(currentCacheFile, tempCacheFile)
Below is a snippet of the file stream compare in FileUtil.java
byte[] bytes1 = new byte[FILE_CHUNK_SIZE]; byte[] bytes2 = new byte[FILE_CHUNK_SIZE]; while (true) { int bytesRead1 = fis1.read(bytes1); int bytesRead2 = fis2.read(bytes2); if (bytesRead1 != bytesRead2) { // shouldn't really happen, since lengths are equal return false; } ...
File Validation (Staged)
The publisher staging area has privileged rights to write new content to the LOCKSS cache nodes when the LOCKSS Web crawler discovers new or different content. In the case of a PLN, the publisher staging area could simply be a commodity storage device with a Web interface and no revision control or bit-checking. While the LOCKSS repository manager keeps previous versions of documents, the polling mechanism only actively maintains the current file.
A proposal to maintain file authority and mitigate the ramifications of corruption or accidental modification of files on the content staging area would be to include checksums with the staged file. The Plugin/AU definition could inform the repository manager whether to look for authenticity checksums and what hashing algorithm was used (SHA1,MD5). In the event a checksum does not match, the repository manager should throw a crawl error. In addition the Plugin/AU definition should be extended to send alerts to the publisher on crawl errors. Hash checks could be referenced in RepositoryNodeImpl sealNewVesion.
http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/ http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/mrc http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/mrc/000404.mrc http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/mrc/000404.mrc.md5 http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif/000404.tif http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif/000404.tif.md5 http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt/000404.csv http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt/000404.csv.md5 http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt/000404.txt http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt/000404.txt.md5
AU definition keys example:
<au_hash_check> true </au_hash_check> (default false) <au_hash_check_file_types> tif, jpg, pdf </au_hash_check_file_types> (iterated file types) <au_hash_check_algorithm> MD5 </au_hash_check_algorithm> [SHA1,MD5] <au_hash_check_extenstion> md5 </au_hash_check_extension> (string concatenate to Node URL)
The 000404.tif.md5 could itself become a node in the cache, but the check should be made to the content staging area or the .md5 file should be cached first. Mismatched hash checks should result in a logged crawl error and the new revision discarded.
http://en.wikipedia.org/wiki/File_Fixity
Alerter
Alert.java contains a auAlert() method. It should be fairly trivial to extend this method to pull additional au parameters such as publisher alert email (BaseArchivalUnit.java)
// Factories /** * Create an AU-specific alert. * @param prototype the prototype Alert * @param au the au * @return a new Alert instance */ public static Alert auAlert(Alert prototype, ArchivalUnit au) { Alert res = new Alert(prototype); res.setAttribute(ATTR_IS_CONTENT, true); if (au != null) { res.setAttribute(ATTR_AUID, au.getAuId()); res.setAttribute(ATTR_AU_NAME, au.getName()); // res.setAttribute(ATTR_AU_TITLE, au.getJournalTitle()); } return res; }
Hasher
Remote system can pull hash data for entire AUs or specific URLs. Would require a list of URLs to check.
# Block hashes from bpl-adpnet.jclc.org, 15:22:10 03/27/14 # AU: Birmingham Public Library Cartography Collection: Maps (000400-000599) # Hash algorithm: SHA-1 # Encoding: Hex 4EFD9D9C88483D7125964F797EA01E03CB81EBBB http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif/000404.tif # end
Hash an entire AU can take a while.
# Block hashes from bpl-adpnet.jclc.org, 15:11:51 03/27/14 # AU: Birmingham Public Library Cartography Collection: Maps (000400-000599) # Hash algorithm: SHA-1 # Encoding: Hex A36FE52C03437F63900ECA41332AF65DE5DC269F http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599 2299DC5525010D96AA8254AC8E43A9E3D392E0A1 http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404 33BFA4735978365FC9C29DC7A745D180AF1D2BDE http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/mrc 94C781F1294DF82188EFED4BC51844F9BB731A6A http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/mrc/000404.mrc DBAE90CF223EAFBB1955C459188776FEBA0E2CD2 http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif 4EFD9D9C88483D7125964F797EA01E03CB81EBBB http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif/000404.tif EA59BC88313EFE0BB9656709C531635393CFE8E2 http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt 6F94DAF6EBD8C92D06499EECADAA64D986465893 http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt/000404.csv 8AECEF741646D287EDDF3C5AA394A7E6BEE6F9D0 http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt/000404.txt
ICP Server
Internet Cache Protocol (RFC 2186). See also Application of Internet Cache Protocol (RFC 2187).
Page 2 of RFC 2187
The essential difference between a parent and sibling is that a "neighbor hit" may be fetched from either one, but a "neighbor miss" may NOT be fetched from a sibling. In other words, in a sibling relationship, a cache can only ask to retrieve objects that the sibling already has cached, whereas the same cache can ask a parent to retrieve any object regardless of whether or not it is cached.
Establishing trusted node to elevate to parent? On per AU basis? Global elevation would be problematic.
ICP parameters
org.lockss.icp.enabled [] Comment: The ICP enabled flag parameter. Used in: org.lockss.proxy.icp.IcpManager org.lockss.icp.interval [3600000 (1h0m0s)] Comment: The ICP watchdog interval parameter. Used in: org.lockss.proxy.icp.IcpManager org.lockss.icp.port [] Comment: The ICP port parameter. Used in: org.lockss.proxy.icp.IcpManager org.lockss.icp.rate [400/100] Comment: The ICP rate-limiting string parameter. Used in: org.lockss.proxy.icp.IcpManager org.lockss.icp.slow [true] Comment: The slow ICP string parameter. Used in: org.lockss.proxy.icp.IcpManager
"The ICP server responds to queries sent by other proxies and caches to let them know if this LOCKSS box has the content they are looking for. This is useful if you are integrating this box into an existing network structure with other proxies and caches that support ICP."
ICP Manager
RFC 2187
5.2.3. ICP_OP_HIT If the cache reaches this point without already matching one of the previous opcodes, it means the request is allowed and we must determine if it will be HIT or MISS, so we check if the URL exists in the local cache. If so, and if the cached entry is fresh for at least the next 30 seconds, we can return an ICP_OP_HIT message. The stale/fresh determination uses the local refresh (or TTL) rules. Note that a race condition exists for ICP_OP_HIT replies to sibling peers. The ICP_OP_HIT means that a subsequent HTTP request for the named URL would result in a cache hit. We assume that the HTTP request will come very quickly after the ICP_OP_HIT. However, there is a slight chance that the object might be purged from this cache before the HTTP request is received. If this happens, and the replying peer has applied Squid's `miss_access' configuration then the user will receive a very confusing access denied message. 5.2.3.1. ICP_OP_HIT_OBJ Before returning the ICP_OP_HIT message, we see if we can send an ICP_OP_HIT_OBJ message instead. We can use ICP_OP_HIT_OBJ if: o The ICP_OP_QUERY message had the ICP_FLAG_HIT_OBJ flag set. o The entire object (plus URL) will fit in an ICP message. The maximum ICP message size is 16 Kbytes, but an application may choose to set a smaller maximum value for ICP_OP_HIT_OBJ replies. Normally ICP replies are sent immediately after the query is received, but the ICP_OP_HIT_OBJ message cannot be sent until the object data is available to copy into the reply message. For Squid and Harvest this means the object must be "swapped in" from disk if it is not already in memory. Therefore, on average, an ICP_OP_HIT_OBJ reply will have higher latency than ICP_OP_HIT.
/** * <p>Processes an incoming ICP message.</p> * @param message An incoming message. */ protected void processMessage(IcpMessage message) { if (message.getOpcode() == IcpMessage.ICP_OP_QUERY) { IcpMessage response; try { // Prepare response if (!proxyManager.isIpAuthorized(message.getUdpAddress().getHostAddress())) { logger.debug2("processMessage: DENIED"); response = message.makeDenied(); } else { String urlString = message.getPayloadUrl(); CachedUrl cu = pluginManager.findCachedUrl(urlString); if (cu != null && cu.hasContent() && !isClockssUnsubscribed(cu)) { logger.debug2("processMessage: HIT"); response = message.makeHit(); } else { logger.debug2("processMessage: MISS_NOFETCH"); response = message.makeMissNoFetch(); } AuUtil.safeRelease(cu); } } catch (IcpException ipe) { logger.debug2("processMessage: ERR", ipe); try { response = message.makeError(); } catch (IcpException ipe2) { logger.debug2("processMessage: double exception", ipe2); return; // abort } }
Manipulating the Cache
Exporting Content
The LOCKSS repository manager provides a simple HTTP-based download of any content type on the LOCKSS server by linking, copying or moving the data to /cache0/gamma/tmp/export/ and visiting Web admin http://bpl-adpnet.jclc.org:8081/ExportContent
/cache0/gamma/tmp/export/ is emptied on daemon restart, unless file ownership is changed from LOCKSS.
Deleting Stale AUs
Repositories marked as 'Deleted' in the RepositoryTable can be removed from the cache by deleting the appropriate directory (i.e. /cache0/gamma/cache/f). AUs not marked as deleted should be removed through the Web UI first or, untested, by removing the appropriates lines from /cache0/gamma/config/au.txt. Then remove the directory in the cache.
Restart the daemon.
Moving AUs, Same Disk
The daemon fills in missing directories when adding AUs so renaming in a directory is not strictly necessary. Nevertheless, cache directories are incremented 'a-z' then 'a[a-z]', 'b[a-z]' etc. An AU is not referenced by a specific cache directory location. An AU is referenced by indicating the disk repository in au.txt (ie. local\:/cache0/gamma NOT local\:/cache0/gamma/f). To move an AU to a new location on the same disk, simply rename the directory.
Consider the following scenario, AUs exist at 'a', 'b', 'c', 'd', 'e'. After deleting AUs 'c' and 'd', cache/gamma looks like 'a', 'b', 'e'. I can rename directory 'e' to 'c' and the next AU added will be appropriately named 'd'.
Restart the daemon.
n.b. Wait until active crawls are completed before renaming directories, otherwise crawler will use initial directories
Moving AUs, Different Disk
Cache directories are incremented "a-z" then "a[a-z]", "b[a-z]" etc. first identify the next directory name on the new disk.
Locate and modify the appropriate lines defining the AU in /cache0/gamma/config/au.txt (e.g. repository=local\:/cache1/gamma changes to local\:/cache2/gamma). Move the directory from the original to the new disk (e.g. mv /cache1/gamma/cache/g /cache2/gamma/cache/gj)
Restart the daemon.
Copying AUs, Different Nodes
The repository manager of a node will not initiate a new content crawl to a peer node, and new content crawls are generally recognized as the only way to populate a node's cache. It is possible to copy a file system from on node and transfer it to another. Once the node is populated with data and the repository manager is coerced into believing a new content crawl was completed successfully, the repository manager will engage in normal peer polling for the archival unit. Modification or addition of data to the archival unit still requires the use of the publisher's content staging area with the original URL structure.
Understand File System Organization
File system organization in the cache is system dependent only to a depth of 4 from the root. After that point, the file system organization is dependent on the original access URL for the archival unit data.
It is possible to populate a new node using a copy of the file system of a peer node. The peer node should have indicated a high-reliability factor of the cached data (100% agreement in a recent poll would be good). (I am shortcutting some steps under the assumption that the archival unit is in the add titles list of the LOCKSS Web UI, otherwise one would have to manually edit the au.txt file and create directories in the cache.)
/cache0/gamma/cache/m/bpldb.bplonline.org/http/adpn/load/Cartography/000400-000599/000404/tif/000404.tif/
When the archival unit is added to the node through the Web UI, the repository manager creates the next available directory based on the volume selected. (The base directory for the archival unit truncates at /cache0/gamma/cache/m/, the extra path in the example is to show URL structure dependence). The repository manager will then initiate a new content crawl to the start URL defined in the AU. Since the publisher has vacated the content staging area, the crawl will result in a 404 or other HTTP error. The changes to au.txt and creation of the /m/ directory in the file system are not rolled back.
Pack the AU
A tarred (compressed) package of an archival unit should be made on the peer node with content. For this example, the peer node had the archival unit in the file system at /cache2/gamma/cache/bj/.
tar -zpcvf /cache0/gamma/tmp/export/au_to_transfer.tar.gz /cache2/gamma/cache/bj/
There are a number of methods of exporting the content from the LOCKSS node, including FTP, but for this example I am utilizing the built-in HTTP export. After this command is executed, there will be a link called "au_to_transfer.tar.gz" in the ExportContent page of the LOCKSS Web UI.
Unpack the AU
The present working directory at this point should be /cache0/gamma/cache/. The appropriate command
tar -xvzf /path/to/au_to_transfer.tar.gz -C m --strip-components=4 --exclude \#agreement --exclude \#no_au_peers --exclude \#id_agreement.xml --exclude \#node_props --exclude \#au_id_file
Finish Up
The #node_props files will be generated at the LOCKSS daemon restart, which should be done at this time. By preserving the files #au_state.xml and #nodestate.xml, the repository manager thinks that a new content crawl has already been completed. A manually initiated V3 Poll on the archival unit should return appropriate URL agreement levels. The new LOCKSS node should now engage in normal content polling with the other peers.
Voting and Polling
Hashed UrlSets
AU released by LOCKSS with current content staging
# Block hashes from bpl-adpnet.jclc.org, 12:00:19 11/07/12 # AU: Birmingham Public Library Cartography Collection: Maps (000400-000599) # Hash algorithm: SHA-1 # Encoding: Base64 IpncVSUBDZaqglSsjkOp49OS4KE= http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/ M7+kc1l4Nl/Jwp3Hp0XRgK8dK94= http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/mrc lMeB8SlN+CGI7+1LxRhE+btzGmo= http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/mrc/000404.mrc 266QzyI+r7sZVcRZGId2/roOLNI= http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif Tv2dnIhIPXEllk95fqAeA8uB67s= http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif/000404.tif 6lm8iDE+/gu5ZWcJxTFjU5PP6OI= http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt b5Ta9uvYyS0GSZ7srapk2YZGWJM= http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt/000404.csv iuzvdBZG0oft3zxao5Sn5r7m+dA= http://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/txt/000404.txt # end
Manually created AU with limited copy of current content
# Block hashes from bpl-adpnet.jclc.org, 11:41:58 11/07/12 # AU: Birmingham Public Library Base Plugin, Base URL http://bpldb.bplonline.org/adpn/load/, Group test, Collection 000404-000404 # Hash algorithm: SHA-1 # Encoding: Base64 GZzfEqeNHw8WdjuZbX1g/VrJyaI= http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/ JkkpA1r7E548EOiiu0KeBuQYfts= http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/mrc lMeB8SlN+CGI7+1LxRhE+btzGmo= http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/mrc/000404.mrc uVD1FzG1WN43ROniD/VfQb/P6pU= http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/tif Tv2dnIhIPXEllk95fqAeA8uB67s= http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/tif/000404.tif 0bG54sCMTETu5qKnBA3RXO80v/0= http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/txt b5Ta9uvYyS0GSZ7srapk2YZGWJM= http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/txt/000404.csv iuzvdBZG0oft3zxao5Sn5r7m+dA= http://bpldb.bplonline.org/adpn/load/test/000404-000404/000404/txt/000404.txt # end
VoteBlock
url=http\://bpldb.bplonline.org/adpn/load/Cartography/000400-000599/000404/tif/000404.tif vn=I1dlZCBOb3YgMDcgMDM6NTg6NTYgQ1NUIDIwMTIKdW89MApmbz0wCnVsPTQ0NzkxMTEwCm5oPWw1S2NNcnFOdS9McVlobWE3RFNQcm13ZEhab1w9CmVycj1mYWxzZQpmbD00NDc5MTExMApwaD1UdjJkbkloSVBYRWxsazk1ZnFBZUE4dUI2N3NcPQo\= vt=0 (#Wed Nov 07 03:58:56 CST 2012
VoteBlock Versions (vn) decoded from Base64. Notice the ph (Plain Hash) for item 000404.tif matches both instances in the Hashed UrlSet.
#Wed Nov 07 03:58:56 CST 2012 uo=0 fo=0 ul=44791110 nh=l5KcMrqNu/LqYhma7DSPrmwdHZo\= err=false fl=44791110 ph=Tv2dnIhIPXEllk95fqAeA8uB67s\=
Poll Variants
- Proof of Retrievability -- POR (default)
- Proof of Possession -- POP
- Local -- Local
Symmetric variants
- Symmetric Proof of Retrievability -- Symmetric_POR
- Symmetric Proof of Possession -- Symmetric_POP
Parameters controlling actions
org.lockss.poll.v3.modulus [0] Comment: Override default setting of modulus to force PoP polls for testing Used in: org.lockss.poller.v3.V3Poller org.lockss.poll.v3.enableLocalPolls [false] Comment: If true, enable local polls (i.e. polls that do not invite any voters but depend on local hashes. Used in: org.lockss.poller.v3.V3Poller org.lockss.poll.v3.enablePoPPolls [false] Comment: If true, enable sampled (Proof of Possession) polls. Used in: org.lockss.poller.v3.V3Poller org.lockss.poll.v3.enablePoPVoting [true] Comment: If true, can vote in a Proof of Possession poll - for testing Used in: org.lockss.poller.v3.V3Voter org.lockss.poll.v3.enableSymmetricPolls [false] Comment: If true, can request a symmetric poll Used in: org.lockss.poller.v3.V3Voter
Polling introductions by version Release Notes
LOCKSS Daemon 1.62.4
- Initial implementation of "local" polls, which use local hashes to more quickly detect corruption and focus full polls where they are most needed.
- Initial implementation of Proof of Possession (PoP) polls, which establish rights to receive repairs from peers with less work than full polls (now called Proof of Retrievability polls).
LOCKSS Daemon 1.61.5
- Symmetric polling: The poller now exchanges hashes bidirectionally with each voter, potentially doubling the rate at which nodes may prove possession and thus become willing repairers for each other.
Proof of Retrievability
The default V3 polling logic. Votes consist of an AUs entire hashed UrlSet.
Proof of Possession
POP polls are intended to be quicker verification of cached content by requesting only a sample of URLs in an AU. The required for the voter to hash and return the data is reduced and the poll tallier can return a result much more quickly.
POP polls are not enabled by default in this PLN.
See SampleBlockHasher.java and V3Poller.java for more
Local Polls
Local polls do not include voters, local hashes are used.
BlockTally
BlockTally.java enumerates poll results including TOO_CLOSE and NOQUORUM.
TOO_CLOSE
Configuration parameter
org.lockss.poll.v3.voteMargin [75]
Need to investigate this further, particularly conflicts with quorum.
Does this mean a poll of 5 voters with 3 votes in agreement is invalid because (3/5)*100 = 60?
voteMargin input parameter is from configuration or default (75).
boolean isWithinMargin(int voteMargin) { int numAgree = agreeVoters.size(); int numDisagree = disagreeVoters.size(); double numVotes = numVotes(); double actualMargin; if (numAgree > numDisagree) { actualMargin = (double) numAgree / numVotes; } else { actualMargin = (double) numDisagree / numVotes; } if (actualMargin * 100 < voteMargin) { return false; } return true; }
NOQUORUM
Configuration parameter
org.lockss.poll.v3.quorum [5]
ADPNet V3Poll quorum configuration is 3.
Polling minimum is 3, not 3 + caller.
11:15:56.170: Debug: 10-V3Poller: [InvitationCallback] Enough peers are participating (3+0)
Proxies
LOCKSS Proxy Servers
- Content Proxy - Serve order : publisher, cache
- Audit Proxy - Serve order : cache
Proxy Services
- crawl_proxy AU title list parameter
- Squid Proxy
Proxy Headers
A LOCKSS node with audit proxy can serve as a crawl proxy host. The audit proxy services returns the original header information (current.props) as well as LOCKSS headers. When specifying a crawl proxy to populate a new node, the additional information in current.props doesn't affect polling agreement.
Web Services
DaemonStatusService
Starting with LOCKSS Daemon 1.61.5.
A new LOCKSS user should be created for SOAP calls. The LOCKSS user can have no explicit permissions. This indicates user has read-only access to the status tables. LOCKSS should be set to Basic authentication, Forms authentication use SOAP headers is not working as is.
WSDL Service Reference : http://bpl-adpnet.jclc.org:8081/ws/DaemonStatusService?wsdl
isDaemonReady()
Client Request. This can throw a server exception if service is not started. Best to implement an async callback to wait for server ready.
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:stat="http://status.ws.lockss.org/"> <soapenv:Header/> <soapenv:Body> <stat:isDaemonReady/> </soapenv:Body> </soapenv:Envelope>
DaemonStatusService Response
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <ns2:isDaemonReadyResponse xmlns:ns2="http://status.ws.lockss.org/"> <return>true</return> </ns2:isDaemonReadyResponse> </soap:Body> </soap:Envelope>
getAuIds()
Client Request
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:stat="http://status.ws.lockss.org/"> <soapenv:Header/> <soapenv:Body> <stat:getAuIds/> </soapenv:Body> </soapenv:Envelope>
DaemonStatusService Response
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <ns2:getAuIdsResponse xmlns:ns2="http://status.ws.lockss.org/"> <return> <id>org|bplonline|adpn|BirminghamPublicLibraryBasePlugin&base_url~http%3A%2F%2Fbpldb%2Ebplonline%2Eorg%2Fadpn%2Fload%2F&collection~000200-000399&group~Cartography</id> <name>Birmingham Public Library Cartography Collection: Maps (000200-000399)</name> </return> <return> <id>org|bplonline|adpn|BirminghamPublicLibraryBasePlugin&base_url~http%3A%2F%2Fbpldb%2Ebplonline%2Eorg%2Fadpn%2Fload%2F&collection~000400-000599&group~Cartography</id> <name>Birmingham Public Library Cartography Collection: Maps (000400-000599)</name> </return> </ns2:getAuIdsResponse> </soap:Body> </soap:Envelope>
getAuStatus()
Client Request
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:stat="http://status.ws.lockss.org/"> <soapenv:Header/> <soapenv:Body> <stat:getAuStatus> <!--Optional:--> <auId>org|bplonline|adpn|BirminghamPublicLibraryBasePlugin&base_url~http%3A%2F%2Fbpldb%2Ebplonline%2Eorg%2Fadpn%2Fload%2F&collection~000400-000599&group~Cartography</auId> </stat:getAuStatus> </soapenv:Body> </soapenv:Envelope>
DaemonStatusService Response
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <ns2:getAuStatusResponse xmlns:ns2="http://status.ws.lockss.org/"> <return> <accessType>Subscription</accessType> <availableFromPublisher>true</availableFromPublisher> <contentSize>9956083306</contentSize> <crawlPool>org.bplonline.adpn.BirminghamPublicLibraryBasePlugin</crawlPool> <creationTime>1347306830000</creationTime> <currentlyCrawling>false</currentlyCrawling> <currentlyPolling>false</currentlyPolling> <diskUsage>9992478720</diskUsage> <journalTitle>Birmingham Public Library Cartography Collection</journalTitle> <lastCompletedCrawl>1370972419167</lastCompletedCrawl> <lastCompletedPoll>1367421686137</lastCompletedPoll> <lastCrawl>1370969136912</lastCrawl> <lastCrawlResult>Successful</lastCrawlResult> <lastPoll>1367421131743</lastPoll> <lastPollResult>Complete</lastPollResult> <pluginName>Birmingham Public Library Base Plugin</pluginName> <publisher>Birmingham Public Library</publisher> <recentPollAgreement>0.9841897487640381</recentPollAgreement> <repository>/cache0/gamma/cache/m/</repository> <status>100.00% Agreement</status> <substanceState>Unknown</substanceState> <volume>Birmingham Public Library Cartography Collection: Maps (000400-000599)</volume> </return> </ns2:getAuStatusResponse> </soap:Body> </soap:Envelope>
Content Service
Content Service utilizes MTOM encoding to transfer binary data from the LOCKSS node to the client.
.NET Client Configuration
MTOM encoding is a breaking change for .NET 2.0 Web Service models. .NET clients should instead utilize the WCF Service Model.
Issues with the BasicHttpBinding
- KeepAlive not configurable, multiple targets cannot be accessed within the same application instance
Issues with WsHttpBinding
- Basic transport, no SSL
CustomBinding solution
<bindings> <customBinding> <binding name="ContentServiceCustomBinding" ...> <mtomMessageEncoding messageVersion="Soap11" /> <httpTransport ... authenticationScheme="Basic keepAliveEnabled="false" /> </binding> </customBinding> </bindings>
Versioning
Starting with LOCKSS Daemon 1.68, the Content Service includes methods to access versioning information.
List<FileWsResult> getVersions(String url, String auid): It provides a collection of objects, one per existing version, containing the version number, the file size and the collection date. ContentResult fetchVersionedFile(String url, String auid, Integer version): Similar to the existing fetchFile(String url, String auid), but returning the specified version instead of the latest one.