Difference between revisions of "HOWTO: Package files for staging on the Drop Server"

From Adpnwiki
Jump to navigation Jump to search
 
(13 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
Here are some quick practical suggestions on how to package up files into a LOCKSS [[Archival Unit]], and how to get the [[AU]] staged on the [[drop server]] for harvest and consumption by LOCKSS.
 
Here are some quick practical suggestions on how to package up files into a LOCKSS [[Archival Unit]], and how to get the [[AU]] staged on the [[drop server]] for harvest and consumption by LOCKSS.
  
=== Start with a Folder ===
+
=== First, Start with a Folder ===
  
'''Step 1.''' The materials you will be preserving '''SHOULD''' start out as files packaged into a directory tree under one top-level folder.
+
'''STEP 1.''' The materials you will be preserving '''SHOULD''' start out as files packaged into a directory tree under one top-level folder.
  
 
* '''FORMAT:''' The directory can be organized into any hierarchical structure of folders and subfolders you want, so long as it’s all stored underneath one top-level folder.
 
* '''FORMAT:''' The directory can be organized into any hierarchical structure of folders and subfolders you want, so long as it’s all stored underneath one top-level folder.
Line 19: Line 19:
 
   - Q0000150500.tif
 
   - Q0000150500.tif
  
* '''EXCEPTION:''' If any of the files in your subdirectories happen to be named <code>index.html</code> or <code>index.htm</code>, this is a special case that requires certain precautions to make sure that all your content gets correctly harvested. Check with me for details on how to deal with this.
+
* '''EXCEPTION:''' If any of the files in your subdirectories happen to be named <code>index.html</code> or <code>index.htm</code>, this is a special case that requires certain precautions to make sure that all your content gets correctly harvested. ''Check with the ADPNet TPC for details on how to deal with this.''
* '''RECOMMENDATION:''' The number and size of files is up to you, but there are some practical constraints based on network capacity. It’s probably best practice to divvy up assets into AUs that will contain LESS THAN 1,000,000 or so individual files, and LESS THAN 1 TiB of data per AU. (The LOCKSS network can in fact handle very, very large AUs, and ADPNet is currently preserving AUs that are larger than these suggested, fairly arbitrary limits. But (1) nodes like that take forever to upload and crawl, which means that it’s a much slower turnaround time for you before we can confirm that they are preserved in the network; and (2) nodes like that also make some practical systems administration tasks more of a pain in the neck for the people who run the [[preservation nodes]].
+
* '''RECOMMENDATION:''' The number and size of files is up to you, but there are some practical constraints based on network capacity. Best practice is to divvy up assets into AUs that will contain '''LESS THAN 1,000,000 or so individual files,''' and '''LESS THAN 1 TiB of data''' per AU. The LOCKSS network can in fact handle very, very large AUs, and ADPNet is currently preserving AUs that are larger than these suggested, fairly arbitrary limits. But (1) nodes like that take forever to upload and crawl, which means that it’s a much slower turnaround time for you before we can confirm that they are preserved in the network; and (2) nodes like that also make some practical systems administration tasks more of a pain in the neck for the people who run the [[preservation nodes]].
  
=== Name the Folder ===
+
=== Second, Name the Folder ===
  
'''Step 2.''' Your top-level folder can have almost any name, but the name '''MUST''' be unique among all the AUs you will ever upload.
+
'''STEP 2.''' Your top-level folder can have almost any name, but the name '''MUST''' be unique among all the AUs you will ever upload.
  
 
* Once you ingest an AU, you '''SHOULD NOT''' re-use that directory name unless you actually intend to replace the old materials with the new materials. You need a new name to ingest new AUs.
 
* Once you ingest an AU, you '''SHOULD NOT''' re-use that directory name unless you actually intend to replace the old materials with the new materials. You need a new name to ingest new AUs.
Line 30: Line 30:
 
* '''EXAMPLE:''' ADAH has a bunch of Q-Number Masters files to stage for ingest, so we give the directory a name unique to those contents, for example: Digitization-Masters-Q-numbers-Master-Q0000150001_Q0000150500m (so, next time we upload materials, we upload them under a new directory with the next numbers in the sequence, Digitization-Masters-Q-numbers-Master-Q0000105501_Q0000106000m).
 
* '''EXAMPLE:''' ADAH has a bunch of Q-Number Masters files to stage for ingest, so we give the directory a name unique to those contents, for example: Digitization-Masters-Q-numbers-Master-Q0000150001_Q0000150500m (so, next time we upload materials, we upload them under a new directory with the next numbers in the sequence, Digitization-Masters-Q-numbers-Master-Q0000105501_Q0000106000m).
  
=== Bag the Folder ===
+
=== Third, Bag the Folder ===
  
'''Step 3.''' Once you have your top-level folder prepared and named, you '''SHOULD''' enclose the folder in a BagIt formatted directory.
+
'''STEP 3.''' Once you have your top-level folder prepared and named, you '''SHOULD''' enclose the folder in a BagIt formatted directory.
  
 
* You can do this easily using an open-source Python script ([https://github.com/LibraryOfCongress/bagit-python BagIt-Python]) or using the Bagger application ([https://github.com/LibraryOfCongress/bagger Bagger]).
 
* You can do this easily using an open-source Python script ([https://github.com/LibraryOfCongress/bagit-python BagIt-Python]) or using the Bagger application ([https://github.com/LibraryOfCongress/bagger Bagger]).
Line 45: Line 45:
 
   python bagit.py --validate ${DIRNAME}
 
   python bagit.py --validate ${DIRNAME}
  
=== Prepare a LOCKSS Manifest and Drop It Into the Top Level of the Bag ===
+
=== Fourth, Prepare a LOCKSS Manifest and Drop It Into the Top Level of the Bag ===
  
'''Step 4.''' Once you have your top-level folder prepared, named, and bagged, you '''MUST''' create a small HTML file named <code>manifest.html</code> and drop it into the top-level directory alongside <code>baginfo.txt</code>, <code>bagit.txt</code>, etc. The required format for this manifest file is very simple, but the best way to create it is still to use a tool like the [https://archives.alabama.gov/Services/ADPnet/MakeManifest/ ADAH Make Manifest web form] or the [https://gitlab.com/adpnet/adpn-cli adpn-cli command-line tool].
+
'''STEP 4.''' Once you have your top-level folder prepared, named, and bagged, you '''MUST''' create a small HTML file named <code>manifest.html</code> and drop it into the top-level directory alongside <code>baginfo.txt</code>, <code>bagit.txt</code>, etc. The required format for this manifest file is very simple, but the best way to create it is still to use a tool like the [http://adpn.org/services/MakeManifest/ ADAH Make Manifest web form] or the [https://gitlab.com/adpnet/adpn-cli adpn-cli command-line tool].
  
 
* '''FORMAT:''' The <code>manifest.html</code> file needs to include a link to your AU’s location on the drop server, some boilerplate HTML, and some boilerplate language that gives the LOCKSS daemon permission to harvest content. This is a bit of a pain in the neck and the format is under-documented, but LOCKSS won’t ingest your AU unless it includes a file like this with the correct URL and the correct boilerplate language.
 
* '''FORMAT:''' The <code>manifest.html</code> file needs to include a link to your AU’s location on the drop server, some boilerplate HTML, and some boilerplate language that gives the LOCKSS daemon permission to harvest content. This is a bit of a pain in the neck and the format is under-documented, but LOCKSS won’t ingest your AU unless it includes a file like this with the correct URL and the correct boilerplate language.
Line 69: Line 69:
 
* '''RECOMMENDATION:''' You can access a version of the  templates I use if you go to this URL:
 
* '''RECOMMENDATION:''' You can access a version of the  templates I use if you go to this URL:
  
   [https://archives.alabama.gov/Services/ADPnet/MakeManifest/ archives.alabama.gov/Services/ADPnet/MakeManifest/]
+
   [http://adpn.org/services/MakeManifest/ adpn.org/services/MakeManifest/]
  
 
* Fill in the form fields with your own information. Some notes:
 
* Fill in the form fields with your own information. Some notes:
 
** The "Institution Code" is the username you’ve been assigned (for example, <code>adah</code> or <code>tsk</code>)
 
** The "Institution Code" is the username you’ve been assigned (for example, <code>adah</code> or <code>tsk</code>)
 +
** The "Institution Name" is the human-readable name for your institution, possibly followed by an alphanumeric code used by ADPNet (for example "Alabama Department of Archives and History (ADAH)"). If your institution is not listed as a recognized publisher of AUs, contact ADPNet TPC to be added.
 
** The "Directory Name" is the unique name you chose in step 2.
 
** The "Directory Name" is the unique name you chose in step 2.
 
** The "File Size" field should contain the human-readable summary information that you would get from <code>File > Properties</code> or from a command-line tool like <code>du</code>. Ideally, this should include both the total cumulative size of all the files in the AU (in units like MiB, GiB, TiB, ...) and also the total count of files in the AU.
 
** The "File Size" field should contain the human-readable summary information that you would get from <code>File > Properties</code> or from a command-line tool like <code>du</code>. Ideally, this should include both the total cumulative size of all the files in the AU (in units like MiB, GiB, TiB, ...) and also the total count of files in the AU.
 
** The "Staging Area Base URL" and the "LOCKSS Plugin" fields are pre-filled with standard default values, which should not need to be changed as long as you are working with our drop server.
 
** The "Staging Area Base URL" and the "LOCKSS Plugin" fields are pre-filled with standard default values, which should not need to be changed as long as you are working with our drop server.
* Use the **CREATE FILE** button to generate HTML for your manifest file. You can save the results using <code>File > Save</code> or by copying and pasting the HTML source code into a text editor. Make sure the results are saved as <code>manifest.html</code> and then place that file in the top level of the BagIt-formatted directory, alongside <code>baginfo.txt</code>, <code>bagit.txt</code>, etc.
+
* Use the '''CREATE FILE''' button to generate HTML for your manifest file. You can save the results using <code>File > Save</code> or by copying and pasting the HTML source code into a text editor. Make sure the results are saved as <code>manifest.html</code> and then place that file in the top level of the BagIt-formatted directory, alongside <code>baginfo.txt</code>, <code>bagit.txt</code>, etc.
  
=== Upload Your Archival Unit to Your Drop Server Staging Area ===
+
=== Fifth, Upload Your Archival Unit to Your Drop Server Staging Area ===
  
 
'''STEP 5.''' At this point your AU should be packaged and ready to be considered as an [[Archival Unit]] (AU) by the LOCKSS daemon.
 
'''STEP 5.''' At this point your AU should be packaged and ready to be considered as an [[Archival Unit]] (AU) by the LOCKSS daemon.

Latest revision as of 15:30, 19 May 2022

Here is a checklist based on a write-up by Charles Johnson on 19 February 2021.

Check List

Here are some quick practical suggestions on how to package up files into a LOCKSS Archival Unit, and how to get the AU staged on the drop server for harvest and consumption by LOCKSS.

First, Start with a Folder

STEP 1. The materials you will be preserving SHOULD start out as files packaged into a directory tree under one top-level folder.

  • FORMAT: The directory can be organized into any hierarchical structure of folders and subfolders you want, so long as it’s all stored underneath one top-level folder.
  • EXAMPLE: ADAH is preserving high-quality digital masters from our ongoing scanning projects in a set of directories called Q-Numbers Masters files. Each directory contains 500 TIFF files (with poetic and evocative names like Q0000150001.tif), along with some additional files for metadata pertaining to that package of files. When we start out, the directory that we're going to package for upload looks like this:
  Digitization-Masters-Q-numbers-Master-Q0000150001_Q0000150500m\
  - Q0000150001.tif
  - Q0000150002.tif
  - Q0000150003.tif
  […]
  - Q0000150500.tif
  • EXCEPTION: If any of the files in your subdirectories happen to be named index.html or index.htm, this is a special case that requires certain precautions to make sure that all your content gets correctly harvested. Check with the ADPNet TPC for details on how to deal with this.
  • RECOMMENDATION: The number and size of files is up to you, but there are some practical constraints based on network capacity. Best practice is to divvy up assets into AUs that will contain LESS THAN 1,000,000 or so individual files, and LESS THAN 1 TiB of data per AU. The LOCKSS network can in fact handle very, very large AUs, and ADPNet is currently preserving AUs that are larger than these suggested, fairly arbitrary limits. But (1) nodes like that take forever to upload and crawl, which means that it’s a much slower turnaround time for you before we can confirm that they are preserved in the network; and (2) nodes like that also make some practical systems administration tasks more of a pain in the neck for the people who run the preservation nodes.

Second, Name the Folder

STEP 2. Your top-level folder can have almost any name, but the name MUST be unique among all the AUs you will ever upload.

  • Once you ingest an AU, you SHOULD NOT re-use that directory name unless you actually intend to replace the old materials with the new materials. You need a new name to ingest new AUs.
  • EXAMPLE: ADAH has a bunch of Q-Number Masters files to stage for ingest, so we give the directory a name unique to those contents, for example: Digitization-Masters-Q-numbers-Master-Q0000150001_Q0000150500m (so, next time we upload materials, we upload them under a new directory with the next numbers in the sequence, Digitization-Masters-Q-numbers-Master-Q0000105501_Q0000106000m).

Third, Bag the Folder

STEP 3. Once you have your top-level folder prepared and named, you SHOULD enclose the folder in a BagIt formatted directory.

  • You can do this easily using an open-source Python script (BagIt-Python) or using the Bagger application (Bagger).
  • EXAMPLE: When a Q-numbers directory is ready to be bagged at ADAH, I open Windows PowerShell, then I run:
 python bagit.py ${DIRNAME}
  
  • This encloses the directory with BagIt preservation data. As a result, Digitization-Masters-Q-numbers-Master-Q0000150001_Q0000150500m (for example) is now reorganized so that the top-level folder contains a single “payload” subdirectory, called data, that contains the 500+ TIFFs and associated metadata files, and then a set of small text files (baginfo.txt, bagit.txt, manifest-sha256.txt, tag-manifest-sha256.txt, etc.) that provide a manifest and checksums for those payload files, along with some meta-data about the packaging process.
  • If I want to validate the contents of the preservation package before I upload it, I can do that by running this on the same directory:
 python bagit.py --validate ${DIRNAME}

Fourth, Prepare a LOCKSS Manifest and Drop It Into the Top Level of the Bag

STEP 4. Once you have your top-level folder prepared, named, and bagged, you MUST create a small HTML file named manifest.html and drop it into the top-level directory alongside baginfo.txt, bagit.txt, etc. The required format for this manifest file is very simple, but the best way to create it is still to use a tool like the ADAH Make Manifest web form or the adpn-cli command-line tool.

  • FORMAT: The manifest.html file needs to include a link to your AU’s location on the drop server, some boilerplate HTML, and some boilerplate language that gives the LOCKSS daemon permission to harvest content. This is a bit of a pain in the neck and the format is under-documented, but LOCKSS won’t ingest your AU unless it includes a file like this with the correct URL and the correct boilerplate language.
  • EXAMPLE: After I’ve bagged a Q-Numbers directory, I generate a manifest.html file using a script to file in a standard template with information about the AU I’m about to upload. I place the file in the top level of the BagIt-formatted directory, alongside baginfo.txt, bagit.txt, etc. So now my directory looks like:
 Digitization-Masters-Q-numbers-Master-Q0000150001_Q0000150500m\
 - bagit.txt
 - baginfo.txt
 - manifest.html
 - manifest-sha256.txt
 - manifest-sha512.txt
 - tagmanifest-sha256.txt
 - tagmanifest-sha512.txt
 - data\
       - Q0000150001.tif
       - Q0000150002.tif
       - Q0000150003.tif
       […]
       - Q0000150500.tif
  • RECOMMENDATION: You can access a version of the templates I use if you go to this URL:
 adpn.org/services/MakeManifest/
  • Fill in the form fields with your own information. Some notes:
    • The "Institution Code" is the username you’ve been assigned (for example, adah or tsk)
    • The "Institution Name" is the human-readable name for your institution, possibly followed by an alphanumeric code used by ADPNet (for example "Alabama Department of Archives and History (ADAH)"). If your institution is not listed as a recognized publisher of AUs, contact ADPNet TPC to be added.
    • The "Directory Name" is the unique name you chose in step 2.
    • The "File Size" field should contain the human-readable summary information that you would get from File > Properties or from a command-line tool like du. Ideally, this should include both the total cumulative size of all the files in the AU (in units like MiB, GiB, TiB, ...) and also the total count of files in the AU.
    • The "Staging Area Base URL" and the "LOCKSS Plugin" fields are pre-filled with standard default values, which should not need to be changed as long as you are working with our drop server.
  • Use the CREATE FILE button to generate HTML for your manifest file. You can save the results using File > Save or by copying and pasting the HTML source code into a text editor. Make sure the results are saved as manifest.html and then place that file in the top level of the BagIt-formatted directory, alongside baginfo.txt, bagit.txt, etc.

Fifth, Upload Your Archival Unit to Your Drop Server Staging Area

STEP 5. At this point your AU should be packaged and ready to be considered as an Archival Unit (AU) by the LOCKSS daemon.

  • Use WinSCP (or any other SFTP tool that you like) to upload the whole packaged-up directory to drop.adpn.org, storing it under the drop_au_content_in_here subdirectory of your staging area.
  • Notify ADPNet TPC to let us know you’re ready to go ahead with the ingest.