Difference between revisions of "Partitioning Cache Data"

From Adpnwiki
Jump to navigation Jump to search
 
(29 intermediate revisions by the same user not shown)
Line 1: Line 1:
  
= Partitioning Cache Data =
+
= Why Partition =
  
 
== Partitioned Cost Reductions ==
 
== Partitioned Cost Reductions ==
Line 10: Line 10:
 
  org.lockss.titleDbs = http://bpldb.bplonline.org/etc/adpn/titledb-local.xml
 
  org.lockss.titleDbs = http://bpldb.bplonline.org/etc/adpn/titledb-local.xml
  
Implementing a ''1 + 6'' partitioning strategy can save 12% on average for each network node. ''1 + 6'' indicates AU owner + 6 additional network nodes. Adding 2 additional nodes to the network can decrease per node storage by an average of 30%. Adding 4 additional nodes and partitioning cache data can save per node storage on average of 41%. This means we could store up to 18 TB of data on 10.6 TB nodes.
+
''n.b. An online interface exists but access control isn't defined yet. Contact Tobin for details.''
  
Implementing a ''1 + 5'' which is 6 discrete nodes in the network (double quorum), base storage decrease is 25% with no additional nodes. ''1 + 5'' with 4 additional nodes achieves a staggering 50% on average per node storage reduction.
+
PLNs can reduce the current burden on storage capacity and be more agile incorporating new members by establishing a baseline number indicating how many copies of the data in the network is sufficient. Data should then be partitioned such that each AU is contained on that number of nodes.  For this exercise, I have looked at implementing a ''1 + 5'' baseline. ''1 + 5'' indicates AU owner plus 5 network nodes (not the AU owner). This strategy results in 6 copies of data in an 8 node network and the average storage burden per node is reduced by '''25%''' with no new nodes added to the network. Adding 2 nodes to the network and implementing ''1 + 5'' reduces burden on average '''40%''' per node. Adding 4 nodes to the network and implementing ''1 + 5'' reduces burden on average '''50%''' per node.  
 +
 +
{| class="wikitable"
 +
!colspan="7"|Extant Storage Burden in TB
 +
|-
 +
| ||'''Base'''||colspan="6" align="center"|'''1 + 5'''
 +
|-
 +
| || ||colspan="6" align="center"| additional nodes
 +
|-
 +
| || ||+0||+1||+2||+3||+4
 +
|-
 +
|ADAH||5.87||3.79||3.42||2.64||2.47||2.37
 +
|-
 +
|AUB||5.87||4.66||4.52||3.82||3.77||3.44
 +
|-
 +
|BPL||5.87||3.79||3.36||3.28||3.14||1.89
 +
|-
 +
|SHC||5.87||4.79||4.06||3.57||2.81||2.74
 +
|-
 +
|TROY||5.87||3.80||3.32||3.20||3.11||2.81
 +
|-
 +
|UAB||5.87||4.96||3.64||3.14||2.70||2.61
 +
|-
 +
|UAT||5.87||5.26||5.16||5.11||4.98||4.97
 +
|-
 +
|UNA||5.87||4.18||4.02||3.64||3.05||2.87
 +
|-
 +
|}
  
All nodes (default configuration)
+
Limiting the copies of an AU and increasing the number of nodes increases network capacity. The nodes added to the network also reduce the storage burden of the 8 original nodes by reshuffling AU responsibilities in the ''1 + 5'' assignments.
  <nowiki>
 
au_host AUCount ContentSize in TB DiskCost in TB
 
ADAH 778 5.11 5.13
 
AUB 778 5.11 5.13
 
BPL 778 5.11 5.13
 
SHC 778 5.11 5.13
 
TROY 778 5.11 5.13
 
UAB 778 5.11 5.13
 
UAT 778 5.11 5.13
 
UNA 778 5.11 5.13
 
40.88 41.04</nowiki>
 
Does not include vacated publisher AUs (which is between 500 and 600 GB).
 
  
''1 + 6'' no additional nodes
+
'''Graph''' : Cost Reduction per Node in TB : http://bpldb.bplonline.org/images/adpn/CostReductionTB.png
<nowiki>au_host AUCount ContentSize DiskCost count size cost
 
ADAH 667 4.60 4.62 -14.27% -9.99% -9.97%
 
AUB 676 3.83 3.85 -13.11% -25.00% -24.94%
 
BPL 669 4.49 4.51 -14.01% -12.19% -12.15%
 
SHC 666 4.48 4.49 -14.40% -12.41% -12.41%
 
TROY 666 4.29 4.30 -14.40% -16.13% -16.09%
 
UAB 667 4.58 4.60 -14.27% -10.29% -10.26%
 
UAT 767 5.06 5.09 -1.41% -0.89% -0.86%
 
UNA 667 4.43 4.45 -14.27% -13.27% -13.23%
 
35.76 35.91 -12.52% -12.49%</nowiki>
 
  
''1 + 6'' with 2 additional nodes
+
'''Graph''' : Cost Reduction per Node % Change : http://bpldb.bplonline.org/images/adpn/CostReductionPercent.png
<nowiki>au_host AUCount ContentSize DiskCost count size cost
 
ADAH 519 3.93 3.95 -33.29% -23.00% -22.98%
 
AUB 540 3.21 3.23 -30.59% -37.15% -37.09%
 
BPL 522 3.09 3.11 -32.90% -39.49% -39.44%
 
SHC 519 3.22 3.23 -33.29% -36.97% -36.96%
 
TROY 519 3.26 3.28 -33.29% -36.15% -36.09%
 
UAB 519 3.02 3.03 -33.29% -40.93% -40.89%
 
UAT 751 4.89 4.91 -3.47% -4.40% -4.38%
 
UNA 519 3.26 3.28 -33.29% -36.15% -36.09%
 
ADAH2 518 3.94 3.95
 
BPL2 519 3.94 3.95
 
35.76 35.91 -12.52% -12.49%</nowiki>
 
  
''1 + 6'' with 4 additional nodes
+
= Partition Implementation =
<nowiki>au_host AUCount ContentSize in TB DiskCost in TB
 
ADAH 503 2.60 2.61 -35.35% -49.11%
 
AUB 425 3.16 3.17 -45.37% -38.13%
 
BPL 405 2.47 2.48 -47.94% -51.57%
 
SHC 408 2.83 2.84 -47.56% -44.66%
 
TROY 401 3.34 3.35 -48.46% -34.68%
 
UAB 413 2.50 2.51 -46.92% -51.13%
 
UAT 740 4.79 4.81 -4.88% -6.23%
 
UNA 357 2.97 2.98 -54.11% -41.96%
 
ADAH2 407 2.52 2.54
 
AUB2 501 3.25 3.26
 
BPL2 457 2.86 2.87
 
UAT2 428 2.48 2.49
 
35.76 35.91 -41.32% -39.68%</nowiki>
 
  
''1 + 5'' with no additional nodes
+
== Title List ==
<nowiki>
 
au_host AUCount ContentSize in TB DiskCost in TB
 
ADAH 581 3.83 3.85 -25.32% -24.87%
 
AUB 539 3.97 3.99 -30.72% -22.12%
 
BPL 537 3.72 3.73 -30.98% -27.13%
 
SHC 590 3.36 3.38 -24.16% -34.07%
 
TROY 555 3.79 3.81 -28.66% -25.71%
 
UAB 572 3.68 3.70 -26.48% -27.87%
 
UAT 757 4.92 4.93 -2.70% -3.68%
 
UNA 536 3.34 3.35 -31.11% -34.56%
 
30.65 30.78 -25.02% -25.00%</nowiki>
 
  
''1 + 5'' with 4 additional nodes
+
The title list is the most crucial component for any partitioning. With close to 1000 AUs management of partition responsibilities would be cumbersome at the node level.
<nowiki>au_host AUCount ContentSize in TB DiskCost in TB
 
ADAH 395 1.95 1.96 -49.23% -61.80%
 
AUB 349 2.90 2.91 -55.14% -43.30%
 
BPL 463 2.77 2.78 -40.49% -45.74%
 
SHC 350 2.94 2.95 -55.01% -42.46%
 
TROY 327 1.83 1.84 -57.97% -64.18%
 
UAB 331 2.10 2.11 -57.46% -58.88%
 
UAT 736 4.85 4.87 -5.40% -5.02%
 
UNA 312 2.37 2.38 -59.90% -53.55%
 
ADAH2 332 2.06 2.07
 
AUB2 328 2.52 2.53
 
BPL2 367 2.53 2.54
 
UAT2 377 1.83 1.84
 
30.65 30.78 -47.57% -46.87%</nowiki>
 
  
= Distribution Algorithms =
+
<nowiki><lockss-config>
 +
<property name="org.lockss.titleSet">
 +
  <property name="Birmingham Public Library">
 +
  <property name="name" value="All Birmingham Public Library AUs" />
 +
  <property name="class" value="xpath" />
 +
  <property name="xpath" value="[attributes/publisher='Birmingham Public Library']" />
 +
  </property>
 +
</property>
 +
<property name="org.lockss.title">
 +
  <property name="BirminghamPublicLibraryBasePluginBirminghamPublicLibraryCartographyCollectionMaps000400000599">
 +
  <property name="attributes.publisher" value="Birmingham Public Library" />
 +
  <property name="journalTitle" value="Birmingham Public Library Cartography Collection" />
 +
  <property name="type" value="journal" />
 +
  <property name="title" value="Birmingham Public Library Cartography Collection: Maps (000400-000599)" />
 +
  <property name="plugin" value="org.bplonline.adpn.BirminghamPublicLibraryBasePlugin" />
 +
  <property name="param.1">
 +
    <property name="key" value="base_url" />
 +
    <property name="value" value="http://bpldb.bplonline.org/adpn/load/" />
 +
  </property>
 +
  <property name="param.2">
 +
    <property name="key" value="group" />
 +
    <property name="value" value="Cartography" />
 +
  </property>
 +
  <property name="param.3">
 +
    <property name="key" value="collection" />
 +
    <property name="value" value="000400-000599" />
 +
  </property>
 +
  </property>
 +
</property>
 +
</lockss-config></nowiki>
  
TBD...
+
=== Comprehending LOCKSS Title List ===
  
Sample data was run using an static array of nodes, shuffled with a Fisher-Yates shuffle, and least used node put in first array position.
+
Nested same name elements with different levels of nesting depth causes some difficulty in comprehension using standard tools. Standard deserialization techniques won't work because group 1 property element collection (org.lockss.titleSet) has different depth than group 2 property element collection (org.lockss.title).  
  
  <nowiki>string[] _nodes = { "ADAH", "AUB", "UAT", "UAB", "BPL", "UNA", "TROY", "SHC" };
+
  <nowiki>protected void DeserializeXml() {
Shuffle(_nodes); // Fisher-Yates Shuffle
+
using (FileStream _f = new FileStream(Server.MapPath(@"titledb.xml"), FileMode.Open, FileAccess.Read))
           
+
{
//string _leastUsedNode = GetLeastUsedNodeByAUCount();
+
  XmlReaderSettings _sets = new XmlReaderSettings();
//string _leastUsedNode = GetLeastUsedNodeByContentSize();
+
  _sets.IgnoreWhitespace = true;
string _leastUsedNode = GetLeastUsedNodeByDiskCost();
+
  _sets.ProhibitDtd = false;
  
// put at beginning of array
+
  XmlReader _xml = XmlReader.Create(_f, _sets);
Swap(_nodes, _leastUsedNode);
+
  XmlSerializer _xs = new XmlSerializer(typeof(LockssTitleDb));
 +
  LockssTitleDb _titles = (LockssTitleDb)_xs.Deserialize(_xml);
 +
 +
  for (int i = 0; i < _titles.LockssTitleSet.Length; i++)
 +
  {
 +
    if (i % 2 == 0)
 +
    {
 +
      // first group property (has attribute name=org.lockss.titleSet)
 +
      // titleSet displays publisher detail
 +
      // this assumes group of 2 for each publisher and AU list
 +
    }
 +
    else
 +
    {
 +
      // second group property (has attribute name=org.lockss.title)
 +
      // each iteration is a new AU for group 1 publisher
 +
      for (int j = 0; j < _titles.LockssTitleSet[i].ChildNodes.Count; j++)
 +
      {
 +
        // property  name = normalized AU string
 +
        // outer AU definition
 +
        foreach (XmlNode _node in _titles.LockssTitleSet[i].ChildNodes[j].ChildNodes)
 +
        {
 +
          // loop through property tags         
 +
        }
 +
      }
 +
    }
 +
  }
 +
  _f.close();
 +
}
 +
...
 +
[XmlRoot("lockss-config")]
 +
public class LockssTitleDb
 +
{
 +
    // cannot create a single interface for all nesting depths with a single name
 +
    [XmlAnyElement()]
 +
    public XmlElement[] LockssTitleSet;
 +
}
 +
</nowiki>
 +
 
 +
=== Local Data Store ===
 +
 
 +
 
 +
http://bpldb.bplonline.org/images/adpn/datastore.png
 +
 
 +
=== XML Generation ===
 +
 
 +
ExpertConfig option org.lockss.titlesDb does not examine content type. It only looks at the URL ending and string matches .xml, else it assumes it is a .txt configuration file. See [http://lockss.cvs.sourceforge.net/viewvc/lockss/lockss-daemon/src/org/lockss/config/BaseConfigFile.java?view=markup BaseConfigFile.java] constructor.
 +
 
 +
== Title Distribution ==
 +
 
 +
=== Base Distribution ===
 +
 
 +
Using the [http://bpldb.bplonline.org/images/adpn/datastore.png local data store], '''au_ids''' are auto-incremented. After the initial distribution, new AU releases can be isolated by selecting from last known '''au_id'''. Row timestamp could also be used.
 +
 
 +
<nowiki>string[] _peers = GetPeerArray();
 +
foreach (DataRow _row in _titles.Rows)
 +
{
 +
Shuffle(_peers);
 +
 
 +
int _counter = 0;
 +
bool _isPeer = IsPeer(_row["au_pub_id"].ToString());
 +
int _maxCount = _isPeer ? 5 : 6;
 +
// not every publisher is a peer
 +
 
 +
// insert 6 5
 +
foreach (string _peer in _peers)
 +
{
 +
  if (_counter == _maxCount) break;
 +
  if (_peer.Equals(_row["au_pub_id"].ToString())) continue;
 +
 
 +
  _connect.Command.Parameters.Clear();
 +
  _connect.Command.CommandText = "INSERT INTO `adpn_peer_titles` (`peer_id`, `au_id`) VALUES (?,?)";
 +
  _connect.Command.Parameters.AddWithValue("?", _peer);
 +
  _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
 +
  _connect.Command.ExecuteNonQuery();
 +
  _counter++;
 +
}
 +
 
 +
if (_isPeer)
 +
{
 +
  // insert 1
 +
  _connect.Command.Parameters.Clear();
 +
  _connect.Command.CommandText = "INSERT INTO `adpn_peer_titles` (`peer_id`, `au_id`) VALUES (?,?)";
 +
  _connect.Command.Parameters.AddWithValue("?", _row["au_pub_id"].ToString());
 +
  _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
 +
  _connect.Command.ExecuteNonQuery();
 +
}
 +
}</nowiki>
 +
 
 +
A resultant table set could look like the following:
  
int _counter = 0;
+
<nowiki>peer_id  count(*)  % burden
// insert 6 5
+
ADAH   499   55%
foreach (string _node in _nodes)
+
AUB   549   61%
 +
BPL   520   57%
 +
SHC   504   56%
 +
SHC1   490   54%
 +
TROY   492   54%
 +
UAB   532   59%
 +
UAT   852   94%
 +
UAT1   482   53%
 +
UNA   510   56%
 +
 +
905 Titles</nowiki>
 +
 
 +
=== New Node Reshuffle ===
 +
 
 +
The original distribution algorithm can be used with a slight modification, only update rows in the '''adpn_peer_titles''' table when new node is up after shuffle. This approach is assuming the new node is not to be considered an AU owner for existing AUs.
 +
 
 +
<nowiki>string[] _peers = GetPeerArray();
 +
foreach (DataRow _row in _titles.Rows)
 
{
 
{
  if (_counter == 5) break;
+
  Shuffle(_peers);
if (_node.Equals(_row["au_owner"].ToString()))  continue;  
 
  
  // process
+
  int _counter = 0;
  counter++;
+
bool _isPeer = IsPeer(_row["au_pub_id"].ToString());
}
+
int _maxCount = _isPeer ? 5 : 6;
 +
// not every publisher is a peer
 +
 
 +
// insert 6 5
 +
foreach (string _peer in _peers)
 +
{
 +
  if (_counter == _maxCount) break;
 +
  if (_peer.Equals(_row["au_pub_id"].ToString())) continue;
 +
 
 +
  // ONLY modify existing AU map when new node is up
 +
  if (_peer.Equals(newNodePeerId))
 +
  {
 +
    _connect.Command.Parameters.Clear();
 +
    _connect.Command.CommandText = @"
 +
UPDATE `adpn_peer_titles` AS `a`,
 +
      (SELECT `b2`.`peer_id` FROM `adpn_peer_titles` AS `b2` 
 +
      WHERE `b2`.`au_id` = ?  AND `b2`.`peer_id` != ? ORDER BY RAND() LIMIT 1) AS `b`
 +
SET `a`.`peer_id` = ?
 +
WHERE `a`.`peer_id` = `b`.`peer_id`
 +
AND `a`.`au_id` = ? ";
 +
 
 +
    _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
 +
    _connect.Command.Parameters.AddWithValue("?", _row["au_pub_id"].ToString());
 +
    _connect.Command.Parameters.AddWithValue("?", _peer);
 +
    _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
 +
    _connect.Command.ExecuteNonQuery();
 +
  }
 +
}
 +
}</nowiki>
 +
 
 +
Using the original distribution and reshuffling for a new node resulted in :
 +
 
 +
<nowiki>peer_id  count(*)  % burden
 +
ADAH   456    50%
 +
AUB   492    54%
 +
BPL   467    52%
 +
SHC   456   50%
 +
SHC1   438   48%
 +
TROY    445   49%
 +
UAB   466   51%
 +
UAT   845   93%
 +
UAT1   441   49%
 +
UNA   467   52%
 +
BPL1   457   50%
 +
 
 +
905 titles
 
</nowiki>
 
</nowiki>
  
= Storage Calculator =
+
=== Dead Node Reshuffle ===
 +
 
 +
When a node is no longer a part of the network:
 +
# Delete peer_id from apdn_peer_titles
 +
# Select a list of peer_ids and au_ids where count is less than maxcount
 +
# Insert a shuffled peer id where not in peer_ids select list
 +
 
 +
The entire method :
 +
 
 +
<nowiki>private void DeadNodeShuffle()
 +
{
 +
  _connect.Command.Parameters.Clear();
 +
  _connect.Command.CommandText = @"SELECT `au_id` FROM `adpn_peer_titles` GROUP BY `au_id` HAVING COUNT(*) < 6";
 +
  DataTable _titles = new DataTable();
 +
  using (OdbcDataAdapter _adapter = new OdbcDataAdapter(_connect.Command))
 +
  {
 +
    _adapter.Fill(_titles);
 +
  }
 +
 
 +
  foreach (DataRow _row in _titles.Rows)
 +
  {
 +
    _connect.Command.Parameters.Clear();
 +
    _connect.Command.CommandText = @"INSERT INTO `adpn_peer_titles` (`peer_id`, `au_id`)
 +
      (SELECT `adpn_peers`.`peer_id`, (SELECT `au_id` FROM `au_titlelist` WHERE `au_id` = ?) AS 'au_id'
 +
      FROM `adpn_peers`
 +
      WHERE `adpn_peers`.`active` = 'y'
 +
      AND  `adpn_peers`.`peer_id` NOT IN (SELECT `peer_id` FROM `adpn_peer_titles` WHERE `au_id` = ?)     
 +
      ORDER BY RAND() LIMIT 1)";
 +
    _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
 +
    _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
 +
    _connect.Command.ExecuteNonQuery();
 +
  }
 +
}</nowiki>
 +
 
 +
Dropping a node (SHC1) and reshuffling AUs without full holdings results in the following table.
  
http://www.ibeast.com/content/tools/RaidCalc/RaidCalc.asp RAID Calculator
+
<nowiki>peer_id  count(*)  % burden
 +
ADAH   496     55%
 +
AUB   528     58%
 +
BPL   512     57%
 +
BPL1   514     57%
 +
SHC   500     55%
 +
TROY   503     56%
 +
UAB   520     57%
 +
UAT   850     94%
 +
UAT1   496     55%
 +
UNA   511     56%
  
8 Disks * 3072 GB + RAID 5 = 20027.16 GB
+
905 titles</nowiki>

Latest revision as of 08:13, 14 October 2013

Why Partition

Partitioned Cost Reductions

Quorum in the network is 3. (This is down from the previous value of 4... find out why).

Any sort of partitioning strategy would need to be implemented at the titledb.xml level. Title AUs can be assigned to peers centrally, and each peer should receive a custom titledb.xml file. If LOCKSS is unwilling to support that then there is alternative using local parameters.

org.lockss.titleDbs = http://bpldb.bplonline.org/etc/adpn/titledb-local.xml

n.b. An online interface exists but access control isn't defined yet. Contact Tobin for details.

PLNs can reduce the current burden on storage capacity and be more agile incorporating new members by establishing a baseline number indicating how many copies of the data in the network is sufficient. Data should then be partitioned such that each AU is contained on that number of nodes. For this exercise, I have looked at implementing a 1 + 5 baseline. 1 + 5 indicates AU owner plus 5 network nodes (not the AU owner). This strategy results in 6 copies of data in an 8 node network and the average storage burden per node is reduced by 25% with no new nodes added to the network. Adding 2 nodes to the network and implementing 1 + 5 reduces burden on average 40% per node. Adding 4 nodes to the network and implementing 1 + 5 reduces burden on average 50% per node.

Extant Storage Burden in TB
Base 1 + 5
additional nodes
+0 +1 +2 +3 +4
ADAH 5.87 3.79 3.42 2.64 2.47 2.37
AUB 5.87 4.66 4.52 3.82 3.77 3.44
BPL 5.87 3.79 3.36 3.28 3.14 1.89
SHC 5.87 4.79 4.06 3.57 2.81 2.74
TROY 5.87 3.80 3.32 3.20 3.11 2.81
UAB 5.87 4.96 3.64 3.14 2.70 2.61
UAT 5.87 5.26 5.16 5.11 4.98 4.97
UNA 5.87 4.18 4.02 3.64 3.05 2.87

Limiting the copies of an AU and increasing the number of nodes increases network capacity. The nodes added to the network also reduce the storage burden of the 8 original nodes by reshuffling AU responsibilities in the 1 + 5 assignments.

Graph : Cost Reduction per Node in TB : http://bpldb.bplonline.org/images/adpn/CostReductionTB.png

Graph : Cost Reduction per Node % Change : http://bpldb.bplonline.org/images/adpn/CostReductionPercent.png

Partition Implementation

Title List

The title list is the most crucial component for any partitioning. With close to 1000 AUs management of partition responsibilities would be cumbersome at the node level.

<lockss-config>
 <property name="org.lockss.titleSet">
  <property name="Birmingham Public Library">
   <property name="name" value="All Birmingham Public Library AUs" />
   <property name="class" value="xpath" />
   <property name="xpath" value="[attributes/publisher='Birmingham Public Library']" />
  </property>
 </property> 
 <property name="org.lockss.title">
  <property name="BirminghamPublicLibraryBasePluginBirminghamPublicLibraryCartographyCollectionMaps000400000599">
   <property name="attributes.publisher" value="Birmingham Public Library" />
   <property name="journalTitle" value="Birmingham Public Library Cartography Collection" />
   <property name="type" value="journal" />
   <property name="title" value="Birmingham Public Library Cartography Collection: Maps (000400-000599)" />
   <property name="plugin" value="org.bplonline.adpn.BirminghamPublicLibraryBasePlugin" />
   <property name="param.1">
    <property name="key" value="base_url" />
    <property name="value" value="http://bpldb.bplonline.org/adpn/load/" />
   </property>
   <property name="param.2">
    <property name="key" value="group" />
    <property name="value" value="Cartography" />
   </property>
   <property name="param.3">
    <property name="key" value="collection" />
    <property name="value" value="000400-000599" />
   </property>
  </property>
 </property>
</lockss-config>

Comprehending LOCKSS Title List

Nested same name elements with different levels of nesting depth causes some difficulty in comprehension using standard tools. Standard deserialization techniques won't work because group 1 property element collection (org.lockss.titleSet) has different depth than group 2 property element collection (org.lockss.title).

protected void DeserializeXml() {
 using (FileStream _f = new FileStream(Server.MapPath(@"titledb.xml"), FileMode.Open, FileAccess.Read))
 {
  XmlReaderSettings _sets = new XmlReaderSettings();
  _sets.IgnoreWhitespace = true;
  _sets.ProhibitDtd = false;

  XmlReader _xml = XmlReader.Create(_f, _sets);
  XmlSerializer _xs = new XmlSerializer(typeof(LockssTitleDb));
  LockssTitleDb _titles = (LockssTitleDb)_xs.Deserialize(_xml);
 
  for (int i = 0; i < _titles.LockssTitleSet.Length; i++)
  {
    if (i % 2 == 0)
    {
      // first group property (has attribute name=org.lockss.titleSet)
      // titleSet displays publisher detail
      // this assumes group of 2 for each publisher and AU list
    } 
    else 
    {
      // second group property (has attribute name=org.lockss.title)
      // each iteration is a new AU for group 1 publisher
      for (int j = 0; j < _titles.LockssTitleSet[i].ChildNodes.Count; j++)
      {
        // property  name = normalized AU string
        // outer AU definition
        foreach (XmlNode _node in _titles.LockssTitleSet[i].ChildNodes[j].ChildNodes)
        {
          // loop through property tags          
        }
      }
    }
  }
  _f.close();
 }
...
[XmlRoot("lockss-config")]
public class LockssTitleDb 
{
    // cannot create a single interface for all nesting depths with a single name
    [XmlAnyElement()]
    public XmlElement[] LockssTitleSet;
}

Local Data Store

http://bpldb.bplonline.org/images/adpn/datastore.png

XML Generation

ExpertConfig option org.lockss.titlesDb does not examine content type. It only looks at the URL ending and string matches .xml, else it assumes it is a .txt configuration file. See BaseConfigFile.java constructor.

Title Distribution

Base Distribution

Using the local data store, au_ids are auto-incremented. After the initial distribution, new AU releases can be isolated by selecting from last known au_id. Row timestamp could also be used.

string[] _peers = GetPeerArray();
foreach (DataRow _row in _titles.Rows)
{
 Shuffle(_peers);

 int _counter = 0;
 bool _isPeer = IsPeer(_row["au_pub_id"].ToString());
 int _maxCount = _isPeer ? 5 : 6; 
 // not every publisher is a peer

 // insert 6 5
 foreach (string _peer in _peers)
 {
  if (_counter == _maxCount) break;
  if (_peer.Equals(_row["au_pub_id"].ToString())) continue;

  _connect.Command.Parameters.Clear();
  _connect.Command.CommandText = "INSERT INTO `adpn_peer_titles` (`peer_id`, `au_id`) VALUES (?,?)";
  _connect.Command.Parameters.AddWithValue("?", _peer);
  _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
  _connect.Command.ExecuteNonQuery();
  _counter++;
 }

 if (_isPeer)
 {
  // insert 1
  _connect.Command.Parameters.Clear();
  _connect.Command.CommandText = "INSERT INTO `adpn_peer_titles` (`peer_id`, `au_id`) VALUES (?,?)";
  _connect.Command.Parameters.AddWithValue("?", _row["au_pub_id"].ToString());
  _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
  _connect.Command.ExecuteNonQuery();
 }
}

A resultant table set could look like the following:

peer_id  count(*)  % burden
ADAH	  499	  55%
AUB	  549	  61%
BPL	  520	  57%
SHC	  504	  56%
SHC1	  490	  54%
TROY	  492	  54%
UAB	  532	  59%
UAT	  852	  94%
UAT1	  482	  53%
UNA	  510	  56%
		
905	Titles

New Node Reshuffle

The original distribution algorithm can be used with a slight modification, only update rows in the adpn_peer_titles table when new node is up after shuffle. This approach is assuming the new node is not to be considered an AU owner for existing AUs.

string[] _peers = GetPeerArray();
foreach (DataRow _row in _titles.Rows)
{
 Shuffle(_peers);

 int _counter = 0;
 bool _isPeer = IsPeer(_row["au_pub_id"].ToString());
 int _maxCount = _isPeer ? 5 : 6; 
 // not every publisher is a peer

 // insert 6 5
 foreach (string _peer in _peers)
 {
  if (_counter == _maxCount) break;
  if (_peer.Equals(_row["au_pub_id"].ToString())) continue;

  // ONLY modify existing AU map when new node is up 
  if (_peer.Equals(newNodePeerId))
  {
    _connect.Command.Parameters.Clear();
    _connect.Command.CommandText = @"
UPDATE `adpn_peer_titles` AS `a`, 
      (SELECT  `b2`.`peer_id` FROM `adpn_peer_titles` AS `b2`  
       WHERE `b2`.`au_id` = ?  AND `b2`.`peer_id` != ? ORDER BY RAND() LIMIT 1) AS `b` 
SET `a`.`peer_id` = ?
WHERE `a`.`peer_id` = `b`.`peer_id`
AND `a`.`au_id` = ? ";

    _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
    _connect.Command.Parameters.AddWithValue("?", _row["au_pub_id"].ToString());
    _connect.Command.Parameters.AddWithValue("?", _peer);
    _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
    _connect.Command.ExecuteNonQuery();
  }
 }
}

Using the original distribution and reshuffling for a new node resulted in :

peer_id  count(*)  % burden
ADAH	  456     50%
AUB	  492  	  54%
BPL	  467     52%
SHC	  456	  50%
SHC1	  438	  48%
TROY  	  445	  49%
UAB	  466	  51%
UAT	  845	  93%
UAT1	  441	  49%
UNA	  467	  52%
BPL1	  457	  50%

905 titles 

Dead Node Reshuffle

When a node is no longer a part of the network:

  1. Delete peer_id from apdn_peer_titles
  2. Select a list of peer_ids and au_ids where count is less than maxcount
  3. Insert a shuffled peer id where not in peer_ids select list

The entire method :

private void DeadNodeShuffle()
{
  _connect.Command.Parameters.Clear();
  _connect.Command.CommandText = @"SELECT `au_id` FROM `adpn_peer_titles` GROUP BY `au_id` HAVING COUNT(*) < 6";
  DataTable _titles = new DataTable();
  using (OdbcDataAdapter _adapter = new OdbcDataAdapter(_connect.Command))
  {
    _adapter.Fill(_titles);
  }

  foreach (DataRow _row in _titles.Rows)
  {
    _connect.Command.Parameters.Clear();
    _connect.Command.CommandText = @"INSERT INTO `adpn_peer_titles` (`peer_id`, `au_id`) 
      (SELECT `adpn_peers`.`peer_id`, (SELECT `au_id` FROM `au_titlelist` WHERE `au_id` = ?) AS 'au_id'
       FROM `adpn_peers` 
       WHERE `adpn_peers`.`active` = 'y'
       AND  `adpn_peers`.`peer_id` NOT IN (SELECT `peer_id` FROM `adpn_peer_titles` WHERE `au_id` = ?)       
       ORDER BY RAND() LIMIT 1)";
    _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
    _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString());
    _connect.Command.ExecuteNonQuery();
  }
}

Dropping a node (SHC1) and reshuffling AUs without full holdings results in the following table.

peer_id   count(*)   % burden
ADAH	  496	    55%
AUB	  528	    58%
BPL	  512	    57%
BPL1	  514	    57%
SHC	  500	    55%
TROY	  503	    56%
UAB	  520	    57%
UAT	  850	    94%
UAT1	  496	    55%
UNA	  511	    56%

905 titles