Difference between revisions of "Partitioning Cache Data"
(19 intermediate revisions by the same user not shown) | |||
Line 10: | Line 10: | ||
org.lockss.titleDbs = http://bpldb.bplonline.org/etc/adpn/titledb-local.xml | org.lockss.titleDbs = http://bpldb.bplonline.org/etc/adpn/titledb-local.xml | ||
− | + | ''n.b. An online interface exists but access control isn't defined yet. Contact Tobin for details.'' | |
− | + | PLNs can reduce the current burden on storage capacity and be more agile incorporating new members by establishing a baseline number indicating how many copies of the data in the network is sufficient. Data should then be partitioned such that each AU is contained on that number of nodes. For this exercise, I have looked at implementing a ''1 + 5'' baseline. ''1 + 5'' indicates AU owner plus 5 network nodes (not the AU owner). This strategy results in 6 copies of data in an 8 node network and the average storage burden per node is reduced by '''25%''' with no new nodes added to the network. Adding 2 nodes to the network and implementing ''1 + 5'' reduces burden on average '''40%''' per node. Adding 4 nodes to the network and implementing ''1 + 5'' reduces burden on average '''50%''' per node. | |
− | + | ||
− | + | {| class="wikitable" | |
− | + | !colspan="7"|Extant Storage Burden in TB | |
− | + | |- | |
− | + | | ||'''Base'''||colspan="6" align="center"|'''1 + 5''' | |
− | + | |- | |
− | + | | || ||colspan="6" align="center"| additional nodes | |
− | + | |- | |
− | + | | || ||+0||+1||+2||+3||+4 | |
− | + | |- | |
− | + | |ADAH||5.87||3.79||3.42||2.64||2.47||2.37 | |
− | + | |- | |
− | + | |AUB||5.87||4.66||4.52||3.82||3.77||3.44 | |
− | + | |- | |
− | + | |BPL||5.87||3.79||3.36||3.28||3.14||1.89 | |
− | ''1 + | + | |- |
− | + | |SHC||5.87||4.79||4.06||3.57||2.81||2.74 | |
− | + | |- | |
− | + | |TROY||5.87||3.80||3.32||3.20||3.11||2.81 | |
− | + | |- | |
− | + | |UAB||5.87||4.96||3.64||3.14||2.70||2.61 | |
− | + | |- | |
− | + | |UAT||5.87||5.26||5.16||5.11||4.98||4.97 | |
− | + | |- | |
− | + | |UNA||5.87||4.18||4.02||3.64||3.05||2.87 | |
− | + | |- | |
− | + | |} | |
− | ''1 + | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | ADAH | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | SHC | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | UNA | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | Limiting the copies of an AU and increasing the number of nodes increases network capacity. The nodes added to the network also reduce the storage burden of the 8 original nodes by reshuffling AU responsibilities in the ''1 + 5'' assignments. | |
− | |||
− | |||
− | |||
− | + | '''Graph''' : Cost Reduction per Node in TB : http://bpldb.bplonline.org/images/adpn/CostReductionTB.png | |
− | http:// | + | '''Graph''' : Cost Reduction per Node % Change : http://bpldb.bplonline.org/images/adpn/CostReductionPercent.png |
− | |||
− | |||
= Partition Implementation = | = Partition Implementation = | ||
Line 169: | Line 83: | ||
</property> | </property> | ||
</lockss-config></nowiki> | </lockss-config></nowiki> | ||
− | |||
=== Comprehending LOCKSS Title List === | === Comprehending LOCKSS Title List === | ||
Line 234: | Line 147: | ||
=== Base Distribution === | === Base Distribution === | ||
− | Using the local data store, au_ids are auto-incremented. After the initial distribution, new AU releases can be isolated by selecting from last known | + | Using the [http://bpldb.bplonline.org/images/adpn/datastore.png local data store], '''au_ids''' are auto-incremented. After the initial distribution, new AU releases can be isolated by selecting from last known '''au_id'''. Row timestamp could also be used. |
<nowiki>string[] _peers = GetPeerArray(); | <nowiki>string[] _peers = GetPeerArray(); | ||
Line 270: | Line 183: | ||
} | } | ||
}</nowiki> | }</nowiki> | ||
+ | |||
+ | A resultant table set could look like the following: | ||
+ | |||
+ | <nowiki>peer_id count(*) % burden | ||
+ | ADAH 499 55% | ||
+ | AUB 549 61% | ||
+ | BPL 520 57% | ||
+ | SHC 504 56% | ||
+ | SHC1 490 54% | ||
+ | TROY 492 54% | ||
+ | UAB 532 59% | ||
+ | UAT 852 94% | ||
+ | UAT1 482 53% | ||
+ | UNA 510 56% | ||
+ | |||
+ | 905 Titles</nowiki> | ||
=== New Node Reshuffle === | === New Node Reshuffle === | ||
− | The original distribution algorithm can be used with a slight modification, only update rows in the adpn_peer_titles table when new node is up after shuffle. | + | The original distribution algorithm can be used with a slight modification, only update rows in the '''adpn_peer_titles''' table when new node is up after shuffle. This approach is assuming the new node is not to be considered an AU owner for existing AUs. |
<nowiki>string[] _peers = GetPeerArray(); | <nowiki>string[] _peers = GetPeerArray(); | ||
Line 295: | Line 224: | ||
{ | { | ||
_connect.Command.Parameters.Clear(); | _connect.Command.Parameters.Clear(); | ||
− | _connect.Command.CommandText = @"UPDATE `adpn_peer_titles` | + | _connect.Command.CommandText = @" |
− | + | UPDATE `adpn_peer_titles` AS `a`, | |
+ | (SELECT `b2`.`peer_id` FROM `adpn_peer_titles` AS `b2` | ||
+ | WHERE `b2`.`au_id` = ? AND `b2`.`peer_id` != ? ORDER BY RAND() LIMIT 1) AS `b` | ||
+ | SET `a`.`peer_id` = ? | ||
+ | WHERE `a`.`peer_id` = `b`.`peer_id` | ||
+ | AND `a`.`au_id` = ? "; | ||
+ | |||
+ | _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); | ||
+ | _connect.Command.Parameters.AddWithValue("?", _row["au_pub_id"].ToString()); | ||
_connect.Command.Parameters.AddWithValue("?", _peer); | _connect.Command.Parameters.AddWithValue("?", _peer); | ||
_connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); | _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); | ||
− | |||
_connect.Command.ExecuteNonQuery(); | _connect.Command.ExecuteNonQuery(); | ||
} | } | ||
Line 305: | Line 241: | ||
}</nowiki> | }</nowiki> | ||
+ | Using the original distribution and reshuffling for a new node resulted in : | ||
+ | |||
+ | <nowiki>peer_id count(*) % burden | ||
+ | ADAH 456 50% | ||
+ | AUB 492 54% | ||
+ | BPL 467 52% | ||
+ | SHC 456 50% | ||
+ | SHC1 438 48% | ||
+ | TROY 445 49% | ||
+ | UAB 466 51% | ||
+ | UAT 845 93% | ||
+ | UAT1 441 49% | ||
+ | UNA 467 52% | ||
+ | BPL1 457 50% | ||
+ | 905 titles | ||
+ | </nowiki> | ||
=== Dead Node Reshuffle === | === Dead Node Reshuffle === | ||
Line 313: | Line 265: | ||
# Select a list of peer_ids and au_ids where count is less than maxcount | # Select a list of peer_ids and au_ids where count is less than maxcount | ||
# Insert a shuffled peer id where not in peer_ids select list | # Insert a shuffled peer id where not in peer_ids select list | ||
+ | |||
+ | The entire method : | ||
+ | |||
+ | <nowiki>private void DeadNodeShuffle() | ||
+ | { | ||
+ | _connect.Command.Parameters.Clear(); | ||
+ | _connect.Command.CommandText = @"SELECT `au_id` FROM `adpn_peer_titles` GROUP BY `au_id` HAVING COUNT(*) < 6"; | ||
+ | DataTable _titles = new DataTable(); | ||
+ | using (OdbcDataAdapter _adapter = new OdbcDataAdapter(_connect.Command)) | ||
+ | { | ||
+ | _adapter.Fill(_titles); | ||
+ | } | ||
+ | |||
+ | foreach (DataRow _row in _titles.Rows) | ||
+ | { | ||
+ | _connect.Command.Parameters.Clear(); | ||
+ | _connect.Command.CommandText = @"INSERT INTO `adpn_peer_titles` (`peer_id`, `au_id`) | ||
+ | (SELECT `adpn_peers`.`peer_id`, (SELECT `au_id` FROM `au_titlelist` WHERE `au_id` = ?) AS 'au_id' | ||
+ | FROM `adpn_peers` | ||
+ | WHERE `adpn_peers`.`active` = 'y' | ||
+ | AND `adpn_peers`.`peer_id` NOT IN (SELECT `peer_id` FROM `adpn_peer_titles` WHERE `au_id` = ?) | ||
+ | ORDER BY RAND() LIMIT 1)"; | ||
+ | _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); | ||
+ | _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); | ||
+ | _connect.Command.ExecuteNonQuery(); | ||
+ | } | ||
+ | }</nowiki> | ||
+ | |||
+ | Dropping a node (SHC1) and reshuffling AUs without full holdings results in the following table. | ||
+ | |||
+ | <nowiki>peer_id count(*) % burden | ||
+ | ADAH 496 55% | ||
+ | AUB 528 58% | ||
+ | BPL 512 57% | ||
+ | BPL1 514 57% | ||
+ | SHC 500 55% | ||
+ | TROY 503 56% | ||
+ | UAB 520 57% | ||
+ | UAT 850 94% | ||
+ | UAT1 496 55% | ||
+ | UNA 511 56% | ||
+ | |||
+ | 905 titles</nowiki> |
Latest revision as of 09:13, 14 October 2013
Why Partition
Partitioned Cost Reductions
Quorum in the network is 3. (This is down from the previous value of 4... find out why).
Any sort of partitioning strategy would need to be implemented at the titledb.xml level. Title AUs can be assigned to peers centrally, and each peer should receive a custom titledb.xml file. If LOCKSS is unwilling to support that then there is alternative using local parameters.
org.lockss.titleDbs = http://bpldb.bplonline.org/etc/adpn/titledb-local.xml
n.b. An online interface exists but access control isn't defined yet. Contact Tobin for details.
PLNs can reduce the current burden on storage capacity and be more agile incorporating new members by establishing a baseline number indicating how many copies of the data in the network is sufficient. Data should then be partitioned such that each AU is contained on that number of nodes. For this exercise, I have looked at implementing a 1 + 5 baseline. 1 + 5 indicates AU owner plus 5 network nodes (not the AU owner). This strategy results in 6 copies of data in an 8 node network and the average storage burden per node is reduced by 25% with no new nodes added to the network. Adding 2 nodes to the network and implementing 1 + 5 reduces burden on average 40% per node. Adding 4 nodes to the network and implementing 1 + 5 reduces burden on average 50% per node.
Extant Storage Burden in TB | |||||||
---|---|---|---|---|---|---|---|
Base | 1 + 5 | ||||||
additional nodes | |||||||
+0 | +1 | +2 | +3 | +4 | |||
ADAH | 5.87 | 3.79 | 3.42 | 2.64 | 2.47 | 2.37 | |
AUB | 5.87 | 4.66 | 4.52 | 3.82 | 3.77 | 3.44 | |
BPL | 5.87 | 3.79 | 3.36 | 3.28 | 3.14 | 1.89 | |
SHC | 5.87 | 4.79 | 4.06 | 3.57 | 2.81 | 2.74 | |
TROY | 5.87 | 3.80 | 3.32 | 3.20 | 3.11 | 2.81 | |
UAB | 5.87 | 4.96 | 3.64 | 3.14 | 2.70 | 2.61 | |
UAT | 5.87 | 5.26 | 5.16 | 5.11 | 4.98 | 4.97 | |
UNA | 5.87 | 4.18 | 4.02 | 3.64 | 3.05 | 2.87 |
Limiting the copies of an AU and increasing the number of nodes increases network capacity. The nodes added to the network also reduce the storage burden of the 8 original nodes by reshuffling AU responsibilities in the 1 + 5 assignments.
Graph : Cost Reduction per Node in TB : http://bpldb.bplonline.org/images/adpn/CostReductionTB.png
Graph : Cost Reduction per Node % Change : http://bpldb.bplonline.org/images/adpn/CostReductionPercent.png
Partition Implementation
Title List
The title list is the most crucial component for any partitioning. With close to 1000 AUs management of partition responsibilities would be cumbersome at the node level.
<lockss-config> <property name="org.lockss.titleSet"> <property name="Birmingham Public Library"> <property name="name" value="All Birmingham Public Library AUs" /> <property name="class" value="xpath" /> <property name="xpath" value="[attributes/publisher='Birmingham Public Library']" /> </property> </property> <property name="org.lockss.title"> <property name="BirminghamPublicLibraryBasePluginBirminghamPublicLibraryCartographyCollectionMaps000400000599"> <property name="attributes.publisher" value="Birmingham Public Library" /> <property name="journalTitle" value="Birmingham Public Library Cartography Collection" /> <property name="type" value="journal" /> <property name="title" value="Birmingham Public Library Cartography Collection: Maps (000400-000599)" /> <property name="plugin" value="org.bplonline.adpn.BirminghamPublicLibraryBasePlugin" /> <property name="param.1"> <property name="key" value="base_url" /> <property name="value" value="http://bpldb.bplonline.org/adpn/load/" /> </property> <property name="param.2"> <property name="key" value="group" /> <property name="value" value="Cartography" /> </property> <property name="param.3"> <property name="key" value="collection" /> <property name="value" value="000400-000599" /> </property> </property> </property> </lockss-config>
Comprehending LOCKSS Title List
Nested same name elements with different levels of nesting depth causes some difficulty in comprehension using standard tools. Standard deserialization techniques won't work because group 1 property element collection (org.lockss.titleSet) has different depth than group 2 property element collection (org.lockss.title).
protected void DeserializeXml() { using (FileStream _f = new FileStream(Server.MapPath(@"titledb.xml"), FileMode.Open, FileAccess.Read)) { XmlReaderSettings _sets = new XmlReaderSettings(); _sets.IgnoreWhitespace = true; _sets.ProhibitDtd = false; XmlReader _xml = XmlReader.Create(_f, _sets); XmlSerializer _xs = new XmlSerializer(typeof(LockssTitleDb)); LockssTitleDb _titles = (LockssTitleDb)_xs.Deserialize(_xml); for (int i = 0; i < _titles.LockssTitleSet.Length; i++) { if (i % 2 == 0) { // first group property (has attribute name=org.lockss.titleSet) // titleSet displays publisher detail // this assumes group of 2 for each publisher and AU list } else { // second group property (has attribute name=org.lockss.title) // each iteration is a new AU for group 1 publisher for (int j = 0; j < _titles.LockssTitleSet[i].ChildNodes.Count; j++) { // property name = normalized AU string // outer AU definition foreach (XmlNode _node in _titles.LockssTitleSet[i].ChildNodes[j].ChildNodes) { // loop through property tags } } } } _f.close(); } ... [XmlRoot("lockss-config")] public class LockssTitleDb { // cannot create a single interface for all nesting depths with a single name [XmlAnyElement()] public XmlElement[] LockssTitleSet; }
Local Data Store
http://bpldb.bplonline.org/images/adpn/datastore.png
XML Generation
ExpertConfig option org.lockss.titlesDb does not examine content type. It only looks at the URL ending and string matches .xml, else it assumes it is a .txt configuration file. See BaseConfigFile.java constructor.
Title Distribution
Base Distribution
Using the local data store, au_ids are auto-incremented. After the initial distribution, new AU releases can be isolated by selecting from last known au_id. Row timestamp could also be used.
string[] _peers = GetPeerArray(); foreach (DataRow _row in _titles.Rows) { Shuffle(_peers); int _counter = 0; bool _isPeer = IsPeer(_row["au_pub_id"].ToString()); int _maxCount = _isPeer ? 5 : 6; // not every publisher is a peer // insert 6 5 foreach (string _peer in _peers) { if (_counter == _maxCount) break; if (_peer.Equals(_row["au_pub_id"].ToString())) continue; _connect.Command.Parameters.Clear(); _connect.Command.CommandText = "INSERT INTO `adpn_peer_titles` (`peer_id`, `au_id`) VALUES (?,?)"; _connect.Command.Parameters.AddWithValue("?", _peer); _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); _connect.Command.ExecuteNonQuery(); _counter++; } if (_isPeer) { // insert 1 _connect.Command.Parameters.Clear(); _connect.Command.CommandText = "INSERT INTO `adpn_peer_titles` (`peer_id`, `au_id`) VALUES (?,?)"; _connect.Command.Parameters.AddWithValue("?", _row["au_pub_id"].ToString()); _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); _connect.Command.ExecuteNonQuery(); } }
A resultant table set could look like the following:
peer_id count(*) % burden ADAH 499 55% AUB 549 61% BPL 520 57% SHC 504 56% SHC1 490 54% TROY 492 54% UAB 532 59% UAT 852 94% UAT1 482 53% UNA 510 56% 905 Titles
New Node Reshuffle
The original distribution algorithm can be used with a slight modification, only update rows in the adpn_peer_titles table when new node is up after shuffle. This approach is assuming the new node is not to be considered an AU owner for existing AUs.
string[] _peers = GetPeerArray(); foreach (DataRow _row in _titles.Rows) { Shuffle(_peers); int _counter = 0; bool _isPeer = IsPeer(_row["au_pub_id"].ToString()); int _maxCount = _isPeer ? 5 : 6; // not every publisher is a peer // insert 6 5 foreach (string _peer in _peers) { if (_counter == _maxCount) break; if (_peer.Equals(_row["au_pub_id"].ToString())) continue; // ONLY modify existing AU map when new node is up if (_peer.Equals(newNodePeerId)) { _connect.Command.Parameters.Clear(); _connect.Command.CommandText = @" UPDATE `adpn_peer_titles` AS `a`, (SELECT `b2`.`peer_id` FROM `adpn_peer_titles` AS `b2` WHERE `b2`.`au_id` = ? AND `b2`.`peer_id` != ? ORDER BY RAND() LIMIT 1) AS `b` SET `a`.`peer_id` = ? WHERE `a`.`peer_id` = `b`.`peer_id` AND `a`.`au_id` = ? "; _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); _connect.Command.Parameters.AddWithValue("?", _row["au_pub_id"].ToString()); _connect.Command.Parameters.AddWithValue("?", _peer); _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); _connect.Command.ExecuteNonQuery(); } } }
Using the original distribution and reshuffling for a new node resulted in :
peer_id count(*) % burden ADAH 456 50% AUB 492 54% BPL 467 52% SHC 456 50% SHC1 438 48% TROY 445 49% UAB 466 51% UAT 845 93% UAT1 441 49% UNA 467 52% BPL1 457 50% 905 titles
Dead Node Reshuffle
When a node is no longer a part of the network:
- Delete peer_id from apdn_peer_titles
- Select a list of peer_ids and au_ids where count is less than maxcount
- Insert a shuffled peer id where not in peer_ids select list
The entire method :
private void DeadNodeShuffle() { _connect.Command.Parameters.Clear(); _connect.Command.CommandText = @"SELECT `au_id` FROM `adpn_peer_titles` GROUP BY `au_id` HAVING COUNT(*) < 6"; DataTable _titles = new DataTable(); using (OdbcDataAdapter _adapter = new OdbcDataAdapter(_connect.Command)) { _adapter.Fill(_titles); } foreach (DataRow _row in _titles.Rows) { _connect.Command.Parameters.Clear(); _connect.Command.CommandText = @"INSERT INTO `adpn_peer_titles` (`peer_id`, `au_id`) (SELECT `adpn_peers`.`peer_id`, (SELECT `au_id` FROM `au_titlelist` WHERE `au_id` = ?) AS 'au_id' FROM `adpn_peers` WHERE `adpn_peers`.`active` = 'y' AND `adpn_peers`.`peer_id` NOT IN (SELECT `peer_id` FROM `adpn_peer_titles` WHERE `au_id` = ?) ORDER BY RAND() LIMIT 1)"; _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); _connect.Command.Parameters.AddWithValue("?", _row["au_id"].ToString()); _connect.Command.ExecuteNonQuery(); } }
Dropping a node (SHC1) and reshuffling AUs without full holdings results in the following table.
peer_id count(*) % burden ADAH 496 55% AUB 528 58% BPL 512 57% BPL1 514 57% SHC 500 55% TROY 503 56% UAB 520 57% UAT 850 94% UAT1 496 55% UNA 511 56% 905 titles