Nested Sets, PHP, Verb Objects and Noun Objects
Introduction
Controversy persists to this day about the relative merits of dynamic languages such as PHP and Python versus static languages such as C# and Java. We’re finding more and more that the difference isn’t so much about static or dynamic typing, but more about the cultures of different languages. In this article, I discuss an efficient representations of SQL trees in a database, an algorithm for creating that representation, and a PHP implementation. The PHP implementations uses objects in a way foreign to many developers: rather than using objects to represent nouns (data), it uses a class to represent a verb (an algorithm.) I make the case that programmers shouldn’t feel compelled to create new classes to represent every data item: that verb objects often provide the right level of abstraction for many tasks.
The Presenting Problem
Lately I’ve been collecting pictures of animals, and decided that incorporating the taxonomic database from ITIS would be a big help. I’m interested in asking questions like “What are all the species underneath the Tapiridae family?” The ITIS database uses the adjacency list representation, where each row contains a column that references the primary key of a parent row. Algorithms for the adjacency list are well known, but are awkward to implement in SQL since it takes multiple SQL statements to traverse a tree.
Nested sets are an alternative representation that makes it simple to write fast queries on trees. Like the parts explosion diagram below, components of the hierarchy are represented with contiguous numbers (parts 1-3 form one end piece of the steering knuckle.)
This article discusses the adjacency list and nested set models and presents a simple algorithm for converting an adjacency list into nested sets.
Adjacency lists
A common representation of a tree in a relational database is like this:
[01] create table obviousTree ( [02] id varchar(255), [03] parent varchar(255), [04] primary key(id), [05] foreign key(parent) references tree(id) [06] )
The parent column is allowed to be null, so the root of the tree has a null parent. A typical tree in this representation might look like
[07] sql> SELECT * FROM obviousTree; [08] +--------------+ [09] | id | parent | [10] |-----|--------| [11] | 'a' | null | [12] | 'b' | 'a' | [13] | 'c' | 'a' | [14] | 'd' | 'b' | [15] | 'e' | 'b' | [16] | 'f' | 'b' | [17] +--------------+
It’s simple in this representatation to find out what the parent of a node is,
[18] SELECT parent FROM obviousTree WHERE id=@Child
or to find the direct children of a node,
[19] SELECT id FROM obviousTree WHERE parent=@Parent
It’s not possible, however, to write a pure SQL statement that traverses all of the descendants of a node (at least not in standard SQL.) You need to either write a stored procedure or write a program in a language like C# or PHP to implements a breadth-first or depth-first traversal of the tree. This isn’t conceptually hard, but it’s inefficient and doesn’t take full advantage of the querying power of SQL.
Nested Sets
Joe Celko has promoted the Nested Set representation of trees in several of his books. In the nested set representation, we represent the position of each node in the tree with two numbers: the left value and the right value. We’ll call them lft and rgt in our code, since LEFT and RIGHT are reserved words in SQL.
The left and right values of a parent node enclose the left and right values of the children, so it’s easy to ask for the the descendants of a node
[20] SELECT * FROM fastTree WHERE [21] lft>@AncestorLft [22] AND lft<@AncestorRgt
to count the children
[23] SELECT (rgt-lft-1)/2 FROM fastTree WHERE lft=@AncestorLeft
or to find all the ancestors of a node
[24] SELECT * FROM fastTree WHERE [25] lft<@DescendentLft [26] AND rgt>@DescendentRgt
Granted, update operations are slower and more complex in the nested set representation, but for my application, where I’m accessing a nearly 500,000-node tree that never changes, nested sets are the clear choice.
The Conversion Algorithm in PHP
Looking at the tree above, you can see a straightforward algorithm for creating the nested set representation. We traverse the tree depth-first, keeping a counter, which we’ll call cnt as we go — it works a lot like a thumb operated tally counter:
When we first encounter a node (going down), we write cnt into the lft field of that node, then we increment cnt. When we encounter it again (going up), we write cnt into the rgt field of the node an increment cnt.
Joe Celko makes a good case for implementing mutating operations on nested sets in stored procedures. (see his book on Trees and Hiearchies and SQL) I think he’s particularly right when it comes to adding, removing and moving nodes, where the operation requires several statements that should be done in a transaction. Transactional integrity is less important, however, in a one-time import procedure which is done in a batch; although Mysql 5.0 (which I’m using) supports stored procedures, it’s got less of a culture of stored procedure use than other databases, so I felt comfortable writing the conversion script in PHP. This script converted a tree with 477,797 nodes in less than a minute, so performance was adequate.
I’m going to store the nested set in a table that looks like
[27] create table fastTree ( [28] lft integer not null, [29] rgt integer not null, [30] id varchar(255), [31] primary key(lft), [32] unique(rgt), [33] unique(id) [34] )
I include “_Config.php”, which initializes the database connection, $conn, using the third-party ADODB library. After that I retrieve the list of parent-child relationships from the database:
[35] require_once("./_Config.php"); [36] $rawLink=$conn->Execute("SELECT id,parent FROM obviousTree"); [37] [38] $link=array(); [39] foreach($rawLink as $k=>$row) { [40] $parent=$row["parent"]; [41] $child=$row["id"]; [42] if (!array_key_exists($parent,$link)) { [43] $link[$parent]=array(); [44] } [45] $link[$parent][]=$child; [46] }
Note that we’re building a complete copy of the adjacency list in RAM: the value of $link[$parent] is an array that contains the id’s of all the children of the $parent. I’m doing this for two reasons: (i) the adjacency list is small enough to fit in RAM, and (ii) minimize the cost of I/O: if you expect to access all rows in a table, it’s a lot cheaper to do a full table scan than it is to do hundreds of thousands of index queries.
Next we define an object, TreeTransformer, that represents the algorithm. In particular, it provides a scope for the $cnt variable, which has a lifespan apart from the recursive function that represents the tree:
[47] class TreeTransformer { [48] function __construct($link) { [49] $this->count=1; [50] $this->link=$link; [51] } [52] [53] function traverse($id) { [54] $lft=$this->count; [55] $this->count++; [56] [57] $kid=$this->getChildren($id); [58] if ($kid) { [59] foreach($kid as $c) { [60] $this->traverse($c); [61] } [62] } [63] $rgt=$this->count; [64] $this->count++; [65] $this->write($lft,$rgt,$id); [66] }
Traverse() is the core of the algorithm: it works in three phases: (i) it assigns the $lft value, (ii) it loops over all children, calling itself recursively for each child, and (iii) assigns the $rgt value and writes an output record. The scope of $this->count is exactly the scope we want for the variable, and saves the need of passing $count back and forth between different levels of traverse(). Traverse calls two functions:
[67] function getChildren($id) { [68] return $this->link[$id]; [69] } [70] [71] function write($lft,$rgt,$id) { [72] global $conn; [74] $conn->Execute(" [75] INSERT INTO fastTree [76] (lft,rgt,id) [77] VALUES [78] (?,?,?) [79] ",array($lft,$rgt,$id)); [80] } [81] }
Noun Objects or Verb Objects?
This script uses a single class: instead of using classes to represent data, it uses a class to represent an algorithm, a verb. Rather than create objects to create data structures, I reuse data structures that come with PHP.
This is an extensible design.
TreeTransformer provides two extension points: getChildren() and write(). Most of the objections a person could have to this implementation could be addressed here: for instance, getChildren() could be modified to support a different data structure for the adjacency list, or even to operate in a streaming mode that does an SQL query for each node. write(), on the other hand, could be modified to avoid the limitations of the global $conn, or to change the output format. If TreeTransformer were to evolve in the future, it would make sense to push traverse() up to an AbstractTreeWriter and define getChildren(), write() and $this->link in a subclass.
Noun objects (that represent things) can be useful, but a compulsion to create noun objects can lead to an entanglement between algorithms and data structures, poor code reuse, and a proliferation of artifacts that makes for a high defect count and expensive maintainance. Even if were using noun objects in this program, it still makes sense to implement this algorithm as a verb object: by creating interfaces for the adjacancy list and for the output, I could keep the algorithm reusable while keeping the noun objects simple.
Conclusion: Write Less Code
Many of the fashionable languages that people claim improve productivity, such as Lisp, Python and Ruby, put powerful data structures at the programmer fingertips: they encourage you to reuse data structures provided by the language rather than to create a new object in response to every situation. Functional languages such as CAML and F# show the power of programming by composition, enabled by the use of standard data structures (and interfaces.)
What’s really exciting is that these methods are becoming increasingly available and popular in mainstream languages, such as C# and PHP.
Generic classes and methods, as available in Java and C#, let you have both reusable data structures and type safety. Although there are trade offs, you seriously consider using a List<Thing> rather than creating a CollectionOfThings. Yes, a collection of things that you create can present exactly the interface you need, however, without a lot of care, you might find that another member of your team wrote an incompatible JohnsCollectionOfThings, and you end up writing more glue code because you’re interfacing with a third party vendor that uses a MagicBoxFullOfThings instead.
Although there advantages to encapsulation, and advantages to creating noun objects (sometimes they do provide the right places for extension points,) programmers need to be careful when they create new classes. Every class you write, every method, and every line of code is like a new puppy: you not only need to write it, but you’ll need to maintain it in the future.
Image sources: Parts Explosion Diagram of Hummer Steering Knuckle from CHEMAXX, Handheld Tally Counter from Different Roads To Learning.
Paul Houle on November 4th 2008 in Asynchronous Communications