Author Topic: Representing disjoint sets (Read 4222 times)

harold · « **on:** April 02, 2014, 06:56:38 am »

None of this is new, but I think it deserves more attention.

Parent pointers. As discussed in wiki:disjoint-set_data_structure. Every element points to its parent, the root identifies the set. Singleton sets are an element that points to itself. This is a relatively well-known data structure because every CS curriculum covers it, usually in the context of amortized analysis.
It supports merging two sets in O(log n) and identifying which set an item belongs to also in O(log n), but remember the amortized analysis that makes this data structure so good.
But here are some things you don't get (without extensions), or not nicely:

seeing how big a set is. But you can fix that: take an extra array "size", initialize it with 1's, and on a union, do size[newroot] += size[otherroot]. Getting the size of the set containing x is just size[find(x)].
removing an element from a set. Other elements might be pointing to it, and you have to find them and fix them. The pointers go the wrong way, so you have to scan everything to find them.
a good way to enumerate some set. Again the pointers go the wrong way. You can scan through all items and see if they belong to the set, and in some ways that is not so bad (in the worst case, the set contains everything, and you'd have to visit every item anyway), but it could be better.

So here's a way that initially looks related, but is actually very different: cycles. It looks the same in that you start out with an array [0, 1, 2, 3 .. ], but the similarities pretty much end there.
A set is represented by a cycle. You follow a cycle by doing x = array[ x]. To merge to sets, take any item x from the first set and any item y from the second set, then swap array[ x] and array[ y].
Here are some diagrams to clarify:
Begin situation. Every element is in it's own cycle/set.

Merge the set {1} with the set {2}.

Merge the set {0} with the set {3}.

Merge the set {0,3} with the set {1,2}, could be done in several ways, shown here by swapping array[ 2] with array[ 3].

It should be obvious that this is reversible, so you can easily "undo" merges if you remember which two items you used in the merge.
You can also remove an item from its cycle, but that requires knowing the element which points to it. When you're iterating over a cycle, you can remember the previous element, so you can unlink elements from their set while enumerating that set.
Unlike with the first data structure, finding out which set an element is in is hard, and there's not even an indisputable representative of a set anyway. You could enumerate a set and take the biggest or smallest item as representative, though. An other trick is to add a bunch of items that are only used to "give names to sets".

But it gets better. What if you combine those two data structures?
Merging becomes O(log n), inherited from the union-find structure. There is now an indisputable representative of a set, namely the root in the union-find structure. And now you can also remove an item from its set in a more reasonable O(|S|) where S is the set containing the item (vs O(n) before), with a very simple algorithm: if the array containing the cycles is called "cycles" and the item we're removing is "x", iterate over the cycle setting the parent pointers to cycle[ x], and when you reach the item y such that cycle[ y] = x (ie, you've completed the cycle), swap cycle[ x] and cycle[ y] to unlink the item. You can still keep track of the set sizes with no significant overhead.
So you can:

Merge the set containing x and the set containing y in O(log n), or O(1) if you already know that x and y are both roots
Determine the set containing x in O(log n)
Enumerate the set containing x in O(|S|)
Remove an item from its set (making it a singleton set) in O(|S|)
Get the size of the set containing x in O(log n), or O(1) if you already know that x is a root

Of course the amortized analysis of union/find with path compression still applies, so it's actually better than those O(log n)'s make it look (that already looks pretty good though).
The algorithm for removing an item from a set guarantees that find operations on any of the items in the set that you just removed an item from will run in O(1) the next time, partially off-setting the bad-looking O(|S|) time. Essentially it performs path compression on every item in the set, and in a way that's faster than usual.

As a variant of the cycle structure, you can use two arrays, containing the same cycles but with one going "backwards". Essentially emulating a doubly linked list instead of a singly linked one. Visualize it with double arrows. In this data structure, you can unlink items from their cycle in O(1) (at all times, not just if you remember the previous node), and here's an other fun trick: instead of making the unlinked item into a singleton cycle, you can keep its pointers intact. That way it remembers its original position in the cycle it came from, and you can undo a sequence of unlinkings by relinking them in reverse order. This is a reasonably well-known trick on cyclic doubly linked lists, most famously used in the Dancing Links algorithm.
There's not much point in combining this with the union-find structure, but it works well together with a plain old "array of integers saying which set an item belongs to", that's bad for merging two sets, but it becomes interesting if you're only ever shuffling single items between sets.

Here's an other trick: all the arrays can start out as all zeroes, with only small modifications to the code (and no conceptual changes to the algorithms). Instead of treating items as index of the item they refer to, treat them as an offset. You're indexing into the array, so you always know the current index, the value that you're conceptually storing is just i + array[ i]. The ranks already start at 0, and making the set sizes zero just means off-setting them by 1.

Author Topic: Representing disjoint sets (Read 4222 times)

harold

Representing disjoint sets