Global Rebuilding, Coroutines, and Defunctionalization

Introduction

In 1983, Mark Overmars described global rebuilding in The Design of Dynamic Data Structures. The problem it was aimed at solving was turning the amortized time complexity bounds of batched rebuilding into worst-case bounds. In batched rebuilding we perform a series of updates to a data structure which may cause the performance of operations to degrade, but occasionally we expensively rebuild the data structure back into an optimal arrangement. If the updates don’t degrade performance too much before we rebuild, then we can achieve our target time complexity bounds in an amortized sense. An update that doesn’t degrade performance too much is called a weak update.

Taking an example from Okasaki’s Purely Functional Data Structures, we can consider a binary search tree where deletions occur by simply marking the deleted nodes as deleted. Then, once about half the tree is marked as deleted, we rebuild the tree into a balanced binary search tree and clean out the nodes marked as deleted at that time. In this case, the deletions count as weak updates because leaving the deleted nodes in the tree even when it corresponds to up to half the tree can only mildly impact the time complexity of other operations. Specifically, assuming the tree was balanced at the start, then deleting half the nodes could only reduce the tree’s depth by about 1. On the other hand, naive inserts are not weak updates as they can quickly increase the tree’s depth.

The idea of global rebuilding is relatively straightforward, though how you would actually realize it in any particular example is not. The overall idea is simply that instead of waiting until the last moment and then rebuilding the data structure all at once, we’ll start the rebuild sooner and work at it incrementally as we perform other operations. If we update the new version faster than we update the original version, we’ll finish it by the time we would have wanted to perform a batched rebuild, and we can just switch to this new version.

More concretely, though still quite vaguely, global rebuilding involves, when a threshold is reached, rebuilding by creating a new “empty” version of the data structure called the shadow copy. The original version is the working copy. Work on rebuilding happens incrementally as operations are performed on the data structure. During this period, we service queries from the working copy and continue to update it as usual. Each update needs to make more progress on building the shadow copy than it worsens the working copy. For example, an insert should insert more nodes into the shadow copy than the working copy. Once the shadow copy is built, we may still have more work to do to incorporate changes that occurred after we started the rebuild. To this end, we can maintain a queue of update operations performed on the working copy since the start of a rebuild, and then apply these updates, also incrementally, to the shadow copy. Again, we need to apply the updates from the queue at a fast enough rate so that we will eventually catch up. Of course, all of this needs to happen fast enough so that 1) the working copy doesn’t get too degraded before the shadow copy is ready, and 2) we don’t end up needing to rebuild the shadow copy before it’s ready to do any work.

Coroutines

Okasaki passingly mentions that global rebuilding “can be usefully viewed as running the rebuilding transformation as a coroutine”. Also, the situation described above is quite reminiscent of garbage collection. There the classic half-space stop-the-world copying collector is naturally the batched rebuilding version. More incremental versions often have read or write barriers and break the garbage collection into incremental steps. Garbage collection is also often viewed as two processes coroutining.

The goal of this article is to derive global rebuilding-based data structures from an expression of them as two coroutining processes. Ideally, we should be able to take a data structure implemented via batched rebuilding and simply run the batch rebuilding step as a coroutine. Modifying the data structure’s operations and the rebuilding step should, in theory, just be a matter of inserting appropriate yield statements. Of course, it won’t be that easy since the batched version of rebuilding doesn’t need to worry about concurrent updates to the original data structure.

In theory, such a representation would be a perfectly effective way of articulating the global rebuilding version of the data structure. That said, I will be using the standard power move of CPS transforming and defunctionalizing to get a more data structure-like result.

I’ll implement coroutines as a very simplified case of modeling cooperative concurrency with continuations. In that context, a “process” written in continuation-passing style “yields” to the scheduler by passing its continuation to a scheduling function. Normally, the scheduler would place that continuation at the end of a work queue and then pick up a continuation from the front of the work queue and invoke it resuming the previously suspended “process”. In our case, we only have two “processes” so our “work queue” can just be a single mutable cell. When one “process” yields, it just swaps its continuation into the cell and the other “process’” out and invokes the continuation it read.

Since the rebuilding process is always driven by the main process, the pattern is a bit more like generators. This has the benefit that only the rebuilding process needs to be written in continuation-passing style. The following is a very quick and dirty set of functions for this.

module Coroutine ( YieldFn, spawn ) where
import Control.Monad ( join )
import Data.IORef ( IORef, newIORef, readIORef, writeIORef )

type YieldFn = IO () -> IO ()

yield :: IORef (IO ()) -> IO () -> IO ()
yield = writeIORef

resume :: IORef (IO ()) -> IO ()
resume = join . readIORef

terminate :: IORef (IO ()) -> IO ()
terminate yieldRef = writeIORef yieldRef (ioError $ userError "Subprocess completed")

spawn :: (YieldFn -> IO () -> IO ()) -> IO (IO ())
spawn process = do
    yieldRef <- newIORef undefined
    writeIORef yieldRef $ process (yield yieldRef) (terminate yieldRef)
    return (resume yieldRef)

A simple example of usage is:

process :: YieldFn -> Int -> IO () -> IO ()
process     _ 0 k = k
process yield i k = do
    putStrLn $ "Subprocess: " ++ show i
    yield $ process yield (i-1) k

example :: IO ()
example = do
    resume <- spawn $ \yield -> process yield 10
    forM_ [(1 :: Int) .. 10] $ \i -> do
        putStrLn $ "Main process: " ++ show i
        resume
    putStrLn "Main process done"

with output:

Main process: 1
Subprocess: 10
Main process: 2
Subprocess: 9
Main process: 3
Subprocess: 8
Main process: 4
Subprocess: 7
Main process: 5
Subprocess: 6
Main process: 6
Subprocess: 5
Main process: 7
Subprocess: 4
Main process: 8
Subprocess: 3
Main process: 9
Subprocess: 2
Main process: 10
Subprocess: 1
Main process done

Queues

I’ll use queues since they are very simple and Purely Functional Data Structures describes Hood-Melville Real-Time Queues in Figure 8.1 as an example of global rebuilding. We’ll end up with something quite similar which could be made more similar by changing the rebuilding code. Indeed, the differences are just an artifact of specific, easily changed details of the rebuilding coroutine, as we’ll see.

The examples I’ll present are mostly imperative, not purely functional. There are two reasons for this. First, I’m not focused on purely functional data structures and the technique works fine for imperative data structures. Second, it is arguably more natural to talk about coroutines in an imperative context. In this case, it’s easy to adapt the code to a purely functional version since it’s not much more than a purely functional data structure stuck in an IORef.

For a more imperative structure with mutable linked structure and/or in-place array updates, it would be more challenging to produce a purely functional version. The techniques here could still be used, though there are more “concurrency” concerns. While I don’t include the code here, I did a similar exercise for a random-access stack (a fancy way of saying a growable array). There the “concurrency” concern is that the elements you are copying to the new array may be popped and potentially overwritten before you switch to the new array. In this case, it’s easy to solve, since if the head pointer of the live version reaches the source offset for copy, you can just switch to the new array immediately.

Nevertheless, I can easily imagine scenarios where it may be beneficial, if not necessary, for the coroutines to communicate more and/or for there to be multiple “rebuild” processes. The approach used here could be easily adapted to that. It’s also worth mentioning that even in simpler cases, non-constant-time operations will either need to invoke resume multiple times or need more coordination with the “rebuild” process to know when it can do more than a constant amount of work. This could be accomplished by “rebuild” process simply recognizing this from the data structure state, or some state could be explicitly set to indicate this, or the techniques described earlier could be used, e.g. a different process for non-constant-time operations.

The code below uses the extensions BangPatterns, RecordWildCards, and GADTs.

Batched Rebuilding Implementation

We start with the straightforward, amortized constant-time queues where we push to a stack representing the back of the queue and pop from a stack representing the front. When the front stack is empty, we need to expensively reverse the back stack to make a new front stack.

I intentionally separate out the reverse step as an explicit rebuild function.

module BatchedRebuildingQueue ( Queue, new, enqueue, dequeue ) where
import Data.IORef ( IORef, newIORef, readIORef, writeIORef, modifyIORef )

data Queue a = Queue {
    queueRef :: IORef ([a], [a])
}

new :: IO (Queue a)
new = do
    queueRef <- newIORef ([], [])
    return Queue { .. }

dequeue :: Queue a -> IO (Maybe a)
dequeue q@(Queue { .. }) = do
    (front, back) <- readIORef queueRef
    case front of
        (x:front') -> do
            writeIORef queueRef (front', back)
            return (Just x)
        [] -> case back of
                [] -> return Nothing
                _ -> rebuild q >> dequeue q

enqueue :: a -> Queue a -> IO ()
enqueue x (Queue { .. }) =
    modifyIORef queueRef (\(front, back) -> (front, x:back))

rebuild :: Queue a -> IO ()
rebuild (Queue { .. }) =
    modifyIORef queueRef (\([], back) -> (reverse back, []))

Global Rebuilding Implementation

This step is where a modicum of thought is needed. We need to make the rebuild step from the batched version incremental. This is straightforward, if tedious, given the coroutine infrastructure. In this case, we incrementalize the reverse by reimplementing reverse in CPS with some yield calls inserted. Then we need to incrementalize append. Since we’re not waiting until front is empty, we’re actually computing front ++ reverse back. Incrementalizing append is hard, so we actually reverse front and then use an incremental reverseAppend (which is basically what the incremental reverse does anyway1).

One of first thing to note about this code is that the actual operations are largely unchanged other than inserting calls to resume. In fact, dequeue is even simpler than in the batched version as we can just assume that front is always populated when the queue is not empty. dequeue is freed from the responsibility of deciding when to trigger a rebuild. Most of the bulk of this code is from reimplementing a reverseAppend function (twice).

The parts of this code that require some deeper though are 1) knowing when a rebuild should begin, 2) knowing how “fast” the incremental operations should go2 (e.g. incrementalReverse does two steps at a time and the Hood-Melville implementation has an explicit exec2 that does two steps at a time), and 3) dealing with “concurrent” changes.

For the last, Overmars describes a queue of deferred operations to perform on the shadow copy once it finishes rebuilding. This kind of suggests a situation where the “rebuild” process can reference some “snapshot” of the data structure. In our case, that is the situation we’re in, since our data structures are essentially immutable data structures in an IORef. However, it can easily not be the case, e.g. the random-access stack. Also, this operation queue approach can easily be inefficient and inelegant. None of the implementations below will have this queue of deferred operations. It is easier, more efficient, and more elegant to just not copy over parts of the queue that have been dequeued, rather than have an extra phase of the rebuilding that just pops off the elements of the front stack that we just pushed. A similar situation happens for the random-access stack.

The use of drop could probably be easily eliminated. (I’m not even sure it’s still necessary.) It is mostly an artifact of (not) dealing with off-by-one issues.

module GlobalRebuildingQueue ( Queue, new, dequeue, enqueue ) where
import Data.IORef ( IORef, newIORef, readIORef, writeIORef, modifyIORef, modifyIORef' )
import Coroutine ( YieldFn, spawn )

data Queue a = Queue {
    resume :: IO (),
    frontRef :: IORef [a],
    backRef :: IORef [a],
    frontCountRef :: IORef Int,
    backCountRef :: IORef Int
}

new :: IO (Queue a)
new = do
    frontRef <- newIORef []
    backRef <- newIORef []
    frontCountRef <- newIORef 0
    backCountRef <- newIORef 0
    resume <- spawn $ const . rebuild frontRef backRef frontCountRef backCountRef
    return Queue { .. }

dequeue :: Queue a -> IO (Maybe a)
dequeue q = do
    resume q
    front <- readIORef (frontRef q)
    case front of
        [] -> return Nothing
        (x:front') -> do
            modifyIORef' (frontCountRef q) pred
            writeIORef (frontRef q) front'
            return (Just x)

enqueue :: a -> Queue a -> IO ()
enqueue x q = do
    modifyIORef (backRef q) (x:)
    modifyIORef' (backCountRef q) succ
    resume q

rebuild :: IORef [a] -> IORef [a] -> IORef Int -> IORef Int -> YieldFn -> IO ()
rebuild frontRef backRef frontCountRef backCountRef yield = let k = go k in go k where
  go k = do
    frontCount <- readIORef frontCountRef
    backCount <- readIORef backCountRef
    if backCount > frontCount then do
        back <- readIORef backRef
        front <- readIORef frontRef
        writeIORef backRef []
        writeIORef backCountRef 0
        incrementalReverse back [] $ \rback ->
            incrementalReverse front [] $ \rfront ->
                incrementalRevAppend rfront rback 0 backCount k
      else do
        yield k

  incrementalReverse [] acc k = k acc
  incrementalReverse [x] acc k = k (x:acc)
  incrementalReverse (x:y:xs) acc k = yield $ incrementalReverse xs (y:x:acc) k

  incrementalRevAppend [] front !movedCount backCount' k = do
    writeIORef frontRef front
    writeIORef frontCountRef $! movedCount + backCount'
    yield k
  incrementalRevAppend (x:rfront) acc !movedCount backCount' k = do
    currentFrontCount <- readIORef frontCountRef
    if currentFrontCount <= movedCount then do
        -- This drop count should be bounded by a constant.
        writeIORef frontRef $! drop (movedCount - currentFrontCount) acc
        writeIORef frontCountRef $! currentFrontCount + backCount'
        yield k
      else if null rfront then
        incrementalRevAppend [] (x:acc) (movedCount + 1) backCount' k
      else
        yield $! incrementalRevAppend rfront (x:acc) (movedCount + 1) backCount' k

Defunctionalized Global Rebuilding Implementation

This step is completely mechanical.

There’s arguably no reason to defunctionalize. It produces a result that is more data-structure-like, but, unless you need the code to work in a first-order language, there’s nothing really gained by doing this. It does lead to a result that is more directly comparable to other implementations.

For some data structures, having the continuation be analyzable would provide a simple means for the coroutines to communicate. The main process could directly look at the continuation to determine its state, e.g. if a rebuild is in-progress at all. The main process could also directly manipulate the stored continutation to change the “rebuild” process’ behavior. That said, doing this would mean that we’re not deriving the implementation. Still, the opportunity for additional optimizations and simplifications is nice.

As a minor aside, while it is, of course, obvious from looking at the previous version of the code, it’s neat how the Kont data type implies that the call stack is bounded and that most calls are tail calls. REVERSE_STEP is the only constructor that contains a Kont argument, but its type means that that argument can’t itself be a REVERSE_STEP. Again, I just find it neat how defunctionalization makes this concrete and explicit.

module DefunctionalizedQueue ( Queue, new, dequeue, enqueue ) where
import Data.IORef ( IORef, newIORef, readIORef, writeIORef, modifyIORef, modifyIORef' )

data Kont a r where
  IDLE :: Kont a ()
  REVERSE_STEP :: [a] -> [a] -> Kont a [a] -> Kont a ()
  REVERSE_FRONT :: [a] -> !Int -> Kont a [a]
  REV_APPEND_START :: [a] -> !Int -> Kont a [a]
  REV_APPEND_STEP :: [a] -> [a] -> !Int -> !Int -> Kont a ()

applyKont :: Queue a -> Kont a r -> r -> IO ()
applyKont q IDLE _ = rebuildLoop q
applyKont q (REVERSE_STEP xs acc k) _ = incrementalReverse q xs acc k
applyKont q (REVERSE_FRONT front backCount) rback =
    incrementalReverse q front [] $ REV_APPEND_START rback backCount
applyKont q (REV_APPEND_START rback backCount) rfront =
    incrementalRevAppend q rfront rback 0 backCount
applyKont q (REV_APPEND_STEP rfront acc movedCount backCount) _ =
    incrementalRevAppend q rfront acc movedCount backCount

rebuildLoop :: Queue a -> IO ()
rebuildLoop q@(Queue { .. }) = do
    frontCount <- readIORef frontCountRef
    backCount <- readIORef backCountRef
    if backCount > frontCount then do
        back <- readIORef backRef
        front <- readIORef frontRef
        writeIORef backRef []
        writeIORef backCountRef 0
        incrementalReverse q back [] $ REVERSE_FRONT front backCount
      else do
        writeIORef resumeRef IDLE

incrementalReverse :: Queue a -> [a] -> [a] -> Kont a [a] -> IO ()
incrementalReverse q [] acc k = applyKont q k acc
incrementalReverse q [x] acc k = applyKont q k (x:acc)
incrementalReverse q (x:y:xs) acc k = writeIORef (resumeRef q) $ REVERSE_STEP xs (y:x:acc) k

incrementalRevAppend :: Queue a -> [a] -> [a] -> Int -> Int -> IO ()
incrementalRevAppend (Queue { .. }) [] front !movedCount backCount' = do
    writeIORef frontRef front
    writeIORef frontCountRef $! movedCount + backCount'
    writeIORef resumeRef IDLE
incrementalRevAppend q@(Queue { .. }) (x:rfront) acc !movedCount backCount' = do
    currentFrontCount <- readIORef frontCountRef
    if currentFrontCount <= movedCount then do
        -- This drop count should be bounded by a constant.
        writeIORef frontRef $! drop (movedCount - currentFrontCount) acc
        writeIORef frontCountRef $! currentFrontCount + backCount'
        writeIORef resumeRef IDLE
      else if null rfront then
        incrementalRevAppend q [] (x:acc) (movedCount + 1) backCount'
      else
        writeIORef resumeRef $! REV_APPEND_STEP rfront (x:acc) (movedCount + 1) backCount'

resume :: Queue a -> IO ()
resume q = do
    kont <- readIORef (resumeRef q)
    applyKont q kont ()

data Queue a = Queue {
    resumeRef :: IORef (Kont a ()),
    frontRef :: IORef [a],
    backRef :: IORef [a],
    frontCountRef :: IORef Int,
    backCountRef :: IORef Int
}

new :: IO (Queue a)
new = do
    frontRef <- newIORef []
    backRef <- newIORef []
    frontCountRef <- newIORef 0
    backCountRef <- newIORef 0
    resumeRef <- newIORef IDLE
    return Queue { .. }

dequeue :: Queue a -> IO (Maybe a)
dequeue q  = do
    resume q
    front <- readIORef (frontRef q)
    case front of
        [] -> return Nothing
        (x:front') -> do
            modifyIORef' (frontCountRef q) pred
            writeIORef (frontRef q) front'
            return (Just x)

enqueue :: a -> Queue a -> IO ()
enqueue x q = do
    modifyIORef (backRef q) (x:)
    modifyIORef' (backCountRef q) succ
    resume q

Functional Defunctionalized Global Rebuilding Implementation

This is just a straightforward reorganization of the previous code into purely functional code. This produces a persistent queue with worst-case constant time operations.

It is, of course, far uglier and more ad-hoc than Okasaki’s extremely elegant real-time queues, but the methodology to derive it was simple-minded. The result is also quite similar to the Hood-Melville Queues even though I did not set out to achieve that. That said, I’m pretty confident you could derive pretty much exactly the Hood-Melville queues with just minor modifications to Global Rebuilding Implementation.

module FunctionalQueue ( Queue, empty, dequeue, enqueue ) where

data Kont a r where
  IDLE :: Kont a ()
  REVERSE_STEP :: [a] -> [a] -> Kont a [a] -> Kont a ()
  REVERSE_FRONT :: [a] -> !Int -> Kont a [a]
  REV_APPEND_START :: [a] -> !Int -> Kont a [a]
  REV_APPEND_STEP :: [a] -> [a] -> !Int -> !Int -> Kont a ()

applyKont :: Queue a -> Kont a r -> r -> Queue a
applyKont q IDLE _ = rebuildLoop q
applyKont q (REVERSE_STEP xs acc k) _ = incrementalReverse q xs acc k
applyKont q (REVERSE_FRONT front backCount) rback =
    incrementalReverse q front [] $ REV_APPEND_START rback backCount
applyKont q (REV_APPEND_START rback backCount) rfront =
    incrementalRevAppend q rfront rback 0 backCount
applyKont q (REV_APPEND_STEP rfront acc movedCount backCount) _ =
    incrementalRevAppend q rfront acc movedCount backCount

rebuildLoop :: Queue a -> Queue a
rebuildLoop q@(Queue { .. }) =
    if backCount > frontCount then
        let q' = q { back = [], backCount = 0 } in
        incrementalReverse q' back [] $ REVERSE_FRONT front backCount
      else
        q { resumeKont = IDLE }

incrementalReverse :: Queue a -> [a] -> [a] -> Kont a [a] -> Queue a
incrementalReverse q [] acc k = applyKont q k acc
incrementalReverse q [x] acc k = applyKont q k (x:acc)
incrementalReverse q (x:y:xs) acc k = q { resumeKont = REVERSE_STEP xs (y:x:acc) k }

incrementalRevAppend :: Queue a -> [a] -> [a] -> Int -> Int -> Queue a
incrementalRevAppend q [] front' !movedCount backCount' =
    q { front = front', frontCount = movedCount + backCount', resumeKont = IDLE }
incrementalRevAppend q (x:rfront) acc !movedCount backCount' =
    if frontCount q <= movedCount then
        -- This drop count should be bounded by a constant.
        let !front = drop (movedCount - frontCount q) acc in
        q { front = front, frontCount = frontCount q + backCount', resumeKont = IDLE }
      else if null rfront then
        incrementalRevAppend q [] (x:acc) (movedCount + 1) backCount'
      else
        q { resumeKont = REV_APPEND_STEP rfront (x:acc) (movedCount + 1) backCount' }

resume :: Queue a -> Queue a
resume q = applyKont q (resumeKont q) ()

data Queue a = Queue {
    resumeKont :: !(Kont a ()),
    front :: [a],
    back :: [a],
    frontCount :: !Int,
    backCount :: !Int
}

empty :: Queue a
empty = Queue { resumeKont = IDLE, front = [], back = [], frontCount = 0, backCount = 0 }

dequeue :: Queue a -> (Maybe a, Queue a)
dequeue q =
    case front of
        [] -> (Nothing, q)
        (x:front') ->
            (Just x, q' { front = front', frontCount = frontCount - 1 })
  where q'@(Queue { .. }) = resume q

enqueue :: a -> Queue a -> Queue a
enqueue x q@(Queue { .. }) = resume (q { back = x:back, backCount = backCount + 1 })

Hood-Melville Implementation

This is just the Haskell code from Purely Functional Data Structures adapted to the interface of the other examples.

This code is mostly to compare. The biggest difference, other than some code structuring differences, is the front and back lists are reversed in parallel while my code does them sequentially. As mentioned before, to get a structure like that would simply be a matter of defining a parallel incremental reverse back in the Global Rebuilding Implementation.

Again, Okasaki’s real-time queue that can be seen as an application of the lazy rebuilding and scheduling techniques, described in his thesis and book, is a better implementation than this in pretty much every way.

module HoodMelvilleQueue (Queue, empty, dequeue, enqueue) where

data RotationState a
  = Idle
  | Reversing !Int [a] [a] [a] [a]
  | Appending !Int [a] [a]
  | Done [a]

data Queue a = Queue !Int [a] (RotationState a) !Int [a]

exec :: RotationState a -> RotationState a
exec (Reversing ok (x:f) f' (y:r) r') = Reversing (ok+1) f (x:f') r (y:r')
exec (Reversing ok [] f' [y] r') = Appending ok f' (y:r')
exec (Appending 0 f' r') = Done r'
exec (Appending ok (x:f') r') = Appending (ok-1) f' (x:r')
exec state = state

invalidate :: RotationState a -> RotationState a
invalidate (Reversing ok f f' r r') = Reversing (ok-1) f f' r r'
invalidate (Appending 0 f' (x:r')) = Done r'
invalidate (Appending ok f' r') = Appending (ok-1) f' r'
invalidate state = state

exec2 :: Int -> [a] -> RotationState a -> Int -> [a] -> Queue a
exec2 !lenf f state lenr r =
    case exec (exec state) of
        Done newf -> Queue lenf newf Idle lenr r
        newstate -> Queue lenf f newstate lenr r

check :: Int -> [a] -> RotationState a -> Int -> [a] -> Queue a
check !lenf f state !lenr r =
    if lenr <= lenf then exec2 lenf f state lenr r
    else let newstate = Reversing 0 f [] r []
         in exec2 (lenf+lenr) f newstate 0 []

empty :: Queue a
empty = Queue 0 [] Idle 0 []

dequeue :: Queue a -> (Maybe a, Queue a)
dequeue q@(Queue _ [] _ _ _) = (Nothing, q)
dequeue (Queue lenf (x:f') state lenr r) =
    let !q' = check (lenf-1) f' (invalidate state) lenr r in
    (Just x, q')

enqueue :: a -> Queue a -> Queue a
enqueue x (Queue lenf f state lenr r) = check lenf f state (lenr+1) (x:r)

Okasaki’s Real-Time Queues

Just for completeness. This implementation crucially relies on lazy evaluation. Our queues are of the form Queue f r s. If you look carefully, you’ll notice that the only place we consume s is in the first clause of exec, and there we discard its elements. In other words, we only care about the length of s. s gets “decremented” each time we enqueue until it’s empty at which point we rotate r to f in the second clause of exec. The key thing is that f and s are initialized to the same value in that clause. That means each time we “decrement” s we are also forcing a bit of f. Forcing a bit of f/s means computing a bit of rotate. rotate xs ys a is an incremental version of xs ++ reverse ys ++ a (where we use the invariant length ys = 1 + length xs for the base case).

Using Okasaki’s terminology, rotate illustrates a simple form of lazy rebuilding where we use lazy evaluation rather than explicit or implicit coroutines to perform work “in parallel”. Here, we interleave the evaluation of rotate with enqueue and dequeue via forcing the conses of f/s. However, lazy rebuilding itself may not lead to worst-case optimal times (assuming it is amortized optimal). We need to use Okasaki’s other technique of scheduling to strategically force the thunks incrementally rather than all at once. Here s is a schedule telling us when to force parts of f. (As mentioned, s also serves as a counter telling us when to perform a rebuild.)

module OkasakiQueue ( Queue, empty, dequeue, enqueue ) where

data Queue a = Queue [a] ![a] [a]

empty :: Queue a
empty = Queue [] [] []

dequeue :: Queue a -> (Maybe a, Queue a)
dequeue q@(Queue [] _ _) = (Nothing, q)
dequeue (Queue (x:f) r s) = (Just x, exec f r s)

rotate :: [a] -> [a] -> [a] -> [a]
rotate     [] (y: _) a = y:a
rotate (x:xs) (y:ys) a = x:rotate xs ys (y:a)

exec :: [a] -> [a] -> [a] -> Queue a
exec f !r (_:s) = Queue f r s
exec f !r [] = let f' = rotate f r [] in Queue f' [] f'

enqueue :: a -> Queue a -> Queue a
enqueue x (Queue f r s) = exec f (x:r) s 

It’s instructive to compare the above to the following implementation which doesn’t use a schedule. This implementation is essentially the Banker’s Queue from Okasaki’s book, except we use lazy rebuilding to spread the xs ++ reverse ys (particularly the reverse part) over multiple dequeues via rotate. The following implementation performs extremely well in my benchmark, but the operations are subtly not constant-time. Specifically, after a long series of enqueues, a dequeue will do work proportional to the logarithm of the number of enqueues. Essentially, f will be a nested series of rotate calls, one for every doubling of the length of the queue. Even if we change let f' to let !f', that will only make the first dequeue cheap. The second will still be expensive.

module UnscheduledOkasakiQueue ( Queue, empty, dequeue, enqueue ) where

data Queue a = Queue [a] !Int [a] !Int

empty :: Queue a
empty = Queue [] 0 [] 0

dequeue :: Queue a -> (Maybe a, Queue a)
dequeue q@(Queue [] _ _ _) = (Nothing, q)
dequeue (Queue (x:f) lenf r lenr) = (Just x, exec f (lenf - 1) r lenr)

rotate :: [a] -> [a] -> [a] -> [a]
rotate     [] (y: _) a = y:a
rotate (x:xs) (y:ys) a = x:rotate xs ys (y:a)

exec :: [a] -> Int -> [a] -> Int -> Queue a
exec f !lenf !r !lenr | lenf >= lenr = Queue f lenf r lenr
exec f !lenf !r !lenr = let f' = rotate f r [] in Queue f' (lenf + lenr) [] 0

enqueue :: a -> Queue a -> Queue a
enqueue x (Queue f lenf r lenr) = exec f lenf (x:r) (lenr + 1) 

Empirical Evaluation

I won’t reproduce the evaluation code as it’s not very sophisticated or interesting. It randomly generated a sequence of enqueues and dequeues with an 80% chance to produce an enqueue over a dequeue so that the queues would grow. It measured the average time of an enqueue and a dequeue, as well as the maximum time of any single dequeue.

The main thing I wanted to see was relatively stable average enqueue and dequeue times with only the batched implementation having a growing maximum dequeue time. This is indeed what I saw, though it took about 1,000,000 operations (or really a queue of a couple hundred thousand elements) for the numbers to stabilize.

The results were mostly unsurprising. Unsurprisingly, in overall time, the batched implementation won. Its enqueue is also, obviously, the fastest. (Indeed, there’s a good chance my measurement of its average enqueue time was largely a measurement of the timer’s resolution.) The operations’ average times were stable illustrating their constant (amortized) time. At large enough sizes, the ratio of the maximum dequeue time versus the average stabilized around 7000 to 1, except, of course, for the batched version which grew linearly to millions to 1 ratios at queue sizes of tens of millions of elements. This illustrates the worst-case time complexity of all the other implementations, and the merely amortized time complexity of the batched one.

While the batched version was best in overall time, the difference wasn’t that great. The worst implementations were still less 1.4x slower. All the worst-case optimal implementations performed roughly the same, but there were still some clear winners and losers. Okasaki’s real-time queue is almost on-par with the batched implementation in overall time and handily beats the other implementations in average enqueue and dequeue times. The main surprise for me was that the loser was the Hood-Melville queue. My guess is this is due to invalidate which seems like it would do more work and produce more garbage than the approach taken in my functional version.

Conclusion

The point of this article was to illustrate the process of deriving a deamortized data structure from an amortized one utilizing batched rebuilding by explicitly modeling global rebuilding as a coroutine.

The point wasn’t to produce the fastest queue implementation, though I am pretty happy with the results. While this is an extremely simple example, it was still nice that each step was very easy and natural. It’s especially nice that this derivation approach produced a better result than the Hood-Melville queue.

Of course, my advice is to use Okasaki’s real-time queue if you need a purely functional queue with worst-case constant-time operations.


  1. This code could definitely be refactored to leverage this similarity to reduce code. Alternatively, one could refunctionalize the Hood-Melville implementation at the end.↩︎

  2. Going “too fast”, so long as it’s still a constant amount of work for each step, isn’t really an issue asymptotically, so you can just crank the knobs if you don’t want to think too hard about it. That said, going faster than you need to will likely give you worse worst-case constant factors. In some cases, going faster than necessary could reduce constant factors, e.g. by better utilizing caches and disk I/O buffers.↩︎