Copy-on-write, wandering trees, and data recovery

Many filesystems use copy-on-write, and many use B-trees to store their metadata. Combining these two techniques results in a peculiar on-disk metadata pattern called a "wandering tree".

Atomic writes

Let's first start with the concept of atomic writes. The atomic write operation can't fail partway through. Atomic write completes fully or leaves the data unchanged as if it never started. It is called atomic to indicate that it is small and indivisible.

Single-sector writes are considered atomic. Multi-sector writes are not. If the multi-sector write fails, it is possible that some sectors will contain old data and others will contain new data. Since most filesystem metadata structures are larger than a sector, there is an obvious problem - a power failure may leave some metadata structure inconsistent. One part of the structure will contain the old version, and the other will contain the new version.

Older filesystems (like FAT) didn't care. Later ones introduced journaling. NTFS, for example, uses journaling and additionally stores a generation number in each sector of every multi-sector structure. On read, it verifies that generation numbers in all the sectors match; this allows for the detection of inconsistencies. Modern filesystems often use copy-on-write.

Copy-on-write

Copy-on-write, often abbreviated as CoW, is another method to maintain consistency of the filesystem. Let's say, for example, that we have some multi-sector metadata structure and a pointer to it. The pointer, however, does fit into a single sector. Instead of updating the metadata structure in place by overwriting it, we write a new version of the metadata to a free space and then update the pointer. If the power fails while the metadata structure is being written, the pointer still points to the valid old version. Since the pointer update is atomic, the filesystem always remains consistent - either it uses the old version or the new version of the metadata, but never something in between.

B-Trees

B-Tree is, generally speaking, a method of arranging data so you can search through it quickly. Let's see the simplified example:

B-Tree example.

Figure 1. A simplified B-Tree.

At the bottom is the data. One level up, there are direct pointers that point to the data; one more level up are indirect pointers to the direct pointers; finally, at the top, the tree root (circled) points to the indirect pointers.

Letters A through D stand for data blocks. The first number in each box indicates the level in the tree (direct pointers are level 0, and the root is level 2). The second number (after the dash) indicates the version of the filesystem, which will be useful later.

Wandering trees

If you apply copy-on-write to a B-tree, the B-tree starts to wander and thus becomes a wandering tree.

Let's take the same example from Figure 1 as the initial state of the filesystem with four data blocks, A-0 through D-0. The -0 indicates the version of the particular block of data.

The initial state of the metadata B-tree.

Figure 2. The initial state of the filesystem B-Tree (version 0).

Note that squares are multi-sector metadata items, while the circle at the tree's root is single-sector. Only the tree root can be updated atomically.

Now, we need to update block D-0. So we write a new version, D-1, into the free space. This change makes the pointer 0-0 invalid, and it needs to be updated. So, we write a new pointer, 0-1. The change at level 0, in turn, makes the pointer 1-0 invalid, so we write a copy of it too. Change at level 1, in turn, invalidates the tree's root, so we write a new root, 2-1. Since the root can be written atomically, the entire operation is atomic. The old version (version 0) of the B-Tree always remains valid. It is replaced by the new valid version (version 1) in a single atomic operation.

This results in the following arrangement:

Version 1 of the metadata B-tree.

Figure 3. Version 1 of the filesystem B-Tree. Lighter shade denotes unreachable objects.

After the filesystem updates the root node, it can reuse the space occupied by the now-unreachable blocks 2-0, 1-0, 0-0, and D-0. Let's pretend the free space is infinite and the overwrite never happens.

In a similar manner, changing block A-0 to A-2 (can't use A-1 because we already spent version 1 to modify block D) results in the following arrangement:

Version 2 of the metadata B-tree.

Figure 4. Version 2 of the filesystem B-Tree. Lighter shade again denotes unreachable objects.

You can see how the tree sort of wanders over the disk space - this is where the name comes from.

A wandering tree.

Figure 5. Some useless AI-generated art.

Impact on data recovery

The short version is that wandering trees make recovery more complex.

There is one positive side. In the hypothetical case, when the disk space is infinite, and there is no need to overwrite previous versions of data, we can recover every previous version of the data. Space is always limited, so this does not happen. Nonetheless, many previous versions of data are recoverable.

Then, there is a negative side. Let's take a relatively small ZFS pool. Greatly simplifying, the pool has 100,000 to 500,000 versions of its metadata, consisting of about 100,000 data blocks. So, we need about 50 billion operations to sort 100K data blocks into 500K versions. Obviously, there aren't 50 billion blocks stored on disk. Most blocks are shared across many versions of the metadata (see Figure 4, blocks B and C). Still, the computational effort required to sort what's what is very expensive.

This situation is further complicated because ZFS overwrites many older blocks as it reuses free space, and many parts of the older trees are isolated. Now, we can't traverse older tree versions, which requires even more expensive workarounds.

The overall impact of wandering tree filesystem layouts on data recovery is that the recovery becomes more complex and takes more time. The effect applies to many filesystems, most notably to ZFS, BTRFS, and APFS.

Created Sunday, June 16, 2024