Simplified theory of copy-on-write filesystems
WARNING: for illustration only; not a technical reference.
This simplified model does not map well onto technical implementations.
Technical implementations are completely different.
The only reason I decided to put this explanation out there is that I think it is easy to understand.
The model applies to all the
CoW
filesystems, including ZFS, BTRFS, ReFS, F2FS, and others.
The principal difference between CoW and a traditional filesystem is that a CoW filesystem never overwrites the data in place.
If you change a file, a traditional filesystem will change the blocks already belonging to that file.
A CoW filesystem will
- allocate new space for the changed blocks,
- write the changed blocks,
- change the references accordingly, and, finally,
- mark the original blocks as free.
Let's say you have a disk with four slots for data on it and a single dataset (collection of files and directories) that starts at Version 0.
A data slot can contain one dataset.
I know this could have been more realistic but bear with me.
So the initial disk state is as follows:
Slot |
Content |
Marked |
1 |
Version 0 |
Active |
2 |
Blank |
Empty |
3 |
Blank |
Empty |
4 |
Blank |
Empty |
Now you change something in the dataset.
Instead of modifying the existing dataset, the CoW filesystem creates a copy of it and writes it into one of the free slots.
So we get
Slot |
Content |
Marked |
1 |
Version 0 |
Empty |
2 |
Version 1 |
Active |
3 |
Blank |
Empty |
4 |
Blank |
Empty |
And then you make another change, thus producing Version 2, and we arrive at the following:
Slot |
Content |
Marked |
1 |
Version 0 |
Empty |
2 |
Version 1 |
Empty |
3 |
Version 2 |
Active |
4 |
Blank |
Empty |
The empty space will eventually be reused.
However, now you want a snapshot of Version 2.
The system will respond like this:
Slot |
Content |
Marked |
1 |
Version 0 |
Empty |
2 |
Version 1 |
Empty |
3 |
Version 2 |
Active, Snapshot of Version 2 |
4 |
Blank |
Empty |
Then the next change, producing Version 3, will result in
Slot |
Content |
Marked |
1 |
Version 0 |
Empty |
2 |
Version 1 |
Empty |
3 |
Version 2 |
Snapshot of Version 3 |
4 |
Version 3 |
Active |
Now the next change will overwrite any of the empty spaces at random.
Slot |
Content |
Marked |
1 |
Version 0 |
Empty |
2 |
Version 4 |
Active |
3 |
Version 2 |
Snapshot of Version 4 |
4 |
Version 3 |
Empty |
Difference between theory and practice
This was a very simplified theory.
In practice, there are no slots, and multiple versions share common data (only changes are written),
not to mention a myriad of other technicalities, but the general idea holds.
The system will overwrite any place declared free and will not overwrite any place assigned to an active dataset or a snapshot.
Also, there is no circular rule to overwrite the oldest data first.
The system does not track the age of free blocks.
All free blocks go into the uniform pool.
The filesystem then draws blocks from the free pool without regard for what it overwrites.
Klennet ZFS Recovery and snapshots
Klennet ZFS Recovery is designed to look through the Content column,
mostly ignoring the Marked column.
During the scan, it identifies "Version X" and sorts through all intact versions.
Technically, identifying what is the snapshot of what on a damaged filesystem is a tricky task.
It is often impossible due to damage or relevant metadata sections being overwritten over time.
So, ZFS Recovery is specifically designed not to care.
If there is a snapshot name, good, it will read the name and try to associate it with data;
but it is not required for recovery.
As a side effect, you may have to look through many unnamed datasets to sort out what's what.
Filed under: ZFS.
Created Wednesday, April 19, 2023