"Linux Gazette...making Linux just a little more fun!"

Journalling Filesystems for Linux

By Matteo Dell'Omodarme

Introduction

A filesystem is the software used to organize and manage the data stored on disk drives; it ensures the integrity of the data providing that data written to disk is identical when it is read back. In addition to storing the data contained in files, a filesystem also stores and manages important information about the files and about the filesystem itself (i.e. date and time stamps, ownership, access permissions, the file's size and the storage location or locations on disk, and so on). This information is commonly referred to as metadata.

Since a filesystem tries to work as asynchronous as possible, in order to avoid hard-disk bottleneck, a sudden interruption of its work could result in a loss of data. As an example, let's consider the following scenario: what happens if your machine crashes when you are working on a document residing on a Linux standard ext2 filesystem?
There are several answers:

The machine crashes after you saved the file. This is the best scenario: you haven't lost anything. Just reboot the machine and continue working on the document.
The machine crashes before you saved the file. You have lost all your changes but your old version is still ok.
The machine crashes during the exact moment when the file is being written. This is the worst case: the new version of the file is physically overwriting the old version. You end up with a file partially new and partially old. If the file was written in a binary form you can't reopen it because the internal format of its data is inconsistent with what the application expects.

In this last scenario things can be even worse if the drive was writing the metadata areas, such as the directory itself. Now instead of one corrupted file, you have one corrupted filesystem and you can lose an entire directory or all the data on an entire disk partition.

The standard Linux filesystem (ext2fs) makes an attempt to prevent and recover from the metadata corruption case performing an extensive filesystem analysis (fsck) during bootup. Since ext2fs incorporates redundant copies of critical metadata, it is extremely unlikely for that data to be completely lost. The system figures out where the corrupt metadata is, and then either repairs the damage by copying from the redundant version or simply deletes the file or files whose metadata is affected.

Obviously, the larger is the filesystem to check, the longer the check process. On a partition of several gigabytes it may take a great deal of time to check the metadata during bootup.
As Linux begins to take on more complex applications, on larger servers, and with less tolerance for downtime, there is a need for more sophisticated filesystems that do an even better job of protecting data and metadata.

The journalling filesystems available for Linux are the answer to this need.

What is a journalling filesystem?

Here is reported only a general introduction to journalling. For more specific and technical notes please see Juan I. Santos Florido article in Linux Gazette 55. Other information can be obtained from freshmeat.net/articles/view/212/.

Most modern filesystems use journalling techniques borrowed from the database world to improve crash recovery. Disk transactions are written sequentially to an area of disk called journal or log before being written to their final locations within the filesystem.
Implementations vary in terms of what data is written to the log. Some implementations write only the filesystem metadata, while others record all writes to the journal.

Now, if a crash happens before the journal entry is committed, then the original data is still on the disk and you lost only your new changes. If the crash happens during the actual disk update (i.e. after the journal entry was committed), the journal entry shows what was supposed to have happened. So when the system reboots, it can simply replay the journal entries and complete the update that was interrupted.

In either case, you have valid data and not a trashed partition. And since the recovery time associated with this log-based approach is much shorter, the system is on line in few seconds.

It is also important to note that using a journalling filesystem does not entirely obsolete the use of filesystem checking programs (fsck). Hardware and software errors that corrupt random blocks in the filesystem are not generally recoverable with the transaction log.

Available journalling filesystems

In the following part I will consider three journalling filesystems.

The first one is ext3. Developed by Stephen Tweedie, a leading Linux kernel developer, ext3 adds journalling into ext2. It is available in alpha form at ftp.linux.org.uk/pub/linux/sct/fs/jfs/.

Namesys has a journalling filesystem under development called ReiserFS. It is available at www.namesys.com.

SGI has released on May 1 2001 version 1.0 of its XFS filesystem for Linux. You can find it at oss.sgi.com/projects/xfs/.

In this article these three solutions are tested and benchmarked using two different programs.

Installing ext3

For technical notes about ext3 filesystem please refer to Dr. Stephen Tweedie's paper and to his talk.

The ext3 filesystem is directly derived from its ancestor, ext2. It has the valuable characteristic to be absolutely backward compatible to ext2 since it is just an ext2 filesystem with journalling. The obvious drawback is that ext3 doesn't implement any of the modern filesystem features which increase data manipulation speed and packing.

ext3 comes as a patch of 2.2.19 kernel, so first of all, get a linux-2.2.19 kernel from ftp.kernel.org or from one of its mirrors. The patch is available at ftp.linux.org.uk/pub/linux/sct/fs/jfs or ftp.kernel.org/pub/linux/kernel/people/sct/ext3 or from one mirror of this site.
From one of these sites you need to get the following files:

ext3-0.0.7a.tar.bz2: the kernel patch.
e2fsprogs-1.21-WIP-0601.tar.bz2: the e2fsprogs suite with ext3 support.

Copy Linux kernel linux-2.2.19.tar.bz2 and ext3-0.0.7a.tar.bz2 files to /usr/src directory and extract them:

mv linux linux-old
tar -Ixvf linux-2.2.19.tar.bz2
tar -Ixvf ext3-0.0.7a.tar.bz2
cd linux
cat ../ext3-0.0.7a/linux-2.2.19.kdb.diff | patch -sp1
cat ../ext3-0.0.7a/linux-2.2.19.ext3.diff | patch -sp1

The first diff is copy of SGI's kdb kernel debugger patches. The second one is the ext3 filesystem.
Now, configure the kernel, saying YES to "Enable Second extended fs development code" in the filesystem section, and build it.

After the kernel is compiled and installed you should make and install the e2fsprogs:

tar -Ixvf e2fsprogs-1.21-WIP-0601.tar.bz2
cd e2fsprogs-1.21
./configure
make
make check
make install

That's all. The next step is to make an ext3 filesystem in a partition. Reboot with the new kernel. Now you have two options: make a new journalling filesystem or journal an existing one.

Making a new ext3 filesystem. Just use the mke2fs from the installed e2fsprogs, and use the "-j" option when running mke2fs:
```
mke2fs -j /dev/xxx
```
where /dev/xxx is the device where you would create the ext3 filesystem. The "-j" flag tells mke2fs to create an ext3 filesystem with a hidden journal. You could control the size of the journal using the optional flag -Jsize=<n> (n is the preferred size of the journal in Mb).
Upgrade an existing ext2 filesystem to ext3. Just use tune2fs:
```
tune2fs -j /dev/xxx
```
You should do that either on mounted or unmounted filesystem. If the filesystem is mounted a file .journal is created in the top-level directory of the filesystem; if it is unmounted a hidden system inode is used for the journal. In such a way all the data in the filesystem are preserved.

You can mount the ext3 filesystem using the command:

mount -t ext3 /dev/xxx /mount_dir

Since ext3 is basically ext2 with journalling, a cleanly unmounted ext3 filesystem could be remounted as ext2 without any other commands.

Installing XFS

For a technical overview of XFS filesystem refer to SGI linux XFS page and to SGI publications page.
Also see the FAQ page.

XFS is a journalling filesystem for Linux available from SGI. It is a mature technology that has been proven on IRIX systems as the default filesystem for all SGI customers. XFS is licensed under GPL.
XFS Linux 1.0 is released for the Linux 2.4 kernel, and I tried the 2.4.2 patch. So the first step is to acquire a linux-2.4.2 kernel from one mirror of kernel.org.
The patches are at oss.sgi.com/projects/xfs/download/Release-1.0/patches. From this directory download:

Copy the Linux kernel linux-2.4.2.tar.bz2 in /usr/src directory, rename the existing linux directory to linux-old and extract the new kernel:

mv linux linux-old
tar -Ixf inux-2.4.2.tar.bz2

Copy each patch in the top directory of your linux source tree (i.e. /usr/src/linux) and apply them:

zcat patchfile.gz | patch -p1

Then configure the kernel, enabling the options "XFS filesystem support" (CONFIG_XFS_FS) and "Page Buffer support" (CONFIG_PAGE_BUF) in the filesystem section. Note that you will also need to upgrade the following system utilities to these versions or later:

Install the new kernel and reboot.
Now download the xfs progs tools. This tarball contains a set of commands to use the XFS filesystem, such as mkfs.xfs. To build them:

tar -zxf  xfsprogs-1.2.0.src.tar.gz
cd xfsprogs-1.2.0
make configure 
make 
make install

After installing this set of commands you can create a new XFS filesystem with the command:

mkfs -t xfs /dev/xxx

One important option that you may need is "-f" which will force the creation of a new filesystem, if a filesystem already exists on that partition. Again, note that this will destroy all data currently on that partition:

mkfs -t xfs -f /dev/xxx

You can then mount the new filesystem with the command:

mount -t xfs /dev/xxx /mount_dir

Installing ReiserFS

For technical notes about reiserFS refer to NAMESYS home page and to FAQ page.

ReiserFS has been in the official Linux kernel since 2.4.1-pre4. You always need to get the utils (e.g. mkreiserfs to create ReiserFS on an empty partition, the resizer, etc.).
The up-to-date ReiserFS version is available as a patch against either 2.2.x and 2.4.x kernels. I tested the patch against 2.2.19 Linux kernel.

The first step, as usual, is to get a linux-2.2.19.tar.bz2 standard kernel from a mirror of kernel.org. Then get the reiserfs 2.2.19 patch. At present time the last patch is 3.5.33.
Please note that, if you choose to get the patch against 2.4.x kernel, you should get also the utils tarball reiserfsprogs-3.x.0j.tar.gz.
Now unpack the kernel and the patch. Copy the tarballs in /usr/src and move the linux directory to linux-old; then run the commands:

tar -Ixf linux-2.2.19.tar.bz2
bzcat linux-2.2.19-reiserfs-3.5.33-patch.bz2 | patch -p0

Compile the Linux kernel setting reiserfs support on filesystem section.
Compile and install the reiserfs utils:

cd /usr/src/linux/fs/reiserfs/utils 
make
make install

Install the new kernel and reboot. Now you can create a new reiserfs filesystem with the command:

mkreiserfs /dev/xxxx

and mount it:

mount -t reiserfs /dev/xxx /mount_dir

Filesystems benchmark

For the test I used a Pentium III - 16 Mb RAM - 2 Gb HD with a Linux RedHat 6.2 installed.
All the filesystems worked fine for me, so I started a little benchmark analysis to compare their performances. As a first test I simulated a crash turning off the power, in order to control the journal recovery process. All filesystems passed successfully this phase and the machine was on line in few seconds with each filesystem.

The next step is a benchmark analysis using bonnie++ program, available at www.coker.com.au/bonnie++. The program tests database type access to a single file, and it tests creation, reading, and deleting of small files which can simulate the usage of programs such as Squid, INN, or Maildir-format programs (qmail).
The benchmark command was:

bonnie++ -d/work1 -s10 -r4 -u0

which executes the test using 10Mb (-s10) in the filesystem mounted in /work1 directory. So, before launching the benchmark, you must create the requested filesystem on a partition and mount it on /work1 directory. The other flags specify the RAM amount in Mb (-r4) and the user (-u0, i.e. run as root).

The results are shown in the following table.

Sequential Output Sequential Input Random
Seeks

Size:Chunk Size Per Char Block Rewrite Per Char Block

K/sec % CPU K/sec % CPU K/sec % CPU K/sec % CPU K/sec % CPU / sec % CPU

ext2 10M 1471 97 14813 67 1309 14 1506 94 4889 15 309.8 10

ext3 10M 1366 98 2361 38 1824 22 1482 94 4935 14 317.8 10

xfs 10M 1206 94 9512 77 1351 33 1299 98 4779 80 229.1 11

reiserfs 10M 1455 99 4253 31 2340 26 1477 93 5593 26 174.3 5

Sequential Create Random Create

Num Files Create Read Delete Create Read Delete

/ sec % CPU / sec % CPU / sec % CPU / sec % CPU / sec % CPU / sec % CPU

ext2 16 94 99 278 99 492 97 95 99 284 100 93 41

ext3 16 89 98 274 100 458 96 93 99 288 99 97 45

xfs 16 92 99 251 96 436 98 91 99 311 99 90 41

reiserfs 16 1307 100 8963 100 1914 99 1245 99 9316 100 1725 100

Two data are shown for each test: the speed of the filesystem (in K/sec) and the CPU usage (in %). The higher the speed the better the filesystem. The opposite is true for the CPU usage.
As you can see reiserFS reports a hands down victory in managing files (section Sequential Create and Random Create), overwhelming its opponents by a factor higher than 10. In addition to that is almost as good as the other filesystem in the Sequential Output and Sequential Input. There isn't any significant difference among the other filesystems. XFS speed is similar to ext2 filesystem, and ext3 is, as expected, a little slower than ext2 (it is basically the same thing, and it wastes some time during the journalling calls).

As a last test I get the mongo benchmark program available at reiserFS benchmark page at www.namesys.com, and I modified it in order to test the three journalling filesystems. I inserted in the mongo.pl perl script the commands to mount the xfs and ext3 filesystem and to format them. Then I started a benchmark analysis.
The script formats partition /dev/xxxx, mounts it and runs given number of processes during each phase: Create, Copy, Symlinks, Read, Stats, Rename and Delete. Also, the program calculates fragmentation after Create and Copy phases:

Fragm = number_of_fragments / number_of_files

You can find the same results in the directory results in the files:

log       - raw results
log.tbl   - results for compare program
log_table - results in table form

The tests was executed as in the following example:

mongo.pl ext3 /dev/hda3 /work1 logext3 1

where ext3 must be replaced by reiserfs or xfs in order to test the other filesystems. The other arguments are the device to mount, where the filesystem to test is located, the mounting directory, the filename where the results are stored and the number of processes to start.

In the following tables there are the results of this analysis. The data reported is time (in sec). The lower the value, the better the filesystem. In the first table the median dimension of files managed is 100 bytes, in the second one it is 1000 bytes and in the last one 10000 bytes.

ext3
files=68952
size=100 bytes
dirs=242 XFS
files=68952
size=100 bytes
dirs=241 reiserFS
files=68952
size=100 bytes
dirs=241

Create 90.07 267.86 53.05

Fragm. 1.32 1.02 1.00

Copy 239.02 744.51 126.97

Fragm. 1.32 1.03 1.80

Slinks 0 203.54 105.71

Read 782.75 1543.93 562.53

Stats 108.65 262.25 225.32

Rename 67.26 205.18 70.72

Delete 23.80 389.79 85.51

ext3
files=11248
size=1000 bytes
dirs=44 XFS
files=11616
size=1000 bytes
dirs=43 ReiserFS
files=11616
size=1000 bytes
dirs=43

Create 30.68 57.94 36.38

Fragm. 1.38 1.01 1.03

Copy 75.21 149.49 84.02

Fragm. 1.38 1.01 1.43

Slinks 16.68 29.59 19.29

Read 225.74 348.99 409.45

Stats 25.60 46.41 89.23

Rename 16.11 33.57 20.69

Delete 6.04 64.90 18.21

ext3
files=2274
size=10000 bytes
dirs=32 XFS
files=2292
size=10000 bytes
dirs=31 reiserFS
files=2292
size=10000 bytes
dirs=31

Create 27.13 25.99 22.27

Fragm. 1.44 1.02 1.05

Copy 55.27 55.73 43.24

Fragm. 1.44 1.02 1.12

Slinks 1.33 2.51 1.43

Read 40.51 50.20 56.34

Stats 2.34 1.99 3.52

Rename 0.99 1.10 1.25

Delete 3.40 8.99 1.84

From these tables you can see that ext3 is usually faster in Stats Delate and Rename, while reiserFS wins in Create and Copy. Also note that the performance of reiserFS in better in the first case (small files) as expected by its technical documentation.

Conclusions

There are at present time at least two robust and reliable journalling filesystems for Linux (i.e. XFS and reiserFS) which can be utilized without fear.
ext3 is still an alpha release and can undergo several failures. I had some problems using bonnie++ on this filesystem: the system reported some VM errors and killed the shell I was using.

Considering the benchmark results my advice is to install a reiserFS filesystem in the future (I'll surely do it).

Matteo Dell'Omodarme

I'm a student at the University of Pisa and a Linux user since 1994. Now I'm working on the administrations of Linux boxes at the Astronomy section of the Department of Physics, with special experience about security. My primary email address is matt@martine2.difi.unipi.it.

		Sequential Output						Sequential Input				Random Seeks
	Size:Chunk Size	Per Char		Block		Rewrite		Per Char		Block		Random Seeks
		K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	/ sec	% CPU
ext2	10M	1471	97	14813	67	1309	14	1506	94	4889	15	309.8	10

ext3	10M	1366	98	2361	38	1824	22	1482	94	4935	14	317.8	10

xfs	10M	1206	94	9512	77	1351	33	1299	98	4779	80	229.1	11

reiserfs	10M	1455	99	4253	31	2340	26	1477	93	5593	26	174.3	5

		Sequential Create						Random Create
	Num Files	Create		Read		Delete		Create		Read		Delete
		/ sec	% CPU	/ sec	% CPU	/ sec	% CPU	/ sec	% CPU	/ sec	% CPU	/ sec	% CPU
ext2	16	94	99	278	99	492	97	95	99	284	100	93	41
ext3	16	89	98	274	100	458	96	93	99	288	99	97	45
xfs	16	92	99	251	96	436	98	91	99	311	99	90	41
reiserfs	16	1307	100	8963	100	1914	99	1245	99	9316	100	1725	100

	`ext3` `files=68952` `size=100 bytes` `dirs=242`	`XFS` `files=68952` `size=100 bytes` `dirs=241`	`reiserFS` `files=68952` `size=100 bytes` `dirs=241`
`Create`	90.07	267.86	53.05
`Fragm.`	1.32	1.02	1.00
`Copy`	239.02	744.51	126.97
`Fragm.`	1.32	1.03	1.80
`Slinks`	0	203.54	105.71
`Read`	782.75	1543.93	562.53
`Stats`	108.65	262.25	225.32
`Rename`	67.26	205.18	70.72
`Delete`	23.80	389.79	85.51

	`ext3` `files=11248` `size=1000 bytes` `dirs=44`	`XFS` `files=11616` `size=1000 bytes` `dirs=43`	`ReiserFS` `files=11616` `size=1000 bytes` `dirs=43`
`Create`	30.68	57.94	36.38
`Fragm.`	1.38	1.01	1.03
`Copy`	75.21	149.49	84.02
`Fragm.`	1.38	1.01	1.43
`Slinks`	16.68	29.59	19.29
`Read`	225.74	348.99	409.45
`Stats`	25.60	46.41	89.23
`Rename`	16.11	33.57	20.69
`Delete`	6.04	64.90	18.21

	`ext3` `files=2274` `size=10000 bytes` `dirs=32`	`XFS` `files=2292` `size=10000 bytes` `dirs=31`	`reiserFS` `files=2292` `size=10000 bytes` `dirs=31`
`Create`	27.13	25.99	22.27
`Fragm.`	1.44	1.02	1.05
`Copy`	55.27	55.73	43.24
`Fragm.`	1.44	1.02	1.12
`Slinks`	1.33	2.51	1.43
`Read`	40.51	50.20	56.34
`Stats`	2.34	1.99	3.52
`Rename`	0.99	1.10	1.25
`Delete`	3.40	8.99	1.84