ZFS Record Size

Marsell Kukuljevic of Joyent wrote me to say (paraphrasing):

"I thought ZFS record size is variable: by default it's 128K, but write 2KB of data (assuming nothing else writes), then only 2KB writes to disk (excluding metadata). What does record size actually enforce?

I assume this is affected by transaction groups, so if I write 6 2K files, it'll write a 12K record, but if I write 6 32K files, it'll write two records: 128K and 64K. That causes a problem with read and write magnification in future writes though, so I'm not sure if such behaviour makes sense. Maybe recordsize only affects writes within a file?

I'm asking this in context of one of the recommendations in the evil tuning guide, to use a recordsize of 8K to match with Postgres' buffer size. Fair enough, I presume this means that records written to disk are then always at most 8KB (ignoring any headers and footers), but how does compression factor into this?

I've noticed that Postgres compresses quite well. With LZJB it still gets ~3x. Assuming a recordsize of 8K, then it'd be about 3KB written to disk for that record (again, excluding all the metadata), right?"

The recordsize parameter enforces the size of the largest blockwritten to a ZFS file system or volume.There is an excellent blog about the ZFS recordsize here. Note that ZFS does not always read/write recordsizebytes. For instance, a write of 2K to a file will typically result inat least one 2KB write (and maybe more than one for metadata).The recordsize is the largest block that ZFS will read/write.The interested reader can verify this by using DTrace on bdev_strategy(), left as an exercise.Also note that because of the way ZFS maintains information aboutallocated/free space on disk (i.e., spacemaps), smaller recordsizeshould not result in more space or time being used to maintain thatinformation.

Instead of repeating the blog post, let's do someexperimenting.

To make things easy (i.e., we don't want to sift through tens ofthousands of lines of zdb(1M) output), we'll createa small pool and work with that. I'm assuming you are on a systemthat supports ZFS and has zdb. SmartOS would be an excellentchoice...

## mkfile 100m /var/tmp/poolfile# zpool create testpool /var/tmp/poolfile# zfs get recordsize,compression testpoolNAME      PROPERTY     VALUE     SOURCEtestpool  recordsize   128K      defaulttestpool  compression  off       default#

An alternative to using files (/var/tmp/poolfile), is to create achild dataset using the zfs command, and run zdb on the childdataset. This also cuts down on the amount of data displayed by zdb.

We'll start with the simplest case:

# dd if=/dev/zero of=/testpool/foo bs=128k count=11+0 records in1+0 records out# zdb -dddddddd testpool...    Object  lvl   iblk   dblk  dsize  lsize   %full  type        21    1    16K   128K   128K   128K  100.00  ZFS plain file (K=inherit) (Z=inherit)                                        168   bonus  System attributes       dnode flags: USED_BYTES USERUSED_ACCOUNTED     dnode maxblkid: 0       path    /foo    uid     0       gid     0       atime   Thu Mar 21 03:50:24 2013        mtime   Thu Mar 21 03:50:24 2013        ctime   Thu Mar 21 03:50:24 2013        crtime  Thu Mar 21 03:50:24 2013        gen     2462    mode    100644  size    131072  parent  4       links   1       pflags  40800000004Indirect blocks:               0 L0 0:1b4800:20000 20000L/20000P F=1 B=2462/2462             segment [0000000000000000, 0000000000020000) size  128K...#

From the above output, we can see that the "foo" file has one block.It is on vdev 0 (the only vdev in the pool), at offset 0x1b4800(relative to the 4MB label at the beginning of every disk), and sizeis 0x20000 (=128K). Note that if you're following along, and don'tseethe "/foo" file in your output, run sync, or wait a few seconds.Generally, it can take up to 5 seconds before the data is on disk.This implies that zdb reads from disk, bypassing ARC (which is whatyou want for a file system debugger).

Now let's do the same for a 2KB file.

# rm /testpool/foo# dd if=/dev/zero of=/testpool/foo bs=2k count=1# zdb -dddddddd testpool...    Object  lvl   iblk   dblk  dsize  lsize   %full  type        22    1    16K     2K     2K     2K  100.00  ZFS plain file (K=inherit) (Z=inherit)                                        168   bonus  System attributes       dnode flags: USED_BYTES USERUSED_ACCOUNTED     dnode maxblkid: 0       path    /foo    uid     0       gid     0       atime   Thu Mar 21 04:21:25 2013        mtime   Thu Mar 21 04:21:25 2013        ctime   Thu Mar 21 04:21:25 2013        crtime  Thu Mar 21 04:21:25 2013        gen     2839    mode    100644  size    2048    parent  4       links   1       pflags  40800000004Indirect blocks:               0 L0 0:180000:800 800L/800P F=1 B=2839/2839           segment [0000000000000000, 0000000000000800) size    2K...#

So, as Marsell notes, block size is variable. Here, the foo file isat offset 0x180000 and size is 0x800 (=2K).What if we use a larger block size than 128KB to dd?

# rm /testpool/foo# dd if=/dev/zero of=/testpool/foo bs=256k count=11+0 records in1+0 records out# zdb -dddddddd testpool...    Object  lvl   iblk   dblk  dsize  lsize   %full  type        23    2    16K   128K   258K   256K  100.00  ZFS plain file (K=inherit) (Z=inherit)                                        168   bonus  System attributes       dnode flags: USED_BYTES USERUSED_ACCOUNTED     dnode maxblkid: 1       path    /foo    uid     0       gid     0       atime   Thu Mar 21 04:23:32 2013        mtime   Thu Mar 21 04:23:32 2013        ctime   Thu Mar 21 04:23:32 2013        crtime  Thu Mar 21 04:23:32 2013        gen     2868    mode    100644  size    262144  parent  4       links   1       pflags  40800000004Indirect blocks:               0 L1  0:1f0c00:400 0:12b5a00:400 4000L/400P F=2 B=2868/2868               0  L0 0:1b3800:20000 20000L/20000P F=1 B=2868/2868           20000  L0 0:180000:20000 20000L/20000P F=1 B=2868/2868         segment [0000000000000000, 0000000000040000) size  256K...#

This time, the file has 2 blocks, each 128KB large. Because the datadoes not fit into 1 block, there is 1 indirect block (block containingblock pointers) at 0x1f0c00, and it is 0x400 (1KB) on disk. Theindirect block is compressed. Decompressed, it is 0x4000 bytes(=16KB). The "4000L/400P" refers to the logical size (4000L) andthe physical size (400P). Logical is after decompression, physical isthe size compressed on disk. Turning off compression only effectsindirect blocks. Other metadata is always compressed (always lzjb??).

Now we'll try creating 6 2-KB files, and see what that gives us. (Note that output has been omitted.)

# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=2k count=1; done...# sync# zdb -dddddddd testpool...       path    /f1Indirect blocks:               0 L0 0:80000:800 800L/800P F=1 B=4484/4484       path    /f2Indirect blocks:               0 L0 0:81200:800 800L/800P F=1 B=4484/4484       path    /f3Indirect blocks:               0 L0 0:81a00:800 800L/800P F=1 B=4484/4484       path    /f4Indirect blocks:               0 L0 0:82200:800 800L/800P F=1 B=4484/4484       path    /f5Indirect blocks:               0 L0 0:82a00:800 800L/800P F=1 B=4484/4484       path    /f6Indirect blocks:               0 L0 0:87200:800 800L/800P F=1 B=4484/4484...

So, they all fit in the same 128k block (between 0x80000 and 0xa0000).And they are all in the same transaction group (4484).There is a gap between the space used for file f1 and file f2, but f2through f6 are contiguous. Does this result in one write to the disk?Hard to say as it is difficult to correlate writes to disk withwrites to ZFS files, and also because the "disk" is actually a file.It should be possible to determine if it is one write or multiple byusing DTrace and a child dataset in a pool with real disks.Would we get the same behavior if the writes were in separatetransaction groups? Let's try to find out.

# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=2k count=1; sleep 6; done...# zdb -dddddddd testpool...       path    /f1Indirect blocks:               0 L0 0:ef400:800 800L/800P F=1 B=4827/4827       path    /f2Indirect blocks:               0 L0 0:fd800:800 800L/800P F=1 B=4828/4828       path    /f3Indirect blocks:               0 L0 0:82a00:800 800L/800P F=1 B=4829/4829       path    /f4Indirect blocks:               0 L0 0:88000:800 800L/800P F=1 B=4831/4831       path    /f5Indirect blocks:               0 L0 0:8b200:800 800L/800P F=1 B=4832/4832       path    /f6Indirect blocks:               0 L0 0:8ca00:800 800L/800P F=1 B=4833/4833...

Each write is in a different transaction group, and they are notcontiguous. Some are in different 128KB blocks.

So, back to Marsell's questions... Marsell says that if he writes6 2KB files, it will be a 12KB write. That is not clear from theabove output. In fact, it may be 6 2KB writes, it might be 1 128KBwrite. It could even be 1 12KB write. I ran the first 6x2KB a secondtime, and all of the files were contiguous on disk. It is alsopossible that even if the writes are in different transaction groups,they could all be contiguous.

Let's write 6 32KB files.

# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=32k count=1; done# zdb -dddddddd testpool...       path    /f1Indirect blocks:               0 L0 0:8da00:8000 8000L/8000P F=1 B=5108/5108       path    /f2Indirect blocks:               0 L0 0:a8e00:8000 8000L/8000P F=1 B=5108/5108       path    /f3Indirect blocks:               0 L0 0:b0e00:8000 8000L/8000P F=1 B=5108/5108       path    /f4Indirect blocks:               0 L0 0:b8e00:8000 8000L/8000P F=1 B=5108/5108       path    /f5Indirect blocks:               0 L0 0:efc00:8000 8000L/8000P F=1 B=5108/5108       path    /f6Indirect blocks:               0 L0 0:da400:8000 8000L/8000P F=1 B=5108/5108...

The writes are all in the same transaction group, but not all in thesame 128KB block. In fact, a single write may be spread acrosstransaction groups. Note that this implies that there can bedata loss, i.e., not all data written in one write call ends up ondisk if there is a power failure. ZFS guarantees consistency of thefile system, i.e., the transaction is all or none. If there aremultiple transactions, some transactions may not make it to disk.Applications concerned about this should either use synchronouswrites, or have some other recovery mechanism. Note that synchronouswrites use the ZFS intent log (ZIL), so performance may not becompromised.

Here is a single write of 4MB.

# dd if=/dev/zero of=/testpool/big bs=4096k count=11+0 records in1+0 records out# sync# zdb -dddddddd testpool...      path    /bigIndirect blocks:               0 L1  0:620000:400 0:1300000:400 4000L/400P F=32 B=5410/5410               0  L0 0:200000:20000 20000L/20000P F=1 B=5409/5409           20000  L0 0:220000:20000 20000L/20000P F=1 B=5409/5409           40000  L0 0:240000:20000 20000L/20000P F=1 B=5409/5409           60000  L0 0:260000:20000 20000L/20000P F=1 B=5409/5409           80000  L0 0:280000:20000 20000L/20000P F=1 B=5409/5409           a0000  L0 0:2a0000:20000 20000L/20000P F=1 B=5409/5409           c0000  L0 0:2c0000:20000 20000L/20000P F=1 B=5409/5409           e0000  L0 0:2e0000:20000 20000L/20000P F=1 B=5409/5409          100000  L0 0:300000:20000 20000L/20000P F=1 B=5409/5409          120000  L0 0:320000:20000 20000L/20000P F=1 B=5409/5409          140000  L0 0:340000:20000 20000L/20000P F=1 B=5409/5409          160000  L0 0:360000:20000 20000L/20000P F=1 B=5409/5409          180000  L0 0:380000:20000 20000L/20000P F=1 B=5409/5409          1a0000  L0 0:3a0000:20000 20000L/20000P F=1 B=5409/5409          1c0000  L0 0:3c0000:20000 20000L/20000P F=1 B=5409/5409          1e0000  L0 0:3e0000:20000 20000L/20000P F=1 B=5409/5409          200000  L0 0:400000:20000 20000L/20000P F=1 B=5409/5409          220000  L0 0:420000:20000 20000L/20000P F=1 B=5409/5409          240000  L0 0:440000:20000 20000L/20000P F=1 B=5409/5409          260000  L0 0:460000:20000 20000L/20000P F=1 B=5409/5409          280000  L0 0:485e00:20000 20000L/20000P F=1 B=5410/5410          2a0000  L0 0:4a5e00:20000 20000L/20000P F=1 B=5410/5410          2c0000  L0 0:4c5e00:20000 20000L/20000P F=1 B=5410/5410          2e0000  L0 0:500000:20000 20000L/20000P F=1 B=5410/5410          300000  L0 0:520000:20000 20000L/20000P F=1 B=5410/5410          320000  L0 0:540000:20000 20000L/20000P F=1 B=5410/5410          340000  L0 0:560000:20000 20000L/20000P F=1 B=5410/5410          360000  L0 0:580000:20000 20000L/20000P F=1 B=5410/5410          380000  L0 0:5a0000:20000 20000L/20000P F=1 B=5410/5410          3a0000  L0 0:5c0000:20000 20000L/20000P F=1 B=5410/5410          3c0000  L0 0:5e0000:20000 20000L/20000P F=1 B=5410/5410          3e0000  L0 0:600000:20000 20000L/20000P F=1 B=5410/5410

The write is spread across 2 transaction groups. Examining the code in zfs_write(), you can see that each write is broken into recordsize blocks, results in a separate transactions (see the calls to dmu_tx_create() in that code). The transactions can be spread across multiple transaction groups (calls to dmu_tx_assign()).

If the recordsize is set to 8k, the maximum size of a block will be8KB. Let's give that a try and look at results. Blocks that arealready allocated are not effected.

# zfs set recordsize=8192 testpool# zfs get recordsize testpoolNAME      PROPERTY    VALUE    SOURCEtestpool  recordsize  8K       local# dd if=/dev/zero of=/testpool/smallblock bs=128k count=11+0 records in1+0 records out# sync# zdb -dddddddd testpool...      path    /smallblockIndirect blocks:               0 L1  0:64e800:400 0:1312c00:400 4000L/400P F=16 B=5653/5653               0  L0 0:624800:2000 2000L/2000P F=1 B=5653/5653            2000  L0 0:627c00:2000 2000L/2000P F=1 B=5653/5653            4000  L0 0:632800:2000 2000L/2000P F=1 B=5653/5653            6000  L0 0:634800:2000 2000L/2000P F=1 B=5653/5653            8000  L0 0:636800:2000 2000L/2000P F=1 B=5653/5653            a000  L0 0:638800:2000 2000L/2000P F=1 B=5653/5653            c000  L0 0:63a800:2000 2000L/2000P F=1 B=5653/5653            e000  L0 0:63c800:2000 2000L/2000P F=1 B=5653/5653           10000  L0 0:63e800:2000 2000L/2000P F=1 B=5653/5653           12000  L0 0:640800:2000 2000L/2000P F=1 B=5653/5653           14000  L0 0:642800:2000 2000L/2000P F=1 B=5653/5653           16000  L0 0:644800:2000 2000L/2000P F=1 B=5653/5653           18000  L0 0:646800:2000 2000L/2000P F=1 B=5653/5653           1a000  L0 0:648800:2000 2000L/2000P F=1 B=5653/5653           1c000  L0 0:64a800:2000 2000L/2000P F=1 B=5653/5653           1e000  L0 0:64c800:2000 2000L/2000P F=1 B=5653/5653

Basically, the behavior is the same as with the default 128KBrecordsize, except that the maximum size of a block is 8KB. Thisshould hold for all blocks (data and metadata). Any modified metadata(due to copy-on-write) will also use the smaller block size.As for performance implications, I'll leave that to theRoch Bourbonnais blog, referenced at the beginning.

For compression, nothing really changes. The maximum size of acompressed block is the recordsize. We'll reset the recordsize to thedefault, and turn on lzjb compression.

# zfs set recordsize=128k testpool# zfs set compression=lzjb testpool# zfs get recordsize,compression testpoolNAME      PROPERTY     VALUE     SOURCEtestpool  recordsize   128K      localtestpool  compression  lzjb      local#

And write 256KB...

# dd if=/dev/zero of=/testpool/zero bs=128k count=22+0 records in2+0 records out# zdb -dddddddd testpool...       path    /zeroIndirect blocks:

Good, so compression of all NULLs resulted in no blocks. Let's writedata.

# dd if=/usr/dict/words of=/testpool/foo.compressed bs=128k count=21+1 records in1+1 records out# zdb -dddddddd testpool...       path    /foo.compressedIndirect blocks:               0 L1  0:6b2200:400 0:1390800:400 4000L/400P F=2 B=5830/5830               0  L0 0:690600:15200 20000L/15200P F=1 B=5830/5830           20000  L0 0:6a5800:ca00 20000L/ca00P F=1 B=5830/5830

So, 1 indirect block and 2 blocks of compressed data. The first blockof compressed data is 0x15200 in size, the second is 0xca00. The twoblocks are contiguous, so it is possible they are written in 1 writeto the disk.

To conclude, recordsize is handled at the block level. It is themaximum size of a block that may be written by ZFS. Existingdata/metadata is not changed if the recordsize is changed, and/or ifcompression is used. As for performance tuning, I would be careful ofputting too much faith in the ZFS evil tuning guide. It is dated,some of the descriptions are not accurate, and there are thingsmissing.

I'll have another ZFS related blog soon. Currently waiting for a bugto be fixed in zdb.

We offer comprehensive training for Triton Developers, Operators and End Users.



Post written by rachelbalik