Idle Musings

Sysadmin stuff

Linux Software RAID With Western Digital WD20EARS Green Drives

In a previous post I mentioned that I had a bad experience setting up WD20EARS drives for use in mdadm software RAID under Linux. Thought it’d be a good idea to explain what the problems were and what my current configuration is as there seems to be precious little obvious information online about this.

WD20EARS/EARX drives are “advanced format”, meaning they use a 4K sector layout rather than the standard 512-byte sectors found on most older hard drives. You can generally tell a drive like this by the “AF” or “advanced format” wording on the label. I hadn’t read this information and was blissfully unaware, so when I installed Fedora 17 and made a new partition table on each of the drives, I just picked the default options. Unfortunately, the default (at least in fdisk’s standard mode) is to start at sector 63 and use the full size of the drive.

What I found previously is that when build my RAID array initially, the sync process would take around 48 hours to get to 99.2% done, then errors would spuriously appear in my syslog and the rebuild would fail as a drive dropped out and the array had too few working devices left. This can be seen in /var/log/messages:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
Jul 29 12:00:42 zeus kernel: [211732.938234] ata3.00: exception Emask 0x0 SAct 0x7f81c000 SErr 0x0 action 0x6 frozen
Jul 29 12:00:42 zeus kernel: [211732.940747] ata3.00: failed command: WRITE FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211732.943208] ata3.00: cmd 61/08:70:4f:00:00/00:00:00:00:00/40 tag 14 ncq 4096 out
Jul 29 12:00:42 zeus kernel: [211732.943212]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211732.948087] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211732.950566] ata3.00: failed command: READ FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211732.953098] ata3.00: cmd 60/00:78:27:d0:99/04:00:e7:00:00/40 tag 15 ncq 524288 in
Jul 29 12:00:42 zeus kernel: [211732.953101]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211732.958229] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211732.960785] ata3.00: failed command: READ FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211732.963505] ata3.00: cmd 60/08:80:27:d4:99/02:00:e7:00:00/40 tag 16 ncq 266240 in
Jul 29 12:00:42 zeus kernel: [211732.963508]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211732.968986] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211732.971698] ata3.00: failed command: READ FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211732.974376] ata3.00: cmd 60/08:b8:e7:cf:99/00:00:e7:00:00/40 tag 23 ncq 4096 in
Jul 29 12:00:42 zeus kernel: [211732.974379]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211732.979570] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211732.982215] ata3.00: failed command: READ FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211732.984698] ata3.00: cmd 60/08:c0:ef:cf:99/00:00:e7:00:00/40 tag 24 ncq 4096 in
Jul 29 12:00:42 zeus kernel: [211732.984701]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211732.990083] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211732.992825] ata3.00: failed command: READ FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211732.995608] ata3.00: cmd 60/08:c8:f7:cf:99/00:00:e7:00:00/40 tag 25 ncq 4096 in
Jul 29 12:00:42 zeus kernel: [211732.995612]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211733.001215] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211733.004021] ata3.00: failed command: READ FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211733.006885] ata3.00: cmd 60/08:d0:ff:cf:99/00:00:e7:00:00/40 tag 26 ncq 4096 in
Jul 29 12:00:42 zeus kernel: [211733.006888]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211733.012528] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211733.015117] ata3.00: failed command: READ FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211733.017697] ata3.00: cmd 60/08:d8:07:d0:99/00:00:e7:00:00/40 tag 27 ncq 4096 in
Jul 29 12:00:42 zeus kernel: [211733.017700]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211733.023182] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211733.025894] ata3.00: failed command: READ FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211733.028646] ata3.00: cmd 60/08:e0:1f:d0:99/00:00:e7:00:00/40 tag 28 ncq 4096 in
Jul 29 12:00:42 zeus kernel: [211733.028649]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211733.034257] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211733.037113] ata3.00: failed command: READ FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211733.040008] ata3.00: cmd 60/08:e8:0f:d0:99/00:00:e7:00:00/40 tag 29 ncq 4096 in
Jul 29 12:00:42 zeus kernel: [211733.040012]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211733.045785] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211733.048709] ata3.00: failed command: READ FPDMA QUEUED
Jul 29 12:00:42 zeus kernel: [211733.051644] ata3.00: cmd 60/08:f0:17:d0:99/00:00:e7:00:00/40 tag 30 ncq 4096 in
Jul 29 12:00:42 zeus kernel: [211733.051648]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 29 12:00:42 zeus kernel: [211733.057550] ata3.00: status: { DRDY }
Jul 29 12:00:42 zeus kernel: [211733.060518] ata3: hard resetting link
Jul 29 12:00:42 zeus kernel: [211733.519527] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jul 29 12:00:42 zeus kernel: [211733.527772] ata3.00: configured for UDMA/133
Jul 29 12:00:42 zeus kernel: [211733.527787] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527799] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527807] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527813] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527819] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527824] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527829] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527834] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527840] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527845] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527850] ata3.00: device reported invalid CHS sector 0
Jul 29 12:00:42 zeus kernel: [211733.527869] sd 2:0:0:0: [sdc] Unhandled error code
Jul 29 12:00:42 zeus kernel: [211733.527874] sd 2:0:0:0: [sdc]  Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jul 29 12:00:42 zeus kernel: [211733.527882] sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 e7 99 cf 6f 00 00 08 00
Jul 29 12:00:42 zeus kernel: [211733.527901] end_request: I/O error, dev sdc, sector 3885617007
Jul 29 12:00:42 zeus kernel: [211733.530703] md/raid:md0: read error not correctable (sector 3885616944 on sdc1).
Jul 29 12:00:42 zeus kernel: [211733.530712] md/raid:md0: Disk failure on sdc1, disabling device.
Jul 29 12:00:42 zeus kernel: [211733.530715] md/raid:md0: Operation continuing on 2 devices.

Not really knowing what happened, I realised after a few panicked minutes that I could use the “–force” parameter to tell mdadm to clear the errors and restart the array. I did this, only for it to fail rebuilding again at the same point. Convinced that my third WD20EARS drive must be faulty, I bought the closest thing I could find to a replacement - a WD20EARX drive - and installed this instead. I copied the old partition layout from the other working drives and rebuilt the array from scratch with this replacement drive, thinking everything would be fine.

Unfortunately, this was not the case. A mere 0.2% further into the resync process (at 99.4% this time), I got the same errors on a different drive - sda rather than sdc as before. Thinking that I couldn’t possibly be this unlucky, I started investigating more online and discovered that this is a symptom that can occur when a 4K sector drive is erroneously partitioned off the sector boundaries (for example at sector 63, like in my current layout).

So, bearing all this in mind - when you make partitions on 4K sector drives, it’s very important to start your partitions on a multiple of 4. After realising my error, I reran the partitioning and picked 2048 as a reasonable starting point. I also picked 3907029128 as the ending point, as it’s on a multiple of 4 and a little bit smaller than the whole drive. The idea of keeping partitions a tad smaller is to make it possible to replace a failed drive with one that has slightly fewer sectors, if you need to.

1
2
3
4
5
6
7
8
9
10
11
[root@zeus ~]# fdisk -u -l /dev/sda

Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
81 heads, 24 sectors/track, 2009788 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x19cac868

   Device Boot      Start         End      Blocks   Id  System
   /dev/sda1            2048  3907029128  1953513540+  da  Non-FS data

All five drives in my RAID array have been partitioned the same way. You can achieve this more easily with:

1
2
3
for X in sdb sdc sdd sde; do
  sfdisk -d /dev/sda | sfdisk /dev/$X
done

The recommendation is that you use type “da” for “non-fs data” rather than “83” for Linux or “fd” for “RAID autodetect”. This stops distribution auto-mount startup scripts for searching for superblocks on drives marked as “fd” and trying to mount them up in a funny order.

Once your drives are partitioned, you can create the RAID array. The default chunk size for drives of this size is about 512k (I think) but as the purpose of this NAS is mostly for storing large files, my belief is that a 2048k chunk will be more efficient. Again, bear in mind the “should be divisible by 4” logic.

1
mdadm --verbose --create /dev/md0 --level=5 --chunk=2048 --raid-devices=5 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

The mdraid superblocks will be written to the drives and the array will start syncing. Bearing in mind that even when I built the array with the incorrect partition boundaries I was getting about 30MB/sec resync speed at the absolute maximum, I was initially a little disappointed with the figures I saw.

1
2
3
4
5
6
7
[root@zeus ~]# cat /proc/mdadm
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde1[5] sdb1[1] sdc1[2] sda1[0] sdd1[4]
      7813521408 blocks super 1.2 level 5, 2048k chunk, algorithm 2 [5/5] [UUUUU]
    [=>...................]  recovery = 1.8% (2007428/7813521408) finish=1152.7min speed=32732K/sec
  
          unused devices: <none>

I’d hoped for better than this after the reports I’d read online, so wondered if the effort had all been worth it. Once the array had been syncing for about half an hour, however, the speed suddenly skyrocketed:

1
2
3
4
5
6
7
[root@zeus ~]# cat /proc/mdadm
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde1[5] sdb1[1] sdc1[2] sda1[0] sdd1[4]
      7813521408 blocks super 1.2 level 5, 2048k chunk, algorithm 2 [5/5] [UUUUU]
    [=>...................]  recovery = 2.5% (3007416/7813521408) finish=352.1min speed=112418K/sec
  
          unused devices: <none>

It stayed around this speed until the end, then my array was finally synced and ready to go.

My policy is to use LVM for storage devices as it makes for easier migrations in future if necessary, as well as keeping the management tidier. I’m telling vgcreate to use the same 2048k size for the physical extents to match it up with mdadm’s stripe size.

1
2
3
# pvcreate /dev/md0
# vgcreate vgcreate -s 2048k /dev/vg_raid /dev/md0
# lvcreate -n lv_nas -L100%FREE /dev/vg_raid

Once the logical volume is created, we can format it. A good policy is to set the RAID stripe and stride sizes based on your array. The stripe should be the same as your underlying storage device, and the stride should be equal to (stripe size / number of data drives in the array). As this is a 5-disk RAID5 array, we have four “data drives” (n-1) and one “parity drive”, although obviously the data is striped throughout the disk in RAID5 rather than using a dedicated disk like RAID3/RAID4.

So, as we have 2048k stripes/extents and four “data” drives:

stripe = 2048 stride = (2048 / 4) = 512

1
# mkfs.ext4 -m0 -E stride=512,stripe_width=2048 /dev/vg_raid/lv_nas

This step completes quickly. The -m0 says we don’t want any space reserved for the superuser (normally 5% is held back, but 5% of 7.3TB is about 370GB and certainly not space I’d be willing to surrender)

After this, we can add the array to /etc/fstab, mount the partition and do some testing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@zeus ~]# mkdir /mnt/nas
[root@zeus ~]# mount /dev/vg_raid/lv_nas /mnt/nas

[root@zeus ~]# for X in {1..4}; do hdparm -t /dev/md0; done
/dev/md0:
 Timing buffered disk reads: 1034 MB in  3.12 seconds = 331.10 MB/sec

/dev/md0:
 Timing buffered disk reads: 1066 MB in  3.00 seconds = 355.19 MB/sec

/dev/md0:
 Timing buffered disk reads: 1198 MB in  3.00 seconds = 399.27 MB/sec

/dev/md0:
 Timing buffered disk reads: 1034 MB in  3.08 seconds = 335.96 MB/sec

An average of 355MB/sec for reads from the array - not bad at all. Given that the disks can manage about 80MB/sec for sequential reads individually, this combined speed is about in line with what I’d expect. It’s quite easy to read from lots of drives at once, though - RAID5 generally struggles with writes. How quickly can we manage those?

1
2
3
4
5
6
7
8
[root@zeus ~]# dd if=/dev/zero of=/mnt/nas/bigfile bs=1G count=10
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 38.4125 s, 280 MB/s
[root@zeus ~]# dd if=/dev/zero of=/mnt/nas/bigfile1 bs=1G count=10
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 39.0001 s, 275 MB/s

278MB/sec or so writing to 5 disks in RAID5 is pretty good going. I wish I had the numbers saved from before the rebuild so I could show just how much better this configuration is than before.

Comments