====== Replacing faulty disks in a ZFS RAID System ====== Here are my experiences with a faulty hard disk drive on a freshly installed FreeBSD-11.0 using auto ZFS option during the install to create a raidz2 with 4 disks Initial symptoms were Xorg frozen and some other services had crashed I was able to ssh into it and see a lot of disk i/o related errors with ada0 in /var/log/messages So I decided to # zpool scrub zroot # zpool status pool: zroot state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub in progress since Fri Apr 14 13:34:02 2017 662G scanned out of 1.66T at 124M/s, 2h22m to go 1.06M repaired, 38.96% done config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada0p3 ONLINE 3 10 19 (repairing) ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 After the scrub completed I was still getting a lot of i/o errors in /var/log/messages so decided to replace the SATA cable on my ada0 drive The motherboard manual came in handy here to determine which drive was ada0 or you could use something like # dmesg | grep -B1 'ada0: Serial Number' After replacing the SATA cable there were still i/o errors in messages and worse still after a reboot I got the following errors and the system didn't want to boot at all gptzfsboot: error 16 lba XXXXXXXXX decimal 16 is hex 0x10 The following is referenced from [[http://www.ctyme.com/intr/rb-0606.htm#Table234|External Link]] 00h successful completion 01h invalid function in AH or invalid parameter 02h address mark not found 03h disk write-protected 04h sector not found/read error 05h reset failed (hard disk) 05h data did not verify correctly (TI Professional PC) 06h disk changed (floppy) 07h drive parameter activity failed (hard disk) 08h DMA overrun 09h data boundary error (attempted DMA across 64K boundary or >80h sectors) 0Ah bad sector detected (hard disk) 0Bh bad track detected (hard disk) 0Ch unsupported track or invalid media 0Dh invalid number of sectors on format (PS/2 hard disk) 0Eh control data address mark detected (hard disk) 0Fh DMA arbitration level out of range (hard disk) 10h uncorrectable CRC or ECC error on read 11h data ECC corrected (hard disk) 20h controller failure 31h no media in drive (IBM/MS INT 13 extensions) 32h incorrect drive type stored in CMOS (Compaq) 40h seek failed 80h timeout (not ready) AAh drive not ready (hard disk) B0h volume not locked in drive (INT 13 extensions) B1h volume locked in drive (INT 13 extensions) B2h volume not removable (INT 13 extensions) B3h volume in use (INT 13 extensions) B4h lock count exceeded (INT 13 extensions) B5h valid eject request failed (INT 13 extensions) B6h volume present but read protected (INT 13 extensions) BBh undefined error (hard disk) CCh write fault (hard disk) E0h status register error (hard disk) FFh sense operation failed (hard disk) So 10h uncorrectable CRC or ECC error on read Time to try a replacement hard disk! So going back to my original zpool status output config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada0p3 ONLINE 3 10 19 (repairing) ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 If the faulty disk is still running then do this first # zpool offline zroot ada0p3 (will allow you to easily reattach the drive again if something goes wrong later) Shutdown the system and replace the faulty disk drive, power on the system # zpool status config: NAME STATE READ WRITE CKSUM zroot DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 15788859347225537330 UNAVAIL 0 0 0 was ada0 ada1p3 ONLINE 0 0 0 ada2p3 ONLINE 0 0 0 ada3p3 ONLINE 0 0 0 # zpool online zroot 15788859347225537330 # zpool replace zroot 15788859347225537330 ada0p3 complains about missing labels so create some # zpool offline zroot 15788859347225537330 # ls /dev/ada* /dev/ada0 /dev/ada1p3 /dev/ada2p3 /dev/ada3p3 /dev/ada1 /dev/ada2 /dev/ada3 /dev/ada1p1 /dev/ada2p1 /dev/ada3p1 /dev/ada1p2 /dev/ada2p2 /dev/ada3p2 # gpart show => 40 976773088 ada1 GPT (466G) 40 1024 1 freebsd-boot (512K) 1064 984 - free - (492K) 2048 4194303 2 freebsd-swap (2.0G) 4196352 972576768 3 freebsd-zfs (464M) 976773120 8 - free - (4.0K) => 40 976773088 ada2 GPT (466G) 40 1024 1 freebsd-boot (512K) 1064 984 - free - (492K) 2048 4194303 2 freebsd-swap (2.0G) 4196352 972576768 3 freebsd-zfs (464M) 976773120 8 - free - (4.0K) => 40 976773088 ada3 GPT (466G) 40 1024 1 freebsd-boot (512K) 1064 984 - free - (492K) 2048 4194303 2 freebsd-swap (2.0G) 4196352 972576768 3 freebsd-zfs (464M) 976773120 8 - free - (4.0K) => 40 976773088 ada0 GPT (466G) # gpart show -l ada1 => 40 976773088 ada1 GPT (466G) 40 1024 1 gptboot1 (512K) 1064 984 - free - (492K) 2048 4194303 2 swap1 (2.0G) 4196352 972576768 3 zfs1 (464M) 976773120 8 - free - (4.0K) # gpart add -a 4k -s 512k -l gptboot0 -t freebsd-boot ada0 # gpart add -b 1m -s 2g -l swap0 -t freebsd-swap ada0 # gpart add -a 4k -l zfs0 -t freebsd-zfs ada0 # gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0 # zpool online zroot 15788859347225537330 # zpool replace zroot 15788859347225537330 ada0p3 # zpool status (should show replacing-0 and ada0p3 resilvering) Note: if you stuff up the 'gpart add' partitioning then can delete as follows # gpart delete -i 3 ada0 # gpart delete -i 2 ada0 # gpart delete -i 1 ada0