Replacing faulty disks in a ZFS RAID System

Here are my experiences with a faulty hard disk drive on a freshly installed FreeBSD-11.0 using auto ZFS option during the install to create a raidz2 with 4 disks

Initial symptoms were Xorg frozen and some other services had crashed

I was able to ssh into it and see a lot of disk i/o related errors with ada0 in /var/log/messages

So I decided to

# zpool scrub zroot

# zpool status
  pool: zroot
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Apr 14 13:34:02 2017
        662G scanned out of 1.66T at 124M/s, 2h22m to go
        1.06M repaired, 38.96% done
config:

	NAME        STATE     READ WRITE CKSUM
	zroot       ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    ada0p3  ONLINE       3    10    19  (repairing)
	    ada1p3  ONLINE       0     0     0
	    ada2p3  ONLINE       0     0     0
	    ada3p3  ONLINE       0     0     0

After the scrub completed I was still getting a lot of i/o errors in /var/log/messages so decided to replace the SATA cable on my ada0 drive

The motherboard manual came in handy here to determine which drive was ada0 or you could use something like

# dmesg | grep -B1 'ada0: Serial Number'

After replacing the SATA cable there were still i/o errors in messages and worse still after a reboot I got the following errors and the system didn't want to boot at all

gptzfsboot: error 16 lba XXXXXXXXX

decimal 16 is hex 0x10

The following is referenced from External Link

00h    successful completion
01h    invalid function in AH or invalid parameter
02h    address mark not found
03h    disk write-protected
04h    sector not found/read error
05h    reset failed (hard disk)
05h    data did not verify correctly (TI Professional PC)
06h    disk changed (floppy)
07h    drive parameter activity failed (hard disk)
08h    DMA overrun
09h    data boundary error (attempted DMA across 64K boundary or >80h sectors)
0Ah    bad sector detected (hard disk)
0Bh    bad track detected (hard disk)
0Ch    unsupported track or invalid media
0Dh    invalid number of sectors on format (PS/2 hard disk)
0Eh    control data address mark detected (hard disk)
0Fh    DMA arbitration level out of range (hard disk)
10h    uncorrectable CRC or ECC error on read
11h    data ECC corrected (hard disk)
20h    controller failure
31h    no media in drive (IBM/MS INT 13 extensions)
32h    incorrect drive type stored in CMOS (Compaq)
40h    seek failed
80h    timeout (not ready)
AAh    drive not ready (hard disk)
B0h    volume not locked in drive (INT 13 extensions)
B1h    volume locked in drive (INT 13 extensions)
B2h    volume not removable (INT 13 extensions)
B3h    volume in use (INT 13 extensions)
B4h    lock count exceeded (INT 13 extensions)
B5h    valid eject request failed (INT 13 extensions)
B6h    volume present but read protected (INT 13 extensions)
BBh    undefined error (hard disk)
CCh    write fault (hard disk)
E0h    status register error (hard disk)
FFh    sense operation failed (hard disk)

So 10h uncorrectable CRC or ECC error on read

Time to try a replacement hard disk!

So going back to my original zpool status output

config:

	NAME        STATE     READ WRITE CKSUM
	zroot       ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    ada0p3  ONLINE       3    10    19  (repairing)
	    ada1p3  ONLINE       0     0     0
	    ada2p3  ONLINE       0     0     0
	    ada3p3  ONLINE       0     0     0

If the faulty disk is still running then do this first

# zpool offline zroot ada0p3 (will allow you to easily reattach the drive again if something goes wrong later)

Shutdown the system and replace the faulty disk drive, power on the system

# zpool status

config:

	NAME        STATE     READ WRITE CKSUM
	zroot       DEGRADED       0     0     0
	  raidz2-0  DEGRADED       0     0     0
	    15788859347225537330  UNAVAIL       0    0    0  was ada0 
	    ada1p3  ONLINE       0     0     0
	    ada2p3  ONLINE       0     0     0
	    ada3p3  ONLINE       0     0     0

# zpool online zroot 15788859347225537330

# zpool replace zroot 15788859347225537330 ada0p3

complains about missing labels so create some

# zpool offline zroot 15788859347225537330

# ls /dev/ada*
/dev/ada0      /dev/ada1p3    /dev/ada2p3    /dev/ada3p3
/dev/ada1      /dev/ada2      /dev/ada3
/dev/ada1p1    /dev/ada2p1    /dev/ada3p1
/dev/ada1p2    /dev/ada2p2    /dev/ada3p2

# gpart show
=>       40  976773088  ada1  GPT  (466G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048    4194303     2  freebsd-swap  (2.0G)
    4196352  972576768     3  freebsd-zfs  (464M)
  976773120          8        - free -  (4.0K)
  
=>       40  976773088  ada2  GPT  (466G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048    4194303     2  freebsd-swap  (2.0G)
    4196352  972576768     3  freebsd-zfs  (464M)
  976773120          8        - free -  (4.0K)

=>       40  976773088  ada3  GPT  (466G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048    4194303     2  freebsd-swap  (2.0G)
    4196352  972576768     3  freebsd-zfs  (464M)
  976773120          8        - free -  (4.0K)

=>       40  976773088  ada0  GPT  (466G)

# gpart show -l ada1
=>       40  976773088  ada1  GPT  (466G)
         40       1024     1  gptboot1  (512K)
       1064        984        - free -  (492K)
       2048    4194303     2  swap1  (2.0G)
    4196352  972576768     3  zfs1  (464M)
  976773120          8        - free -  (4.0K)
  
  
# gpart add -a 4k -s 512k -l gptboot0 -t freebsd-boot ada0
# gpart add -b 1m -s 2g -l swap0 -t freebsd-swap ada0
# gpart add -a 4k -l zfs0 -t freebsd-zfs ada0
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

# zpool online zroot 15788859347225537330

# zpool replace zroot 15788859347225537330 ada0p3

# zpool status (should show replacing-0 and ada0p3 resilvering)

Note: if you stuff up the 'gpart add' partitioning then can delete as follows

# gpart delete -i 3 ada0
# gpart delete -i 2 ada0
# gpart delete -i 1 ada0
zfsvdevs.txt · Last modified: 2017/04/15 05:29 by matti.k
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki