April 6th 2018

linux, fsck

Checking all hard drives for errors


First of all, I fired up my trusty SysRescueCD USB stick.

Start with getting some SMART information:

smartctl -a /dev/sda

Perform a short test - takes a few minutes only:

smartctl -t short /dev/sda

Check results:

smartctl -l selftest /dev/sda

General health report:

smartctl -H /dev/sda

My external USB 3.0 hard drive was not automatically recognized; I guessed that it was scsi:

smartctl -d scsi -a /dev/sdc
smartctl -t short -d scsi /dev/sdc
smartctl -l selftest -d scsi /dev/sdc
smartctl -H -d scsi /dev/sdc

This took the longest; around 1.5h for a 250GB hard drive:

badblocks -b 4096 -c 4096 -s -v /dev/sda

An additional filesystem check - make sure that the partitions are not mounted. With the -f switch, this took a minute or so per partition.

fdisk -l /dev/sda
fsck -fV /dev/sda1
fsck -fV /dev/sda3
fsck -fV /dev/sda4

It did find some errors on one partition, which it asked me to fix, and I answered yes. Alarmed by this, I re-ran the SMART test on that hard drive, a bit more thorought his time. First, let's see how long this will take:

smartctl -c /dev/sdb

Over an hour, ugh. Nevertheless:

smartctl -t long /dev/sdb

But only after a few minutes, smartctl -l selftest /dev/sdb tells me that the extended test completed without error. Hmmm :-|
I guess this will have to do for now.

Rinse and repeat for all hard drives; make sure they're not mounted.

Some helpful links:

https://askubuntu.com/questions/539184/how-do-i-check-the-integrity-of-a-storage-medium-hard-disk-or-flash-drive
https://www.thomas-krenn.com/en/wiki/SMART_tests_with_smartctl
https://blog.shadypixel.com/monitoring-hard-drive-health-on-linux-with-smartmontools/
https://www.maketecheasier.com/check-repair-filesystem-fsck-linux/

What if Problems Are Found?

# fsck -fV /dev/sdc1
fsck from util-linux 2.32
[/usr/bin/fsck.ext4 (1) -- /home/backup] fsck.ext4 -f /dev/sdc1 
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create<y>? yes
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(131366912--131371007)
Fix<y>? yes
Free blocks count wrong for group #4009 (10240, counted=32768).
Fix<y>? yes
Free blocks count wrong for group #4046 (28672, counted=32768).
Fix<y>? yes
Free blocks count wrong (123602180, counted=123628804).
Fix<y>? yes

recovery+backup: ***** FILE SYSTEM WAS MODIFIED *****
recovery+backup: 205354/61046784 files (1.8% non-contiguous), 120553212/244182016 blocks
# fsck -fV /dev/sdc1
fsck from util-linux 2.32
[/usr/bin/fsck.ext4 (1) -- /home/backup] fsck.ext4 -f /dev/sdc1 
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
recovery+backup: 205354/61046784 files (1.8% non-contiguous), 120553212/244182016 blocks

I had a look in lost+found, and it was empty. I assume that means no data loss, and this chapter of another useful article seems to confirm the assumption.
Nevertheless, this partitions hosts my backups, so i want to be very sure:

borg check --info --verify-data /path/to/borgbackupdir
Starting repository check
Starting repository index check
Completed repository check, no problems found.
Starting archive consistency check...
Starting cryptographic data integrity verification...
Finished cryptographic data integrity verification, verified 74519 chunks with 0 integrity errors.
Analyzing archive 201709030041 (1/9)
Analyzing archive 201709031032 (2/9)
Analyzing archive 201709031221 (3/9)
Analyzing archive 201709091605 (4/9)
Analyzing archive 201709161743 (5/9)
Analyzing archive 201709240050 (6/9)
Analyzing archive 201709301658 (7/9)
Analyzing archive 201710071610 (8/9)
Analyzing archive 201710141636 (9/9)
Archive consistency check complete, no problems found.

This is very slow, for a daily or weekly check one might want to remove --verify-data.