Once in a while my Linux system won’t boot and gives filesystem errors. I can “fix” them by booting with a LiveCD and running:
sudo fsck -y /dev/sda1
The command says it finds bad blocks and fixes them, then the system will boot again. Does the fact that they keep happening indicate hardware failure, or could there be something else wrong?
I note that when I instead run:
sudo fsck -y /dev/sda
I get these errors:
fsck from util-linux 2.34 [/usr/sbin/fsck.ext2 (1) -- /dev/sda] fsck.ext2 /dev/sda e2fsck 1.45.5 (07-Jan-2020) ext2fs_open2: Bad magic number in super-block fsck.ext2: Superblock invalid, trying backup blocks... fsck.ext2: Bad magic number in super-block while trying to open /dev/sda
The superblock could not be read or does not describe a valid ext2/ext3/ext4 filesystem. If the device is valid and it really contains an ext2/ext3/ext4 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device> or
e2fsck -b 32768 <device>
Found a dos partition table in /dev/sda
Is this because it’s invalid to run fsck on the whole disk instead of just one partition, or is there something corrupt on my drive? I’ve seen many places on the internet giving instructions that run fsck on the whole disk. My disk has only one partition, a Linux ext4 one.
Here is a picture of the Disks application Smart Data & Tests window.
The result of grep -i FPDMA /var/log/syslog* is:
adam>grep -i FPDMA /var/log/syslog*
/var/log/syslog:Sep 21 13:40:19 adam-gregs-better-computer kernel: [ 728.921941] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:40:19 adam-gregs-better-computer kernel: [ 729.213899] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:40:20 adam-gregs-better-computer kernel: [ 729.373884] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:42:40 adam-gregs-better-computer kernel: [ 870.000879] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:42:40 adam-gregs-better-computer kernel: [ 870.000904] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:05 adam-gregs-better-computer kernel: [ 895.312734] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:05 adam-gregs-better-computer kernel: [ 895.312760] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:06 adam-gregs-better-computer kernel: [ 895.476760] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:06 adam-gregs-better-computer kernel: [ 895.640724] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:49 adam-gregs-better-computer kernel: [ 938.924872] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:49 adam-gregs-better-computer kernel: [ 938.924901] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:49 adam-gregs-better-computer kernel: [ 938.924924] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:49 adam-gregs-better-computer kernel: [ 938.924945] ata3.00: failed command: WRITE FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:53 adam-gregs-better-computer kernel: [ 942.878558] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:43:53 adam-gregs-better-computer kernel: [ 942.878583] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog.1:Sep 18 08:30:43 adam-gregs-better-computer kernel: [ 33.579255] ata3.00: failed command: READ FPDMA QUEUED
I would suggest that with your system constantly needing to run a file system check, your disk might be failing, especially when you get bad block notices every single fsck
. I would start backing up your data to another drive and prepare for a reinstallation soon to a new disk, since a dying disk is a fast way to lose your important data.
– Thomas Ward♦Sep 21, 2021 at 14:52
- Edit your question and show me screenshots of the
Disks
application SMART Data & Tests data window. Resize the window to capture all of the data for the screenshot. Start comments to me with @heynnema or I’ll miss them. – heynnema Sep 21, 2021 at 15:34
- Is this a SSD or HDD? How old is it? – heynnema Sep 21, 2021 at 17:50
- Edit your question and show me
grep -i FPDMA /var/log/syslog*
. – heynnema Sep 21, 2021 at 18:19
To answer your last question first, a fsck
is a file system check, not a disk check. You can of course check your whole disk, but fsck
will check and possibly repair each file system separately, possibly in parallel.
Encountering bad blocks at each run of fsck
does indicate a hardware failure. The contents of a bad block are copied to an available good block, and then the block is marked as “bad”, meaning the file system software will no longer use it. So the number of bad blocks on your disk seems to increase. You may want to verify that you have proper backups.
OP has a SSD. SSD possibly needs a firmware update, or a GRUB tweak. Please see “NCQ errors” in my answer.
– heynnemaSep 21, 2021 at 21:11
fsck
Let’s repair your file system (again)…
- boot to a Ubuntu Live DVD/USB in “Try Ubuntu” mode
- open a
terminal
window by pressing Ctrl+Alt+T - type
sudo fdisk -l
- identify the /dev/sdXX device name for your “Linux Filesystem”
- type
sudo fsck -f /dev/sda1
, replacingsdXX
with the number you found earlier - repeat the
fsck
command if there were errors - type
reboot
Bad blocks and SMART Data
The SMART Data indicates what would normally be a failing HDD. However, we have an SSD that’s not too old. We’ll look at solving NCQ errors first.
Note: Determine the manufacturer and model # of the SSD, and then visit their web site to check for updated firmware.
Note: Maintain good backups, just in case the SSD is failing.
NCQ errors
grep -i FPDMA /var/log/syslog*
/var/log/syslog:Sep 21 13:40:19 adam-gregs-better-computer kernel: [ 728.921941] ata3.00: failed command: READ FPDMA QUEUED
/var/log/syslog:Sep 21 13:40:19 adam-gregs-better-computer kernel: [ 729.213899] ata3.00: failed command: READ FPDMA QUEUED
Native Command Queuing (NCQ) is an extension of the Serial ATA protocol allowing hard disk drives to internally optimize the order in which received read and write commands are executed.
Edit sudo -H gedit /etc/default/grub
and change the following line to include this extra parameter. Then do sudo update-grub
to write the changes to disk. Reboot. Monitor hangs/etc., and watch grep -i FPDMA /var/log/syslog*
or dmesg
for continued error messages.
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq"
The drive is ADATA SU635. I couldn’t find a firmware update on their website. Also the Amazon page said it was first available in January 2020, so maybe it’s actually newer than I thought (I must have started using it sometime in 2020). In the process of opening the computer to check its model, I also discovered that it was at a slant due to missing some screws that would keep it in its enclosure, which must have made it move when I tilted the computer at some point. I wonder if that was causing the problem? I screwed it in and we’ll see if the issues keep happening.
– user2596667Sep 23, 2021 at 0:56
@user2596667 Go ahead and do my answer to try and solve the problem.
– heynnemaSep 23, 2021 at 1:51
I’d rather wait to see if screwing in the drive fixed things. So far no NCQ errors have appeared since then. If some do or if it fails again then I’ll try your suggested steps.
– user2596667Sep 23, 2021 at 13:23
Could you also elaborate on why it’s needed to repair the filesystem again with fsck, since I just did run it and fixed errors? Is it because the -f option is important, or because it’s necessary to keep re-running it until there are no errors? Also what specifically in my screenshot indicates a failing drive, and what is different about an SSD that makes it potentially fixable where a mechanical drive wouldn’t be?
– user2596667Sep 23, 2021 at 13:23
@user2596667 You need to run fsck
again because that’s been the primary fix, and because it’s finding errors. The -f just forces the check to occur, even if the drive reports that it’s clean. If you look at the SMART Data, the Relocated Sector Count, and Reported Uncorrectable Errors, and Relocation Count, and UDMA CRC Error Rate, and Read Error Retry Rate are all non-zero values. A SSD failure is an electronic failure, a HDD failure is usually a physical media error.
– heynnemaSep 23, 2021 at 13:33
this source from : https://askubuntu.com/questions/1364966/recurring-need-to-run-fsck-because-system-wont-boot