Two separate Hard Drives corrupted in as many days... User error?

archomrade [he/him]@midwest.social · edit-2 1 year ago

Two separate Hard Drives corrupted in as many days... User error?

Atemu@lemmy.ml · 1 year ago

The CIFS errors and logs inside the VMs are rather uninteresting as they're just passing through the underlying HW's issue.

These logs presented here definitely indicate an issue between CPU and drives. Could also be RAM but I'd check SATA cables and controllers first.

archomrade [he/him]@midwest.social · edit-2 1 year ago

Yup, after scrubbing the log file, the problem port is ONLY ATA port 4.00. No other ports have thrown errors, BUT, i just did a block check on all the boot drive partitions, and it looks like they all have bad superblocks… not sure if the issue then is with the specific sata port or if the issue originates in the memory, or if the bad blocks get propagated to the other drives? unclear.

Oct 19 09:59:17 pve1 kernel: ata4.00: cmd 35/00:08:00:08:c4/00:00:e8:00:00/e0 tag 15 dma 4096 out
Oct 19 09:59:17 pve1 kernel: ata4.00: status: { DRDY ERR }
Oct 19 09:59:17 pve1 kernel: ata4.00: error: { ABRT }
Oct 19 09:59:17 pve1 kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 19 09:59:17 pve1 kernel: ata4.00: irq_stat 0x40000001
Oct 19 09:59:17 pve1 kernel: ata4.00: failed command: WRITE DMA EXT

i'll do:

a memory test
swap the ports of the HD's to specifically avoid port4.00
do a read-write test to make sure the issue doesn't re-appear.

if non of the above solves the mystery, i suppose I can splurge on another junker and see if I have better luck on the next one. I just have to decide if I wait for ddrescue to finish, or just start it now… Probably start it now, on the off-chance i'm just creating more bad blocks on the backup.

Atemu@lemmy.ml · 1 year ago

do a read-write test to make sure the issue doesn’t re-appear

I can recommend https://wiki.archlinux.org/title/Badblocks

archomrade [he/him]@midwest.social · 1 year ago

fuck me, that test was damning. Read test was fine, but starting a read-write test revealed all the same I/O errors as before, this time on a differen't port.

mortrek@lemmy.ml · 1 year ago

Have you tried a running a different distro live f/usb or something like that? Doesn't seem likely that it would help, but who knows…

It's unlikely the kernel or other low-level code is the problem on 10 year old Intel hardware, though. I've run numerous distros on numerous different machines, many of which were Intel-based, over the last couple decades, and never had this kind of basic, low-level problem with SATA before without it being the cable or controller. Oh, I just remembered: check the PSU as well if you can. A faulty PSU could have a bad rail or wire or something that leads to these problems. If you have a known-good one lying around, depending on the motherboard, you could try temporarily hooking it up to the board and drive and see if it changes anything.

To eliminate Linux as a potential culprit, you could try to install Windows (7, 8, 10, whatever) and see if it exhibits similar problems.

archomrade [he/him]@midwest.social · 1 year ago

Well shit. Looks like the other sata ports are having the same problem.

trying to get a hardware probe running, but what are the chances i need to replace the motherboard/the machine? It's looking likely the problem is upstream from the sata drives themselves, i just don't know if it's worth trying to swap the cpu before just ditching the machine entirely. I don't have a cpu lying around to test it. memtester came back clean after 5 passes.

Atemu@lemmy.ml · 1 year ago

I'd honestly just abandon the hardware. It's not worth your time to deal with that.