Mailing List Archive

Hard drive error from SMART
Howdy,

As some know, I recently moved a LOT of data around.  Seems to have
stressed one of my drives.  I got a email from SMART reporting a error. 
It's info:


The following warning/error was logged by the smartd daemon:

Device: /dev/sdd [SAT], 1 Currently unreadable (pending) sectors


The following warning/error was logged by the smartd daemon:

Device: /dev/sdd [SAT], 1 Offline uncorrectable sectors


This is from smartctl. 


ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE     
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail 
Always       -       23544426
  3 Spin_Up_Time            0x0003   087   086   000    Pre-fail 
Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age  
Always       -       50
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail 
Always       -       4
  7 Seek_Error_Rate         0x000f   094   060   045    Pre-fail 
Always       -       2694155454
  9 Power_On_Hours          0x0032   073   073   000    Old_age  
Always       -       24299 (121 195 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail 
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age  
Always       -       35
184 End-to-End_Error        0x0032   100   100   099    Old_age  
Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age  
Always       -       0
188 Command_Timeout         0x0032   100   086   000    Old_age  
Always       -       14 14 14
189 High_Fly_Writes         0x003a   100   100   000    Old_age  
Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   059   040    Old_age  
Always       -       39 (Min/Max 30/41)
191 G-Sense_Error_Rate      0x0032   092   092   000    Old_age  
Always       -       17952
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age  
Always       -       498
193 Load_Cycle_Count        0x0032   100   100   000    Old_age  
Always       -       1044
194 Temperature_Celsius     0x0022   039   041   000    Old_age  
Always       -       39 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   031   001   000    Old_age  
Always       -       23544426
197 Current_Pending_Sector  0x0012   100   100   000    Old_age  
Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age  
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age  
Always       -       0
203 Run_Out_Cancel          0x00b3   100   100   099    Pre-fail 
Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age  
Offline      -       24215h+54m+57.249s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age  
Offline      -       18070332014
242 Total_LBAs_Read         0x0000   100   253   000    Old_age  
Offline      -       18343277504



The nutshell is #5 up there.  #198 was a issue until I ran the long
selftest.  It moved to #5 plus added 3 or 4 it seems.  According to
google results, it should be fine for now.  Still, a replacement drive
is on the way and I've unmount the drives for that LVM.  They still
spinning and running a selftest but nothing else should be accessing
them.  This is also from the selftest. 


SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining 
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%    
24299         -
# 2  Short offline       Completed without error       00%    
24298         -
# 3  Extended offline    Completed without error       00%    
24291         -
# 4  Extended offline    Aborted by host               10%    
24266         -
# 5  Short offline       Completed without error       00%    
24218         -
# 6  Short offline       Completed without error       00%    
24194         -
# 7  Short offline       Completed without error       00%    
24171         -
# 8  Short offline       Completed without error       00%    
24146         -

The one I aborted was because it was stuck on 10% for well over a day. 
The whole test doesn't take that long, or shouldn't anyway.  I restarted
it shortly after that.  I might add, the test did take many hours longer
than it estimated which from my past experience is quite odd.  It's
usually pretty accurate.  Still, it completed and shows it passed, just
has a boo boo on it.  I also did a file system check it fixed a couple
problems and a bunch of little things I see corrected often on bootup. 
Something about length of something.  Seems trivial. 

Given the low number and it showing it corrected that error, and then
passed a short and long test, is this drive "safe enough" to keep in
service?  I have backups just in case but just curious what others know
from experience.  At least this isn't one of those nasty messages that
the drive will die within 24 hours.  I got one of those ages ago and it
didn't miss it by much.  A little over 30 hours or so later, it was a
door stop.  It would spin but it couldn't even be seen by the BIOS. 
Maybe drives are getting better and SMART is getting better as well. 

Thoughts.  Replace as soon as drive arrives or wait and see?

Dale

:-)  :-)
Re: Hard drive error from SMART [ In reply to ]
On 12/04/2022 02:27, Dale wrote:
> The one I aborted was because it was stuck on 10% for well over a day.
> The whole test doesn't take that long, or shouldn't anyway.  I restarted
> it shortly after that.  I might add, the test did take many hours longer
> than it estimated which from my past experience is quite odd.  It's
> usually pretty accurate.  Still, it completed and shows it passed, just
> has a boo boo on it.  I also did a file system check it fixed a couple
> problems and a bunch of little things I see corrected often on bootup.
> Something about length of something.  Seems trivial.

Given that the firmware SOMETIMES gets its knickers in a twist,
especially consumer drives (not sure what yours are?), and read errors
are a dime a dozen, I wouldn't worry that much about ONE error.

Do another SMART test after your next reboot. Any NEW errors will be a
red flag, but just this one again? Don't worry.
>
> Given the low number and it showing it corrected that error, and then
> passed a short and long test, is this drive "safe enough" to keep in
> service?  I have backups just in case but just curious what others know
> from experience.  At least this isn't one of those nasty messages that
> the drive will die within 24 hours.  I got one of those ages ago and it
> didn't miss it by much.  A little over 30 hours or so later, it was a
> door stop.  It would spin but it couldn't even be seen by the BIOS.
> Maybe drives are getting better and SMART is getting better as well.

SMART is a lot better than it was, but remember, it only picks up wear
and tear. Mechanical failure is just as deadly, and usually strikes out
of the blue. I saw some stats somewhere it's something like 1/3, 2/3
wear and tear picked up by SMART, and mechanical failure undetectable by
smart. Can't remember which stat was which.
>
> Thoughts.  Replace as soon as drive arrives or wait and see?

If you get a couple of errors, then no more for months, the drive is
probably fine. If you get new errors every time you test, ditch it ASAP.

Either way, make sure it's backed up!

Cheers,
Wol
Re: Hard drive error from SMART [ In reply to ]
Wols Lists wrote:
> On 12/04/2022 02:27, Dale wrote:
>> The one I aborted was because it was stuck on 10% for well over a day.
>> The whole test doesn't take that long, or shouldn't anyway.  I restarted
>> it shortly after that.  I might add, the test did take many hours longer
>> than it estimated which from my past experience is quite odd.  It's
>> usually pretty accurate.  Still, it completed and shows it passed, just
>> has a boo boo on it.  I also did a file system check it fixed a couple
>> problems and a bunch of little things I see corrected often on bootup.
>> Something about length of something.  Seems trivial.
>
> Given that the firmware SOMETIMES gets its knickers in a twist,
> especially consumer drives (not sure what yours are?), and read errors
> are a dime a dozen, I wouldn't worry that much about ONE error.
>
> Do another SMART test after your next reboot. Any NEW errors will be a
> red flag, but just this one again? Don't worry.


That seems to be what my google searches revealed.  After all, nothing
is perfect.  I'm sometimes surprised that drives aren't shipped with a
couple of these.  I'll keep my backups up to date as usual tho.  ;-)


>>
>> Given the low number and it showing it corrected that error, and then
>> passed a short and long test, is this drive "safe enough" to keep in
>> service?  I have backups just in case but just curious what others know
>> from experience.  At least this isn't one of those nasty messages that
>> the drive will die within 24 hours.  I got one of those ages ago and it
>> didn't miss it by much.  A little over 30 hours or so later, it was a
>> door stop.  It would spin but it couldn't even be seen by the BIOS.
>> Maybe drives are getting better and SMART is getting better as well.
>
> SMART is a lot better than it was, but remember, it only picks up wear
> and tear. Mechanical failure is just as deadly, and usually strikes
> out of the blue. I saw some stats somewhere it's something like 1/3,
> 2/3 wear and tear picked up by SMART, and mechanical failure
> undetectable by smart. Can't remember which stat was which.

My understanding is that SMART detects media problems and sometimes even
when a electronic component is getting out of spec.  However, it is
unlikely to detect that the spindle motor or the mechanism that moves
the heads is about to go out.  It can detect some things but not
everything.  From my understanding, it is mostly about monitoring the
magnetic media itself.  It is however, better than nothing at all. 


>>
>> Thoughts.  Replace as soon as drive arrives or wait and see?
>
> If you get a couple of errors, then no more for months, the drive is
> probably fine. If you get new errors every time you test, ditch it ASAP.
>
> Either way, make sure it's backed up!
>
> Cheers,
> Wol
>
>


Sounds like a plan.  Drive should be here Friday.  I'll keep a eye on
it.  It's down to 10% on long selftest and no errors reported yet.  I'll
keep the drive unmounted until Friday tho, just in case. 

Thanks for the opinions. 

Dale

:-)  :-) 
RE: Hard drive error from SMART [ In reply to ]
> -----Original Message-----
> From: Dale <rdalek1967@gmail.com>
> Sent: Monday, April 11, 2022 6:28 PM
> To: gentoo-user@lists.gentoo.org
> Subject: [gentoo-user] Hard drive error from SMART
>
> Given the low number and it showing it corrected that error, and then passed a short and long test, is this drive "safe enough" to keep in service? I have backups just in case but just curious what others know from experience. At least this isn't one of those nasty messages that the drive will die within 24 hours. I got one of those ages ago and it didn't miss it by much. A little over 30 hours or so later, it was a door stop. It would spin but it couldn't even be seen by the BIOS.
> Maybe drives are getting better and SMART is getting better as well.
>
> Thoughts. Replace as soon as drive arrives or wait and see?
>
> Dale
>
> :-) :-)
>
When it's just one or two errors like that and they don't keep going up I tend to treat it as an isolated incident, but the drive still goes into the pool I use with RAID just in case.

Preferably a setup where you can lose more than one disk without losing the data.

Note that, depending on where the bad sector is, when it gets remapped the extra seek necessary to read that logical address could slow the drive down substantially. Make sure your filesystem's root inode or something doesn't end up on top of it.

Sometimes I miss the old drives where all this was handled by the OS and so you knew exactly what sector was bad and your filesystem could be told to just not use it. Made scanning for bad sectors more annoying, but deciding how bad the drive was rather easier.

LMP
Re: Hard drive error from SMART [ In reply to ]
On Mon, Apr 11, 2022 at 9:27 PM Dale <rdalek1967@gmail.com> wrote:
>
> Thoughts. Replace as soon as drive arrives or wait and see?
>

So, first of all just about all my hard drives are in a RAID at this
point, so I have a higher tolerance for issues.

If a drive is under warranty I'll usually try to see if they will RMA
it. More often than not they will, and in that case there is really
no reason not to. I'll do advance shipping and replace the drive
before sending the old one back so that I mostly have redundancy the
whole time.

If it isn't under warranty then I'll scrub it and see what happens.
I'll of course do SMART self-tests, but usually an error like this
won't actually clear until you overwrite the offline sector so that
the drive can reallocate it. A RAID scrub/resilver/etc will overwrite
the sector with the correct contents which will allow this to happen.
(Otherwise there is no way for the drive to recover - if it knew what
was stored there it wouldn't have an error in the first place.)

If an error comes back then I'll replace the drive. My drives are
pretty large at this point so I don't like keeping unreliable drives
around. It just increases the risk of double failures, given that a
large hard drive can take more than a day to replace. Write speeds
just don't keep pace with capacities. I do have offline backups but I
shudder at the thought of how long one of those would take to restore.

--
Rich
Re: Hard drive error from SMART [ In reply to ]
Rich Freeman wrote:
> On Mon, Apr 11, 2022 at 9:27 PM Dale <rdalek1967@gmail.com> wrote:
>> Thoughts. Replace as soon as drive arrives or wait and see?
>>
> So, first of all just about all my hard drives are in a RAID at this
> point, so I have a higher tolerance for issues.
>
> If a drive is under warranty I'll usually try to see if they will RMA
> it. More often than not they will, and in that case there is really
> no reason not to. I'll do advance shipping and replace the drive
> before sending the old one back so that I mostly have redundancy the
> whole time.
>
> If it isn't under warranty then I'll scrub it and see what happens.
> I'll of course do SMART self-tests, but usually an error like this
> won't actually clear until you overwrite the offline sector so that
> the drive can reallocate it. A RAID scrub/resilver/etc will overwrite
> the sector with the correct contents which will allow this to happen.
> (Otherwise there is no way for the drive to recover - if it knew what
> was stored there it wouldn't have an error in the first place.)
>
> If an error comes back then I'll replace the drive. My drives are
> pretty large at this point so I don't like keeping unreliable drives
> around. It just increases the risk of double failures, given that a
> large hard drive can take more than a day to replace. Write speeds
> just don't keep pace with capacities. I do have offline backups but I
> shudder at the thought of how long one of those would take to restore.
>


Sadly, I don't have RAID here but to be honest, I really need to have it
given the data and my recent luck with hard drives.  Drives used to get
dumped because they were just to small to use anymore.  Nowadays, they
seem to break in some fashion long before their usefulness ends their
lives. 

I remounted the drives and did a backup.  For anyone running up on this,
just in case one of the files got corrupted, I used a little trick to
see if I can figure out which one may be bad if any.  I took my rsync
commands from my little script and ran them one at a time with --dry-run
added.  If a file was to be updated on the backup that I hadn't changed
or added, I was going to check into it before updating my backups.  It
could be that the backup file was still good and the file on my drive
reporting problems was bad.  In that case, I would determine which was
good and either restore it from backups or allow it to be updated if
needed.  Either way, I should have a good file since the drive claims to
have fixed the problem.  Now let us pray.  :-D 

Drive isn't under warranty.  I may have to start buying new drives from
dealers.  Sometimes I find drives that are pulled from systems and have
very few hours on them.  Still, warranty may not last long.  Saves a lot
of money tho. 

USPS claims drive is on the way.  Left a distribution point and should
update again when it gets close.  First said Saturday, then said
Friday.  I think Friday is about right but if the wind blows right,
maybe Thursday. 

I hope I have another port and power cable plug for the swap out.  At
least now, I can unmount it and swap without a lot of rebooting.  Since
it's on LVM, that part is easy.  Regretfully I have experience on that
process.  :/

Thanks to all. 

Dale

:-)  :-) 
RE: Hard drive error from SMART [ In reply to ]
> -----Original Message-----
> From: Dale <rdalek1967@gmail.com>
> Sent: Tuesday, April 12, 2022 10:08 AM
> To: gentoo-user@lists.gentoo.org
> Subject: Re: [gentoo-user] Hard drive error from SMART
>
> Rich Freeman wrote:
> > On Mon, Apr 11, 2022 at 9:27 PM Dale <rdalek1967@gmail.com> wrote:
> >> Thoughts. Replace as soon as drive arrives or wait and see?
> >>
> > So, first of all just about all my hard drives are in a RAID at this
> > point, so I have a higher tolerance for issues.
> >
> > If a drive is under warranty I'll usually try to see if they will RMA
> > it. More often than not they will, and in that case there is really
> > no reason not to. I'll do advance shipping and replace the drive
> > before sending the old one back so that I mostly have redundancy the
> > whole time.
> >
> > If it isn't under warranty then I'll scrub it and see what happens.
> > I'll of course do SMART self-tests, but usually an error like this
> > won't actually clear until you overwrite the offline sector so that
> > the drive can reallocate it. A RAID scrub/resilver/etc will overwrite
> > the sector with the correct contents which will allow this to happen.
> > (Otherwise there is no way for the drive to recover - if it knew what
> > was stored there it wouldn't have an error in the first place.)
> >
> > If an error comes back then I'll replace the drive. My drives are
> > pretty large at this point so I don't like keeping unreliable drives
> > around. It just increases the risk of double failures, given that a
> > large hard drive can take more than a day to replace. Write speeds
> > just don't keep pace with capacities. I do have offline backups but I
> > shudder at the thought of how long one of those would take to restore.
> >
>
>
> Sadly, I don't have RAID here but to be honest, I really need to have it given the data and my recent luck with hard drives. Drives used to get dumped because they were just to small to use anymore. Nowadays, they seem to break in some fashion long before their usefulness ends their lives.
>
> I remounted the drives and did a backup. For anyone running up on this, just in case one of the files got corrupted, I used a little trick to see if I can figure out which one may be bad if any. I took my rsync commands from my little script and ran them one at a time with --dry-run added. If a file was to be updated on the backup that I hadn't changed or added, I was going to check into it before updating my backups. It could be that the backup file was still good and the file on my drive reporting problems was bad. In that case, I would determine which was good and either restore it from backups or allow it to be updated if needed. Either way, I should have a good file since the drive claims to have fixed the problem. Now let us pray. :-D
>
> Drive isn't under warranty. I may have to start buying new drives from dealers. Sometimes I find drives that are pulled from systems and have very few hours on them. Still, warranty may not last long. Saves a lot of money tho.
>
> USPS claims drive is on the way. Left a distribution point and should update again when it gets close. First said Saturday, then said Friday. I think Friday is about right but if the wind blows right, maybe Thursday.
>
> I hope I have another port and power cable plug for the swap out. At least now, I can unmount it and swap without a lot of rebooting. Since it's on LVM, that part is easy. Regretfully I have experience on that process. :/
>
> Thanks to all.
>
> Dale
>
> :-) :-)
>
>
You can get up to 16X SATA PCI-e cards these days for pretty cheap. So as long as you have the power to run another drive or two there's not much reason not to do RAID on the important stuff. Also, the SATA protocol allows for port expanders, which are also pretty cheap.

One of my favorite things about BTRFS is the data checksums. If the drive returns garbage, it turns into a read error. Also, if you can't do real RAID, but have excess space you can tell it to keep two copies of everything. Doesn't help with total drive failure, but does protect against the occasional failed sector. If you don't mind writes taking twice as long anyway.

LMP
Re: Hard drive error from SMART [ In reply to ]
Am Tue, Apr 12, 2022 at 12:08:24PM -0500 schrieb Dale:
> Rich Freeman wrote:
> > On Mon, Apr 11, 2022 at 9:27 PM Dale <rdalek1967@gmail.com> wrote:
> >> Thoughts. Replace as soon as drive arrives or wait and see?
> >>
> > So, first of all just about all my hard drives are in a RAID at this
> > point, so I have a higher tolerance for issues.

> Sadly, I don't have RAID here but to be honest, I really need to have it
> given the data and my recent luck with hard drives. 

Plus, if you do a Raid 5 or Raid-Z1, you use your capacity more efficiently
with just three drives. However, when I was building my NAS 5½ years ago,
there was already an article about Raid-5 becoming obsolete due to the ever
rising drive capacity. Because if you have a failed drive and need to
replace and rebuild, the chances that another drive fails during rebuild
rises with the drive capacity.

> Drives used to get dumped because they were just to small to use anymore. 
> Nowadays, they seem to break in some fashion long before their usefulness
> ends their lives. 

I recently bought a passive mini-pc (zotac zbox) and just for the fun of it
installed a 160 GB HDD that maxes out at aronud 40 MiB/s. You do NOT want to
run a modern Linux desktop on such a drive. :D

> I remounted the drives and did a backup.  For anyone running up on this,
> just in case one of the files got corrupted, I used a little trick to
> see if I can figure out which one may be bad if any.  I took my rsync
> commands from my little script and ran them one at a time with --dry-run
> added.

I actually developed a tool for that. It creates and checks md5 checksums
recursively and *per directory*. Whenever I copy stuff from somewhere, like
a music album, I do an immediate md5 run on that directory. And when I later
copy that stuff around, I simply run the tool again on the copy (after the
FS cache was flushed, for example by unmounting and remounting) to see
whether the checksums are still valid.

You can find it on github: https://github.com/felf/dh
It’s a single-file python application, because I couldn’t be bothered with
the myriad ways of creating a python package. ;-)

--
Grüße | Greetings | Salut | Qapla’
Please do not share anything from, with or about me on any social network.

A horse comes into a bar.
Barkeep: “Hey!”
Horse: “Sure.”
RE: Hard drive error from SMART [ In reply to ]
> -----Original Message-----
> From: Frank Steinmetzger <Warp_7@gmx.de>
> Sent: Tuesday, April 12, 2022 10:39 AM
> To: gentoo-user@lists.gentoo.org
> Subject: Re: [gentoo-user] Hard drive error from SMART
>
>
> I actually developed a tool for that. It creates and checks md5 checksums recursively and *per directory*. Whenever I copy stuff from somewhere, like a music album, I do an immediate md5 run on that directory. And when I later copy that stuff around, I simply run the tool again on the copy (after the FS cache was flushed, for example by unmounting and remounting) to see whether the checksums are still valid.
>
> You can find it on github: https://github.com/felf/dh It’s a single-file python application, because I couldn’t be bothered with the myriad ways of creating a python package. ;-)
>
> --
> Grüße | Greetings | Salut | Qapla’
> Please do not share anything from, with or about me on any social network.
>
> A horse comes into a bar.
> Barkeep: “Hey!”
> Horse: “Sure.”
>
There's also app-crypt/md5deep

Does a number of hashes, is threaded, has options for piecewise hashing and a matching mode for using the hashes to find duplicates. Also a number of input and output filters for those cases where you don't want to hash everything.

Also can output a number of formats, but reformatting is generally trivial.

LMP
Re: Hard drive error from SMART [ In reply to ]
Laurence Perkins wrote:
>> -----Original Message-----
>> From: Dale <rdalek1967@gmail.com>
>> Sent: Tuesday, April 12, 2022 10:08 AM
>> To: gentoo-user@lists.gentoo.org
>> Subject: Re: [gentoo-user] Hard drive error from SMART
>>
>> Rich Freeman wrote:
>>> On Mon, Apr 11, 2022 at 9:27 PM Dale <rdalek1967@gmail.com> wrote:
>>>> Thoughts. Replace as soon as drive arrives or wait and see?
>>>>
>>> So, first of all just about all my hard drives are in a RAID at this
>>> point, so I have a higher tolerance for issues.
>>>
>>> If a drive is under warranty I'll usually try to see if they will RMA
>>> it. More often than not they will, and in that case there is really
>>> no reason not to. I'll do advance shipping and replace the drive
>>> before sending the old one back so that I mostly have redundancy the
>>> whole time.
>>>
>>> If it isn't under warranty then I'll scrub it and see what happens.
>>> I'll of course do SMART self-tests, but usually an error like this
>>> won't actually clear until you overwrite the offline sector so that
>>> the drive can reallocate it. A RAID scrub/resilver/etc will overwrite
>>> the sector with the correct contents which will allow this to happen.
>>> (Otherwise there is no way for the drive to recover - if it knew what
>>> was stored there it wouldn't have an error in the first place.)
>>>
>>> If an error comes back then I'll replace the drive. My drives are
>>> pretty large at this point so I don't like keeping unreliable drives
>>> around. It just increases the risk of double failures, given that a
>>> large hard drive can take more than a day to replace. Write speeds
>>> just don't keep pace with capacities. I do have offline backups but I
>>> shudder at the thought of how long one of those would take to restore.
>>>
>>
>> Sadly, I don't have RAID here but to be honest, I really need to have it given the data and my recent luck with hard drives. Drives used to get dumped because they were just to small to use anymore. Nowadays, they seem to break in some fashion long before their usefulness ends their lives.
>>
>> I remounted the drives and did a backup. For anyone running up on this, just in case one of the files got corrupted, I used a little trick to see if I can figure out which one may be bad if any. I took my rsync commands from my little script and ran them one at a time with --dry-run added. If a file was to be updated on the backup that I hadn't changed or added, I was going to check into it before updating my backups. It could be that the backup file was still good and the file on my drive reporting problems was bad. In that case, I would determine which was good and either restore it from backups or allow it to be updated if needed. Either way, I should have a good file since the drive claims to have fixed the problem. Now let us pray. :-D
>>
>> Drive isn't under warranty. I may have to start buying new drives from dealers. Sometimes I find drives that are pulled from systems and have very few hours on them. Still, warranty may not last long. Saves a lot of money tho.
>>
>> USPS claims drive is on the way. Left a distribution point and should update again when it gets close. First said Saturday, then said Friday. I think Friday is about right but if the wind blows right, maybe Thursday.
>>
>> I hope I have another port and power cable plug for the swap out. At least now, I can unmount it and swap without a lot of rebooting. Since it's on LVM, that part is easy. Regretfully I have experience on that process. :/
>>
>> Thanks to all.
>>
>> Dale
>>
>> :-) :-)
>>
>>
> You can get up to 16X SATA PCI-e cards these days for pretty cheap. So as long as you have the power to run another drive or two there's not much reason not to do RAID on the important stuff. Also, the SATA protocol allows for port expanders, which are also pretty cheap.
>
> One of my favorite things about BTRFS is the data checksums. If the drive returns garbage, it turns into a read error. Also, if you can't do real RAID, but have excess space you can tell it to keep two copies of everything. Doesn't help with total drive failure, but does protect against the occasional failed sector. If you don't mind writes taking twice as long anyway.
>
> LMP


I looked into a card a good while back and they were pretty pricey at
the time.  You happen to have some search terms I can search for on
ebay, Amazon etc?  I know some chipsets work better on Linux out of the
box.  I don't need to buy one that doesn't work or only works with the
threat of a sledge hammer.  lol  I've also looked into that other thing,
SAS? or something.  It's been a while tho. 

I'm pretty good at doing backups.  I do Gentoo updates on Saturday, and
sometimes Sunday.  While the updates are downloading, I update my
backups.  It's almost like a religion for me.  I was just more cautious
earlier.  I suspect a file could be corrupted somewhere but wanted to be
sure it wasn't something important.  I have some files that if lost, I
may not can download again.  They don't exist.  A few I got from some
Govt archive that are really old but since removed, or at least I can't
find them anymore. 

I've given serious thought to switching to BTRFS.  Thing is, I'm still
trying to get LVM figured out.  Plus, LVM is well maintained and should
be for a good long while, plus it works for me.  Still, if I could
afford to have several new drives all at once, I'd certainly play with
it.  It could very well be better.  The one thing I wish, LVM had a GUI
where you could do everything from it.  During my recent rearrangement
of drives, I learned that you can't do a lot of things within webmin. 
It does some things but not everything.  Plus, you have to have a
running GUI to use it.  In that case, I had to unmount /home which meant
no KDE, so no Webmin either.  Still, that could cause trouble too.  I
dunno. 

Thanks.

Dale

:-)  :-)
Re: Hard drive error from SMART [ In reply to ]
Am Tue, Apr 12, 2022 at 06:09:13PM +0000 schrieb Laurence Perkins:

> > I actually developed a tool for that. It creates and checks md5
> > checksums recursively and *per directory*. Whenever I copy stuff from
> > somewhere, like a music album, I do an immediate md5 run on that
> > directory. And when I later copy that stuff around, I simply run the
> > tool again on the copy (after the FS cache was flushed, for example by
> > unmounting and remounting) to see whether the checksums are still valid.
> >
> There's also app-crypt/md5deep
>
> Does a number of hashes, is threaded, has options for piecewise hashing and a matching mode for using the hashes to find duplicates. Also a number of input and output filters for those cases where you don't want to hash everything.

I knew about md5deep when I started with my own tool (as can be read in the
readme ;-) ). But md5deep used one single md5 file at a tree’s root, whereas
I wanted one file per directory in a tree. The reason being that I wanted to
be able to copy individual directories and still check their hashes without
editing checksum files.

--
Grüße | Greetings | Salut | Qapla’
Please do not share anything from, with or about me on any social network.

If you were born feet-first, then, for a short moment,
you wore your mother as a hat.
Re: Hard drive error from SMART [ In reply to ]
On 12/04/2022 18:21, Laurence Perkins wrote:
> You can get up to 16X SATA PCI-e cards these days for pretty cheap. So as long as you have the power to run another drive or two there's not much reason not to do RAID on the important stuff. Also, the SATA protocol allows for port expanders, which are also pretty cheap.
>
> One of my favorite things about BTRFS is the data checksums. If the drive returns garbage, it turns into a read error. Also, if you can't do real RAID, but have excess space you can tell it to keep two copies of everything. Doesn't help with total drive failure, but does protect against the occasional failed sector. If you don't mind writes taking twice as long anyway.

https://raid.wiki.kernel.org/index.php/Linux_Raid

https://raid.wiki.kernel.org/index.php/System2020

That system in the second link is the system being used to type this
message ...

Cheers,
Wol
Re: Hard drive error from SMART [ In reply to ]
On Tue, Apr 12, 2022 at 1:08 PM Dale <rdalek1967@gmail.com> wrote:
>
> I remounted the drives and did a backup. For anyone running up on this,
> just in case one of the files got corrupted, I used a little trick to
> see if I can figure out which one may be bad if any. I took my rsync
> commands from my little script and ran them one at a time with --dry-run
> added. If a file was to be updated on the backup that I hadn't changed
> or added, I was going to check into it before updating my backups.

Unless you're using the --checksum option on rsync this isn't likely
to be effective. By default rsync only looks at size and mtime, so it
isn't going to back up a file unless you intentionally changed it. If
data was silently corrupted this wouldn't detect a change at all
without the --checksum option.

Ultimately if you care about silent corruptions you're best off using
a solution that actually achieves this. btrfs, zfs, or something
whipped up with dm-integrity would be best. At a file level you could
store multiple files and hashes, or use a solution like PAR2. Plain
mdadm raid1 will fix issues if the drive detects and reports errors
(the drive typically has a checksum to do this, but it is a black box
and may not always work). The other solutions will reliably detect
and possibly recover errors even if the drive fails to detect them (a
so-called silent error).

Just about all my linux data these days is on a solution that detects
silent errors - zfs or lizardfs. On ssd-based systems where I don't
want to invest in mirroring I still run zfs to detect errors and just
use frequent backups (ssds are small anyway so they're cheap to
frequently back up, especially if they're on zfs where there are
send-based backup scripts for this, and typically this is for OS
drives where things don't change much anyway).

--
Rich
RE: Hard drive error from SMART [ In reply to ]
>-----Original Message-----
>From: Dale <rdalek1967@gmail.com>
>Sent: Tuesday, April 12, 2022 11:22 AM
>To: gentoo-user@lists.gentoo.org
>Subject: Re: [gentoo-user] Hard drive error from SMART
>
>Laurence Perkins wrote:
>>> -----Original Message-----
>>> From: Dale <rdalek1967@gmail.com>
>>> Sent: Tuesday, April 12, 2022 10:08 AM
>>> To: gentoo-user@lists.gentoo.org
>>> Subject: Re: [gentoo-user] Hard drive error from SMART
>>>
>>> Rich Freeman wrote:
>>>> On Mon, Apr 11, 2022 at 9:27 PM Dale <rdalek1967@gmail.com> wrote:
>>>>> Thoughts. Replace as soon as drive arrives or wait and see?
>>>>>
>>>> So, first of all just about all my hard drives are in a RAID at this
>>>> point, so I have a higher tolerance for issues.
>>>>
>>>> If a drive is under warranty I'll usually try to see if they will
>>>> RMA it. More often than not they will, and in that case there is
>>>> really no reason not to. I'll do advance shipping and replace the
>>>> drive before sending the old one back so that I mostly have
>>>> redundancy the whole time.
>>>>
>>>> If it isn't under warranty then I'll scrub it and see what happens.
>>>> I'll of course do SMART self-tests, but usually an error like this
>>>> won't actually clear until you overwrite the offline sector so that
>>>> the drive can reallocate it. A RAID scrub/resilver/etc will
>>>> overwrite the sector with the correct contents which will allow this to happen.
>>>> (Otherwise there is no way for the drive to recover - if it knew
>>>> what was stored there it wouldn't have an error in the first place.)
>>>>
>>>> If an error comes back then I'll replace the drive. My drives are
>>>> pretty large at this point so I don't like keeping unreliable drives
>>>> around. It just increases the risk of double failures, given that a
>>>> large hard drive can take more than a day to replace. Write speeds
>>>> just don't keep pace with capacities. I do have offline backups but
>>>> I shudder at the thought of how long one of those would take to restore.
>>>>
>>>
>>> Sadly, I don't have RAID here but to be honest, I really need to have it given the data and my recent luck with hard drives. Drives used to get dumped because they were just to small to use anymore. Nowadays, they seem to break in some fashion long before their usefulness ends their lives.
>>>
>>> I remounted the drives and did a backup. For anyone running up on
>>> this, just in case one of the files got corrupted, I used a little
>>> trick to see if I can figure out which one may be bad if any. I took
>>> my rsync commands from my little script and ran them one at a time
>>> with --dry-run added. If a file was to be updated on the backup that
>>> I hadn't changed or added, I was going to check into it before
>>> updating my backups. It could be that the backup file was still good
>>> and the file on my drive reporting problems was bad. In that case, I
>>> would determine which was good and either restore it from backups or
>>> allow it to be updated if needed. Either way, I should have a good
>>> file since the drive claims to have fixed the problem. Now let us
>>> pray. :-D
>>>
>>> Drive isn't under warranty. I may have to start buying new drives from dealers. Sometimes I find drives that are pulled from systems and have very few hours on them. Still, warranty may not last long. Saves a lot of money tho.
>>>
>>> USPS claims drive is on the way. Left a distribution point and should update again when it gets close. First said Saturday, then said Friday. I think Friday is about right but if the wind blows right, maybe Thursday.
>>>
>>> I hope I have another port and power cable plug for the swap out. At
>>> least now, I can unmount it and swap without a lot of rebooting.
>>> Since it's on LVM, that part is easy. Regretfully I have experience
>>> on that process. :/
>>>
>>> Thanks to all.
>>>
>>> Dale
>>>
>>> :-) :-)
>>>
>>>
>> You can get up to 16X SATA PCI-e cards these days for pretty cheap. So as long as you have the power to run another drive or two there's not much reason not to do RAID on the important stuff. Also, the SATA protocol allows for port expanders, which are also pretty cheap.
>>
>> One of my favorite things about BTRFS is the data checksums. If the drive returns garbage, it turns into a read error. Also, if you can't do real RAID, but have excess space you can tell it to keep two copies of everything. Doesn't help with total drive failure, but does protect against the occasional failed sector. If you don't mind writes taking twice as long anyway.
>>
>> LMP
>
>
>I looked into a card a good while back and they were pretty pricey at the time. You happen to have some search terms I can search for on ebay, Amazon etc? I know some chipsets work better on Linux out of the box. I don't need to buy one that doesn't work or only works with the threat of a sledge hammer. lol I've also looked into that other thing, SAS? or something. It's been a while tho.
>
>I'm pretty good at doing backups. I do Gentoo updates on Saturday, and sometimes Sunday. While the updates are downloading, I update my backups. It's almost like a religion for me. I was just more cautious earlier. I suspect a file could be corrupted somewhere but wanted to be sure it wasn't something important. I have some files that if lost, I may not can download again. They don't exist. A few I got from some Govt archive that are really old but since removed, or at least I can't find them anymore.
>
>I've given serious thought to switching to BTRFS. Thing is, I'm still trying to get LVM figured out. Plus, LVM is well maintained and should be for a good long while, plus it works for me. Still, if I could afford to have several new drives all at once, I'd certainly play with it. It could very well be better. The one thing I wish, LVM had a GUI where you could do everything from it. During my recent rearrangement of drives, I learned that you can't do a lot of things within webmin. It does some things but not everything. Plus, you have to have a running GUI to use it. In that case, I had to unmount /home which meant no KDE, so no Webmin either. Still, that could cause trouble too. I dunno.
>
>Thanks.
>
>Dale
>
>:-) :-)
>
>

I went with a couple of https://www.amazon.com/MZHOU-Profile-Bracket-Support-Converter/dp/B08L7W8QFT/ in a couple different sizes for two of my mass storage systems and they seem to be doing OK.

The difference between the cheap vendors and the expensive vendors these days tends to be quality control. So plug it in, load it up, run it hard for a few hours. If it doesn't die relatively quickly you're usually good.

Especially if you have RAID with checksums it's difficult for a controller to mangle things too badly even if it does have an issue.

Remember: Data does not exist if it doesn't exist in at least three places. So you still want off-site backups in case your house burns down. Especially for irreplaceable things.

If you have friends who also want off-site backups and you leave your machines running all the time then tahoe-lafs is pretty decent. For that matter they don't even have to really be friends, you really only have to be able to trust them to not selfishly hog all the space.

I use BTRFS RAID1 for a lot of stuff. So far it's been pretty good at catching dropped bits and recovering from failures. It has a bit of the RAID issue where a drive could fail while you're doing a recovery since it only guarantees integrity with one dud drive regardless of the number of drives in the pool. But since each chunk is only written to two drives instead of spread across all of them the rebuild time stays relatively short and even if another drive does fail you'll only lose some of the data instead of all of it. This also means that the wasted space when your drives aren't all the same size is kept to a minimum.

ZFS and similar are arguably better for larger arrays, but are also more hassle to set up.

LVM is good for being able to swap out drives easily but with the modern, huge drives you really want data checksums if you can get them. Otherwise all it takes is a flipped bit somewhere to wreck your data and drive firmware doesn't always notice. I think you can do that with LVM, but I've never looked into it for certain.

LMP
Re: Hard drive error from SMART [ In reply to ]
On 12/04/2022 20:41, Laurence Perkins wrote:
> LVM is good for being able to swap out drives easily but with the modern, huge drives you really want data checksums if you can get them. Otherwise all it takes is a flipped bit somewhere to wreck your data and drive firmware doesn't always notice. I think you can do that with LVM, but I've never looked into it for certain.

Look at that link for my system that I posted. I use dm-integrity, so a
flipped bit will trigger a failure at the raid-5 level and recover.

For those people looking at btrfs - note that parity-raid (5 or 6) is
not a wise idea at the moment so you don't get two-failure protection ...

Cheers,
Wol
Re: Hard drive error from SMART [ In reply to ]
Laurence Perkins wrote:
> I went with a couple of https://www.amazon.com/MZHOU-Profile-Bracket-Support-Converter/dp/B08L7W8QFT/ in a couple different sizes for two of my mass storage systems and they seem to be doing OK.
>
> The difference between the cheap vendors and the expensive vendors these days tends to be quality control. So plug it in, load it up, run it hard for a few hours. If it doesn't die relatively quickly you're usually good.
>
> Especially if you have RAID with checksums it's difficult for a controller to mangle things too badly even if it does have an issue.
>
> Remember: Data does not exist if it doesn't exist in at least three places. So you still want off-site backups in case your house burns down. Especially for irreplaceable things.
>
> If you have friends who also want off-site backups and you leave your machines running all the time then tahoe-lafs is pretty decent. For that matter they don't even have to really be friends, you really only have to be able to trust them to not selfishly hog all the space.
>
> I use BTRFS RAID1 for a lot of stuff. So far it's been pretty good at catching dropped bits and recovering from failures. It has a bit of the RAID issue where a drive could fail while you're doing a recovery since it only guarantees integrity with one dud drive regardless of the number of drives in the pool. But since each chunk is only written to two drives instead of spread across all of them the rebuild time stays relatively short and even if another drive does fail you'll only lose some of the data instead of all of it. This also means that the wasted space when your drives aren't all the same size is kept to a minimum.
>
> ZFS and similar are arguably better for larger arrays, but are also more hassle to set up.
>
> LVM is good for being able to swap out drives easily but with the modern, huge drives you really want data checksums if you can get them. Otherwise all it takes is a flipped bit somewhere to wreck your data and drive firmware doesn't always notice. I think you can do that with LVM, but I've never looked into it for certain.
>
> LMP

I looked at that card and read some of the reviews.  Some claim they had
issues but I suspect a driver problem.  Can you do a lspci -k and see
what driver it uses for that card on your system?  If yours works fine,
I'd want to use the same driver. 

That is a lot of drives tho.  I need to build a NAS thingy.  lol

Dale

:-)  :-) 
Re: Hard drive error from SMART [ In reply to ]
Wols Lists wrote:
> On 12/04/2022 18:21, Laurence Perkins wrote:
>> You can get up to 16X SATA PCI-e cards these days for pretty cheap. 
>> So as long as you have the power to run another drive or two there's
>> not much reason not to do RAID on the important stuff.  Also, the
>> SATA protocol allows for port expanders, which are also pretty cheap.
>>
>> One of my favorite things about BTRFS is the data checksums.  If the
>> drive returns garbage, it turns into a read error.  Also, if you
>> can't do real RAID, but have excess space you can tell it to keep two
>> copies of everything.  Doesn't help with total drive failure, but
>> does protect against the occasional failed sector.  If you don't mind
>> writes taking twice as long anyway.
>
> https://raid.wiki.kernel.org/index.php/Linux_Raid
>
> https://raid.wiki.kernel.org/index.php/System2020
>
> That system in the second link is the system being used to type this
> message ...
>
> Cheers,
> Wol
>
>


Neat setup.  I need something similar for a NAS setup thingy.  Just got
way to much going on right now. 

Dale

:-)  :-) 
Re: Hard drive error from SMART [ In reply to ]
Rich Freeman wrote:
> On Tue, Apr 12, 2022 at 1:08 PM Dale <rdalek1967@gmail.com> wrote:
>> I remounted the drives and did a backup. For anyone running up on this,
>> just in case one of the files got corrupted, I used a little trick to
>> see if I can figure out which one may be bad if any. I took my rsync
>> commands from my little script and ran them one at a time with --dry-run
>> added. If a file was to be updated on the backup that I hadn't changed
>> or added, I was going to check into it before updating my backups.
> Unless you're using the --checksum option on rsync this isn't likely
> to be effective. By default rsync only looks at size and mtime, so it
> isn't going to back up a file unless you intentionally changed it. If
> data was silently corrupted this wouldn't detect a change at all
> without the --checksum option.
>
> Ultimately if you care about silent corruptions you're best off using
> a solution that actually achieves this. btrfs, zfs, or something
> whipped up with dm-integrity would be best. At a file level you could
> store multiple files and hashes, or use a solution like PAR2. Plain
> mdadm raid1 will fix issues if the drive detects and reports errors
> (the drive typically has a checksum to do this, but it is a black box
> and may not always work). The other solutions will reliably detect
> and possibly recover errors even if the drive fails to detect them (a
> so-called silent error).
>
> Just about all my linux data these days is on a solution that detects
> silent errors - zfs or lizardfs. On ssd-based systems where I don't
> want to invest in mirroring I still run zfs to detect errors and just
> use frequent backups (ssds are small anyway so they're cheap to
> frequently back up, especially if they're on zfs where there are
> send-based backup scripts for this, and typically this is for OS
> drives where things don't change much anyway).
>


My hope was if it was corrupted and something changed then I'd see it in
the list.  If nothing changed then rsync wouldn't change anything on the
backups either.  I'll look into that option tho.  May be something for
the future.  ;-)  I suspect it would slow things down quite a bit tho. 

Dale

:-)  :-)
Re: Hard drive error from SMART [ In reply to ]
On Tue, Apr 12, 2022 at 3:01 PM Dale <rdalek1967@gmail.com> wrote:
<SNIP>
> Neat setup. I need something similar for a NAS setup thingy. Just got
> way to much going on right now.
>
> Dale
>
> :-) :-)
>

LOL. Watching this thread made me start a round of backups to my NAS
thingy Dale. ;-)

Mark
RE: Hard drive error from SMART [ In reply to ]
>-----Original Message-----
>From: Wol <antlists@youngman.org.uk>
>Sent: Tuesday, April 12, 2022 2:51 PM
>To: gentoo-user@lists.gentoo.org
>Subject: Re: [gentoo-user] Hard drive error from SMART
>
>On 12/04/2022 20:41, Laurence Perkins wrote:
>> LVM is good for being able to swap out drives easily but with the modern, huge drives you really want data checksums if you can get them. Otherwise all it takes is a flipped bit somewhere to wreck your data and drive firmware doesn't always notice. I think you can do that with LVM, but I've never looked into it for certain.
>
>Look at that link for my system that I posted. I use dm-integrity, so a flipped bit will trigger a failure at the raid-5 level and recover.
>
>For those people looking at btrfs - note that parity-raid (5 or 6) is not a wise idea at the moment so you don't get two-failure protection ...

Specifically if the system crashes or has a power failure there may be some data left hanging until it can complete a scrub. Disk failures during that period may lose some of said data.

How much of a risk that is depends on the stability of your power and kernel and how much data turnover you have. I only use it on systems with UPS power and additional backups. Needs careful monitoring of the drives too since system crashes due to drive failures can leave you in rather a sticky mess.

>
>Cheers,
>Wol
>
>
Re: Hard drive error from SMART [ In reply to ]
> For those people looking at btrfs - note that parity-raid (5 or 6) is not a wise idea at the moment so you don't get two-failure protection ...
>
> Cheers,
> Wol
>
I've been reading that this is less and less true. The write-hole issue is rather old now (first reported around 2016 I think?) From what I read from various sources, the developpers have made some progress and the problem is getting harder and harder to reproduce, for instance, [1].
Although some people recommend using RAID1 for the metadata, and RAID5/6 for the data, just in case.


Julien
[1] https://unixsheikh.com/articles/battle-testing-zfs-btrfs-and-mdadm-dm.html#btrfs-raid-5
Re: Hard drive error from SMART [ In reply to ]
Am Tue, Apr 12, 2022 at 05:03:01PM -0500 schrieb Dale:
> Rich Freeman wrote:
> > On Tue, Apr 12, 2022 at 1:08 PM Dale <rdalek1967@gmail.com> wrote:
> >> I remounted the drives and did a backup. For anyone running up on this,
> >> just in case one of the files got corrupted, I used a little trick to
> >> see if I can figure out which one may be bad if any. I took my rsync
> >> commands from my little script and ran them one at a time with --dry-run
> >> added. If a file was to be updated on the backup that I hadn't changed
> >> or added, I was going to check into it before updating my backups.
> > Unless you're using the --checksum option on rsync this isn't likely
> > to be effective.

> My hope was if it was corrupted and something changed then I'd see it in
> the list.  If nothing changed then rsync wouldn't change anything on the
> backups either.  I'll look into that option tho.  May be something for
> the future.  ;-)  I suspect it would slow things down quite a bit tho. 

The advantage of an integrity scheme (like ZFS or comparing with a checksum
file) over your rsync approach is that you only need to read all the datas™
from one drive instead of two. Plus: if rsync actually detects a change, it
doesn’t know which of the two drives introduced the error. You need to find
out yourself after the fact (which probably won’t be hard, but still, it’s
one more manual step).

--
Grüße | Greetings | Salut | Qapla’
Please do not share anything from, with or about me on any social network.

“An itching nose must be scratched.” … Kosh (Star Wreck)
Re: Hard drive error from SMART [ In reply to ]
Frank Steinmetzger wrote:
> Am Tue, Apr 12, 2022 at 05:03:01PM -0500 schrieb Dale:
>> Rich Freeman wrote:
>>> On Tue, Apr 12, 2022 at 1:08 PM Dale <rdalek1967@gmail.com> wrote:
>>>> I remounted the drives and did a backup. For anyone running up on this,
>>>> just in case one of the files got corrupted, I used a little trick to
>>>> see if I can figure out which one may be bad if any. I took my rsync
>>>> commands from my little script and ran them one at a time with --dry-run
>>>> added. If a file was to be updated on the backup that I hadn't changed
>>>> or added, I was going to check into it before updating my backups.
>>> Unless you're using the --checksum option on rsync this isn't likely
>>> to be effective.
>> My hope was if it was corrupted and something changed then I'd see it in
>> the list.  If nothing changed then rsync wouldn't change anything on the
>> backups either.  I'll look into that option tho.  May be something for
>> the future.  ;-)  I suspect it would slow things down quite a bit tho. 
> The advantage of an integrity scheme (like ZFS or comparing with a checksum
> file) over your rsync approach is that you only need to read all the datas™
> from one drive instead of two. Plus: if rsync actually detects a change, it
> doesn’t know which of the two drives introduced the error. You need to find
> out yourself after the fact (which probably won’t be hard, but still, it’s
> one more manual step).
>


In this case, if something had changed, I'd have no problem manually
checking the file to be sure which was good and which was bad.  Given
the error is recent on my drive, I'd suspect the backups to still be a
good file.  For that reason, I'd suspect the backup file to be good
therefore not to be overwritten.  I was trying to avoid a bad file
replacing a good file on the backup which then destroys all good files
and leaves only bad ones.  This is why I like that SMART at least let me
know there is a problem. 

Sometimes things has to be done manually which is often the best way. 
Just depends on the situation I guess. 

Dale

:-)  :-) 
Re: Hard drive error from SMART [ In reply to ]
Am Tue, Apr 12, 2022 at 06:01:11PM -0500 schrieb Dale:

> > The advantage of an integrity scheme (like ZFS or comparing with a checksum
> > file) over your rsync approach is that you only need to read all the datas™
> > from one drive instead of two. Plus: if rsync actually detects a change, it
> > doesn’t know which of the two drives introduced the error. You need to find
> > out yourself after the fact (which probably won’t be hard, but still, it’s
> > one more manual step).
>
> In this case, if something had changed, I'd have no problem manually
> checking the file to be sure which was good and which was bad.

Consider a big video file, which I know you like to accumulate from youtube
and the likes. How do you find out the broken one? By watching it and trying
to find the one image or audio frame that is garbled? The drive might return
zeros or other garbage (bit flip) instead of actual content without SMART
noticing it (uncorrectable error).

> Given
> the error is recent on my drive, I'd suspect the backups to still be a
> good file.  For that reason, I'd suspect the backup file to be good
> therefore not to be overwritten.  I was trying to avoid a bad file
> replacing a good file on the backup which then destroys all good files
> and leaves only bad ones.  This is why I like that SMART at least let me
> know there is a problem. 

I also tend to rely on smart, but it’s not all-knowing and probably not
infallible.

> Sometimes things has to be done manually which is often the best way. 
> Just depends on the situation I guess. 

--
Grüße | Greetings | Salut | Qapla’
Please do not share anything from, with or about me on any social network.

The only thing still keeping me here is Earth’s gravity.
Re: Hard drive error from SMART [ In reply to ]
Frank Steinmetzger wrote:
> Am Tue, Apr 12, 2022 at 06:01:11PM -0500 schrieb Dale:
>
>>> The advantage of an integrity scheme (like ZFS or comparing with a checksum
>>> file) over your rsync approach is that you only need to read all the datas™
>>> from one drive instead of two. Plus: if rsync actually detects a change, it
>>> doesn’t know which of the two drives introduced the error. You need to find
>>> out yourself after the fact (which probably won’t be hard, but still, it’s
>>> one more manual step).
>> In this case, if something had changed, I'd have no problem manually
>> checking the file to be sure which was good and which was bad.
> Consider a big video file, which I know you like to accumulate from youtube
> and the likes. How do you find out the broken one? By watching it and trying
> to find the one image or audio frame that is garbled? The drive might return
> zeros or other garbage (bit flip) instead of actual content without SMART
> noticing it (uncorrectable error).
>

In this case, I'd likely rename one file and keep them both until I can
figure out which is good.  That said, I'd certainly keep the backup copy
because odds are, it is good since the error came well after my last
backup.  At this point tho, I don't know what file was on that bad spot. 


>> Given
>> the error is recent on my drive, I'd suspect the backups to still be a
>> good file.  For that reason, I'd suspect the backup file to be good
>> therefore not to be overwritten.  I was trying to avoid a bad file
>> replacing a good file on the backup which then destroys all good files
>> and leaves only bad ones.  This is why I like that SMART at least let me
>> know there is a problem. 
> I also tend to rely on smart, but it’s not all-knowing and probably not
> infallible.
>
>


This is very true.  I mentioned elsewhere that things like spindle motor
failure or the motor that moves the heads are usually not detectable. 
Some component failures can be detected but not all or even most from
what I've read.  Basically, the best you can hope for is SMART seeing a
bad spot on the media itself.  That it seems it can detect most of the
time. 

TL;DR next two paragraphs.  Just a interesting story along this line.  I
used to work in parts at a fortune 500 office company.  We had millions
of dollars of just computer stuff in inventory just for computers.  That
was in early 90's.  They also had copiers and their parts, paper etc
etc.  We used a NCR computer for a computer system for the whole
company.  At the end of the building was a speed bump so people wouldn't
go flying down the one lane road between the building and fence on the
property line.  One day a large truck almost empty went a little faster
than normal over the last speed bump.  It shook the building to the
point I could feel it about 150 feet away.  The computer room was like
50 feet away from that side of the building.  It seems the hard drive
felt it very well.  One, maybe more, of the head(s) got under the media
and started peeling it off the platter and made a really ugly screeching
sound.  No routine shutdown, they just pulled the plug.  As you can
imagine tho, it did no good.  Even way back then drives of that speed
were spinning fast enough.  I suspect even by the time a person could
blink it was way past fixing. 

That of course was way before SMART came along but SMART would never be
able to predict such a failure.  Even NCR said it was likely a 1 in a
million chance that the truck hits just when the head was moving over a
weak spot.  Several thousand dollars later, and a private plane bringing
in a new drive, the drive was replaced.  Of course, the idiot in charge
had no backups that were of any use.  All of them were several weeks
old, likely over a month.  Luckily he stayed far away from me for at
least a month.  Otherwise, I'd likely still be in jail, with my hands
around the neck of his corpse.  :-@

SMART isn't a sure thing but it can help in some cases which is better
than nothing at all. 

Dale

:-)  :-) 

1 2 3  View All