Mailing List Archive

Strange disk corruption with Linux >= 2.6.13
Hi there. I'm seeing a really strange problem on my system lately and I
am not really sure that it has anything to do with the kernels.

I would appreciate any guidance with my problems. Any help is welcome.

My desktop has a Duron 1.3GHz (but, for some reason, it runs only at
1.1GHz) and an Asus A7V motherboard, with chipset VIA KT133 (not the
enhanced version KT133A).

Its memory modules are all PC133 and it had a 128MB card + 256MB
memory. Then, I decided that it wasn't the right time for a new computer
and just bought myself 2 newer expansion cards of 512MB.

Now, the motherboard has 512MB + 512MB + 256MB (all slots filled). I
then, recompiled the kernels with HIGHMEM support (the 4GB version) and
I've been seeing some strangeness since then.

The first thing that I noticed was that I run some file integrity
programs (debsums, which checks the md5sum signatures of the packages
that I have installed) to check the state of my system. I discovered
that some packages didn't have its signatures matching the originals.

Then, I reinstalled said packages and run debsums again. I got some
*other* packages with md5sum mismatches. Thinking that it could be
something related to the memory of my system, I decided to run
memtest86+ for some time.

After running for 6 hours, it could not find anything wrong with the
1.25GB of memory installed, which left me quite puzzled.

I then tried using the system again, but, still puzzled by the md5sum
mismatches, I tried to verify them again and I got some other packages
with problems.

At the same time, I was trying to stress test the machine a little bit
and decompressing the kernel tree from a tar.bz2 file, since a friend of
mine asked me to compile him a kernel >= 2.6.12 so that he could use
udev.

In the middle of the untarring, bzip2 stopped and said that it found
inconsitencies and that I should run bzip2recover on the file. I
removed the entire tree and tried uncompressing the tarball again and
the same result happened.

I then decided to reboot the machine, since I was fed up with this
strangeness (that I had never seen occurring before), and after the
boot, I tried running memtest86+ again for some minutes. It didn't find
anything.

Then, I booted back into Linux (at the time I was using 2.6.14-rc2) and
*succeeded* in uncompressing the tar.bz2 file that was "corrupted". At
this point in time, I did not understand anything.

I then left my computer running on memtest86+ while I went to work and
16 hours later, no problem was found and it was still running fine.

I then thought that it could be something with the harddisk and tried to
play with smartctl. I run one long/off-line test on my HD, but it
succeeded (I conjectured that the drive could be running out of spare
sectors).

I also tried running the kernel with highmem=0K, but the symptoms of
corruption repeated themselves. I even thought that maybe Linux couldn't
have been very much exposed to systems with HIGHMEM on older hardware
(like mine) and I then left the machine with just a 512MB card and it
still has problems.

I have voluntary preempt enabled, but I had it before and didn't notice
anything strange. I am now back to kernel 2.6.13.2 (avoiding all the
niceness that is in the 2.6.14-rc's), just to be sure. I can't see many
other things to try, except disabling voluntary preempt (which hasn't
given me any problems with earlier kernels and even -mm kernels).

Other than that, I am stuck and without any ideas. Please, any help
would be much more than welcome.


Thank you very much for suggestions, Rogério Brito.

P.S.: If anybody knows of a live CD with memtest86+ and cpuburn and
other things so that I could test my system, I would be highly
interested to know.

I sincerely don't know if I have a software or a hardware problem here.

P.P.S.: I am using a Debian testing system and the most demanding thing
that I do with my system is to compress some files to MP3 and to type
some texts in LaTeX with Emacs under Fluxbox with the Minimal style
(which is quite easy on the machine---I have not yet dared to use any
heavy desktop environment).

If any information else is desired, please let me know. I will gladly
help you to help me, as I am almost desperate. Thanks.
--
Rogério Brito : rbrito@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
El Tue, 27 Sep 2005 08:10:39 -0300,
Rogério Brito <rbrito@ime.usp.br> escribió:

> Hi there. I'm seeing a really strange problem on my system lately and I
> am not really sure that it has anything to do with the kernels.


You don't say what filesystem are you using. Have you tried running fsck?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Tue, 27 Sep 2005 08:10:39 -0300, Rogério Brito <rbrito@ime.usp.br> wrote:

>Hi there. I'm seeing a really strange problem on my system lately and I
>am not really sure that it has anything to do with the kernels.

Probably not, I had a similar problem recently and for a test case
copied a .iso image file then compared it to original (cp + cmp),
turned out to be bad memory, and yes, memtest86 did not find the
problem. Check mobo datasheet if 2+ double-sided memory allowed,
you may need to stay at 1GB to reduce bus loading.

Cheers,
Grant.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Hi, Diego. Thank you very much for your reply.

On Sep 27, 2005, at 8:34 AM, Diego Calleja wrote:
> You don't say what filesystem are you using. Have you tried running
> fsck?

Oh, sure. I forgot to mention that. I am using ext3 with ACL/xattrs
and with hashed B-Trees (I optimized the filesystem with option -D of
fsck.ext2). Would one of these things be a possible cause for the
strange behaviour that I am seeing?


Again, thank you very much for your interest.

--
Rogério Brito : rbrito@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Oops! I forgot to answer your question completely.

On Sep 27, 2005, at 8:58 AM, Rogério Brito wrote:
> On Sep 27, 2005, at 8:34 AM, Diego Calleja wrote:
>> You don't say what filesystem are you using. Have you tried
>> running fsck?
>
> Oh, sure. I forgot to mention that. I am using ext3 with ACL/xattrs
> and with hashed B-Trees (I optimized the filesystem with option -D
> of fsck.ext2). Would one of these things be a possible cause for
> the strange behaviour that I am seeing?

Yes, I did run fsck. Twice now, in a row (shutdown -r -F now).
Nothing was found, unfortunately. :-( I'm really running out of
ideas. :-(


Thanks, Rogério Brito.

--
Rogério Brito : rbrito@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Hi, Grzegorz. Thank you for your response.

On Sep 27, 2005, at 8:43 AM, Grzegorz Kulewski wrote:
> What is your southbridge?

The southbridge is a VIA VT82C686.

> Maybe there are some problems there with DMA or cables.

Humm, cables. I forgot to check that. I will check that as soon as I
wake up. I spent the entire night trying to fix this, but of course,
I gave up after some days of effort and decided to ask for help.

> Anything in logs?

Nothing in the logs. No oops, no stack trace, no nothing. :-( Oh, now
that you mention it, I remember that I also made my Matrox G400 use
speed 4x. I will try slowing it down to see if there is any influence
on what I see.

> Maybe sourthbridge or northbridge is simply overheating? Maybe you
> have bad power suply? What are readings of temperatures and
> voltages in BIOS after some heavy disk-memmory activities?

I don't know, because lmsensors doesn't give accurate measurements,
unfortunately. :-(

> You can use http://pyropus.ca/software/memtester/ to check your
> memory in linux. You can run cpuburn at the same time. And you can
> do some disk activity at the same time (for example dd if=/dev/hda
> bs=200M | md5sum several times to check if it will give the same
> results).

I had already tried using memtester, but I guess that I was too
ambitious with the amount of memory that I tried it to allocate. I
will try this, but with my filesystem in read-only mode, as I cannot
afford to loose what I have (and Debian's mondo/mind isn't working
right now---I already filed a bug report that is shared by others).

> I will bet that you have some hardware problem there. You can try
> to remove the 256MB DDR module and turn HIGHMEM off. You can also
> try to check each module separately.

I already checked each module separately, but I didn't see any
corruption. I guess that I maybe wasn't paying too much attention. I
will try it again. Thanks for the suggestion.

> And the best choice will be probably to buy new mb (for example
> Abit KW7 or KV7) because your is very old and it can start to
> silently break after so many years... Today mbs are very short
> living parts - 3-4 years and they are broken...

Yes, I was just trying to avoid getting a new system now, with all
the transitions going on (i386 -> x86_64 CPUs, PATA -> SATA etc). But
my time is also costing me some nights of sleep... :-( It sucks not
to be in the US, where things are cheaper. :-(


Thank you very much, Rogério.

--
Rogério Brito : rbrito@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Tue, 27 Sep 2005, Rogério Brito wrote:

> Hi, Grzegorz. Thank you for your response.

Hi, no problem.

> On Sep 27, 2005, at 8:43 AM, Grzegorz Kulewski wrote:
>> What is your southbridge?
>
> The southbridge is a VIA VT82C686.

I know. I had the same southbridge in my Abit KG7 but I don't know if you
have version A or version B. I had version B and it has several disk
problems fixed. For version A there are some workarounds in the kernel.


>> Maybe there are some problems there with DMA or cables.
>
> Humm, cables. I forgot to check that. I will check that as soon as I wake up.
> I spent the entire night trying to fix this, but of course, I gave up after
> some days of effort and decided to ask for help.
>
>> Anything in logs?
>
> Nothing in the logs. No oops, no stack trace, no nothing. :-( Oh, now that

I don't think that there will be any oops or something like that. But
maybe some IDE messages - like failed commands or something. But if there
are no such messages then chance is that this is some memory/mb problem.


> you mention it, I remember that I also made my Matrox G400 use speed 4x. I
> will try slowing it down to see if there is any influence on what I see.

Yes, slowing down your graphics card could help.


>> Maybe sourthbridge or northbridge is simply overheating? Maybe you have bad
>> power suply? What are readings of temperatures and voltages in BIOS after
>> some heavy disk-memmory activities?
>
> I don't know, because lmsensors doesn't give accurate measurements,
> unfortunately. :-(

So after burning reboot fast end check the BIOS measurements. Temperatures
will not change that much in minute or two. If your system is overheating
they will be high for at least 5 minutes after reboot.


>> You can use http://pyropus.ca/software/memtester/ to check your memory in
>> linux. You can run cpuburn at the same time. And you can do some disk
>> activity at the same time (for example dd if=/dev/hda bs=200M | md5sum
>> several times to check if it will give the same results).
>
> I had already tried using memtester, but I guess that I was too ambitious
> with the amount of memory that I tried it to allocate. I will try this, but
> with my filesystem in read-only mode, as I cannot afford to loose what I have
> (and Debian's mondo/mind isn't working right now---I already filed a bug
> report that is shared by others).
>
>> I will bet that you have some hardware problem there. You can try to remove
>> the 256MB DDR module and turn HIGHMEM off. You can also try to check each
>> module separately.
>
> I already checked each module separately, but I didn't see any corruption. I
> guess that I maybe wasn't paying too much attention. I will try it again.
> Thanks for the suggestion.


Hmm... What did you change before the system started not working? Maybe
try with only 256MB module installed if that was the working
configuration...


>> And the best choice will be probably to buy new mb (for example Abit KW7 or
>> KV7) because your is very old and it can start to silently break after so
>> many years... Today mbs are very short living parts - 3-4 years and they
>> are broken...
>
> Yes, I was just trying to avoid getting a new system now, with all the
> transitions going on (i386 -> x86_64 CPUs, PATA -> SATA etc). But my time is

Yeah, I am waiting for stable and better x86_64 too. But I replaced my KG7
to KW7 in the mean time just to be sure I have something before I will
buy x86_64. :-)


> also costing me some nights of sleep... :-( It sucks not to be in the US,
> where things are cheaper. :-(

Yeah, it sucks. I live in Poland and we have really big prices for
computer parts here. :-(


> Thank you very much, Rogério.

No problem.


Grzegorz Kulewski
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Grant Coady wrote:
> On Tue, 27 Sep 2005 08:10:39 -0300, Rogério Brito <rbrito@ime.usp.br> wrote:
>
>
>>Hi there. I'm seeing a really strange problem on my system lately and I
>>am not really sure that it has anything to do with the kernels.
>
>
> Probably not, I had a similar problem recently and for a test case
> copied a .iso image file then compared it to original (cp + cmp),
> turned out to be bad memory, and yes, memtest86 did not find the
> problem. Check mobo datasheet if 2+ double-sided memory allowed,
> you may need to stay at 1GB to reduce bus loading.

I work a lot with hardware any my experience is that memtest is not very
good at detecting errors. I have a Socket 7 board somewhere with bad L2
cache - it was unstable but memtest was unable to find anything.
However, GoldMemory found some errors - they disappeared after disabling
L2 cache and crashes disappeared too. It's not free but at least
shareware - you can find it at http://www.goldmemory.cz/ The older
version (IIRC 5.07) was better, I had problems with some of the newer
ones on perfectly OK hardware (when the test should start, it rebooted
instead).

--
Ondrej Zary
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Tue, Sep 27, 2005 at 09:57:52PM +1000, Grant Coady wrote:
> Probably not, I had a similar problem recently and for a test case
> copied a .iso image file then compared it to original (cp + cmp),
> turned out to be bad memory, and yes, memtest86 did not find the
> problem. Check mobo datasheet if 2+ double-sided memory allowed,
> you may need to stay at 1GB to reduce bus loading.

The board is allowed 1.5GB using 3 x 512M. I believe the 512M modules
must be double sided to work but I am not 100% sure of that.

It is also generally unstable if set to anything over PC100 memory speed
in my experience (my machine has the same board). The memory speed
detection doesn't work properly. I have found it perfectly stable when
set to PC100 in bios and using PC133 memory. It seems to prefer having
the extra margin.

I have never personally had more than 2 x 256M on mine.

Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Hi

On Tue, 27 Sep 2005, Grzegorz Kulewski wrote:

> On Tue, 27 Sep 2005, Rogério Brito wrote:
>
> > The southbridge is a VIA VT82C686.
>
> I know. I had the same southbridge in my Abit KG7 but I don't know if you have
> version A or version B. I had version B and it has several disk problems
> fixed. For version A there are some workarounds in the kernel.

Version B here. It first had only 128MB, worked fine, I added 256MB,
system become unstable, memtest86 found "bad memory" around the last
megabytes. Then I bought 512MB, hoping to use it with 256MB - no way.
Every module alone works, but not together. But in my case memtest86 did
find errors. Try removing the 256MB module?...

Thanks
Guennadi
---
Guennadi Liakhovetski
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Tue, Sep 27, 2005 at 09:42:44PM +0200, Guennadi Liakhovetski wrote:
> Version B here. It first had only 128MB, worked fine, I added 256MB,
> system become unstable, memtest86 found "bad memory" around the last
> megabytes. Then I bought 512MB, hoping to use it with 256MB - no way.
> Every module alone works, but not together. But in my case memtest86 did
> find errors. Try removing the 256MB module?...

FWIW, some VIA based chipsets only take a single DDR400 module, not
two. The manuals are a bit vague about it.


Erik

--
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Tue, 27 Sep 2005, Erik Mouw wrote:

> On Tue, Sep 27, 2005 at 09:42:44PM +0200, Guennadi Liakhovetski wrote:
> > Version B here. It first had only 128MB, worked fine, I added 256MB,
> > system become unstable, memtest86 found "bad memory" around the last
> > megabytes. Then I bought 512MB, hoping to use it with 256MB - no way.
> > Every module alone works, but not together. But in my case memtest86 did
> > find errors. Try removing the 256MB module?...
>
> FWIW, some VIA based chipsets only take a single DDR400 module, not
> two. The manuals are a bit vague about it.

My manual says "2". And it's a A7VI-VM, so, unfortunately, no DDR400, just
PC133/VC133.

Thanks
Guennadi
---
Guennadi Liakhovetski
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Hi Rogerio.

On Tue, 2005-09-27 at 21:10, Rogério Brito wrote:
> Hi there. I'm seeing a really strange problem on my system lately and I
> am not really sure that it has anything to do with the kernels.

I've seen the thread mostly following the hardware line. I'd like to
enquire down the kernel path because I've seen occasional, impossible to
reproduce problems too.

Can I ask first a few questions:

1) Are you using vanilla kernels, or do you have other patches applied?
2) Are you using ext3 only?
3) Is the corruption only ever in memory, or seen on disk too?
4) Is the corruption only in one filesystem or spread across several (if
applicable)? (ie in / but not /home or others?)

Regards,

Nigel


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Tue, Sep 27, 2005 at 08:10:39AM -0300, you [Rogério Brito] wrote:
> Hi there. I'm seeing a really strange problem on my system lately and I
> am not really sure that it has anything to do with the kernels.
>
> I would appreciate any guidance with my problems. Any help is welcome.
>
> My desktop has a Duron 1.3GHz (but, for some reason, it runs only at
> 1.1GHz) and an Asus A7V motherboard, with chipset VIA KT133 (not the
> enhanced version KT133A).

You may be running into this problem:

http://www.uwsg.iu.edu/hypermail/linux/kernel/0207.2/0574.html
http://www.cs.helsinki.fi/linux/linux-kernel/2002-02/1727.html
http://www.cs.helsinki.fi/linux/linux-kernel/2002-01/1048.html
http://marc.theaimsgroup.com/?l=linux-kernel&m=99889965423508&w=2

(A google search will turn up more.)

I had enourmeous trouble with Via KT133 and IDE.

Placing network card to a different PCI slot helped somewhat as did
upgrading the bios.

I NEVER got the board stable, and ended up ditching it.

It seemed to be a KT133 Northbridge DMA issue. My impression is that KT133
is utter crap period.

When browsing the viaarena.com forums, I found huge number of problem
reports about KT133 corrupting DMA transfers with sound cards, video
editing cards and IDE. It seemed to me it just can't get DMA right when it
is under heavy load. The reports were mostly windows, btw.



-- v --

v@iki.fi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Mer, 2005-09-28 at 11:43 +0300, Ville Herva wrote:
> I NEVER got the board stable, and ended up ditching it.
>
> It seemed to be a KT133 Northbridge DMA issue. My impression is that KT133
> is utter crap period.

It was a FIFO bug, but the kernel knows about it and it should handle
this correctly. Is the hard disk running UDMA133 ?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Thu, Sep 29, 2005 at 12:23:28AM +0100, you [Alan Cox] wrote:
> On Mer, 2005-09-28 at 11:43 +0300, Ville Herva wrote:
> > I NEVER got the board stable, and ended up ditching it.
> >
> > It seemed to be a KT133 Northbridge DMA issue. My impression is that KT133
> > is utter crap period.
>
> It was a FIFO bug, but the kernel knows about it and it should handle
> this correctly.

Interesting. Since which version?

> Is the hard disk running UDMA133 ?

The hardware has long since been ditched for good after months of vasted
effort to get it working, but I think HPT370 on KT7 supports UDMA100 at
maximum, and the disks were likely UDMA66.



-- v --

v@iki.fi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Iau, 2005-09-29 at 09:29 +0300, Ville Herva wrote:
> On Thu, Sep 29, 2005 at 12:23:28AM +0100, you [Alan Cox] wrote:
> > On Mer, 2005-09-28 at 11:43 +0300, Ville Herva wrote:
> > > I NEVER got the board stable, and ended up ditching it.
> > >
> > > It seemed to be a KT133 Northbridge DMA issue. My impression is that KT133
> > > is utter crap period.
> >
> > It was a FIFO bug, but the kernel knows about it and it should handle
> > this correctly.
>
> Interesting. Since which version?

Some fixes went in early 2.4 and they got refined later on. See the
function quirk_vialatency). There is a brief summary at the first URL
listed still. Essentially the chip has a flaw where it can lose a
transfer.

If people see this behaviour on a KT133 can you please check the quirk
is being run and displaying

printk(KERN_INFO "Applying VIA southbridge workaround.\n");

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Hi, Grzegorz. Thank you again for your response.

I haven't been up with linux kernel since I have been experimenting with
my motherboard to see if I could make it stable.

On Sep 27 2005, Grzegorz Kulewski wrote:
> On Tue, 27 Sep 2005, Rogério Brito wrote:
> >The southbridge is a VIA VT82C686.
>
> I know. I had the same southbridge in my Abit KG7 but I don't know if
> you have version A or version B. I had version B and it has several
> disk problems fixed. For version A there are some workarounds in the
> kernel.

Didn't know that until I saw the following in the dmesg log:

- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -
rbrito@dumont:~$ dmesg | grep -i via
Disabling VIA memory write queue (PCI ID 0305, rev 02): [55] 89 & 1f -> 09
PCI: Disabling Via external APIC routing
agpgart: Detected VIA Twister-K/KT133x/KM133 chipset
parport_pc: VIA 686A/8231 detected
parport_pc: VIA parallel port: io=0x378, irq=7
VP_IDE: VIA vt82c686a (rev 22) IDE UDMA66 controller on pci0000:00:04.1
- - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - -

This also answers the question of my motherboard having the revision A
of the southbridge.

> >Nothing in the logs. No oops, no stack trace, no nothing. :-( Oh, now
> >that
>
> I don't think that there will be any oops or something like that. But
> maybe some IDE messages - like failed commands or something. But if there
> are no such messages then chance is that this is some memory/mb
> problem.

Yes, I found some of them. See below.

> >you mention it, I remember that I also made my Matrox G400 use speed
> >4x. I will try slowing it down to see if there is any influence on
> >what I see.
>
> Yes, slowing down your graphics card could help.

This is something that I still have not tried, because I lost a good
amount of time using Gold Memory (already mentioned in this thread) to
scan for bad memory.

Even though GM is shareware and only limited its tests to the "quick
tests", it did a *much* better job than memtest86+ finding errors (i.e.,
Gold Memory found errors with my system even when memtest86+ didn't).
Perhaps some of those tests could be included in memtest86+.

Oh, and the fact that we have both memtest86{,+} doesn't help one when
choosing what to use. :-(

> >>I will bet that you have some hardware problem there. You can try to
> >>remove the 256MB DDR module and turn HIGHMEM off. You can also try to
> >>check each module separately.
> >
> >I already checked each module separately, but I didn't see any corruption.
> >I guess that I maybe wasn't paying too much attention. I will try it
> >again. Thanks for the suggestion.
>
> Hmm... What did you change before the system started not working?

It had 256MB + 128MB running at PC100 speed (even though both were rated
to work at PC133 speeds).

> Maybe try with only 256MB module installed if that was the working
> configuration...

The catch is that the problem seems to be transient and not that easy to
reproduce. For instance, I had 2 x 512MB + 256MB installed and it
"worked" (meaning that it booted Linux and the system was useable, even
though I saw some problems with md5sums on my system).

Then, just removing the 256MB module made the computer not even POST
anymore! Weird, isn't it? Beyond anything that I can explain yet.

> >It sucks not to be in the US, where things are cheaper. :-(
>
> Yeah, it sucks. I live in Poland and we have really big prices for
> computer parts here. :-(

So, you know what I am talking about when I want to keep what I have
just for the moment.


Regards,

--
Rogério Brito : rbrito@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Hi, Guennadi.

On Sep 27 2005, Guennadi Liakhovetski wrote:
> Version B here. It first had only 128MB, worked fine, I added 256MB,
> system become unstable, memtest86 found "bad memory" around the last
> megabytes.

This is *quite* similar to what I am seeing.

> Then I bought 512MB, hoping to use it with 256MB - no way.

Again, similar to what I see.

> Every module alone works, but not together. But in my case memtest86
> did find errors.

This is something puzzling: when I first installed the modules to get
1.25GB, things "worked", but I had problems with memtest86+ (not
memtest86).

I changed things (removing modules), got frustrated having only 512MB on
the system with all the other modules laying around here and put them
back.

This second time, I reduced the latency on the BIOS from 2-2-2 to 3-3-3
and it booted and memtest86+ did't find any errors. Yet, I saw some
corruption, which was what prompted me to send the original mail to
linux-kernel (since I didn't know if it was a hardware or a software
problem, as memtest86+ had not found any errors).

> Try removing the 256MB module?...

Right now, I'm only using one 512MB module, but after I have already
paid for the second one, and it wasn't cheap. :-(

I suspect that the system is stable now, but I am not sure. If I
reinstall some packages with apt, it still gets some problems with the
md5sum signatures of *other* packages, which is highly weird. But I
don't see any other problems.

Puzzling, huh? I already run a SMART offline/long self-test on the disk
(to rule out it being a problem) and it passed with flying colors. I
also already used badblocks on this very disk (but in read-only mode),
and it also didn't find any problems.

I have a Quantum FIREBALLlct15 drive here.


Thanks,

--
Rogério Brito : rbrito@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Hi, Ondrej and others,

On Sep 27 2005, Ondrej Zary wrote:
> I have a Socket 7 board somewhere with bad L2 cache - it was unstable
> but memtest was unable to find anything.

Right.

> However, GoldMemory found some errors - they disappeared after
> disabling L2 cache and crashes disappeared too.

I have not yet tried disabling the cache on my case (since both L1 and
L2 caches here are integrated into the processor). May be a possibility,
though.

> It's not free but at least shareware - you can find it at
> http://www.goldmemory.cz/

Thank you very much for this hint. It indeed found problems that
memtest86+ didn't find. I think that it would be nice to have some of
those tests integrated in memtest86+.


Thanks again,

--
Rogério Brito : rbrito@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Sep 27 2005, Lennart Sorensen wrote:
> The board is allowed 1.5GB using 3 x 512M. I believe the 512M modules
> must be double sided to work but I am not 100% sure of that.

Right now, I'm using just a single 512MB module, but it is single-sided
(I guess that by double-sided you guys mean that it has chips on both
sides of the module, right?). The only double-sided module that I have
here is the 256MB module.

OTOH, with just one 512MB everything *seems* to be working fine, but,
honestly, I'm not sure.

> It is also generally unstable if set to anything over PC100 memory speed
> in my experience (my machine has the same board).

Hummm, nice to see that you have also experienced this. With 256 + 128,
I had to use PC100 to have it work stably.

> The memory speed detection doesn't work properly. I have found it
> perfectly stable when set to PC100 in bios and using PC133 memory. It
> seems to prefer having the extra margin.

I'd obviously prefer to have everything working at PC133 speed, but
wouldn't mind running at PC100 speed if I could use everything, since I
sometimes need to use some large programs (for some dynamic programming
problems).


Thanks for sharing your experiences,

--
Rogério Brito : rbrito@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Sep 28 2005, Nigel Cunningham wrote:
> Hi Rogerio.

Hi, Nigel.

> On Tue, 2005-09-27 at 21:10, Rogério Brito wrote:
> > Hi there. I'm seeing a really strange problem on my system lately and I
> > am not really sure that it has anything to do with the kernels.
>
> I've seen the thread mostly following the hardware line. I'd like to
> enquire down the kernel path because I've seen occasional, impossible
> to reproduce problems too.

Nice. I also don't want to rule out anything before I really understand
what's going on.

> Can I ask first a few questions:

Of course.

> 1) Are you using vanilla kernels, or do you have other patches applied?

Yes, all the kernels that I use are just plain vanilla kernels taken
straight from kernel.org. No other patches applied.

> 2) Are you using ext3 only?

Yes, I am.

> 3) Is the corruption only ever in memory, or seen on disk too?

I have noticed the problem mostly on disk. One strange situation was
when I was untarring a kernel tree (compressed with bzip2) and in the
middle of the extraction, bzip2 complained that the thing was
corrupted.

I removed what was extracted right away and tried again to extract the
tree (at this point, suspecting even that something in software had
problems). The problem with bzip2 occurred again. Then, I rebooted the
system an the problem magically went away.

> 4) Is the corruption only in one filesystem or spread across several
> (if applicable)? (ie in / but not /home or others?)

I only have one filesystem right now, but given the difficulties that
I'm seeing, I do plan to go back to a multiple filesystem setup (which I
always used but thought that was overkill---nothing like time to teach
us something what is safest).

If you want to know anything else, don't hesistate to ask.


Regards,

--
Rogério Brito : rbrito@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
On Sat, 1 Oct 2005 18:36:55 -0300, Rogério Brito <rbrito@ime.usp.br> wrote:

>
>I have noticed the problem mostly on disk. One strange situation was
>when I was untarring a kernel tree (compressed with bzip2) and in the
>middle of the extraction, bzip2 complained that the thing was
>corrupted.
>
>I removed what was extracted right away and tried again to extract the
>tree (at this point, suspecting even that something in software had
>problems). The problem with bzip2 occurred again. Then, I rebooted the
>system an the problem magically went away.

This rings a bell, recently I reported a problem:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0508.1/1332.html

Turned out to be bad memory stick :o)

Cheers,
Grant.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Rogério Brito <rbrito@ime.usp.br> wrote:
> On Sep 28 2005, Nigel Cunningham wrote:

>> 3) Is the corruption only ever in memory, or seen on disk too?
>
> I have noticed the problem mostly on disk. One strange situation was
> when I was untarring a kernel tree (compressed with bzip2) and in the
> middle of the extraction, bzip2 complained that the thing was
> corrupted.
>
> I removed what was extracted right away and tried again to extract the
> tree (at this point, suspecting even that something in software had
> problems). The problem with bzip2 occurred again. Then, I rebooted the
> system an the problem magically went away.

I have a similar problem:
It's a corruption while reading data from the HDD into the cache.
The affected page will contain (pseudo?)random data in the first four
bytes (at least on my system it did).

If you waited long enough, the cache page would be discarded and the next
read from the disk would be correct. However, if it happens e.g. in an
inode block, the corruption may find it's way to the disk and/or fubar
your data.

This happens mostly if there are concurrent DMA transfers like playing
sound or watching TV on bttv cards. I'm affected by the later cause,
setting no_overlay reduced it.

--
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Strange disk corruption with Linux >= 2.6.13 [ In reply to ]
Hi, Grant, Nigel and others following this thread.

On Oct 02 2005, Grant Coady wrote:
> On Sat, 1 Oct 2005 18:36:55 -0300, Rogério Brito <rbrito@ime.usp.br> wrote:
> >I removed what was extracted right away and tried again to extract
> >the tree (at this point, suspecting even that something in software
> >had problems). The problem with bzip2 occurred again. Then, I
> >rebooted the system an the problem magically went away.
>
> This rings a bell, recently I reported a problem:
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0508.1/1332.html

Thanks for the information. I am on-and-off experimenting with
goldmemory and memtester86+ to see if I can find something with more
than 512MB that is stable.

I am, right now, using 512MB + 256MB slowed down to PC100 speeds. It
seems to be stable with this configuration (having survived some memory
tests, the decoding of lots of FLAC files in a row and using the machine
as usual---with low consumption things like mutt and browsing with
lynx).

> Turned out to be bad memory stick :o)

The thing is that any stick alone doesn't seem to generate a problem.
Only when they are used simultaneously

I will test it more to see what may be wrong with my setup. :-( I still
have not isolated and understood the problem completely. :-(


Thanks for the feedback, Rogério.

--
Rogério Brito : rbrito@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

1 2  View All