Mailing List Archive

Antw: Attention: Problematic Update for SLES11 (kernel, DLM, cLVM)
Hi!

Let me elaborate why this problem is considered very nasty (according to my current knowledge about the problem):

It seems updating the kernel on one node in the cluster causes the SCTP socket on the other node's DLM to die. And specifically there seems to be no specific log message or action on that event on the affected node. The only work-around (solution) seems to be a restart of the DLM on the affected node. Unfortunately restarting DLM through the cluster means stopping all dependent resources. My attempt to kill the DLM process and restart it (as the actual DLM data seem to reside in the kernel) showed that the node fences itself almost immediately after killing the DLM process.
And in a two-node cluster you cannot migrate resources as the DLM on the other node doesn't work (cannot communicate with the first as the first's socket went away).

So I think (to my current knowledge) that the new kernel is not the problem, but the old kernel (Or maybe it's the DLM). The new kernel causes a problem on the old kernel (you might call this a remote denial of service attack).

The other problem is the cluster stuff: DLM on the updated node cannot start, and the start action is marked as failed, anyway the kernel continues to write log messages at an insane speed. Likewise when trying to set the updated node to standby (to stop flooding the syslog), the stop action runs into a timeout, causing a node fence.

You see, once you have it, it's a very ugly problem. The question is: Does it occur every time you update one node in a two-node cluster?

The only solution seems to be (work around) to restart the older node. But when restarting the node anyway, I could also install the newer kernel (hoping that the new kernel isn't the problem). There's no updtae to DLM AFAIK, however.

Regards,
Ulrich

>>> "Ulrich Windl" <Ulrich.Windl@rz.uni-regensburg.de> schrieb am 30.07.2014 um
16:05 in Nachricht <53D917C3020000A1000167AE@gwsmtp1.uni-regensburg.de>:
> Hello!
>
> An update: The problem is known at SUSE and there is a temporary fix
> (PTF.876616) for this issue. Unfortunately the kernel with the defect is
> newer than the PTF, i.e. the PTF is not included in the latest kernel.
>
> Regards,
> Ulrich
>
>>>> Ulrich Windl schrieb am 30.07.2014 um 08:47 in Nachricht <53D894E7.ECA : 161
> :
> 60728>:
>> Hi!
>>
>> I wanted to notify you that one of the recent updates for SLES11 SP3 may
>> cause trouble when using cLVM: On an updated node, cLVM won't start any
> more,
>> and the kernel will flood your syslog with messages like:
>>
>> Jul 30 08:17:09 h05 kernel: [ 563.700629] dlm: Trying to connect to 172.20
>> .16.1
>> Jul 30 08:17:09 h05 kernel: [ 563.700836] dlm: Can't start SCTP association
>
>> - retrying
>> Jul 30 08:17:09 h05 kernel: [ 563.700843] dlm: Retry sending 48 bytes to
>> node id 17831084
>> Jul 30 08:17:09 h05 kernel: [ 563.700852] dlm: Retrying SCTP association
>> init for node 17831084
>>
>> The issue will be investigated, but be prepared for trouble if you update
>> just one node in your cluster.
>>
>> Regards,
>> Ulrich
>>
>>
>
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems