Mailing List Archive

TOTEM implementation eror (SLES11 SP2)?
Hello,

I'm wondering about these messages:

Feb 25 14:53:31 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:31 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:31 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:53:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6

If you look at the first and the last item in the retransmit list, it's obvious that this cannot be a ring buffer (as I was expecting). To me it looks like an implementation error.

Those messages appear and disappear without apparent reason. Maybe the reason is having two independent rings combined with poor logging: Here is how the situation switches:

Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a6 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a5
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41 42 43 44 45 46
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 1f 20 21 22 23 24 25 26 27 28
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41 42 43 44 45 46
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 1f 20 21 22 23 24 25 26 27 28
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41 42 43 44 45 46
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 1f 20 21 22 23 24 25 26 27 28
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41 42 43 44 45 46
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 1f 20 21 22 23 24 25 26 27 28
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 3d 3e 3f 40 41 42 43 44 45 46
Feb 25 14:54:18 so4 corosync[12457]: [TOTEM ] Retransmit List: 1f 20 21 22 23 24 25 26 27 28

I doubt the network can have that many problems as TOTEM reports:

[...]
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 780
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 780
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 782
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 784
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 784
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 786
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 786
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Marking ringid 1 interface 192.168.0.64 FAULTY
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 788
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 789
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 78c
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Automatically recovered ring 1
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Automatically recovered ring 1
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Automatically recovered ring 1
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 79a
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 79c
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 79c
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 79e
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 79e
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 7a0
Feb 25 14:54:32 so4 corosync[12457]: [TOTEM ] Retransmit List: 7a2
[...]

# grep "Retransmit List" /var/log/messages | wc -l
5504

(All in less than an hour when some nodes booted)

Regards,
Ulrich

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Re: TOTEM implementation eror (SLES11 SP2)? [ In reply to ]
On 2013-02-25T15:26:36, Ulrich Windl <Ulrich.Windl@rz.uni-regensburg.de> wrote:

> Hello,
>
> I'm wondering about these messages:
>
> Feb 25 14:53:31 so4 corosync[12457]: [TOTEM ] Retransmit List: 2a5 28b 28d 28e 295 296 297 298 299 29a 29b 29c 29d 29e 29f 2a0 2a1 2a2 2a3 2a4 2a6

That has nothing to do with Linux HA; this belongs to the corosync list.

Or, as always, to support, if you want to have it fixed in our product
;-)

> Those messages appear and disappear without apparent reason. Maybe the reason is having two independent rings combined with poor logging: Here is how the situation switches:

It's a corosync issue affecting some network environments that we are
actively tracing. It'll sometimes happen even with one ring, and
persists even in 1.4.5. Alas. If you report it to support, we can add
that environment as a data point.


Regards,
Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/