Mailing List Archive

Loosing corosync communication clusterwide
Hello,

I just have an issue on my pacemaker setup, my dlm/clvm/gfs2 was
blocked.

The “dlm_tool ls” command told me “wait ringid”.

The corosync-* commands hangs (like corosync-quorumtool).

The pacemaker “crm_mon” display nothing wrong.

I'm using Ubuntu Trusty Tahr:

- corosync 2.3.3-1ubuntu1
- pacemaker 1.1.10+git20130802-1ubuntu2.1

My cluster was manually rebooted.

Any idea how to debug such situation?

Regards.
--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
Re: Loosing corosync communication clusterwide [ In reply to ]
Daniel Dehennin <daniel.dehennin@baby-gnu.org> writes:

> Hello,

Hello,

> I just have an issue on my pacemaker setup, my dlm/clvm/gfs2 was
> blocked.
>
> The “dlm_tool ls” command told me “wait ringid”.

It happened again:

root@nebula2:~# dlm_tool ls
dlm lockspaces
name datastores
id 0x1b61ba6a
flags 0x00000004 kern_stop
change member 4 joined 1 remove 0 failed 0 seq 3,3
members 1084811078 1084811079 1084811080 1084811119
new change member 3 joined 0 remove 1 failed 1 seq 4,4
new status wait ringid
new members 1084811078 1084811079 1084811080

name clvmd
id 0x4104eefa
flags 0x00000004 kern_stop
change member 4 joined 1 remove 0 failed 0 seq 3,3
members 1084811078 1084811079 1084811080 1084811119
new change member 3 joined 0 remove 1 failed 1 seq 4,4
new status wait ringid
new members 1084811078 1084811079 1084811080

root@nebula2:~# dlm_tool status
cluster nodeid 1084811079 quorate 1 ring seq 21372 21372
daemon now 8351 fence_pid 0
fence 1084811119 nodedown pid 0 actor 0 fail 1415634527 fence 0 now
1415634734
node 1084811078 M add 5089 rem 0 fail 0 fence 0 at 0 0
node 1084811079 M add 5089 rem 0 fail 0 fence 0 at 0 0
node 1084811080 M add 5089 rem 0 fail 0 fence 0 at 0 0
node 1084811119 X add 5766 rem 8144 fail 8144 fence 0 at 0 0

Any idea?
--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
Re: Loosing corosync communication clusterwide [ In reply to ]
I think, you don't have fencing configured in your cluster.

2014-11-10 17:02 GMT+01:00 Daniel Dehennin <daniel.dehennin@baby-gnu.org>:
> Daniel Dehennin <daniel.dehennin@baby-gnu.org> writes:
>
>> Hello,
>
> Hello,
>
>> I just have an issue on my pacemaker setup, my dlm/clvm/gfs2 was
>> blocked.
>>
>> The “dlm_tool ls” command told me “wait ringid”.
>
> It happened again:
>
> root@nebula2:~# dlm_tool ls
> dlm lockspaces
> name datastores
> id 0x1b61ba6a
> flags 0x00000004 kern_stop
> change member 4 joined 1 remove 0 failed 0 seq 3,3
> members 1084811078 1084811079 1084811080 1084811119
> new change member 3 joined 0 remove 1 failed 1 seq 4,4
> new status wait ringid
> new members 1084811078 1084811079 1084811080
>
> name clvmd
> id 0x4104eefa
> flags 0x00000004 kern_stop
> change member 4 joined 1 remove 0 failed 0 seq 3,3
> members 1084811078 1084811079 1084811080 1084811119
> new change member 3 joined 0 remove 1 failed 1 seq 4,4
> new status wait ringid
> new members 1084811078 1084811079 1084811080
>
> root@nebula2:~# dlm_tool status
> cluster nodeid 1084811079 quorate 1 ring seq 21372 21372
> daemon now 8351 fence_pid 0
> fence 1084811119 nodedown pid 0 actor 0 fail 1415634527 fence 0 now
> 1415634734
> node 1084811078 M add 5089 rem 0 fail 0 fence 0 at 0 0
> node 1084811079 M add 5089 rem 0 fail 0 fence 0 at 0 0
> node 1084811080 M add 5089 rem 0 fail 0 fence 0 at 0 0
> node 1084811119 X add 5766 rem 8144 fail 8144 fence 0 at 0 0
>
> Any idea?
> --
> Daniel Dehennin
> Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



--
esta es mi vida e me la vivo hasta que dios quiera

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Loosing corosync communication clusterwide [ In reply to ]
Hanging corosync sounds like libqb problems: trusty comes with 0.16, which likes to hang from time to time. Try building libqb 0.17.

Daniel Dehennin <daniel.dehennin@baby-gnu.org> napisał:
>Hello,
>
>I just have an issue on my pacemaker setup, my dlm/clvm/gfs2 was
>blocked.
>
>The “dlm_tool ls” command told me “wait ringid”.
>
>The corosync-* commands hangs (like corosync-quorumtool).
>
>The pacemaker “crm_mon” display nothing wrong.
>
>I'm using Ubuntu Trusty Tahr:
>
>- corosync 2.3.3-1ubuntu1
>- pacemaker 1.1.10+git20130802-1ubuntu2.1
>
>My cluster was manually rebooted.
>
>Any idea how to debug such situation?
>
>Regards.
>--
>Daniel Dehennin
>Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
>Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>Project Home: http://www.clusterlabs.org
>Getting started:
>http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>Bugs: http://bugs.clusterlabs.org

--
Wysłane za pomocą K-9 Mail.
Re: Loosing corosync communication clusterwide [ In reply to ]
emmanuel segura <emi2fast@gmail.com> writes:

> I think, you don't have fencing configured in your cluster.

I have fencing configured and working, modulo fencing VMs on dead host[1].

Regards.

Footnotes:
[1] http://oss.clusterlabs.org/pipermail/pacemaker/2014-November/022965.html

--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
Re: Loosing corosync communication clusterwide [ In reply to ]
Tomasz Kontusz <tomasz.kontusz@gmail.com> writes:

> Hanging corosync sounds like libqb problems: trusty comes with 0.16,
> which likes to hang from time to time. Try building libqb 0.17.

Thanks, I'll look at this.

Is there a way to get back to normal state without rebooting all
machines and interrupting services?

I thought about a lightweight version of something like:

1. stop pacemaker on all nodes without doing anything with resources,
they all continue to work

2. stop corosync on all nodes

3. start corosync on all nodes

4. start pacemaker on all nodes, as services are running nothing needs
to be done

I looked in the documentation but fail to find some kind of cluster
management best practices.

Regards.
--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
Re: Loosing corosync communication clusterwide [ In reply to ]
> On 11 Nov 2014, at 4:39 am, Daniel Dehennin <daniel.dehennin@baby-gnu.org> wrote:
>
> emmanuel segura <emi2fast@gmail.com> writes:
>
>> I think, you don't have fencing configured in your cluster.
>
> I have fencing configured and working, modulo fencing VMs on dead host[1].

Are you saying that the host and the VMs running inside it are both part of the same cluster?

>
> Regards.
>
> Footnotes:
> [1] http://oss.clusterlabs.org/pipermail/pacemaker/2014-November/022965.html
>
> --
> Daniel Dehennin
> Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Loosing corosync communication clusterwide [ In reply to ]
Andrew Beekhof <andrew@beekhof.net> writes:


[...]

>> I have fencing configured and working, modulo fencing VMs on dead host[1].
>
> Are you saying that the host and the VMs running inside it are both part of the same cluster?

Yes, one of the VM needs to access the GFS2 filesystem like the nodes,
the other VM is a quorum node (standby=on).

Regards.
--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
Re: Loosing corosync communication clusterwide [ In reply to ]
> On 11 Nov 2014, at 10:12 pm, Daniel Dehennin <daniel.dehennin@baby-gnu.org> wrote:
>
> Andrew Beekhof <andrew@beekhof.net> writes:
>
>
> [...]
>
>>> I have fencing configured and working, modulo fencing VMs on dead host[1].
>>
>> Are you saying that the host and the VMs running inside it are both part of the same cluster?
>
> Yes, one of the VM needs to access the GFS2 filesystem like the nodes,
> the other VM is a quorum node (standby=on).

That sounds like a recipe for disaster to be honest.
If you want VM's to be part of a cluster, it would be advisable to have their host(s) be in a different one.
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Loosing corosync communication clusterwide [ In reply to ]
Tomasz Kontusz <tomasz.kontusz@gmail.com> writes:

> Hanging corosync sounds like libqb problems: trusty comes with 0.16,
> which likes to hang from time to time. Try building libqb 0.17.

It was already reported on Ubuntu tracker[1]

Regards.

Footnotes:
[1] https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496

--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF