Mailing List Archive: Corosync fails to start when NIC is absent

Corosync fails to start when NIC is absent

Jan 9, 2015, 1:10 AM

Post #1 of 10 (4670 views)

Hi guys,

Corosync fails to start if there is no such network interface configured in
the system.
Even with "rrp_mode: passive" the problem is the same when at least one
network interface is not configured in the system.

Is this the expected behavior?
I thought that when you use redundant rings, it is enough to have at least
one NIC configured in the system. Am I wrong?

Thank you,
Kostya

Re: Corosync fails to start when NIC is absent [ In reply to ]

konstantin.ponomarenko at gmail

Jan 12, 2015, 7:04 AM

Post #2 of 10 (4600 views)

Permalink

According to the https://access.redhat.com/solutions/638843 , the
interface, that is defined in the corosync.conf, must be present in the
system (see at the bottom of the article, section "ROOT CAUSE").
To confirm that I made a couple of tests.

Here is a part of the corosync.conf file (in a free-write form) (also
attached the origin config file):
===============================
rrp_mode: passive
ring0_addr is defined in corosync.conf
ring1_addr is defined in corosync.conf
===============================

-------------------------------

Two-node cluster

-------------------------------

Test #1:
--------------------------------------------------
IP for ring0 is not defines in the system:
--------------------------------------------------
Start Corosync simultaneously on both nodes.
Corosync fails to start.
From the logs:
Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
config: No interfaces defined
Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
Engine exiting with status 8 at main.c:1343.
Result: Corosync and Pacemaker are not running.

Test #2:
--------------------------------------------------
IP for ring1 is not defines in the system:
--------------------------------------------------
Start Corosync simultaneously on both nodes.
Corosync starts.
Start Pacemaker simultaneously on both nodes.
Pacemaker fails to start.
From the logs, the last writes from the "corosync":
Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid 0
interface 169.254.1.3 FAULTY
Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ] Automatically
recovered ring 0
Result: Corosync and Pacemaker are not running.

Test #3:

"rrp_mode: active" leads to the same result, except Corosync and Pacemaker
init scripts return status "running".
But still "vim /var/log/cluster/corosync.log" shows a lot of errors like:
Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
to the CPG API failed: Library error (2)

Result: Corosync and Pacemaker show their statuses as "running", but
"crm_mon" cannot connect to the cluster database. And half of the
Pacemaker's services are not running (including Cluster Information Base
(CIB)).

-------------------------------

For a single node mode

-------------------------------

IP for ring0 is not defines in the system:

Corosync fails to start.

IP for ring1 is not defines in the system:

Corosync and Pacemaker are started.

It is possible that configuration will be applied successfully (50%),

and it is possible that the cluster is not running any resources,

and it is possible that the node cannot be put in a standby mode (shows:
communication error),

and it is possible that the cluster is running all resources, but applied
configuration is not guaranteed to be fully loaded (some rules can be
missed).

-------------------------------

Conclusions:

-------------------------------

It is possible that in some rare cases (see comments to the bug) the
cluster will work, but in that case its working state is unstable and the
cluster can stop working every moment.

So, is it correct? Does my assumptions make any sense? I didn't any other
explanation in the network ... .

Thank you,
Kostya

On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
konstantin.ponomarenko@gmail.com> wrote:

> Hi guys,
>
> Corosync fails to start if there is no such network interface configured
> in the system.
> Even with "rrp_mode: passive" the problem is the same when at least one
> network interface is not configured in the system.
>
> Is this the expected behavior?
> I thought that when you use redundant rings, it is enough to have at least
> one NIC configured in the system. Am I wrong?
>
> Thank you,
> Kostya
>

Re: Corosync fails to start when NIC is absent [ In reply to ]

jfriesse at redhat

Jan 13, 2015, 2:01 AM

Post #3 of 10 (4605 views)

Permalink

Kostiantyn,

> According to the https://access.redhat.com/solutions/638843 , the
> interface, that is defined in the corosync.conf, must be present in the
> system (see at the bottom of the article, section "ROOT CAUSE").
> To confirm that I made a couple of tests.
>
> Here is a part of the corosync.conf file (in a free-write form) (also
> attached the origin config file):
> ===============================
> rrp_mode: passive
> ring0_addr is defined in corosync.conf
> ring1_addr is defined in corosync.conf
> ===============================
>
> -------------------------------
>
> Two-node cluster
>
> -------------------------------
>
> Test #1:
> --------------------------------------------------
> IP for ring0 is not defines in the system:
> --------------------------------------------------
> Start Corosync simultaneously on both nodes.
> Corosync fails to start.
> From the logs:
> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
> config: No interfaces defined
> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
> Engine exiting with status 8 at main.c:1343.
> Result: Corosync and Pacemaker are not running.
>
> Test #2:
> --------------------------------------------------
> IP for ring1 is not defines in the system:
> --------------------------------------------------
> Start Corosync simultaneously on both nodes.
> Corosync starts.
> Start Pacemaker simultaneously on both nodes.
> Pacemaker fails to start.
> From the logs, the last writes from the "corosync":
> Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid 0
> interface 169.254.1.3 FAULTY
> Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ] Automatically
> recovered ring 0
> Result: Corosync and Pacemaker are not running.
>
>
> Test #3:
>
> "rrp_mode: active" leads to the same result, except Corosync and Pacemaker
> init scripts return status "running".
> But still "vim /var/log/cluster/corosync.log" shows a lot of errors like:
> Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
> to the CPG API failed: Library error (2)
>
> Result: Corosync and Pacemaker show their statuses as "running", but
> "crm_mon" cannot connect to the cluster database. And half of the
> Pacemaker's services are not running (including Cluster Information Base
> (CIB)).
>
>
> -------------------------------
>
> For a single node mode
>
> -------------------------------
>
> IP for ring0 is not defines in the system:
>
> Corosync fails to start.
>
> IP for ring1 is not defines in the system:
>
> Corosync and Pacemaker are started.
>
> It is possible that configuration will be applied successfully (50%),
>
> and it is possible that the cluster is not running any resources,
>
> and it is possible that the node cannot be put in a standby mode (shows:
> communication error),
>
> and it is possible that the cluster is running all resources, but applied
> configuration is not guaranteed to be fully loaded (some rules can be
> missed).
>
>
> -------------------------------
>
> Conclusions:
>
> -------------------------------
>
> It is possible that in some rare cases (see comments to the bug) the
> cluster will work, but in that case its working state is unstable and the
> cluster can stop working every moment.
>
>
> So, is it correct? Does my assumptions make any sense? I didn't any other
> explanation in the network ... .

Corosync needs all interfaces during start and runtime. This doesn't
mean they must be connected (this would make corosync unusable for
physical NIC/Switch or cable failure), but they must be up and have
correct ip.

When this is not the case, corosync rebinds to localhost and weird
things happens. Removal of this rebinding is long time TODO, but there
are still more important bugs (especially because rebind can be avoided).

Regards,
Honza

>
>
>
> Thank you,
> Kostya
>
> On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
> konstantin.ponomarenko@gmail.com> wrote:
>
>> Hi guys,
>>
>> Corosync fails to start if there is no such network interface configured
>> in the system.
>> Even with "rrp_mode: passive" the problem is the same when at least one
>> network interface is not configured in the system.
>>
>> Is this the expected behavior?
>> I thought that when you use redundant rings, it is enough to have at least
>> one NIC configured in the system. Am I wrong?
>>
>> Thank you,
>> Kostya
>>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Corosync fails to start when NIC is absent [ In reply to ]

konstantin.ponomarenko at gmail

Jan 13, 2015, 5:35 AM

Post #4 of 10 (4597 views)

Permalink

Honza,

Thank you for helping me.
So, there is no defined behavior in case one of the interfaces is not in
the system?

Thank you,
Kostya

On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse@redhat.com> wrote:

> Kostiantyn,
>
>
> > According to the https://access.redhat.com/solutions/638843 , the
> > interface, that is defined in the corosync.conf, must be present in the
> > system (see at the bottom of the article, section "ROOT CAUSE").
> > To confirm that I made a couple of tests.
> >
> > Here is a part of the corosync.conf file (in a free-write form) (also
> > attached the origin config file):
> > ===============================
> > rrp_mode: passive
> > ring0_addr is defined in corosync.conf
> > ring1_addr is defined in corosync.conf
> > ===============================
> >
> > -------------------------------
> >
> > Two-node cluster
> >
> > -------------------------------
> >
> > Test #1:
> > --------------------------------------------------
> > IP for ring0 is not defines in the system:
> > --------------------------------------------------
> > Start Corosync simultaneously on both nodes.
> > Corosync fails to start.
> > From the logs:
> > Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
> > config: No interfaces defined
> > Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
> > Engine exiting with status 8 at main.c:1343.
> > Result: Corosync and Pacemaker are not running.
> >
> > Test #2:
> > --------------------------------------------------
> > IP for ring1 is not defines in the system:
> > --------------------------------------------------
> > Start Corosync simultaneously on both nodes.
> > Corosync starts.
> > Start Pacemaker simultaneously on both nodes.
> > Pacemaker fails to start.
> > From the logs, the last writes from the "corosync":
> > Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid 0
> > interface 169.254.1.3 FAULTY
> > Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ] Automatically
> > recovered ring 0
> > Result: Corosync and Pacemaker are not running.
> >
> >
> > Test #3:
> >
> > "rrp_mode: active" leads to the same result, except Corosync and
> Pacemaker
> > init scripts return status "running".
> > But still "vim /var/log/cluster/corosync.log" shows a lot of errors like:
> > Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
> > to the CPG API failed: Library error (2)
> >
> > Result: Corosync and Pacemaker show their statuses as "running", but
> > "crm_mon" cannot connect to the cluster database. And half of the
> > Pacemaker's services are not running (including Cluster Information Base
> > (CIB)).
> >
> >
> > -------------------------------
> >
> > For a single node mode
> >
> > -------------------------------
> >
> > IP for ring0 is not defines in the system:
> >
> > Corosync fails to start.
> >
> > IP for ring1 is not defines in the system:
> >
> > Corosync and Pacemaker are started.
> >
> > It is possible that configuration will be applied successfully (50%),
> >
> > and it is possible that the cluster is not running any resources,
> >
> > and it is possible that the node cannot be put in a standby mode (shows:
> > communication error),
> >
> > and it is possible that the cluster is running all resources, but applied
> > configuration is not guaranteed to be fully loaded (some rules can be
> > missed).
> >
> >
> > -------------------------------
> >
> > Conclusions:
> >
> > -------------------------------
> >
> > It is possible that in some rare cases (see comments to the bug) the
> > cluster will work, but in that case its working state is unstable and the
> > cluster can stop working every moment.
> >
> >
> > So, is it correct? Does my assumptions make any sense? I didn't any other
> > explanation in the network ... .
>
> Corosync needs all interfaces during start and runtime. This doesn't
> mean they must be connected (this would make corosync unusable for
> physical NIC/Switch or cable failure), but they must be up and have
> correct ip.
>
> When this is not the case, corosync rebinds to localhost and weird
> things happens. Removal of this rebinding is long time TODO, but there
> are still more important bugs (especially because rebind can be avoided).
>
> Regards,
> Honza
>
> >
> >
> >
> > Thank you,
> > Kostya
> >
> > On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
> > konstantin.ponomarenko@gmail.com> wrote:
> >
> >> Hi guys,
> >>
> >> Corosync fails to start if there is no such network interface configured
> >> in the system.
> >> Even with "rrp_mode: passive" the problem is the same when at least one
> >> network interface is not configured in the system.
> >>
> >> Is this the expected behavior?
> >> I thought that when you use redundant rings, it is enough to have at
> least
> >> one NIC configured in the system. Am I wrong?
> >>
> >> Thank you,
> >> Kostya
> >>
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

Re: Corosync fails to start when NIC is absent [ In reply to ]

jfriesse at redhat

Jan 14, 2015, 2:59 AM

Post #5 of 10 (4592 views)

Permalink

Kostiantyn,

> Honza,
>
> Thank you for helping me.
> So, there is no defined behavior in case one of the interfaces is not in
> the system?

You are right. There is no defined behavior.

Regards,
Honza

>
>
> Thank you,
> Kostya
>
> On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse@redhat.com> wrote:
>
>> Kostiantyn,
>>
>>
>>> According to the https://access.redhat.com/solutions/638843 , the
>>> interface, that is defined in the corosync.conf, must be present in the
>>> system (see at the bottom of the article, section "ROOT CAUSE").
>>> To confirm that I made a couple of tests.
>>>
>>> Here is a part of the corosync.conf file (in a free-write form) (also
>>> attached the origin config file):
>>> ===============================
>>> rrp_mode: passive
>>> ring0_addr is defined in corosync.conf
>>> ring1_addr is defined in corosync.conf
>>> ===============================
>>>
>>> -------------------------------
>>>
>>> Two-node cluster
>>>
>>> -------------------------------
>>>
>>> Test #1:
>>> --------------------------------------------------
>>> IP for ring0 is not defines in the system:
>>> --------------------------------------------------
>>> Start Corosync simultaneously on both nodes.
>>> Corosync fails to start.
>>> From the logs:
>>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
>>> config: No interfaces defined
>>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
>>> Engine exiting with status 8 at main.c:1343.
>>> Result: Corosync and Pacemaker are not running.
>>>
>>> Test #2:
>>> --------------------------------------------------
>>> IP for ring1 is not defines in the system:
>>> --------------------------------------------------
>>> Start Corosync simultaneously on both nodes.
>>> Corosync starts.
>>> Start Pacemaker simultaneously on both nodes.
>>> Pacemaker fails to start.
>>> From the logs, the last writes from the "corosync":
>>> Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid 0
>>> interface 169.254.1.3 FAULTY
>>> Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ] Automatically
>>> recovered ring 0
>>> Result: Corosync and Pacemaker are not running.
>>>
>>>
>>> Test #3:
>>>
>>> "rrp_mode: active" leads to the same result, except Corosync and
>> Pacemaker
>>> init scripts return status "running".
>>> But still "vim /var/log/cluster/corosync.log" shows a lot of errors like:
>>> Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
>>> to the CPG API failed: Library error (2)
>>>
>>> Result: Corosync and Pacemaker show their statuses as "running", but
>>> "crm_mon" cannot connect to the cluster database. And half of the
>>> Pacemaker's services are not running (including Cluster Information Base
>>> (CIB)).
>>>
>>>
>>> -------------------------------
>>>
>>> For a single node mode
>>>
>>> -------------------------------
>>>
>>> IP for ring0 is not defines in the system:
>>>
>>> Corosync fails to start.
>>>
>>> IP for ring1 is not defines in the system:
>>>
>>> Corosync and Pacemaker are started.
>>>
>>> It is possible that configuration will be applied successfully (50%),
>>>
>>> and it is possible that the cluster is not running any resources,
>>>
>>> and it is possible that the node cannot be put in a standby mode (shows:
>>> communication error),
>>>
>>> and it is possible that the cluster is running all resources, but applied
>>> configuration is not guaranteed to be fully loaded (some rules can be
>>> missed).
>>>
>>>
>>> -------------------------------
>>>
>>> Conclusions:
>>>
>>> -------------------------------
>>>
>>> It is possible that in some rare cases (see comments to the bug) the
>>> cluster will work, but in that case its working state is unstable and the
>>> cluster can stop working every moment.
>>>
>>>
>>> So, is it correct? Does my assumptions make any sense? I didn't any other
>>> explanation in the network ... .
>>
>> Corosync needs all interfaces during start and runtime. This doesn't
>> mean they must be connected (this would make corosync unusable for
>> physical NIC/Switch or cable failure), but they must be up and have
>> correct ip.
>>
>> When this is not the case, corosync rebinds to localhost and weird
>> things happens. Removal of this rebinding is long time TODO, but there
>> are still more important bugs (especially because rebind can be avoided).
>>
>> Regards,
>> Honza
>>
>>>
>>>
>>>
>>> Thank you,
>>> Kostya
>>>
>>> On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
>>> konstantin.ponomarenko@gmail.com> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> Corosync fails to start if there is no such network interface configured
>>>> in the system.
>>>> Even with "rrp_mode: passive" the problem is the same when at least one
>>>> network interface is not configured in the system.
>>>>
>>>> Is this the expected behavior?
>>>> I thought that when you use redundant rings, it is enough to have at
>> least
>>>> one NIC configured in the system. Am I wrong?
>>>>
>>>> Thank you,
>>>> Kostya
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Corosync fails to start when NIC is absent [ In reply to ]

konstantin.ponomarenko at gmail

Jan 14, 2015, 3:31 AM

Post #6 of 10 (4635 views)

Permalink

Thank you. Now I am aware of it.

Thank you,
Kostya

On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse <jfriesse@redhat.com> wrote:

> Kostiantyn,
>
> > Honza,
> >
> > Thank you for helping me.
> > So, there is no defined behavior in case one of the interfaces is not in
> > the system?
>
> You are right. There is no defined behavior.
>
> Regards,
> Honza
>
>
> >
> >
> > Thank you,
> > Kostya
> >
> > On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse@redhat.com>
> wrote:
> >
> >> Kostiantyn,
> >>
> >>
> >>> According to the https://access.redhat.com/solutions/638843 , the
> >>> interface, that is defined in the corosync.conf, must be present in the
> >>> system (see at the bottom of the article, section "ROOT CAUSE").
> >>> To confirm that I made a couple of tests.
> >>>
> >>> Here is a part of the corosync.conf file (in a free-write form) (also
> >>> attached the origin config file):
> >>> ===============================
> >>> rrp_mode: passive
> >>> ring0_addr is defined in corosync.conf
> >>> ring1_addr is defined in corosync.conf
> >>> ===============================
> >>>
> >>> -------------------------------
> >>>
> >>> Two-node cluster
> >>>
> >>> -------------------------------
> >>>
> >>> Test #1:
> >>> --------------------------------------------------
> >>> IP for ring0 is not defines in the system:
> >>> --------------------------------------------------
> >>> Start Corosync simultaneously on both nodes.
> >>> Corosync fails to start.
> >>> From the logs:
> >>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
> >>> config: No interfaces defined
> >>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
> >>> Engine exiting with status 8 at main.c:1343.
> >>> Result: Corosync and Pacemaker are not running.
> >>>
> >>> Test #2:
> >>> --------------------------------------------------
> >>> IP for ring1 is not defines in the system:
> >>> --------------------------------------------------
> >>> Start Corosync simultaneously on both nodes.
> >>> Corosync starts.
> >>> Start Pacemaker simultaneously on both nodes.
> >>> Pacemaker fails to start.
> >>> From the logs, the last writes from the "corosync":
> >>> Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid 0
> >>> interface 169.254.1.3 FAULTY
> >>> Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ] Automatically
> >>> recovered ring 0
> >>> Result: Corosync and Pacemaker are not running.
> >>>
> >>>
> >>> Test #3:
> >>>
> >>> "rrp_mode: active" leads to the same result, except Corosync and
> >> Pacemaker
> >>> init scripts return status "running".
> >>> But still "vim /var/log/cluster/corosync.log" shows a lot of errors
> like:
> >>> Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
> Connection
> >>> to the CPG API failed: Library error (2)
> >>>
> >>> Result: Corosync and Pacemaker show their statuses as "running", but
> >>> "crm_mon" cannot connect to the cluster database. And half of the
> >>> Pacemaker's services are not running (including Cluster Information
> Base
> >>> (CIB)).
> >>>
> >>>
> >>> -------------------------------
> >>>
> >>> For a single node mode
> >>>
> >>> -------------------------------
> >>>
> >>> IP for ring0 is not defines in the system:
> >>>
> >>> Corosync fails to start.
> >>>
> >>> IP for ring1 is not defines in the system:
> >>>
> >>> Corosync and Pacemaker are started.
> >>>
> >>> It is possible that configuration will be applied successfully (50%),
> >>>
> >>> and it is possible that the cluster is not running any resources,
> >>>
> >>> and it is possible that the node cannot be put in a standby mode
> (shows:
> >>> communication error),
> >>>
> >>> and it is possible that the cluster is running all resources, but
> applied
> >>> configuration is not guaranteed to be fully loaded (some rules can be
> >>> missed).
> >>>
> >>>
> >>> -------------------------------
> >>>
> >>> Conclusions:
> >>>
> >>> -------------------------------
> >>>
> >>> It is possible that in some rare cases (see comments to the bug) the
> >>> cluster will work, but in that case its working state is unstable and
> the
> >>> cluster can stop working every moment.
> >>>
> >>>
> >>> So, is it correct? Does my assumptions make any sense? I didn't any
> other
> >>> explanation in the network ... .
> >>
> >> Corosync needs all interfaces during start and runtime. This doesn't
> >> mean they must be connected (this would make corosync unusable for
> >> physical NIC/Switch or cable failure), but they must be up and have
> >> correct ip.
> >>
> >> When this is not the case, corosync rebinds to localhost and weird
> >> things happens. Removal of this rebinding is long time TODO, but there
> >> are still more important bugs (especially because rebind can be
> avoided).
> >>
> >> Regards,
> >> Honza
> >>
> >>>
> >>>
> >>>
> >>> Thank you,
> >>> Kostya
> >>>
> >>> On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
> >>> konstantin.ponomarenko@gmail.com> wrote:
> >>>
> >>>> Hi guys,
> >>>>
> >>>> Corosync fails to start if there is no such network interface
> configured
> >>>> in the system.
> >>>> Even with "rrp_mode: passive" the problem is the same when at least
> one
> >>>> network interface is not configured in the system.
> >>>>
> >>>> Is this the expected behavior?
> >>>> I thought that when you use redundant rings, it is enough to have at
> >> least
> >>>> one NIC configured in the system. Am I wrong?
> >>>>
> >>>> Thank you,
> >>>> Kostya
> >>>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

Re: Corosync fails to start when NIC is absent [ In reply to ]

konstantin.ponomarenko at gmail

Jan 19, 2015, 6:57 AM

Post #7 of 10 (4580 views)

Permalink

One more thing to clarify.
You said "rebind can be avoided" - what does it mean?

Thank you,
Kostya

On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko <
konstantin.ponomarenko@gmail.com> wrote:

> Thank you. Now I am aware of it.
>
> Thank you,
> Kostya
>
> On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse <jfriesse@redhat.com> wrote:
>
>> Kostiantyn,
>>
>> > Honza,
>> >
>> > Thank you for helping me.
>> > So, there is no defined behavior in case one of the interfaces is not in
>> > the system?
>>
>> You are right. There is no defined behavior.
>>
>> Regards,
>> Honza
>>
>>
>> >
>> >
>> > Thank you,
>> > Kostya
>> >
>> > On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse@redhat.com>
>> wrote:
>> >
>> >> Kostiantyn,
>> >>
>> >>
>> >>> According to the https://access.redhat.com/solutions/638843 , the
>> >>> interface, that is defined in the corosync.conf, must be present in
>> the
>> >>> system (see at the bottom of the article, section "ROOT CAUSE").
>> >>> To confirm that I made a couple of tests.
>> >>>
>> >>> Here is a part of the corosync.conf file (in a free-write form) (also
>> >>> attached the origin config file):
>> >>> ===============================
>> >>> rrp_mode: passive
>> >>> ring0_addr is defined in corosync.conf
>> >>> ring1_addr is defined in corosync.conf
>> >>> ===============================
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> Two-node cluster
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> Test #1:
>> >>> --------------------------------------------------
>> >>> IP for ring0 is not defines in the system:
>> >>> --------------------------------------------------
>> >>> Start Corosync simultaneously on both nodes.
>> >>> Corosync fails to start.
>> >>> From the logs:
>> >>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
>> >>> config: No interfaces defined
>> >>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
>> Cluster
>> >>> Engine exiting with status 8 at main.c:1343.
>> >>> Result: Corosync and Pacemaker are not running.
>> >>>
>> >>> Test #2:
>> >>> --------------------------------------------------
>> >>> IP for ring1 is not defines in the system:
>> >>> --------------------------------------------------
>> >>> Start Corosync simultaneously on both nodes.
>> >>> Corosync starts.
>> >>> Start Pacemaker simultaneously on both nodes.
>> >>> Pacemaker fails to start.
>> >>> From the logs, the last writes from the "corosync":
>> >>> Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid
>> 0
>> >>> interface 169.254.1.3 FAULTY
>> >>> Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ]
>> Automatically
>> >>> recovered ring 0
>> >>> Result: Corosync and Pacemaker are not running.
>> >>>
>> >>>
>> >>> Test #3:
>> >>>
>> >>> "rrp_mode: active" leads to the same result, except Corosync and
>> >> Pacemaker
>> >>> init scripts return status "running".
>> >>> But still "vim /var/log/cluster/corosync.log" shows a lot of errors
>> like:
>> >>> Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
>> Connection
>> >>> to the CPG API failed: Library error (2)
>> >>>
>> >>> Result: Corosync and Pacemaker show their statuses as "running", but
>> >>> "crm_mon" cannot connect to the cluster database. And half of the
>> >>> Pacemaker's services are not running (including Cluster Information
>> Base
>> >>> (CIB)).
>> >>>
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> For a single node mode
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> IP for ring0 is not defines in the system:
>> >>>
>> >>> Corosync fails to start.
>> >>>
>> >>> IP for ring1 is not defines in the system:
>> >>>
>> >>> Corosync and Pacemaker are started.
>> >>>
>> >>> It is possible that configuration will be applied successfully (50%),
>> >>>
>> >>> and it is possible that the cluster is not running any resources,
>> >>>
>> >>> and it is possible that the node cannot be put in a standby mode
>> (shows:
>> >>> communication error),
>> >>>
>> >>> and it is possible that the cluster is running all resources, but
>> applied
>> >>> configuration is not guaranteed to be fully loaded (some rules can be
>> >>> missed).
>> >>>
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> Conclusions:
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> It is possible that in some rare cases (see comments to the bug) the
>> >>> cluster will work, but in that case its working state is unstable and
>> the
>> >>> cluster can stop working every moment.
>> >>>
>> >>>
>> >>> So, is it correct? Does my assumptions make any sense? I didn't any
>> other
>> >>> explanation in the network ... .
>> >>
>> >> Corosync needs all interfaces during start and runtime. This doesn't
>> >> mean they must be connected (this would make corosync unusable for
>> >> physical NIC/Switch or cable failure), but they must be up and have
>> >> correct ip.
>> >>
>> >> When this is not the case, corosync rebinds to localhost and weird
>> >> things happens. Removal of this rebinding is long time TODO, but there
>> >> are still more important bugs (especially because rebind can be
>> avoided).
>> >>
>> >> Regards,
>> >> Honza
>> >>
>> >>>
>> >>>
>> >>>
>> >>> Thank you,
>> >>> Kostya
>> >>>
>> >>> On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
>> >>> konstantin.ponomarenko@gmail.com> wrote:
>> >>>
>> >>>> Hi guys,
>> >>>>
>> >>>> Corosync fails to start if there is no such network interface
>> configured
>> >>>> in the system.
>> >>>> Even with "rrp_mode: passive" the problem is the same when at least
>> one
>> >>>> network interface is not configured in the system.
>> >>>>
>> >>>> Is this the expected behavior?
>> >>>> I thought that when you use redundant rings, it is enough to have at
>> >> least
>> >>>> one NIC configured in the system. Am I wrong?
>> >>>>
>> >>>> Thank you,
>> >>>> Kostya
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>>
>> >>> Project Home: http://www.clusterlabs.org
>> >>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >>> Bugs: http://bugs.clusterlabs.org
>> >>>
>> >>
>> >>
>> >> _______________________________________________
>> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: http://bugs.clusterlabs.org
>> >>
>> >
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>

Re: Corosync fails to start when NIC is absent [ In reply to ]

jfriesse at redhat

Jan 20, 2015, 12:50 AM

Post #8 of 10 (4576 views)

Permalink

Kostiantyn,

> One more thing to clarify.
> You said "rebind can be avoided" - what does it mean?

By that I mean that as long as you don't shutdown interface everything
will work as expected. Interface shutdown is administrator decision,
system doesn't do it automagically :)

Regards,
Honza

>
> Thank you,
> Kostya
>
> On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko <
> konstantin.ponomarenko@gmail.com> wrote:
>
>> Thank you. Now I am aware of it.
>>
>> Thank you,
>> Kostya
>>
>> On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse <jfriesse@redhat.com> wrote:
>>
>>> Kostiantyn,
>>>
>>>> Honza,
>>>>
>>>> Thank you for helping me.
>>>> So, there is no defined behavior in case one of the interfaces is not in
>>>> the system?
>>>
>>> You are right. There is no defined behavior.
>>>
>>> Regards,
>>> Honza
>>>
>>>
>>>>
>>>>
>>>> Thank you,
>>>> Kostya
>>>>
>>>> On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse@redhat.com>
>>> wrote:
>>>>
>>>>> Kostiantyn,
>>>>>
>>>>>
>>>>>> According to the https://access.redhat.com/solutions/638843 , the
>>>>>> interface, that is defined in the corosync.conf, must be present in
>>> the
>>>>>> system (see at the bottom of the article, section "ROOT CAUSE").
>>>>>> To confirm that I made a couple of tests.
>>>>>>
>>>>>> Here is a part of the corosync.conf file (in a free-write form) (also
>>>>>> attached the origin config file):
>>>>>> ===============================
>>>>>> rrp_mode: passive
>>>>>> ring0_addr is defined in corosync.conf
>>>>>> ring1_addr is defined in corosync.conf
>>>>>> ===============================
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> Two-node cluster
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> Test #1:
>>>>>> --------------------------------------------------
>>>>>> IP for ring0 is not defines in the system:
>>>>>> --------------------------------------------------
>>>>>> Start Corosync simultaneously on both nodes.
>>>>>> Corosync fails to start.
>>>>>> From the logs:
>>>>>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
>>>>>> config: No interfaces defined
>>>>>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
>>> Cluster
>>>>>> Engine exiting with status 8 at main.c:1343.
>>>>>> Result: Corosync and Pacemaker are not running.
>>>>>>
>>>>>> Test #2:
>>>>>> --------------------------------------------------
>>>>>> IP for ring1 is not defines in the system:
>>>>>> --------------------------------------------------
>>>>>> Start Corosync simultaneously on both nodes.
>>>>>> Corosync starts.
>>>>>> Start Pacemaker simultaneously on both nodes.
>>>>>> Pacemaker fails to start.
>>>>>> From the logs, the last writes from the "corosync":
>>>>>> Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid
>>> 0
>>>>>> interface 169.254.1.3 FAULTY
>>>>>> Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ]
>>> Automatically
>>>>>> recovered ring 0
>>>>>> Result: Corosync and Pacemaker are not running.
>>>>>>
>>>>>>
>>>>>> Test #3:
>>>>>>
>>>>>> "rrp_mode: active" leads to the same result, except Corosync and
>>>>> Pacemaker
>>>>>> init scripts return status "running".
>>>>>> But still "vim /var/log/cluster/corosync.log" shows a lot of errors
>>> like:
>>>>>> Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
>>> Connection
>>>>>> to the CPG API failed: Library error (2)
>>>>>>
>>>>>> Result: Corosync and Pacemaker show their statuses as "running", but
>>>>>> "crm_mon" cannot connect to the cluster database. And half of the
>>>>>> Pacemaker's services are not running (including Cluster Information
>>> Base
>>>>>> (CIB)).
>>>>>>
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> For a single node mode
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> IP for ring0 is not defines in the system:
>>>>>>
>>>>>> Corosync fails to start.
>>>>>>
>>>>>> IP for ring1 is not defines in the system:
>>>>>>
>>>>>> Corosync and Pacemaker are started.
>>>>>>
>>>>>> It is possible that configuration will be applied successfully (50%),
>>>>>>
>>>>>> and it is possible that the cluster is not running any resources,
>>>>>>
>>>>>> and it is possible that the node cannot be put in a standby mode
>>> (shows:
>>>>>> communication error),
>>>>>>
>>>>>> and it is possible that the cluster is running all resources, but
>>> applied
>>>>>> configuration is not guaranteed to be fully loaded (some rules can be
>>>>>> missed).
>>>>>>
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> Conclusions:
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> It is possible that in some rare cases (see comments to the bug) the
>>>>>> cluster will work, but in that case its working state is unstable and
>>> the
>>>>>> cluster can stop working every moment.
>>>>>>
>>>>>>
>>>>>> So, is it correct? Does my assumptions make any sense? I didn't any
>>> other
>>>>>> explanation in the network ... .
>>>>>
>>>>> Corosync needs all interfaces during start and runtime. This doesn't
>>>>> mean they must be connected (this would make corosync unusable for
>>>>> physical NIC/Switch or cable failure), but they must be up and have
>>>>> correct ip.
>>>>>
>>>>> When this is not the case, corosync rebinds to localhost and weird
>>>>> things happens. Removal of this rebinding is long time TODO, but there
>>>>> are still more important bugs (especially because rebind can be
>>> avoided).
>>>>>
>>>>> Regards,
>>>>> Honza
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you,
>>>>>> Kostya
>>>>>>
>>>>>> On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
>>>>>> konstantin.ponomarenko@gmail.com> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> Corosync fails to start if there is no such network interface
>>> configured
>>>>>>> in the system.
>>>>>>> Even with "rrp_mode: passive" the problem is the same when at least
>>> one
>>>>>>> network interface is not configured in the system.
>>>>>>>
>>>>>>> Is this the expected behavior?
>>>>>>> I thought that when you use redundant rings, it is enough to have at
>>>>> least
>>>>>>> one NIC configured in the system. Am I wrong?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Kostya
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: Corosync fails to start when NIC is absent [ In reply to ]

konstantin.ponomarenko at gmail

Jan 20, 2015, 1:01 AM

Post #9 of 10 (4568 views)

Permalink

Got it. Thank you =)
I just thought about possibility of a NIC to burn down.

Thank you,
Kostya

On Tue, Jan 20, 2015 at 10:50 AM, Jan Friesse <jfriesse@redhat.com> wrote:

> Kostiantyn,
>
>
> > One more thing to clarify.
> > You said "rebind can be avoided" - what does it mean?
>
> By that I mean that as long as you don't shutdown interface everything
> will work as expected. Interface shutdown is administrator decision,
> system doesn't do it automagically :)
>
> Regards,
> Honza
>
> >
> > Thank you,
> > Kostya
> >
> > On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko <
> > konstantin.ponomarenko@gmail.com> wrote:
> >
> >> Thank you. Now I am aware of it.
> >>
> >> Thank you,
> >> Kostya
> >>
> >> On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse <jfriesse@redhat.com>
> wrote:
> >>
> >>> Kostiantyn,
> >>>
> >>>> Honza,
> >>>>
> >>>> Thank you for helping me.
> >>>> So, there is no defined behavior in case one of the interfaces is not
> in
> >>>> the system?
> >>>
> >>> You are right. There is no defined behavior.
> >>>
> >>> Regards,
> >>> Honza
> >>>
> >>>
> >>>>
> >>>>
> >>>> Thank you,
> >>>> Kostya
> >>>>
> >>>> On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse@redhat.com>
> >>> wrote:
> >>>>
> >>>>> Kostiantyn,
> >>>>>
> >>>>>
> >>>>>> According to the https://access.redhat.com/solutions/638843 , the
> >>>>>> interface, that is defined in the corosync.conf, must be present in
> >>> the
> >>>>>> system (see at the bottom of the article, section "ROOT CAUSE").
> >>>>>> To confirm that I made a couple of tests.
> >>>>>>
> >>>>>> Here is a part of the corosync.conf file (in a free-write form)
> (also
> >>>>>> attached the origin config file):
> >>>>>> ===============================
> >>>>>> rrp_mode: passive
> >>>>>> ring0_addr is defined in corosync.conf
> >>>>>> ring1_addr is defined in corosync.conf
> >>>>>> ===============================
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> Two-node cluster
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> Test #1:
> >>>>>> --------------------------------------------------
> >>>>>> IP for ring0 is not defines in the system:
> >>>>>> --------------------------------------------------
> >>>>>> Start Corosync simultaneously on both nodes.
> >>>>>> Corosync fails to start.
> >>>>>> From the logs:
> >>>>>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error
> in
> >>>>>> config: No interfaces defined
> >>>>>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
> >>> Cluster
> >>>>>> Engine exiting with status 8 at main.c:1343.
> >>>>>> Result: Corosync and Pacemaker are not running.
> >>>>>>
> >>>>>> Test #2:
> >>>>>> --------------------------------------------------
> >>>>>> IP for ring1 is not defines in the system:
> >>>>>> --------------------------------------------------
> >>>>>> Start Corosync simultaneously on both nodes.
> >>>>>> Corosync starts.
> >>>>>> Start Pacemaker simultaneously on both nodes.
> >>>>>> Pacemaker fails to start.
> >>>>>> From the logs, the last writes from the "corosync":
> >>>>>> Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking
> ringid
> >>> 0
> >>>>>> interface 169.254.1.3 FAULTY
> >>>>>> Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ]
> >>> Automatically
> >>>>>> recovered ring 0
> >>>>>> Result: Corosync and Pacemaker are not running.
> >>>>>>
> >>>>>>
> >>>>>> Test #3:
> >>>>>>
> >>>>>> "rrp_mode: active" leads to the same result, except Corosync and
> >>>>> Pacemaker
> >>>>>> init scripts return status "running".
> >>>>>> But still "vim /var/log/cluster/corosync.log" shows a lot of errors
> >>> like:
> >>>>>> Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
> >>> Connection
> >>>>>> to the CPG API failed: Library error (2)
> >>>>>>
> >>>>>> Result: Corosync and Pacemaker show their statuses as "running", but
> >>>>>> "crm_mon" cannot connect to the cluster database. And half of the
> >>>>>> Pacemaker's services are not running (including Cluster Information
> >>> Base
> >>>>>> (CIB)).
> >>>>>>
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> For a single node mode
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> IP for ring0 is not defines in the system:
> >>>>>>
> >>>>>> Corosync fails to start.
> >>>>>>
> >>>>>> IP for ring1 is not defines in the system:
> >>>>>>
> >>>>>> Corosync and Pacemaker are started.
> >>>>>>
> >>>>>> It is possible that configuration will be applied successfully
> (50%),
> >>>>>>
> >>>>>> and it is possible that the cluster is not running any resources,
> >>>>>>
> >>>>>> and it is possible that the node cannot be put in a standby mode
> >>> (shows:
> >>>>>> communication error),
> >>>>>>
> >>>>>> and it is possible that the cluster is running all resources, but
> >>> applied
> >>>>>> configuration is not guaranteed to be fully loaded (some rules can
> be
> >>>>>> missed).
> >>>>>>
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> Conclusions:
> >>>>>>
> >>>>>> -------------------------------
> >>>>>>
> >>>>>> It is possible that in some rare cases (see comments to the bug) the
> >>>>>> cluster will work, but in that case its working state is unstable
> and
> >>> the
> >>>>>> cluster can stop working every moment.
> >>>>>>
> >>>>>>
> >>>>>> So, is it correct? Does my assumptions make any sense? I didn't any
> >>> other
> >>>>>> explanation in the network ... .
> >>>>>
> >>>>> Corosync needs all interfaces during start and runtime. This doesn't
> >>>>> mean they must be connected (this would make corosync unusable for
> >>>>> physical NIC/Switch or cable failure), but they must be up and have
> >>>>> correct ip.
> >>>>>
> >>>>> When this is not the case, corosync rebinds to localhost and weird
> >>>>> things happens. Removal of this rebinding is long time TODO, but
> there
> >>>>> are still more important bugs (especially because rebind can be
> >>> avoided).
> >>>>>
> >>>>> Regards,
> >>>>> Honza
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thank you,
> >>>>>> Kostya
> >>>>>>
> >>>>>> On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
> >>>>>> konstantin.ponomarenko@gmail.com> wrote:
> >>>>>>
> >>>>>>> Hi guys,
> >>>>>>>
> >>>>>>> Corosync fails to start if there is no such network interface
> >>> configured
> >>>>>>> in the system.
> >>>>>>> Even with "rrp_mode: passive" the problem is the same when at least
> >>> one
> >>>>>>> network interface is not configured in the system.
> >>>>>>>
> >>>>>>> Is this the expected behavior?
> >>>>>>> I thought that when you use redundant rings, it is enough to have
> at
> >>>>> least
> >>>>>>> one NIC configured in the system. Am I wrong?
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>> Kostya
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>>
> >>>>>> Project Home: http://www.clusterlabs.org
> >>>>>> Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>>> Bugs: http://bugs.clusterlabs.org
> >>>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>
> >>>>> Project Home: http://www.clusterlabs.org
> >>>>> Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>> Bugs: http://bugs.clusterlabs.org
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>
> >>>> Project Home: http://www.clusterlabs.org
> >>>> Getting started:
> >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>> Bugs: http://bugs.clusterlabs.org
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>
> >>
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

Re: Corosync fails to start when NIC is absent [ In reply to ]

arvidjaar at gmail

Jan 20, 2015, 1:22 AM

Post #10 of 10 (4570 views)

Permalink

On Tue, Jan 20, 2015 at 11:50 AM, Jan Friesse <jfriesse@redhat.com> wrote:
> Kostiantyn,
>
>
>> One more thing to clarify.
>> You said "rebind can be avoided" - what does it mean?
>
> By that I mean that as long as you don't shutdown interface everything
> will work as expected. Interface shutdown is administrator decision,
> system doesn't do it automagically :)
>

What is possible that e.g. during reboot interface (hardware) fails
and is not detected. This would lead to complete outage of node that
could be avoided.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Mailing List Archive

Mailing List Archive

Attached Files: