Mailing List Archive

Epsilon and root move from an aggr to another
Hi everybody,
past friday during an operation given as NDO we've had a service interruption on NAS component.
We had to move the root aggregate from some old disks to new ones and we've literally followed the procedure reported here (our cDOT is 8.3.2P9 on a 4 nodes cluster)

https://kb.netapp.com/app/answers/answer_view/a_id/1030179

In a very simple way it says:
A. Check for epsilon on the node you've to migrate and move it to another node
A.1 there's a warining about SAN protocols interruptions but we DID NOT have SAN protocols running, only NFS/CIFS.
B. Lif migration after the aggregate relocation
Well, NFS was restarted and all servers and apps belonging to it went down! I let you imagine customer reaction...
Also console after this command:
? ? ? ? ? ? ?system node modify -node node01 -eligibility false
give us a warning about SAN disruption. As I wrote it did not matter us.

Only after that we've found on manual this, but as usual manuals are always less updated than knowledgebase so it could be the last place where to find fresh informations!

https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-AB52F821-3A25-4E02-B1EF-1B09EBE4009D.html

Moving epsilon for certain manually initiated takeovers
Note:?Although cluster formation voting can be modified by using the?cluster modify -eligibility false?command, you should avoid this except for situations such as restoring the node configuration or prolonged node maintenance. If you set a node to be ineligible, it stops serving SAN data until the node is reset to eligible and rebooted.?NAS data access to the node might also be affected when the node is ineligible.

And, what does it mean "might be". I translate that as a "nobody knows, try..."?

Now the most important question (we must migrate other three nodes!) is this:
Assuming that we've well understood that 1. migrate lif and only 2. epsilon false, it there an official answer/doc with updated information that ensure that is this the right procedure to avoid also NAS protocols interruption?

Thank you very much,


Dott. Giacomo Milazzo
Senior Consultant & Technical Account Manager
mobile: +39 340.6001045
@-mail: g.milazzo@sinergy.it
Web: http://www.sinergy.it

SINERGY?SpA?? Viale dei Santi Pietro e Paolo 50
00144 - Roma RM? Tel. +39 06 44243674 Fax +39 06 44245272


_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: Epsilon and root move from an aggr to another [ In reply to ]
Hmmmmm,

Going through the steps in the KB, I would have done the epsilon and
eligibility steps (Step 1 in the KB) right before the reboot (Step 8),
*after* moving the aggregate and the LIFs away from the node to be worked
on.

At that point in time it shouldn't disturb anything, since no user traffic
should pass through this nodes interfaces or disks.

What do you think?
(I'm a little unclear about the meaning of "NFS was restarted", but I have
a feeling the above change in sequence should help)

Also, if you look at the revert steps, the KB first restores eligibility
and HA failover and only at the end reverts aggregates and LIFs.

Regards

Sebastian


On Mon, Mar 12, 2018, 18:39 Milazzo Giacomo <G.Milazzo@sinergy.it> wrote:

> Hi everybody,
> past friday during an operation given as NDO we've had a service
> interruption on NAS component.
> We had to move the root aggregate from some old disks to new ones and
> we've literally followed the procedure reported here (our cDOT is 8.3.2P9
> on a 4 nodes cluster)
>
> https://kb.netapp.com/app/answers/answer_view/a_id/1030179
>
> In a very simple way it says:
> A. Check for epsilon on the node you've to migrate and move it to
> another node
> A.1 there's a warining about SAN protocols interruptions
> but we DID NOT have SAN protocols running, only NFS/CIFS.
> B. Lif migration after the aggregate relocation
> Well, NFS was restarted and all servers and apps belonging to it went
> down! I let you imagine customer reaction...
> Also console after this command:
> system node modify -node node01 -eligibility false
> give us a warning about SAN disruption. As I wrote it did not matter us.
>
> Only after that we've found on manual this, but as usual manuals are
> always less updated than knowledgebase so it could be the last place where
> to find fresh informations!
>
>
> https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-AB52F821-3A25-4E02-B1EF-1B09EBE4009D.html
>
> Moving epsilon for certain manually initiated takeovers
> Note: Although cluster formation voting can be modified by using
> the cluster modify -eligibility false command, you should avoid this except
> for situations such as restoring the node configuration or prolonged node
> maintenance. If you set a node to be ineligible, it stops serving SAN data
> until the node is reset to eligible and rebooted. NAS data access to the
> node might also be affected when the node is ineligible.
>
> And, what does it mean "might be". I translate that as a "nobody knows,
> try..."
>
> Now the most important question (we must migrate other three nodes!) is
> this:
> Assuming that we've well understood that 1. migrate lif and only 2.
> epsilon false, it there an official answer/doc with updated information
> that ensure that is this the right procedure to avoid also NAS protocols
> interruption?
>
> Thank you very much,
>
>
> Dott. Giacomo Milazzo
> Senior Consultant & Technical Account Manager
> mobile: +39 340.6001045
> @-mail: g.milazzo@sinergy.it
> Web: http://www.sinergy.it
>
> SINERGY SpA Viale dei Santi Pietro e Paolo 50
> 00144 - Roma RM Tel. +39 06 44243674 Fax +39 06 44245272
>
>
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
--

sent from my mobile, spellcheck might have messed up...
Re: Epsilon and root move from an aggr to another [ In reply to ]
Hi Sebastian,

Thank for answer. How are you?
A long time has passed since the cdot course we've attended together in Wien.
:-)

So it seems that you agree with me that the kb sequence is wrong (there's also another mistake, where is written to reverse the life but that is obvious to discover).

With NFS restart I mean a cycle NFS stop/start that caused the lost of communications.

Effectively the revert steps suggest the path you're suggesting.

By the way.
That kb gives inadequate informations and should be corrected asap because it does not mention NAS interruption or the sequence is wrong.
Also the prompt of console command does not warn about NAS.

Last, customer would have an official response about these sequence, do you think it could be some documentation?

Customer could assume another risk only if steps are "certified". Otherwise will plan an application stop to avoid disruptions.

Regards




Sent by Mobile

Il 12 mar 2018 20:39, "Sebastian P. Goetze" <spgoetze@gmail.com> ha scritto:
Hmmmmm,

Going through the steps in the KB, I would have done the epsilon and eligibility steps (Step 1 in the KB) right before the reboot (Step 8), *after* moving the aggregate and the LIFs away from the node to be worked on.

At that point in time it shouldn't disturb anything, since no user traffic should pass through this nodes interfaces or disks.

What do you think?
(I'm a little unclear about the meaning of "NFS was restarted", but I have a feeling the above change in sequence should help)

Also, if you look at the revert steps, the KB first restores eligibility and HA failover and only at the end reverts aggregates and LIFs.

Regards

Sebastian


On Mon, Mar 12, 2018, 18:39 Milazzo Giacomo <G.Milazzo@sinergy.it<mailto:G.Milazzo@sinergy.it>> wrote:
Hi everybody,
past friday during an operation given as NDO we've had a service interruption on NAS component.
We had to move the root aggregate from some old disks to new ones and we've literally followed the procedure reported here (our cDOT is 8.3.2P9 on a 4 nodes cluster)

https://kb.netapp.com/app/answers/answer_view/a_id/1030179

In a very simple way it says:
A. Check for epsilon on the node you've to migrate and move it to another node
A.1 there's a warining about SAN protocols interruptions but we DID NOT have SAN protocols running, only NFS/CIFS.
B. Lif migration after the aggregate relocation
Well, NFS was restarted and all servers and apps belonging to it went down! I let you imagine customer reaction...
Also console after this command:
system node modify -node node01 -eligibility false
give us a warning about SAN disruption. As I wrote it did not matter us.

Only after that we've found on manual this, but as usual manuals are always less updated than knowledgebase so it could be the last place where to find fresh informations!

https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-AB52F821-3A25-4E02-B1EF-1B09EBE4009D.html

Moving epsilon for certain manually initiated takeovers
Note: Although cluster formation voting can be modified by using the cluster modify -eligibility false command, you should avoid this except for situations such as restoring the node configuration or prolonged node maintenance. If you set a node to be ineligible, it stops serving SAN data until the node is reset to eligible and rebooted. NAS data access to the node might also be affected when the node is ineligible.

And, what does it mean "might be". I translate that as a "nobody knows, try..."

Now the most important question (we must migrate other three nodes!) is this:
Assuming that we've well understood that 1. migrate lif and only 2. epsilon false, it there an official answer/doc with updated information that ensure that is this the right procedure to avoid also NAS protocols interruption?

Thank you very much,


Dott. Giacomo Milazzo
Senior Consultant & Technical Account Manager
mobile: +39 340.6001045
@-mail: g.milazzo@sinergy.it<mailto:g.milazzo@sinergy.it>
Web: http://www.sinergy.it

SINERGY SpA Viale dei Santi Pietro e Paolo 50
00144 - Roma RM Tel. +39 06 44243674 Fax +39 06 44245272


_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters
--

sent from my mobile, spellcheck might have messed up...

Hmmmmm,

Going through the steps in the KB, I would have done the epsilon and eligibility steps (Step 1 in the KB) right before the reboot (Step 8), *after* moving the aggregate and the LIFs away from the node to be worked on.

At that point in time it shouldn't disturb anything, since no user traffic should pass through this nodes interfaces or disks.

What do you think?
(I'm a little unclear about the meaning of "NFS was restarted", but I have a feeling the above change in sequence should help)

Also, if you look at the revert steps, the KB first restores eligibility and HA failover and only at the end reverts aggregates and LIFs.

Regards

Sebastian


On Mon, Mar 12, 2018, 18:39 Milazzo Giacomo <G.Milazzo@sinergy.it<mailto:G.Milazzo@sinergy.it>> wrote:
Hi everybody,
past friday during an operation given as NDO we've had a service interruption on NAS component.
We had to move the root aggregate from some old disks to new ones and we've literally followed the procedure reported here (our cDOT is 8.3.2P9 on a 4 nodes cluster)

https://kb.netapp.com/app/answers/answer_view/a_id/1030179

In a very simple way it says:
A. Check for epsilon on the node you've to migrate and move it to another node
A.1 there's a warining about SAN protocols interruptions but we DID NOT have SAN protocols running, only NFS/CIFS.
B. Lif migration after the aggregate relocation
Well, NFS was restarted and all servers and apps belonging to it went down! I let you imagine customer reaction...
Also console after this command:
system node modify -node node01 -eligibility false
give us a warning about SAN disruption. As I wrote it did not matter us.

Only after that we've found on manual this, but as usual manuals are always less updated than knowledgebase so it could be the last place where to find fresh informations!

https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-AB52F821-3A25-4E02-B1EF-1B09EBE4009D.html

Moving epsilon for certain manually initiated takeovers
Note: Although cluster formation voting can be modified by using the cluster modify -eligibility false command, you should avoid this except for situations such as restoring the node configuration or prolonged node maintenance. If you set a node to be ineligible, it stops serving SAN data until the node is reset to eligible and rebooted. NAS data access to the node might also be affected when the node is ineligible.

And, what does it mean "might be". I translate that as a "nobody knows, try..."

Now the most important question (we must migrate other three nodes!) is this:
Assuming that we've well understood that 1. migrate lif and only 2. epsilon false, it there an official answer/doc with updated information that ensure that is this the right procedure to avoid also NAS protocols interruption?

Thank you very much,


Dott. Giacomo Milazzo
Senior Consultant & Technical Account Manager
mobile: +39 340.6001045
@-mail: g.milazzo@sinergy.it<mailto:g.milazzo@sinergy.it>
Web: http://www.sinergy.it

SINERGY SpA Viale dei Santi Pietro e Paolo 50
00144 - Roma RM Tel. +39 06 44243674 Fax +39 06 44245272


_______________________________________________
Toasters mailing list
Toasters@teaparty.net<mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters
--

sent from my mobile, spellcheck might have messed up...