Mailing List Archive: [nova][publiccloud-wg] Proposal to shelve on stop/suspend

[nova][publiccloud-wg] Proposal to shelve on stop/suspend

Sep 14, 2018, 4:25 PM

Post #1 of 3 (460 views)

tl;dr: I'm proposing a new parameter to the server stop (and suspend?)
APIs to control if nova shelve offloads the server.

Long form: This came up during the public cloud WG session this week
based on a couple of feature requests [1][2]. When a user stops/suspends
a server, the hypervisor frees up resources on the host but nova
continues to track those resources as being used on the host so the
scheduler can't put more servers there. What operators would like to do
is that when a user stops a server, nova actually shelve offloads the
server from the host so they can schedule new servers on that host. On
start/resume of the server, nova would find a new host for the server.
This also came up in Vancouver where operators would like to free up
limited expensive resources like GPUs when the server is stopped. This
is also the behavior in AWS.

The problem with shelve is that it's great for operators but users just
don't use it, maybe because they don't know what it is and stop works
just fine. So how do you get users to opt into shelving their server?

I've proposed a high-level blueprint [3] where we'd add a new
(microversioned) parameter to the stop API with three options:

* auto
* offload
* retain

Naming is obviously up for debate. The point is we would default to auto
and if auto is used, the API checks a config option to determine the
behavior - offload or retain. By default we would retain for backward
compatibility. For users that don't care, they get auto and it's fine.
For users that do care, they either (1) don't opt into the microversion
or (2) specify the specific behavior they want. I don't think we need to
expose what the cloud's configuration for auto is because again, if you
don't care then it doesn't matter and if you do care, you can opt out of
this.

"How do we get users to use the new microversion?" I'm glad you asked.

Well, nova CLI defaults to using the latest available microversion
negotiated between the client and the server, so by default, anyone
using "nova stop" would get the 'auto' behavior (assuming the client and
server are new enough to support it). Long-term, openstack client plans
on doing the same version negotiation.

As for the server status changes, if the server is stopped and shelved,
the status would be 'SHELVED_OFFLOADED' rather than 'SHUTDOWN'. I
believe this is fine especially if a user is not being specific and
doesn't care about the actual backend behavior. On start, the API would
allow starting (unshelving) shelved offloaded (rather than just stopped)
instances. Trying to hide shelved servers as stopped in the API would be
overly complex IMO so I don't want to try and mask that.

It is possible that a user that stopped and shelved their server could
hit a NoValidHost when starting (unshelving) the server, but that really
shouldn't happen in a cloud that's configuring nova to shelve by default
because if they are doing this, their SLA needs to reflect they have the
capacity to unshelve the server. If you can't honor that SLA, don't
shelve by default.

So, what are the general feelings on this before I go off and start
writing up a spec?

[1] https://bugs.launchpad.net/openstack-publiccloud-wg/+bug/1791681
[2] https://bugs.launchpad.net/openstack-publiccloud-wg/+bug/1791679
[3] https://blueprints.launchpad.net/nova/+spec/shelve-on-stop

--

Thanks,

Matt

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [openstack-dev] [nova][publiccloud-wg] Proposal to shelve on stop/suspend [ In reply to ]

Tim.Bell at cern

Sep 15, 2018, 5:38 AM

Post #2 of 3 (460 views)

Permalink

One extra user motivation that came up during past forums was to have a different quota for shelved instances (or remove them from the project quota all together). Currently, I believe that a shelved instance still counts towards the instances/cores quota thus the reduction of usage by the user is not reflected in the quotas.

One discussion at the time was that the user is still reserving IPs so it is not zero resource usage and the instances still occupy storage.

(We disabled shelving for other reasons so I'm not able to check easily)

Tim

?-----Original Message-----
From: Matt Riedemann <mriedemos@gmail.com>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev@lists.openstack.org>
Date: Saturday, 15 September 2018 at 01:27
To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev@lists.openstack.org>, "openstack-operators@lists.openstack.org" <openstack-operators@lists.openstack.org>, "openstack-sigs@lists.openstack.org" <openstack-sigs@lists.openstack.org>
Subject: [openstack-dev] [nova][publiccloud-wg] Proposal to shelve on stop/suspend

tl;dr: I'm proposing a new parameter to the server stop (and suspend?)
APIs to control if nova shelve offloads the server.

Long form: This came up during the public cloud WG session this week
based on a couple of feature requests [1][2]. When a user stops/suspends
a server, the hypervisor frees up resources on the host but nova
continues to track those resources as being used on the host so the
scheduler can't put more servers there. What operators would like to do
is that when a user stops a server, nova actually shelve offloads the
server from the host so they can schedule new servers on that host. On
start/resume of the server, nova would find a new host for the server.
This also came up in Vancouver where operators would like to free up
limited expensive resources like GPUs when the server is stopped. This
is also the behavior in AWS.

The problem with shelve is that it's great for operators but users just
don't use it, maybe because they don't know what it is and stop works
just fine. So how do you get users to opt into shelving their server?

I've proposed a high-level blueprint [3] where we'd add a new
(microversioned) parameter to the stop API with three options:

* auto
* offload
* retain

Naming is obviously up for debate. The point is we would default to auto
and if auto is used, the API checks a config option to determine the
behavior - offload or retain. By default we would retain for backward
compatibility. For users that don't care, they get auto and it's fine.
For users that do care, they either (1) don't opt into the microversion
or (2) specify the specific behavior they want. I don't think we need to
expose what the cloud's configuration for auto is because again, if you
don't care then it doesn't matter and if you do care, you can opt out of
this.

"How do we get users to use the new microversion?" I'm glad you asked.

Well, nova CLI defaults to using the latest available microversion
negotiated between the client and the server, so by default, anyone
using "nova stop" would get the 'auto' behavior (assuming the client and
server are new enough to support it). Long-term, openstack client plans
on doing the same version negotiation.

As for the server status changes, if the server is stopped and shelved,
the status would be 'SHELVED_OFFLOADED' rather than 'SHUTDOWN'. I
believe this is fine especially if a user is not being specific and
doesn't care about the actual backend behavior. On start, the API would
allow starting (unshelving) shelved offloaded (rather than just stopped)
instances. Trying to hide shelved servers as stopped in the API would be
overly complex IMO so I don't want to try and mask that.

It is possible that a user that stopped and shelved their server could
hit a NoValidHost when starting (unshelving) the server, but that really
shouldn't happen in a cloud that's configuring nova to shelve by default
because if they are doing this, their SLA needs to reflect they have the
capacity to unshelve the server. If you can't honor that SLA, don't
shelve by default.

So, what are the general feelings on this before I go off and start
writing up a spec?

[1] https://bugs.launchpad.net/openstack-publiccloud-wg/+bug/1791681
[2] https://bugs.launchpad.net/openstack-publiccloud-wg/+bug/1791679
[3] https://blueprints.launchpad.net/nova/+spec/shelve-on-stop

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [openstack-dev] [nova][publiccloud-wg] Proposal to shelve on stop/suspend [ In reply to ]

Tim.Bell at cern

Sep 15, 2018, 7:51 AM

Post #3 of 3 (460 views)

Permalink

Found the previous discussion at http://lists.openstack.org/pipermail/openstack-operators/2016-August/011321.html from 2016.

Tim

?-----Original Message-----
From: Tim Bell <Tim.Bell@cern.ch>
Date: Saturday, 15 September 2018 at 14:38
To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev@lists.openstack.org>, "openstack-operators@lists.openstack.org" <openstack-operators@lists.openstack.org>, "openstack-sigs@lists.openstack.org" <openstack-sigs@lists.openstack.org>
Subject: Re: [openstack-dev] [nova][publiccloud-wg] Proposal to shelve on stop/suspend

One extra user motivation that came up during past forums was to have a different quota for shelved instances (or remove them from the project quota all together). Currently, I believe that a shelved instance still counts towards the instances/cores quota thus the reduction of usage by the user is not reflected in the quotas.

One discussion at the time was that the user is still reserving IPs so it is not zero resource usage and the instances still occupy storage.

(We disabled shelving for other reasons so I'm not able to check easily)

Tim

-----Original Message-----
From: Matt Riedemann <mriedemos@gmail.com>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev@lists.openstack.org>
Date: Saturday, 15 September 2018 at 01:27
To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev@lists.openstack.org>, "openstack-operators@lists.openstack.org" <openstack-operators@lists.openstack.org>, "openstack-sigs@lists.openstack.org" <openstack-sigs@lists.openstack.org>
Subject: [openstack-dev] [nova][publiccloud-wg] Proposal to shelve on stop/suspend

tl;dr: I'm proposing a new parameter to the server stop (and suspend?)
APIs to control if nova shelve offloads the server.

Long form: This came up during the public cloud WG session this week
based on a couple of feature requests [1][2]. When a user stops/suspends
a server, the hypervisor frees up resources on the host but nova
continues to track those resources as being used on the host so the
scheduler can't put more servers there. What operators would like to do
is that when a user stops a server, nova actually shelve offloads the
server from the host so they can schedule new servers on that host. On
start/resume of the server, nova would find a new host for the server.
This also came up in Vancouver where operators would like to free up
limited expensive resources like GPUs when the server is stopped. This
is also the behavior in AWS.

The problem with shelve is that it's great for operators but users just
don't use it, maybe because they don't know what it is and stop works
just fine. So how do you get users to opt into shelving their server?

I've proposed a high-level blueprint [3] where we'd add a new
(microversioned) parameter to the stop API with three options:

* auto
* offload
* retain

Naming is obviously up for debate. The point is we would default to auto
and if auto is used, the API checks a config option to determine the
behavior - offload or retain. By default we would retain for backward
compatibility. For users that don't care, they get auto and it's fine.
For users that do care, they either (1) don't opt into the microversion
or (2) specify the specific behavior they want. I don't think we need to
expose what the cloud's configuration for auto is because again, if you
don't care then it doesn't matter and if you do care, you can opt out of
this.

"How do we get users to use the new microversion?" I'm glad you asked.

Well, nova CLI defaults to using the latest available microversion
negotiated between the client and the server, so by default, anyone
using "nova stop" would get the 'auto' behavior (assuming the client and
server are new enough to support it). Long-term, openstack client plans
on doing the same version negotiation.

As for the server status changes, if the server is stopped and shelved,
the status would be 'SHELVED_OFFLOADED' rather than 'SHUTDOWN'. I
believe this is fine especially if a user is not being specific and
doesn't care about the actual backend behavior. On start, the API would
allow starting (unshelving) shelved offloaded (rather than just stopped)
instances. Trying to hide shelved servers as stopped in the API would be
overly complex IMO so I don't want to try and mask that.

It is possible that a user that stopped and shelved their server could
hit a NoValidHost when starting (unshelving) the server, but that really
shouldn't happen in a cloud that's configuring nova to shelve by default
because if they are doing this, their SLA needs to reflect they have the
capacity to unshelve the server. If you can't honor that SLA, don't
shelve by default.

So, what are the general feelings on this before I go off and start
writing up a spec?

[1] https://bugs.launchpad.net/openstack-publiccloud-wg/+bug/1791681
[2] https://bugs.launchpad.net/openstack-publiccloud-wg/+bug/1791679
[3] https://blueprints.launchpad.net/nova/+spec/shelve-on-stop

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators