Mailing List Archive

Circuit Breakers interaction with Shards
This got zero responses on the solr-user list, so I’ll raise the issue here.

Should circuit breakers only kill external search requests and not cluster-internal requests to shards?

Circuit breakers can kill any request, whether it is a client request from outside the cluster or an internal distributed request to a shard. Killing a portion of distributed request will affect the main request. Not sure whether a 503 from a shard will kill the whole request or cause partial results, but it isn’t good.

We run with 8 shards. If a circuit breaker is killing 10% of requests on each host, that will hit 57% of all external requests (0.9^8 = 0.43). That seems like “overkill” to me. If it only kills external requests, then 10% means 10%.

Killing only external requests requires that external requests go roughly equally to all hosts in the cluster, or at least all NRT or PULL replicas.

wunder
Walter Underwood
wunder@wunderwood.org <mailto:wunder@wunderwood.org>
http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
Re: Circuit Breakers interaction with Shards [ In reply to ]
This has an issue of still leading to node outages if the fanout for a
query is high.

Circuit breakers follow a simple rule -- defend the node at the cost of
degraded responses.

Ideally, only few requests will be completely rejected -- some will see
partial results. Due to this non discriminating nature of circuit breakers,
the typical blip on service quality due to high resource usage is short
lived.

However, it is possible to write a circuit breaker which rejects only
external requests in master branch (we have the ability to identify
requests as internal or external there).

Regards,

Atri

On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wunder@wunderwood.org> wrote:

> This got zero responses on the solr-user list, so I’ll raise the issue
> here.
>
> Should circuit breakers only kill external search requests and not
> cluster-internal requests to shards?
>
> Circuit breakers can kill any request, whether it is a client request from
> outside the cluster or an internal distributed request to a shard. Killing
> a portion of distributed request will affect the main request. Not sure
> whether a 503 from a shard will kill the whole request or cause partial
> results, but it isn’t good.
>
> We run with 8 shards. If a circuit breaker is killing 10% of requests on
> each host, that will hit 57% of all external requests (0.9^8 = 0.43). That
> seems like “overkill” to me. If it only kills external requests, then 10%
> means 10%.
>
> Killing only external requests requires that external requests go roughly
> equally to all hosts in the cluster, or at least all NRT or PULL replicas.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
Re: Circuit Breakers interaction with Shards [ In reply to ]
Ideally, it would only affect a few queries. In reality, with a sharded system, the impact will be large.

I disagree that the goal is to protect a node. The goal is to make the entire cluster avoid congestion failure when overloaded, while providing good service for the load that it can handle.

I have had Solr clusters take down entire websites when overloaded, both at Netflix and Chegg, and I’ve built defenses for this at both places. I’m a huge fan of circuit breakers.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Feb 14, 2021, at 9:50 AM, Atri Sharma <atri@apache.org> wrote:
>
> This has an issue of still leading to node outages if the fanout for a query is high.
>
> Circuit breakers follow a simple rule -- defend the node at the cost of degraded responses.
>
> Ideally, only few requests will be completely rejected -- some will see partial results. Due to this non discriminating nature of circuit breakers, the typical blip on service quality due to high resource usage is short lived.
>
> However, it is possible to write a circuit breaker which rejects only external requests in master branch (we have the ability to identify requests as internal or external there).
>
> Regards,
>
> Atri
>
> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
> This got zero responses on the solr-user list, so I’ll raise the issue here.
>
> Should circuit breakers only kill external search requests and not cluster-internal requests to shards?
>
> Circuit breakers can kill any request, whether it is a client request from outside the cluster or an internal distributed request to a shard. Killing a portion of distributed request will affect the main request. Not sure whether a 503 from a shard will kill the whole request or cause partial results, but it isn’t good.
>
> We run with 8 shards. If a circuit breaker is killing 10% of requests on each host, that will hit 57% of all external requests (0.9^8 = 0.43). That seems like “overkill” to me. If it only kills external requests, then 10% means 10%.
>
> Killing only external requests requires that external requests go roughly equally to all hosts in the cluster, or at least all NRT or PULL replicas.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
Re: Circuit Breakers interaction with Shards [ In reply to ]
The way I look at it is that for cluster level stability, rate limiters
should be used which allow rate limiting of only external requests. They
are "circuit breakers" in the sense of defending against cluster level
instability, which is what you describe.

Circuit breakers, in Solr world, are targeted to be the last resort defense
of a node.

As I said earlier, it is possible to write a circuit breaker which rejects
only external requests, but I personally do not see the benefit in presence
of rate limiters.

On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wunder@wunderwood.org> wrote:

> Ideally, it would only affect a few queries. In reality, with a sharded
> system, the impact will be large.
>
> I disagree that the goal is to protect a node. The goal is to make the
> entire cluster avoid congestion failure when overloaded, while providing
> good service for the load that it can handle.
>
> I have had Solr clusters take down entire websites when overloaded, both
> at Netflix and Chegg, and I’ve built defenses for this at both places. I’m
> a huge fan of circuit breakers.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On Feb 14, 2021, at 9:50 AM, Atri Sharma <atri@apache.org> wrote:
>
> This has an issue of still leading to node outages if the fanout for a
> query is high.
>
> Circuit breakers follow a simple rule -- defend the node at the cost of
> degraded responses.
>
> Ideally, only few requests will be completely rejected -- some will see
> partial results. Due to this non discriminating nature of circuit breakers,
> the typical blip on service quality due to high resource usage is short
> lived.
>
> However, it is possible to write a circuit breaker which rejects only
> external requests in master branch (we have the ability to identify
> requests as internal or external there).
>
> Regards,
>
> Atri
>
> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wunder@wunderwood.org>
> wrote:
>
>> This got zero responses on the solr-user list, so I’ll raise the issue
>> here.
>>
>> Should circuit breakers only kill external search requests and not
>> cluster-internal requests to shards?
>>
>> Circuit breakers can kill any request, whether it is a client request
>> from outside the cluster or an internal distributed request to a shard.
>> Killing a portion of distributed request will affect the main request. Not
>> sure whether a 503 from a shard will kill the whole request or cause
>> partial results, but it isn’t good.
>>
>> We run with 8 shards. If a circuit breaker is killing 10% of requests on
>> each host, that will hit 57% of all external requests (0.9^8 = 0.43). That
>> seems like “overkill” to me. If it only kills external requests, then 10%
>> means 10%.
>>
>> Killing only external requests requires that external requests go roughly
>> equally to all hosts in the cluster, or at least all NRT or PULL replicas.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/ (my blog)
>>
>
>
Re: Circuit Breakers interaction with Shards [ In reply to ]
We’ve looked at and rejected rate limiters as high-maintenance and not sufficient protection.

We would have run nginx on each node, sent external traffic to nginx on a different port and let internal traffic stay on the default Solr port. This has other advantages (monitoring), but the rate limiting part is way too fiddly.

Rates depend on how much CPU is used per query and on the size of the cluster (if they are not on each node). Some examples from our largest cluster which would need a change in rate limits. Some of these could be set by doing offline load benchmarks, some not.

* Experiment cell that uses 2.5X more CPU for each query (running now in prod)
* Increasing traffic allocated to that cell (did this last week)
* Increase in index size (number of docs and CPU requirements increase about 5% every month)
* Website slowdown that shifts most traffic to mobile, where queries use 2X as much CPU
* Horizontal scaling from 24 tp 48 nodes
* Vertical scaling from c5.8xlarge to c5.18xlarge

And so on. Rate limiting would require almost weekly load benchmarks and it still wouldn’t catch the outage-causing problems.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Feb 14, 2021, at 10:25 AM, Atri Sharma <atri@apache.org> wrote:
>
> The way I look at it is that for cluster level stability, rate limiters should be used which allow rate limiting of only external requests. They are "circuit breakers" in the sense of defending against cluster level instability, which is what you describe.
>
> Circuit breakers, in Solr world, are targeted to be the last resort defense of a node.
>
> As I said earlier, it is possible to write a circuit breaker which rejects only external requests, but I personally do not see the benefit in presence of rate limiters.
>
> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
> Ideally, it would only affect a few queries. In reality, with a sharded system, the impact will be large.
>
> I disagree that the goal is to protect a node. The goal is to make the entire cluster avoid congestion failure when overloaded, while providing good service for the load that it can handle.
>
> I have had Solr clusters take down entire websites when overloaded, both at Netflix and Chegg, and I’ve built defenses for this at both places. I’m a huge fan of circuit breakers.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>
>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <atri@apache.org <mailto:atri@apache.org>> wrote:
>>
>> This has an issue of still leading to node outages if the fanout for a query is high.
>>
>> Circuit breakers follow a simple rule -- defend the node at the cost of degraded responses.
>>
>> Ideally, only few requests will be completely rejected -- some will see partial results. Due to this non discriminating nature of circuit breakers, the typical blip on service quality due to high resource usage is short lived.
>>
>> However, it is possible to write a circuit breaker which rejects only external requests in master branch (we have the ability to identify requests as internal or external there).
>>
>> Regards,
>>
>> Atri
>>
>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
>> This got zero responses on the solr-user list, so I’ll raise the issue here.
>>
>> Should circuit breakers only kill external search requests and not cluster-internal requests to shards?
>>
>> Circuit breakers can kill any request, whether it is a client request from outside the cluster or an internal distributed request to a shard. Killing a portion of distributed request will affect the main request. Not sure whether a 503 from a shard will kill the whole request or cause partial results, but it isn’t good.
>>
>> We run with 8 shards. If a circuit breaker is killing 10% of requests on each host, that will hit 57% of all external requests (0.9^8 = 0.43). That seems like “overkill” to me. If it only kills external requests, then 10% means 10%.
>>
>> Killing only external requests requires that external requests go roughly equally to all hosts in the cluster, or at least all NRT or PULL replicas.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>
Re: Circuit Breakers interaction with Shards [ In reply to ]
This is a debate better suited for a different forum -- but I would
disagree with your assertion that rate limiting is a bad idea.

Solr allows you to specify node level request quotas which also follow the
principle of not limiting internal requests. I find that to be pretty
useful in two forms: 1. Use it in conjunction with a global request limit
which is typically 0.75 of my total load capacity given my average query
resource consumption. 2. Allow per node request limits to ensure fairness
and dedicated capacity for different types of requests. 3. Allow circuit
breakers to handle cases where a couple of rogue queries can take down
nodes.

We digress -- as I said, it should be fairly simple to have a circuit
breaker which rejects only external requests, but should be clearly
documented with its downsides.

On Mon, 15 Feb 2021, 00:33 Walter Underwood, <wunder@wunderwood.org> wrote:

> We’ve looked at and rejected rate limiters as high-maintenance and not
> sufficient protection.
>
> We would have run nginx on each node, sent external traffic to nginx on a
> different port and let internal traffic stay on the default Solr port. This
> has other advantages (monitoring), but the rate limiting part is way too
> fiddly.
>
> Rates depend on how much CPU is used per query and on the size of the
> cluster (if they are not on each node). Some examples from our largest
> cluster which would need a change in rate limits. Some of these could be
> set by doing offline load benchmarks, some not.
>
> * Experiment cell that uses 2.5X more CPU for each query (running now in
> prod)
> * Increasing traffic allocated to that cell (did this last week)
> * Increase in index size (number of docs and CPU requirements increase
> about 5% every month)
> * Website slowdown that shifts most traffic to mobile, where queries use
> 2X as much CPU
> * Horizontal scaling from 24 tp 48 nodes
> * Vertical scaling from c5.8xlarge to c5.18xlarge
>
> And so on. Rate limiting would require almost weekly load benchmarks and
> it still wouldn’t catch the outage-causing problems.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On Feb 14, 2021, at 10:25 AM, Atri Sharma <atri@apache.org> wrote:
>
> The way I look at it is that for cluster level stability, rate limiters
> should be used which allow rate limiting of only external requests. They
> are "circuit breakers" in the sense of defending against cluster level
> instability, which is what you describe.
>
> Circuit breakers, in Solr world, are targeted to be the last resort
> defense of a node.
>
> As I said earlier, it is possible to write a circuit breaker which rejects
> only external requests, but I personally do not see the benefit in presence
> of rate limiters.
>
> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wunder@wunderwood.org>
> wrote:
>
>> Ideally, it would only affect a few queries. In reality, with a sharded
>> system, the impact will be large.
>>
>> I disagree that the goal is to protect a node. The goal is to make the
>> entire cluster avoid congestion failure when overloaded, while providing
>> good service for the load that it can handle.
>>
>> I have had Solr clusters take down entire websites when overloaded, both
>> at Netflix and Chegg, and I’ve built defenses for this at both places. I’m
>> a huge fan of circuit breakers.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/ (my blog)
>>
>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <atri@apache.org> wrote:
>>
>> This has an issue of still leading to node outages if the fanout for a
>> query is high.
>>
>> Circuit breakers follow a simple rule -- defend the node at the cost of
>> degraded responses.
>>
>> Ideally, only few requests will be completely rejected -- some will see
>> partial results. Due to this non discriminating nature of circuit breakers,
>> the typical blip on service quality due to high resource usage is short
>> lived.
>>
>> However, it is possible to write a circuit breaker which rejects only
>> external requests in master branch (we have the ability to identify
>> requests as internal or external there).
>>
>> Regards,
>>
>> Atri
>>
>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wunder@wunderwood.org>
>> wrote:
>>
>>> This got zero responses on the solr-user list, so I’ll raise the issue
>>> here.
>>>
>>> Should circuit breakers only kill external search requests and not
>>> cluster-internal requests to shards?
>>>
>>> Circuit breakers can kill any request, whether it is a client request
>>> from outside the cluster or an internal distributed request to a shard.
>>> Killing a portion of distributed request will affect the main request. Not
>>> sure whether a 503 from a shard will kill the whole request or cause
>>> partial results, but it isn’t good.
>>>
>>> We run with 8 shards. If a circuit breaker is killing 10% of requests on
>>> each host, that will hit 57% of all external requests (0.9^8 = 0.43). That
>>> seems like “overkill” to me. If it only kills external requests, then 10%
>>> means 10%.
>>>
>>> Killing only external requests requires that external requests go
>>> roughly equally to all hosts in the cluster, or at least all NRT or PULL
>>> replicas.
>>>
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/ (my blog)
>>>
>>
>>
>
Re: Circuit Breakers interaction with Shards [ In reply to ]
Rate limiting is a good idea. It requires a lot of ongoing engineering to adjust the rates to the current cluster behavior. It doesn’t help with some kinds of overload. The ROI just doesn’t work out. It is too much work for not enough benefit.

Rate limiting works if the collection size doesn’t change and the queries don’t change.

At Netflix, we limited traffic based on number of connections to each server. This is basically the length of the queue of requests for that server. This is similar to limiting by load average, which is also the work waiting to be done. It has the same weaknesses as the load average circuit breaker, but it did not need to be changed when average CPU usage per query increased. It was “set and forget”. Rate limiters require constant adjustment.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Feb 14, 2021, at 11:44 AM, Atri Sharma <atri@apache.org> wrote:
>
> This is a debate better suited for a different forum -- but I would disagree with your assertion that rate limiting is a bad idea.
>
> Solr allows you to specify node level request quotas which also follow the principle of not limiting internal requests. I find that to be pretty useful in two forms: 1. Use it in conjunction with a global request limit which is typically 0.75 of my total load capacity given my average query resource consumption. 2. Allow per node request limits to ensure fairness and dedicated capacity for different types of requests. 3. Allow circuit breakers to handle cases where a couple of rogue queries can take down nodes.
>
> We digress -- as I said, it should be fairly simple to have a circuit breaker which rejects only external requests, but should be clearly documented with its downsides.
>
> On Mon, 15 Feb 2021, 00:33 Walter Underwood, <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
> We’ve looked at and rejected rate limiters as high-maintenance and not sufficient protection.
>
> We would have run nginx on each node, sent external traffic to nginx on a different port and let internal traffic stay on the default Solr port. This has other advantages (monitoring), but the rate limiting part is way too fiddly.
>
> Rates depend on how much CPU is used per query and on the size of the cluster (if they are not on each node). Some examples from our largest cluster which would need a change in rate limits. Some of these could be set by doing offline load benchmarks, some not.
>
> * Experiment cell that uses 2.5X more CPU for each query (running now in prod)
> * Increasing traffic allocated to that cell (did this last week)
> * Increase in index size (number of docs and CPU requirements increase about 5% every month)
> * Website slowdown that shifts most traffic to mobile, where queries use 2X as much CPU
> * Horizontal scaling from 24 tp 48 nodes
> * Vertical scaling from c5.8xlarge to c5.18xlarge
>
> And so on. Rate limiting would require almost weekly load benchmarks and it still wouldn’t catch the outage-causing problems.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>
>> On Feb 14, 2021, at 10:25 AM, Atri Sharma <atri@apache.org <mailto:atri@apache.org>> wrote:
>>
>> The way I look at it is that for cluster level stability, rate limiters should be used which allow rate limiting of only external requests. They are "circuit breakers" in the sense of defending against cluster level instability, which is what you describe.
>>
>> Circuit breakers, in Solr world, are targeted to be the last resort defense of a node.
>>
>> As I said earlier, it is possible to write a circuit breaker which rejects only external requests, but I personally do not see the benefit in presence of rate limiters.
>>
>> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
>> Ideally, it would only affect a few queries. In reality, with a sharded system, the impact will be large.
>>
>> I disagree that the goal is to protect a node. The goal is to make the entire cluster avoid congestion failure when overloaded, while providing good service for the load that it can handle.
>>
>> I have had Solr clusters take down entire websites when overloaded, both at Netflix and Chegg, and I’ve built defenses for this at both places. I’m a huge fan of circuit breakers.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>>
>>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <atri@apache.org <mailto:atri@apache.org>> wrote:
>>>
>>> This has an issue of still leading to node outages if the fanout for a query is high.
>>>
>>> Circuit breakers follow a simple rule -- defend the node at the cost of degraded responses.
>>>
>>> Ideally, only few requests will be completely rejected -- some will see partial results. Due to this non discriminating nature of circuit breakers, the typical blip on service quality due to high resource usage is short lived.
>>>
>>> However, it is possible to write a circuit breaker which rejects only external requests in master branch (we have the ability to identify requests as internal or external there).
>>>
>>> Regards,
>>>
>>> Atri
>>>
>>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
>>> This got zero responses on the solr-user list, so I’ll raise the issue here.
>>>
>>> Should circuit breakers only kill external search requests and not cluster-internal requests to shards?
>>>
>>> Circuit breakers can kill any request, whether it is a client request from outside the cluster or an internal distributed request to a shard. Killing a portion of distributed request will affect the main request. Not sure whether a 503 from a shard will kill the whole request or cause partial results, but it isn’t good.
>>>
>>> We run with 8 shards. If a circuit breaker is killing 10% of requests on each host, that will hit 57% of all external requests (0.9^8 = 0.43). That seems like “overkill” to me. If it only kills external requests, then 10% means 10%.
>>>
>>> Killing only external requests requires that external requests go roughly equally to all hosts in the cluster, or at least all NRT or PULL replicas.
>>>
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>>
>
Re: Circuit Breakers interaction with Shards [ In reply to ]
Walter, it sounds like you were doing rate limiting, just in a different
way that is more dynamic than a simple (yet fiddly) constant?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, Feb 14, 2021 at 2:54 PM Walter Underwood <wunder@wunderwood.org>
wrote:

> Rate limiting is a good idea. It requires a lot of ongoing engineering to
> adjust the rates to the current cluster behavior. It doesn’t help with some
> kinds of overload. The ROI just doesn’t work out. It is too much work for
> not enough benefit.
>
> Rate limiting works if the collection size doesn’t change and the queries
> don’t change.
>
> At Netflix, we limited traffic based on number of connections to each
> server. This is basically the length of the queue of requests for that
> server. This is similar to limiting by load average, which is also the work
> waiting to be done. It has the same weaknesses as the load average circuit
> breaker, but it did not need to be changed when average CPU usage per query
> increased. It was “set and forget”. Rate limiters require constant
> adjustment.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On Feb 14, 2021, at 11:44 AM, Atri Sharma <atri@apache.org> wrote:
>
> This is a debate better suited for a different forum -- but I would
> disagree with your assertion that rate limiting is a bad idea.
>
> Solr allows you to specify node level request quotas which also follow the
> principle of not limiting internal requests. I find that to be pretty
> useful in two forms: 1. Use it in conjunction with a global request limit
> which is typically 0.75 of my total load capacity given my average query
> resource consumption. 2. Allow per node request limits to ensure fairness
> and dedicated capacity for different types of requests. 3. Allow circuit
> breakers to handle cases where a couple of rogue queries can take down
> nodes.
>
> We digress -- as I said, it should be fairly simple to have a circuit
> breaker which rejects only external requests, but should be clearly
> documented with its downsides.
>
> On Mon, 15 Feb 2021, 00:33 Walter Underwood, <wunder@wunderwood.org>
> wrote:
>
>> We’ve looked at and rejected rate limiters as high-maintenance and not
>> sufficient protection.
>>
>> We would have run nginx on each node, sent external traffic to nginx on a
>> different port and let internal traffic stay on the default Solr port. This
>> has other advantages (monitoring), but the rate limiting part is way too
>> fiddly.
>>
>> Rates depend on how much CPU is used per query and on the size of the
>> cluster (if they are not on each node). Some examples from our largest
>> cluster which would need a change in rate limits. Some of these could be
>> set by doing offline load benchmarks, some not.
>>
>> * Experiment cell that uses 2.5X more CPU for each query (running now in
>> prod)
>> * Increasing traffic allocated to that cell (did this last week)
>> * Increase in index size (number of docs and CPU requirements increase
>> about 5% every month)
>> * Website slowdown that shifts most traffic to mobile, where queries use
>> 2X as much CPU
>> * Horizontal scaling from 24 tp 48 nodes
>> * Vertical scaling from c5.8xlarge to c5.18xlarge
>>
>> And so on. Rate limiting would require almost weekly load benchmarks and
>> it still wouldn’t catch the outage-causing problems.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/ (my blog)
>>
>> On Feb 14, 2021, at 10:25 AM, Atri Sharma <atri@apache.org> wrote:
>>
>> The way I look at it is that for cluster level stability, rate limiters
>> should be used which allow rate limiting of only external requests. They
>> are "circuit breakers" in the sense of defending against cluster level
>> instability, which is what you describe.
>>
>> Circuit breakers, in Solr world, are targeted to be the last resort
>> defense of a node.
>>
>> As I said earlier, it is possible to write a circuit breaker which
>> rejects only external requests, but I personally do not see the benefit in
>> presence of rate limiters.
>>
>> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wunder@wunderwood.org>
>> wrote:
>>
>>> Ideally, it would only affect a few queries. In reality, with a sharded
>>> system, the impact will be large.
>>>
>>> I disagree that the goal is to protect a node. The goal is to make the
>>> entire cluster avoid congestion failure when overloaded, while providing
>>> good service for the load that it can handle.
>>>
>>> I have had Solr clusters take down entire websites when overloaded, both
>>> at Netflix and Chegg, and I’ve built defenses for this at both places. I’m
>>> a huge fan of circuit breakers.
>>>
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/ (my blog)
>>>
>>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <atri@apache.org> wrote:
>>>
>>> This has an issue of still leading to node outages if the fanout for a
>>> query is high.
>>>
>>> Circuit breakers follow a simple rule -- defend the node at the cost of
>>> degraded responses.
>>>
>>> Ideally, only few requests will be completely rejected -- some will see
>>> partial results. Due to this non discriminating nature of circuit breakers,
>>> the typical blip on service quality due to high resource usage is short
>>> lived.
>>>
>>> However, it is possible to write a circuit breaker which rejects only
>>> external requests in master branch (we have the ability to identify
>>> requests as internal or external there).
>>>
>>> Regards,
>>>
>>> Atri
>>>
>>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wunder@wunderwood.org>
>>> wrote:
>>>
>>>> This got zero responses on the solr-user list, so I’ll raise the issue
>>>> here.
>>>>
>>>> Should circuit breakers only kill external search requests and not
>>>> cluster-internal requests to shards?
>>>>
>>>> Circuit breakers can kill any request, whether it is a client request
>>>> from outside the cluster or an internal distributed request to a shard.
>>>> Killing a portion of distributed request will affect the main request. Not
>>>> sure whether a 503 from a shard will kill the whole request or cause
>>>> partial results, but it isn’t good.
>>>>
>>>> We run with 8 shards. If a circuit breaker is killing 10% of requests
>>>> on each host, that will hit 57% of all external requests (0.9^8 = 0.43).
>>>> That seems like “overkill” to me. If it only kills external requests, then
>>>> 10% means 10%.
>>>>
>>>> Killing only external requests requires that external requests go
>>>> roughly equally to all hosts in the cluster, or at least all NRT or PULL
>>>> replicas.
>>>>
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org
>>>> http://observer.wunderwood.org/ (my blog)
>>>>
>>>
>>>
>>
>
Re: Circuit Breakers interaction with Shards [ In reply to ]
Limiting open connections is not the same as rate limiting. Open connections is a count of the requests being processed by a node. When the load balancer gets a new request and all current connections are waiting for a response, a new connection is opened.

If the requests are all the same query and returned from the query cache, the rate can be very high with a few connections. If the request are very slow, like deep paging, it only takes a few hundred requests to max out the connections. 100 queries/sec could be 5% CPU or 100% CPU.

Think of the count of requests waiting to be handled (number of active connections) as like a cluster-wide load average. On connection per request being processed, plus one connection per request waiting.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On Feb 16, 2021, at 8:53 AM, David Smiley <dsmiley@apache.org> wrote:
>
> Walter, it sounds like you were doing rate limiting, just in a different way that is more dynamic than a simple (yet fiddly) constant?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley <http://www.linkedin.com/in/davidwsmiley>
>
> On Sun, Feb 14, 2021 at 2:54 PM Walter Underwood <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
> Rate limiting is a good idea. It requires a lot of ongoing engineering to adjust the rates to the current cluster behavior. It doesn’t help with some kinds of overload. The ROI just doesn’t work out. It is too much work for not enough benefit.
>
> Rate limiting works if the collection size doesn’t change and the queries don’t change.
>
> At Netflix, we limited traffic based on number of connections to each server. This is basically the length of the queue of requests for that server. This is similar to limiting by load average, which is also the work waiting to be done. It has the same weaknesses as the load average circuit breaker, but it did not need to be changed when average CPU usage per query increased. It was “set and forget”. Rate limiters require constant adjustment.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>
>> On Feb 14, 2021, at 11:44 AM, Atri Sharma <atri@apache.org <mailto:atri@apache.org>> wrote:
>>
>> This is a debate better suited for a different forum -- but I would disagree with your assertion that rate limiting is a bad idea.
>>
>> Solr allows you to specify node level request quotas which also follow the principle of not limiting internal requests. I find that to be pretty useful in two forms: 1. Use it in conjunction with a global request limit which is typically 0.75 of my total load capacity given my average query resource consumption. 2. Allow per node request limits to ensure fairness and dedicated capacity for different types of requests. 3. Allow circuit breakers to handle cases where a couple of rogue queries can take down nodes.
>>
>> We digress -- as I said, it should be fairly simple to have a circuit breaker which rejects only external requests, but should be clearly documented with its downsides.
>>
>> On Mon, 15 Feb 2021, 00:33 Walter Underwood, <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
>> We’ve looked at and rejected rate limiters as high-maintenance and not sufficient protection.
>>
>> We would have run nginx on each node, sent external traffic to nginx on a different port and let internal traffic stay on the default Solr port. This has other advantages (monitoring), but the rate limiting part is way too fiddly.
>>
>> Rates depend on how much CPU is used per query and on the size of the cluster (if they are not on each node). Some examples from our largest cluster which would need a change in rate limits. Some of these could be set by doing offline load benchmarks, some not.
>>
>> * Experiment cell that uses 2.5X more CPU for each query (running now in prod)
>> * Increasing traffic allocated to that cell (did this last week)
>> * Increase in index size (number of docs and CPU requirements increase about 5% every month)
>> * Website slowdown that shifts most traffic to mobile, where queries use 2X as much CPU
>> * Horizontal scaling from 24 tp 48 nodes
>> * Vertical scaling from c5.8xlarge to c5.18xlarge
>>
>> And so on. Rate limiting would require almost weekly load benchmarks and it still wouldn’t catch the outage-causing problems.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>>
>>> On Feb 14, 2021, at 10:25 AM, Atri Sharma <atri@apache.org <mailto:atri@apache.org>> wrote:
>>>
>>> The way I look at it is that for cluster level stability, rate limiters should be used which allow rate limiting of only external requests. They are "circuit breakers" in the sense of defending against cluster level instability, which is what you describe.
>>>
>>> Circuit breakers, in Solr world, are targeted to be the last resort defense of a node.
>>>
>>> As I said earlier, it is possible to write a circuit breaker which rejects only external requests, but I personally do not see the benefit in presence of rate limiters.
>>>
>>> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
>>> Ideally, it would only affect a few queries. In reality, with a sharded system, the impact will be large.
>>>
>>> I disagree that the goal is to protect a node. The goal is to make the entire cluster avoid congestion failure when overloaded, while providing good service for the load that it can handle.
>>>
>>> I have had Solr clusters take down entire websites when overloaded, both at Netflix and Chegg, and I’ve built defenses for this at both places. I’m a huge fan of circuit breakers.
>>>
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>>>
>>>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <atri@apache.org <mailto:atri@apache.org>> wrote:
>>>>
>>>> This has an issue of still leading to node outages if the fanout for a query is high.
>>>>
>>>> Circuit breakers follow a simple rule -- defend the node at the cost of degraded responses.
>>>>
>>>> Ideally, only few requests will be completely rejected -- some will see partial results. Due to this non discriminating nature of circuit breakers, the typical blip on service quality due to high resource usage is short lived.
>>>>
>>>> However, it is possible to write a circuit breaker which rejects only external requests in master branch (we have the ability to identify requests as internal or external there).
>>>>
>>>> Regards,
>>>>
>>>> Atri
>>>>
>>>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wunder@wunderwood.org <mailto:wunder@wunderwood.org>> wrote:
>>>> This got zero responses on the solr-user list, so I’ll raise the issue here.
>>>>
>>>> Should circuit breakers only kill external search requests and not cluster-internal requests to shards?
>>>>
>>>> Circuit breakers can kill any request, whether it is a client request from outside the cluster or an internal distributed request to a shard. Killing a portion of distributed request will affect the main request. Not sure whether a 503 from a shard will kill the whole request or cause partial results, but it isn’t good.
>>>>
>>>> We run with 8 shards. If a circuit breaker is killing 10% of requests on each host, that will hit 57% of all external requests (0.9^8 = 0.43). That seems like “overkill” to me. If it only kills external requests, then 10% means 10%.
>>>>
>>>> Killing only external requests requires that external requests go roughly equally to all hosts in the cluster, or at least all NRT or PULL replicas.
>>>>
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/> (my blog)
>>>
>>
>