This is a debate better suited for a different forum -- but I would
disagree with your assertion that rate limiting is a bad idea.
Solr allows you to specify node level request quotas which also follow the
principle of not limiting internal requests. I find that to be pretty
useful in two forms: 1. Use it in conjunction with a global request limit
which is typically 0.75 of my total load capacity given my average query
resource consumption. 2. Allow per node request limits to ensure fairness
and dedicated capacity for different types of requests. 3. Allow circuit
breakers to handle cases where a couple of rogue queries can take down
nodes.
We digress -- as I said, it should be fairly simple to have a circuit
breaker which rejects only external requests, but should be clearly
documented with its downsides.
On Mon, 15 Feb 2021, 00:33 Walter Underwood, <wunder@wunderwood.org> wrote:
> We’ve looked at and rejected rate limiters as high-maintenance and not
> sufficient protection.
>
> We would have run nginx on each node, sent external traffic to nginx on a
> different port and let internal traffic stay on the default Solr port. This
> has other advantages (monitoring), but the rate limiting part is way too
> fiddly.
>
> Rates depend on how much CPU is used per query and on the size of the
> cluster (if they are not on each node). Some examples from our largest
> cluster which would need a change in rate limits. Some of these could be
> set by doing offline load benchmarks, some not.
>
> * Experiment cell that uses 2.5X more CPU for each query (running now in
> prod)
> * Increasing traffic allocated to that cell (did this last week)
> * Increase in index size (number of docs and CPU requirements increase
> about 5% every month)
> * Website slowdown that shifts most traffic to mobile, where queries use
> 2X as much CPU
> * Horizontal scaling from 24 tp 48 nodes
> * Vertical scaling from c5.8xlarge to c5.18xlarge
>
> And so on. Rate limiting would require almost weekly load benchmarks and
> it still wouldn’t catch the outage-causing problems.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On Feb 14, 2021, at 10:25 AM, Atri Sharma <atri@apache.org> wrote:
>
> The way I look at it is that for cluster level stability, rate limiters
> should be used which allow rate limiting of only external requests. They
> are "circuit breakers" in the sense of defending against cluster level
> instability, which is what you describe.
>
> Circuit breakers, in Solr world, are targeted to be the last resort
> defense of a node.
>
> As I said earlier, it is possible to write a circuit breaker which rejects
> only external requests, but I personally do not see the benefit in presence
> of rate limiters.
>
> On Sun, 14 Feb 2021, 23:50 Walter Underwood, <wunder@wunderwood.org>
> wrote:
>
>> Ideally, it would only affect a few queries. In reality, with a sharded
>> system, the impact will be large.
>>
>> I disagree that the goal is to protect a node. The goal is to make the
>> entire cluster avoid congestion failure when overloaded, while providing
>> good service for the load that it can handle.
>>
>> I have had Solr clusters take down entire websites when overloaded, both
>> at Netflix and Chegg, and I’ve built defenses for this at both places. I’m
>> a huge fan of circuit breakers.
>>
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/ (my blog)
>>
>> On Feb 14, 2021, at 9:50 AM, Atri Sharma <atri@apache.org> wrote:
>>
>> This has an issue of still leading to node outages if the fanout for a
>> query is high.
>>
>> Circuit breakers follow a simple rule -- defend the node at the cost of
>> degraded responses.
>>
>> Ideally, only few requests will be completely rejected -- some will see
>> partial results. Due to this non discriminating nature of circuit breakers,
>> the typical blip on service quality due to high resource usage is short
>> lived.
>>
>> However, it is possible to write a circuit breaker which rejects only
>> external requests in master branch (we have the ability to identify
>> requests as internal or external there).
>>
>> Regards,
>>
>> Atri
>>
>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, <wunder@wunderwood.org>
>> wrote:
>>
>>> This got zero responses on the solr-user list, so I’ll raise the issue
>>> here.
>>>
>>> Should circuit breakers only kill external search requests and not
>>> cluster-internal requests to shards?
>>>
>>> Circuit breakers can kill any request, whether it is a client request
>>> from outside the cluster or an internal distributed request to a shard.
>>> Killing a portion of distributed request will affect the main request. Not
>>> sure whether a 503 from a shard will kill the whole request or cause
>>> partial results, but it isn’t good.
>>>
>>> We run with 8 shards. If a circuit breaker is killing 10% of requests on
>>> each host, that will hit 57% of all external requests (0.9^8 = 0.43). That
>>> seems like “overkill” to me. If it only kills external requests, then 10%
>>> means 10%.
>>>
>>> Killing only external requests requires that external requests go
>>> roughly equally to all hosts in the cluster, or at least all NRT or PULL
>>> replicas.
>>>
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/ (my blog)
>>>
>>
>>
>