Mailing List Archive: Approach for a new Autoscaling framework

Approach for a new Autoscaling framework

Jul 23, 2020, 5:26 AM

Post #1 of 14 (671 views)

[I’m moving a discussion from the PR
<https://github.com/apache/lucene-solr/pull/1684> for SOLR-14613
<https://issues.apache.org/jira/browse/SOLR-14613> to the dev list for a
wider audience. This is about replacing the now (in master) gone
Autoscaling framework with a way for clients to write their customized
placement code]

It took me a long time to write this mail and it's quite long, sorry.
Please anybody interested in the future of Autoscaling (not only those I
cc'ed) do read it and provide feedback. Very impacting decisions have to be
made now.

Thanks Noble for your feedback.
I believe it is important that we are aligned on what we build here, esp.
at the early defining stages (now).

Let me try to elaborate on your concerns and provide in general the
rationale behind the approach.

*> Anyone who wishes to implement this should not require to learn a lot
before even getting started*
For somebody who knows Solr (what is a Node, Collection, Shard, Replica)
and basic notions related to Autoscaling (getting variables representing
current state to make decisions), there’s not much to learn. The framework
uses the same concepts, often with the same names.

*> I don't believe we should have a set of interfaces that duplicate
existing classes just for this functionality.*
Where appropriate we can have existing classes be the implementations for
these interfaces and be passed to the plugins, that would be perfectly ok.
The proposal doesn’t include implementations at this stage, therefore
there’s no duplication, or not yet... (we must get the interfaces right and
agreed upon before implementation). If some interface methods in the
proposal have a different name from equivalent methods in internal classes
we plan to use, of course let's rename one or the other.

Existing internal abstractions are most of the time concrete classes and
not interfaces (Replica, Slice, DocCollection, ClusterState). Making these
visible to contrib code living elsewhere is making future refactoring hard
and contrib code will most likely end up reaching to methods it shouldn’t
be using. If we define a clean set of interfaces for plugins, I wouldn’t
hesitate to break external plugins that reach out to other internal Solr
classes, but will make everything possible to keep the API backward
compatible so existing plugins can be recompiled without change.

*> 24 interfaces to do this is definitely over engineering*
I don’t consider the number of classes or interfaces a metric of complexity
or of engineering quality. There are sample
<https://github.com/apache/lucene-solr/pull/1684/files#diff-ddbe185b5e7922b91b90dfabfc50df4c>
plugin implementations to serve as a base for plugin writers (and for us
defining this framework) and I believe the process is relatively simple.
Trying to do the same things with existing Solr classes might prove a lot
harder (but might be worth the effort for comparison purposes to make sure
we agree on the approach? For example, getting sister replicas of a given
replica in the proposed API is: replica.getShard()
<https://github.com/apache/lucene-solr/pull/1684/files#diff-a2d49bd52fddde54bb7fd2e96238507eR27>
.getReplicas()
<https://github.com/apache/lucene-solr/pull/1684/files#diff-9633f5e169fa3095062451599daac213R31>.
Doing so with the internal classes likely involves getting the DocCollection
and Slice name from the Replica, then get the DocCollection from the
cluster state, there get the Slice based on its name and finally
getReplicas() from the Slice). I consider the role of this new framework is
to make life as easy as possible for writing placement code and the like,
make life easy for us to maintain it, make it easy to write a simulation
engine (should be at least an order of magnitude simpler than the previous
one), etc.

An example regarding readability and number of interfaces: rather than
defining an enum with runtime annotation for building its instances (
Variable.Type
<https://github.com/apache/lucene-solr/blob/branch_8_6/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/Variable.java#L98>)
and then very generic access methods, the proposal defines a specific
interface for each “variable type” (called properties
<https://github.com/apache/lucene-solr/pull/1684/files#diff-4c0fa84354f93cb00e6643aefd00fd3c>).
Rather than concatenating strings to specify the data to return from a
remote node (based on snitches
<https://github.com/apache/lucene-solr/blame/branch_8_6/solr/core/src/java/org/apache/solr/cloud/rule/ImplicitSnitch.java#L60>,
see doc
<https://lucene.apache.org/solr/guide/8_1/solrcloud-autoscaling-policy-preferences.html#node-selector>),
the proposal is explicit and strongly typed (here
<https://github.com/apache/lucene-solr/pull/1684/files#diff-4ec32958f54ec8e1f7e2d5ce8de331bb>
example
to get a specific system property from a node). This definitely does
increase the number of interfaces, but reduces IMO the effort to code to
these abstractions and provides a lot more compile time and IDE assistance.

Goal is to hide all the boilerplate code and machinery (and to a point -
complexity) in the implementations of these interfaces rather than have
each plugin writer deal with the same problems.

We’re moving from something that was complex and hard to read and debug yet
functionally extremely rich, to something simpler for us, more demanding
for users (write code rather than policy config if there's a need for new
behavior) but that should not be less "expressive" in any significant way.
One could even imagine reimplementing the former Autoscaling config Domain
Specific Language on top of these API (maybe as a summer internship project
:)

*> This is a common mistake that we all do. When we design a feature we
think that is the most important thing.*
If by *"most important thing"* you mean investing the best reasonable
effort to do things right then yes.
If you mean trying to make a minor feature look more important and inflated
than it is, I disagree.
As a personal note, replica placement is not the aspect of SolrCloud I'm
most interested in, but the first bottleneck we hit when pushing the scale
of SolrCloud. I approach this with a state of mind "let's do it right and
get it out of the way" to move to topics I really want to work on (around
distribution in SolrCloud and the role of Overseer). Implementing
Autoscaling in a way that simplifies future refactoring (or that does not
make them harder than they already are) is therefore *very high* on my
priority list, to support modest changes (Slice to Shard renaming) and more
ambitious ones (replacing Zookeeper, removing Overseer, you name it).

Thanks for reading, again sorry for the long email, but I hope this helps
(at least helps the discussion),
Ilan

On Thu 23 Jul 2020 at 08:16, Noble Paul <notifications@github.com> wrote:

> I don't believe we should have a set of interfaces that duplicate existing
> classes just for this functionality. This is a common mistake that we all
> do. When we design a feature we think that is the most important thing. We
> endup over designing and over engineering things. This feature will remain
> a tiny part of Solr. Anyone who wishes to implement this should not require
> to learn a lot before even getting started. Let's try to have a minimal set
> of interfaces so that people who try to implement them do not have a huge
> learning cure.
>
> Let's try to understand the requirement
>
> - Solr wants a set of positions to place a few replicas
> - The implementation wants to know what is the current state of the
> cluster so that it can make those decisions
>
> 24 interfaces to do this is definitely over engineering
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/lucene-solr/pull/1684#issuecomment-662837142>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AKIOMCFT5GU2II347GZ4HTTR47IVTANCNFSM4PC3HDKQ>
> .
>