Mailing List Archive: Accessibility of MergeThread.rateLimiter

Accessibility of MergeThread.rateLimiter

Sep 22, 2021, 12:03 AM

Post #1 of 5 (329 views)

In an effort to prepare Elasticsearch for modularization, we are
investigating and eliminating split packages. The situation has improved
through recent refactoring in Lucene 9.0, but a number of split
packages still remain. This message identifies one such, so that it can
be discussed in isolation, with a view to a potential solution either in
Lucene or possibly within Elasticsearch itself.

Elasticsearch defines an `ElasticsearchConcurrentMergeScheduler` [1]
(subclass of lucene's `ConcurrentMergeScheduler`), that provides
tracking of merge times for all and current merges. It does so by
extracting the relevant information from the `MergeThread.rateLimiter`
to report, a) the total-bytes-written, via
`MergeRateLimiter::getTotalBytesWritten`, and b) the MB-per-second
throttle, via `MergeRateLimiter::getMBPerSec`. Currently access to the
package-private `rateLimiter` is gained by effectively (but not
literally) injecting into a lucene package [2].

The elasticsearch ECMS overrides the factory for creating merge threads,
so is in control of the thread creation, but still cannot gain access to
the merge tracking information.

There are several ways that we could resolve this, e.g.

1) In Lucene, declare `MergeThread.rateLimiter` as a public field.
Elasticsearch can access the rate limiter directly.

2) Same as #1, but instead declare `MergeThread.rateLimiter` as
protected field. Elasticsearch can then subclass MergeThread and
access the rate limiter.

3) In Lucene, provide an overloaded MergeThread constructor that accepts
a `MergeRateLimiter`. Elasticsearch can then create the merge rate
limiter itself and pass it to the merge thread on construction. This
would work, but would require ES to retain a map from thread to rate
limiter - not ideal.

4) Add public accessors directly to MergeThread to expose the necessary
information, e.g.
class MergeThread {
...
/** Returns the current mb per second rate limit. */
public double getMBPerSec() { return rateLimiter.getMBPerSec(); }

/** Returns the total bytes written by this merge. */
public long getTotalBytesWritten() { return rateLimiter.getTotalBytesWritten(); }
}

I'm sure there are probably a few other ideas that are not listed above,
but so far no.2 above seems the least intrusive. No.4 is also reasonable
if we consider this functionality more broadly desirable.

-Chris.

[1] https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/engine/ElasticsearchConcurrentMergeScheduler.java#L39
[2] https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/index/OneMergeHelper.java#L16

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Accessibility of MergeThread.rateLimiter [ In reply to ]

jpountz at gmail

Sep 22, 2021, 2:12 AM

Post #2 of 5 (329 views)

Permalink

Hi Chris,

I looked into this and Elasticsearch seems to only need access to the rate
limiter for logging purposes, without adding any information that Lucene
doesn't have.
Maybe another option would consist of moving the logging to Lucene? Having
information in the IndexWriter's InfoStream about rate limiting for each
completed merge sounds like something that would generally be useful.

On Wed, Sep 22, 2021 at 9:03 AM Chris Hegarty
<christopher.hegarty@elastic.co.invalid> wrote:

> In an effort to prepare Elasticsearch for modularization, we are
> investigating and eliminating split packages. The situation has improved
> through recent refactoring in Lucene 9.0, but a number of split
> packages still remain. This message identifies one such, so that it can
> be discussed in isolation, with a view to a potential solution either in
> Lucene or possibly within Elasticsearch itself.
>
> Elasticsearch defines an `ElasticsearchConcurrentMergeScheduler` [1]
> (subclass of lucene's `ConcurrentMergeScheduler`), that provides
> tracking of merge times for all and current merges. It does so by
> extracting the relevant information from the `MergeThread.rateLimiter`
> to report, a) the total-bytes-written, via
> `MergeRateLimiter::getTotalBytesWritten`, and b) the MB-per-second
> throttle, via `MergeRateLimiter::getMBPerSec`. Currently access to the
> package-private `rateLimiter` is gained by effectively (but not
> literally) injecting into a lucene package [2].
>
> The elasticsearch ECMS overrides the factory for creating merge threads,
> so is in control of the thread creation, but still cannot gain access to
> the merge tracking information.
>
> There are several ways that we could resolve this, e.g.
>
> 1) In Lucene, declare `MergeThread.rateLimiter` as a public field.
> Elasticsearch can access the rate limiter directly.
>
> 2) Same as #1, but instead declare `MergeThread.rateLimiter` as
> protected field. Elasticsearch can then subclass MergeThread and
> access the rate limiter.
>
> 3) In Lucene, provide an overloaded MergeThread constructor that accepts
> a `MergeRateLimiter`. Elasticsearch can then create the merge rate
> limiter itself and pass it to the merge thread on construction. This
> would work, but would require ES to retain a map from thread to rate
> limiter - not ideal.
>
> 4) Add public accessors directly to MergeThread to expose the necessary
> information, e.g.
> class MergeThread {
> ...
> /** Returns the current mb per second rate limit. */
> public double getMBPerSec() { return rateLimiter.getMBPerSec(); }
>
> /** Returns the total bytes written by this merge. */
> public long getTotalBytesWritten() { return
> rateLimiter.getTotalBytesWritten(); }
> }
>
> I'm sure there are probably a few other ideas that are not listed above,
> but so far no.2 above seems the least intrusive. No.4 is also reasonable
> if we consider this functionality more broadly desirable.
>
> -Chris.
>
> [1]
> https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/engine/ElasticsearchConcurrentMergeScheduler.java#L39
> [2]
> https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/index/OneMergeHelper.java#L16
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Adrien

Re: Accessibility of MergeThread.rateLimiter [ In reply to ]

christopher.hegarty at elastic

Sep 22, 2021, 4:40 AM

Post #3 of 5 (329 views)

Permalink

Hi Adrien,

Great suggestion. If I understand correctly, then what you are suggesting is something along the lines of ( subject to exact message details, which we can trash out in a PR ):

diff --git a/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java b/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java
index 4500d5cf7ce..76f7ea2a9f2 100644
--- a/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java
+++ b/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java
@@ -691,13 +691,23 @@ public class ConcurrentMergeScheduler extends MergeScheduler {
public void run() {
try {
if (verbose()) {
- message(" merge thread: start");
+ message(String.format(Locale.ROOT, "merge thread %s start", this.getName()));
}

doMerge(mergeSource, merge);

if (verbose()) {
- message(" merge thread: done");
+ message(
+ String.format(
+ Locale.ROOT,
+ "merge thread %s done estSize=%.1f MB (written=%.1f MB) runTime=%.1fs (stopped=%.1fs, paused=%.1fs) rate=%s",
+ this.getName(),
+ bytesToMB(merge.estimatedMergeBytes),
+ bytesToMB(rateLimiter.getTotalBytesWritten()),
+ nsToSec(System.nanoTime() - merge.mergeStartNS),
+ nsToSec(rateLimiter.getTotalStoppedNS()),
+ nsToSec(rateLimiter.getTotalPausedNS()),
+ rateToString(rateLimiter.getMBPerSec())));
}
runOnMergeFinished(mergeSource);
} catch (Throwable exc) {

Then ES can leverage such from the infoStream, right? ( thus avoiding the need for ES extract the inaccessible information directly itself, while also being more generally useful in Lucene logs ).

Or have I misinterpreted your comment?

-Chris.

> On 22 Sep 2021, at 10:12, Adrien Grand <jpountz@gmail.com> wrote:
>
> Hi Chris,
>
> I looked into this and Elasticsearch seems to only need access to the rate limiter for logging purposes, without adding any information that Lucene doesn't have.
> Maybe another option would consist of moving the logging to Lucene? Having information in the IndexWriter's InfoStream about rate limiting for each completed merge sounds like something that would generally be useful.
>

Re: Accessibility of MergeThread.rateLimiter [ In reply to ]

jpountz at gmail

Sep 22, 2021, 4:56 AM

Post #4 of 5 (329 views)

Permalink

You interpreted my suggestion correctly.

Elasticsearch can indeed leverage information that is sent to the
InfoStream: look up the class called LoggerInfoStream, which forwards all
the InfoStream logging to Log4J.

On Wed, Sep 22, 2021 at 1:40 PM Chris Hegarty
<christopher.hegarty@elastic.co.invalid> wrote:

> Hi Adrien,
>
> Great suggestion. If I understand correctly, then what you are suggesting
> is something along the lines of ( subject to exact message details, which
> we can trash out in a PR ):
>
> *diff --git
> a/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java
> b/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java*
> *index 4500d5cf7ce..76f7ea2a9f2 100644*
> *---
> a/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java*
> *+++
> b/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java*
> @@ -691,13 +691,23 @@ public class ConcurrentMergeScheduler extends
> MergeScheduler {
> public void run() {
> try {
> if (verbose()) {
> - message(" merge thread: start");
> + message(String.format(Locale.ROOT, "merge thread %s start",
> this.getName()));
> }
>
>
> doMerge(mergeSource, merge);
>
>
> if (verbose()) {
> - message(" merge thread: done");
> + message(
> + String.format(
> + Locale.ROOT,
> + "merge thread %s done estSize=%.1f MB (written=%.1f MB)
> runTime=%.1fs (stopped=%.1fs, paused=%.1fs) rate=%s",
> + this.getName(),
> + bytesToMB(merge.estimatedMergeBytes),
> + bytesToMB(rateLimiter.getTotalBytesWritten()),
> + nsToSec(System.nanoTime() - merge.mergeStartNS),
> + nsToSec(rateLimiter.getTotalStoppedNS()),
> + nsToSec(rateLimiter.getTotalPausedNS()),
> + rateToString(rateLimiter.getMBPerSec())));
> }
> runOnMergeFinished(mergeSource);
> } catch (Throwable exc) {
>
>
> Then ES can leverage such from the infoStream, right? ( thus avoiding the
> need for ES extract the inaccessible information directly itself, while
> also being more generally useful in Lucene logs ).
>
> Or have I misinterpreted your comment?
>
> -Chris.
>
>
> On 22 Sep 2021, at 10:12, Adrien Grand <jpountz@gmail.com> wrote:
>
> Hi Chris,
>
> I looked into this and Elasticsearch seems to only need access to the rate
> limiter for logging purposes, without adding any information that Lucene
> doesn't have.
> Maybe another option would consist of moving the logging to Lucene? Having
> information in the IndexWriter's InfoStream about rate limiting for each
> completed merge sounds like something that would generally be useful.
>
>
>

--
Adrien

Re: Accessibility of MergeThread.rateLimiter [ In reply to ]

christopher.hegarty at elastic

Sep 22, 2021, 6:02 AM

Post #5 of 5 (329 views)

Permalink

Hi Adrien,

> On 22 Sep 2021, at 12:56, Adrien Grand <jpountz@gmail.com> wrote:
>
> You interpreted my suggestion correctly.

Great. Thanks for the confirmation.

I filed the following issue to track this:
https://issues.apache.org/jira/browse/LUCENE-10118 <https://issues.apache.org/jira/browse/LUCENE-10118>

> Elasticsearch can indeed leverage information that is sent to the InfoStream: look up the class called LoggerInfoStream, which forwards all the InfoStream logging to Log4J.

Got it. Thanks.

Cheers,
-Chris.