Mailing List Archive

Could we allow an IndexInput to read from a still writing IndexOutput?
Hi Team,

Today, Lucene's Directory abstraction does not allow opening an IndexInput
on a file until the file is fully written and closed via IndexOutput. We
enforce this in tests, and some of our core Directory implementations
demand this (e.g. caching the file's length on opening an IndexInput).

Yet, most filesystems will easily allow simultaneous read/append of a
single file. We just don't expose this IO semantics to Lucene, but could
we allow random-access reads with append-only writes on one file? Is there
a strong reason that we don't allow this?

Quick TL/DR context: we are trying to enable FST compilation to write
off-heap (directly to disk), enabling creating arbitrarily large FSTs with
bounded heap, matching how FSTs can now be read off-heap, and it would be
much much more RAM efficient if we could read/append the same file at once.

Full gory details context: inspired by how Tantivy
<https://github.com/quickwit-oss/tantivy> (awesome and fast Rust search
engine!) writes its FSTs <https://blog.burntsushi.net/transducers/>, over
in this issue <https://github.com/apache/lucene/issues/12543> and PR
<https://github.com/dungba88/lucene/commit/882f5a5b1f60d4321d2e09986335063368c08e9b>,
we (thank you Dzung Bui / @dungba88!) are trying to fix Lucene's FST
building to immediately stream the FST to disk, instead of buffering the
whole thing in RAM and then writing to disk.

This would allow building arbitrarily large FSTs without using up heap, and
symmetrically matches how we can now read FSTs off-heap, plus FST building
is already (mostly) append-only. This would also allow removing some of the
crazy abstractions we have for writing FST bytes into RAM (FSTStore,
BytesStore). It would enable interesting things like a Codec whose term
dictionary is stored entirely in an FST
<https://github.com/apache/lucene/pull/12688> (also inspired by Tantivy).

The wrinkle is that, while the FST is building, it sometimes looks back and
reads previously written bytes, to share suffixes and create a minimal (or
near minimal) FST. So if IndexInput could read those bytes, even as the
FST is still appending to IndexOutput, it would "just work".

Failing that, our plan B is to wastefully duplicate the byte[] slices from
the already written bytes into our own private (heap resident, boo) copy,
which would use quite a bit more RAM while building the FST, and make less
minimal FSTs for a given RAM budget. I haven't measured the added wasted
RAM if we have to go this route but I fear it is sizable in practice, i.e.
it strongly negates the whole idea of writing an FST off-heap since its
effectively storing a possibly large portion of the FST in many duplicated
byte[] fragments (in the NodeHash).

So ... could we somehow relax Lucene's Directory semantics to allow opening
an IndexInput on a still appending IndexOutput, since most filesystems are
fine with this?

Mike McCandless

http://blog.mikemccandless.com
Re: Could we allow an IndexInput to read from a still writing IndexOutput? [ In reply to ]
what will happen on windows?

sorry, could not resist.

On Thu, Oct 19, 2023 at 9:48?AM Michael McCandless
<lucene@mikemccandless.com> wrote:
>
> Hi Team,
>
> Today, Lucene's Directory abstraction does not allow opening an IndexInput on a file until the file is fully written and closed via IndexOutput. We enforce this in tests, and some of our core Directory implementations demand this (e.g. caching the file's length on opening an IndexInput).
>
> Yet, most filesystems will easily allow simultaneous read/append of a single file. We just don't expose this IO semantics to Lucene, but could we allow random-access reads with append-only writes on one file? Is there a strong reason that we don't allow this?
>
> Quick TL/DR context: we are trying to enable FST compilation to write off-heap (directly to disk), enabling creating arbitrarily large FSTs with bounded heap, matching how FSTs can now be read off-heap, and it would be much much more RAM efficient if we could read/append the same file at once.
>
> Full gory details context: inspired by how Tantivy (awesome and fast Rust search engine!) writes its FSTs, over in this issue and PR, we (thank you Dzung Bui / @dungba88!) are trying to fix Lucene's FST building to immediately stream the FST to disk, instead of buffering the whole thing in RAM and then writing to disk.
>
> This would allow building arbitrarily large FSTs without using up heap, and symmetrically matches how we can now read FSTs off-heap, plus FST building is already (mostly) append-only. This would also allow removing some of the crazy abstractions we have for writing FST bytes into RAM (FSTStore, BytesStore). It would enable interesting things like a Codec whose term dictionary is stored entirely in an FST (also inspired by Tantivy).
>
> The wrinkle is that, while the FST is building, it sometimes looks back and reads previously written bytes, to share suffixes and create a minimal (or near minimal) FST. So if IndexInput could read those bytes, even as the FST is still appending to IndexOutput, it would "just work".
>
> Failing that, our plan B is to wastefully duplicate the byte[] slices from the already written bytes into our own private (heap resident, boo) copy, which would use quite a bit more RAM while building the FST, and make less minimal FSTs for a given RAM budget. I haven't measured the added wasted RAM if we have to go this route but I fear it is sizable in practice, i.e. it strongly negates the whole idea of writing an FST off-heap since its effectively storing a possibly large portion of the FST in many duplicated byte[] fragments (in the NodeHash).
>
> So ... could we somehow relax Lucene's Directory semantics to allow opening an IndexInput on a still appending IndexOutput, since most filesystems are fine with this?
>
> Mike McCandless
>
> http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Could we allow an IndexInput to read from a still writing IndexOutput? [ In reply to ]
I think there is a certain beauty (of tape-backed storage flavor...) in
existing abstractions and I wouldn't change them unless absolutely
necessary (FST construction isn't the dominant cost in indexing). Also,
random seeks all over the place may be really problematic in certain
scenarios (as is opening a written-to file for reading, as Robert
mentioned).

> Failing that, our plan B is to wastefully duplicate the byte[] slices
from the already written bytes into our own private (heap resident, boo)
copy, which would use quite a bit more RAM while building the FST, and make
less minimal FSTs for a given RAM budget.

Well, this node cache doesn't have to be on heap... It can be a plain
temporary file (with full random access). It's a scratch-only structure
which you can delete after the fst is written. It does add I/O overhead but
doesn't interfere with the rest of the code in Lucene. Perhaps, instead of
changing IndexInput and IndexOutput, one could start with a plain temp file
(NIO API)?

I also think that the tradeoffs presented in graphs on the fst-node-cache
issue are not so bad at all. Yes, the FST is not minimal, but the
construction-space vs output size is quite all right to me.

Dawid

On Thu, Oct 19, 2023 at 3:50?PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Hi Team,
>
> Today, Lucene's Directory abstraction does not allow opening an IndexInput
> on a file until the file is fully written and closed via IndexOutput. We
> enforce this in tests, and some of our core Directory implementations
> demand this (e.g. caching the file's length on opening an IndexInput).
>
> Yet, most filesystems will easily allow simultaneous read/append of a
> single file. We just don't expose this IO semantics to Lucene, but could
> we allow random-access reads with append-only writes on one file? Is there
> a strong reason that we don't allow this?
>
> Quick TL/DR context: we are trying to enable FST compilation to write
> off-heap (directly to disk), enabling creating arbitrarily large FSTs with
> bounded heap, matching how FSTs can now be read off-heap, and it would be
> much much more RAM efficient if we could read/append the same file at once.
>
> Full gory details context: inspired by how Tantivy
> <https://github.com/quickwit-oss/tantivy> (awesome and fast Rust search
> engine!) writes its FSTs <https://blog.burntsushi.net/transducers/>, over
> in this issue <https://github.com/apache/lucene/issues/12543> and PR
> <https://github.com/dungba88/lucene/commit/882f5a5b1f60d4321d2e09986335063368c08e9b>,
> we (thank you Dzung Bui / @dungba88!) are trying to fix Lucene's FST
> building to immediately stream the FST to disk, instead of buffering the
> whole thing in RAM and then writing to disk.
>
> This would allow building arbitrarily large FSTs without using up heap,
> and symmetrically matches how we can now read FSTs off-heap, plus FST
> building is already (mostly) append-only. This would also allow removing
> some of the crazy abstractions we have for writing FST bytes into RAM
> (FSTStore, BytesStore). It would enable interesting things like a Codec
> whose term dictionary is stored entirely in an FST
> <https://github.com/apache/lucene/pull/12688> (also inspired by Tantivy).
>
> The wrinkle is that, while the FST is building, it sometimes looks back
> and reads previously written bytes, to share suffixes and create a minimal
> (or near minimal) FST. So if IndexInput could read those bytes, even as
> the FST is still appending to IndexOutput, it would "just work".
>
> Failing that, our plan B is to wastefully duplicate the byte[] slices from
> the already written bytes into our own private (heap resident, boo) copy,
> which would use quite a bit more RAM while building the FST, and make less
> minimal FSTs for a given RAM budget. I haven't measured the added wasted
> RAM if we have to go this route but I fear it is sizable in practice, i.e.
> it strongly negates the whole idea of writing an FST off-heap since its
> effectively storing a possibly large portion of the FST in many duplicated
> byte[] fragments (in the NodeHash).
>
> So ... could we somehow relax Lucene's Directory semantics to allow
> opening an IndexInput on a still appending IndexOutput, since most
> filesystems are fine with this?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
Re: Could we allow an IndexInput to read from a still writing IndexOutput? [ In reply to ]
Hi, the biggest problem is with some IndexInputs that work on FS Cache
(mmapdir). The file size changes while you are writing therefore it
could cause strange issues. Especially the mapping of mmap may not see
the changes you have already written as there is no happens-before
relationship.

Basically the IO model of Lucene is WORM. So something thats visible to
readers must never change anymore.

So as said by the others, if you need stuff already written, keep it in
memory (like nodes). We should really not change our IO model for this
singleton. 1% slowdown while writing due to some caching of buffering
does not matter and risk us corrupting indexes or run into errors while
reading.

Uwe

Am 19.10.2023 um 15:47 schrieb Michael McCandless:
> Hi Team,
>
> Today, Lucene's Directory abstraction does not allow opening an
> IndexInput on a file until the file is fully written and closed via
> IndexOutput.  We enforce this in tests, and some of our core Directory
> implementations demand this (e.g. caching the file's length on opening
> an IndexInput).
>
> Yet, most filesystems will easily allow simultaneous read/append of a
> single file.  We just don't expose this IO semantics to Lucene, but
> could we allow random-access reads with append-only writes on one
> file?  Is there a strong reason that we don't allow this?
>
> Quick TL/DR context: we are trying to enable FST compilation to write
> off-heap (directly to disk), enabling creating arbitrarily large FSTs
> with bounded heap, matching how FSTs can now be read off-heap, and it
> would be much much more RAM efficient if we could read/append the same
> file at once.
>
> Full gory details context: inspired by how Tantivy
> <https://github.com/quickwit-oss/tantivy> (awesome and fast Rust
> search engine!) writes its FSTs
> <https://blog.burntsushi.net/transducers/>, over in this issue
> <https://github.com/apache/lucene/issues/12543> and PR
> <https://github.com/dungba88/lucene/commit/882f5a5b1f60d4321d2e09986335063368c08e9b>,
> we (thank you Dzung Bui / @dungba88!) are trying to fix Lucene's FST
> building to immediately stream the FST to disk, instead of buffering
> the whole thing in RAM and then writing to disk.
>
> This would allow building arbitrarily large FSTs without using up
> heap, and symmetrically matches how we can now read FSTs off-heap,
> plus FST building is already (mostly) append-only. This would also
> allow removing some of the crazy abstractions we have for writing FST
> bytes into RAM (FSTStore, BytesStore).  It would enable interesting
> things like a Codec whose term dictionary is stored entirely in an FST
> <https://github.com/apache/lucene/pull/12688> (also inspired by Tantivy).
>
> The wrinkle is that, while the FST is building, it sometimes looks
> back and reads previously written bytes, to share suffixes and create
> a minimal (or near minimal) FST.  So if IndexInput could read those
> bytes, even as the FST is still appending to IndexOutput, it would
> "just work".
>
> Failing that, our plan B is to wastefully duplicate the byte[] slices
> from the already written bytes into our own private (heap resident,
> boo) copy, which would use quite a bit more RAM while building the
> FST, and make less minimal FSTs for a given RAM budget.  I haven't
> measured the added wasted RAM if we have to go this route but I fear
> it is sizable in practice, i.e. it strongly negates the whole idea of
> writing an FST off-heap since its effectively storing a possibly large
> portion of the FST in many duplicated byte[] fragments (in the NodeHash).
>
> So ... could we somehow relax Lucene's Directory semantics to allow
> opening an IndexInput on a still appending IndexOutput, since most
> filesystems are fine with this?
>
> Mike McCandless
>
> http://blog.mikemccandless.com

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:uwe@thetaphi.de
Re: Could we allow an IndexInput to read from a still writing IndexOutput? [ In reply to ]
Thanks everyone! Responses below:

On Thu, Oct 19, 2023 at 11:17?AM Robert Muir <rcmuir@gmail.com> wrote:

> what will happen on windows?
>
> sorry, could not resist.
>

LOL, yeah, sigh.

On Thu, Oct 19, 2023 at 10:36?PM Dawid Weiss <dawid.weiss@gmail.com> wrote:

>
> I think there is a certain beauty (of tape-backed storage flavor...) in
> existing abstractions and I wouldn't change them unless absolutely
> necessary (FST construction isn't the dominant cost in indexing). Also,
> random seeks all over the place may be really problematic in certain
> scenarios (as is opening a written-to file for reading, as Robert
> mentioned).
>

I do agree. I love how minimalistic the IO semantics Lucene actually
requires are.

> Failing that, our plan B is to wastefully duplicate the byte[] slices
> from the already written bytes into our own private (heap resident, boo)
> copy, which would use quite a bit more RAM while building the FST, and make
> less minimal FSTs for a given RAM budget.
>
> Well, this node cache doesn't have to be on heap... It can be a plain
> temporary file (with full random access). It's a scratch-only structure
> which you can delete after the fst is written. It does add I/O overhead but
> doesn't interfere with the rest of the code in Lucene. Perhaps, instead of
> changing IndexInput and IndexOutput, one could start with a plain temp file
> (NIO API)?
>

That's an interesting option. I had ruled out "bypassing Directory
abstraction and going straight to JDK IO APIs", but maybe it's OK to do
so. I like this option Dawid!


> I also think that the tradeoffs presented in graphs on the fst-node-cache
> issue are not so bad at all. Yes, the FST is not minimal, but the
> construction-space vs output size is quite all right to me.
>

Well, the tradeoffs I posted in this PR
<https://github.com/apache/lucene/pull/12633> (now merged, to main, and
eventually to 9.x) are only if we still buffer the whole FST in RAM, and so
we use that as our random-access cache to past FST nodes. If we succeed in
changing FST writing to fully off-heap (append bytes directly to disk),
then we need to duplicate that random-access RAM somewhere else (maybe a
direct NIO file, maybe just duplicated byte[] copies in the NodeHash). So
net/net those curves will get worse -- more RAM required to achieve the
same minimality. I haven't tested just how much worse yet ... I wanted to
probe this possibility (random read access on an appending write file)
first to not wastefully duplicate these bytes in RAM.

On Sat, Oct 21, 2023 at 1:09?AM Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi, the biggest problem is with some IndexInputs that work on FS Cache
> (mmapdir). The file size changes while you are writing therefore it could
> cause strange issues. Especially the mapping of mmap may not see the
> changes you have already written as there is no happens-before relationship.
>
Hmm, I didn't realize Panama's MMap implementation had this limitation. Or
maybe you are saying this is an OS level limitation? Because when you map
a region of a file, you must give a bounded range (0 .. file-length), and
then if the file grows, you would have to re-map or make a 2nd, 3rd, ...
map? Yeah OK this seems problematic indeed.

> So as said by the others, if you need stuff already written, keep it in
> memory (like nodes). We should really not change our IO model for this
> singleton. 1% slowdown while writing due to some caching of buffering does
> not matter and risk us corrupting indexes or run into errors while reading.
>
Yeah OK I'm convinced :) Let's leave Lucene's IO "WORM" semantics intact,
and either use direct NIO for the suffix hash (NodeHash), or, burn the RAM
in duplicating the FST nodes (and measure the impact on RAM vs minimality).

Thanks everyone,

Mike McCandless

http://blog.mikemccandless.com