Mailing List Archive: Learning Lucene from ground up

Learning Lucene from ground up

rahul196452 at gmail

Nov 4, 2022, 9:23 PM

Post #1 of 3 (250 views)

Hello,
I have been working with Lucene and Solr for quite some time and have a
good understanding of a lot of moving parts at the code level. However I
wish to learn Lucene internals from the ground up and want to familiarize
myself with all the dirty details. I would like to know what would be the
best way to go about it.

To kick things off, I have been thinking about picking up “Lucene in
Action”, but have been hesitant (and possibly wrongly) since it is based on
Lucene 3.0 and we have come a long way since then. To give an example of
the level of detail I wish to learn (among other things) would be what
parts of a segment (.tim, .tip, etc) get loaded in memory at search time,
which part uses finite state machines and why, etc

I would really appreciate any thoughts/inputs on how I can go about this.
Thanks in advance!

Regards,
Rahul

Re: Learning Lucene from ground up [ In reply to ]

mycoy.zhang at gmail

Nov 5, 2022, 6:12 PM

Post #2 of 3 (249 views)

Permalink

I just started learning Lucene HNSW source code last months.

I find the most effective way is to start with the testcases, set debugging
break points in the code you're interested in, and walk through the code

Regards
MyCoy

On Fri, Nov 4, 2022 at 9:24 PM Rahul Goswami <rahul196452@gmail.com> wrote:

> Hello,
> I have been working with Lucene and Solr for quite some time and have a
> good understanding of a lot of moving parts at the code level. However I
> wish to learn Lucene internals from the ground up and want to familiarize
> myself with all the dirty details. I would like to know what would be the
> best way to go about it.
>
> To kick things off, I have been thinking about picking up “Lucene in
> Action”, but have been hesitant (and possibly wrongly) since it is based on
> Lucene 3.0 and we have come a long way since then. To give an example of
> the level of detail I wish to learn (among other things) would be what
> parts of a segment (.tim, .tip, etc) get loaded in memory at search time,
> which part uses finite state machines and why, etc
>
> I would really appreciate any thoughts/inputs on how I can go about this.
> Thanks in advance!
>
> Regards,
> Rahul
>

Re: Learning Lucene from ground up [ In reply to ]

jpountz at gmail

Nov 7, 2022, 2:50 AM

Post #3 of 3 (248 views)

Permalink

+1 to MyCoy's suggestion.

To answer your most immediate questions:
- Lucene mostly loads metadata in memory at the time of opening a segment
(dvm, tmd, fdm, vem, nvm, kdm files), other files are memory-mapped and
Lucene relies on the filesystem cache to have their data efficiently
available. This allows Lucene to have a very small memory footprint for
searching.
- Finite state machines are mostly used for suggesters and for the terms
index (tip file), which essentially stores all prefixes that are shared by
25-40 terms in a FST.

On Sun, Nov 6, 2022 at 2:12 AM MyCoy Z <mycoy.zhang@gmail.com> wrote:

> I just started learning Lucene HNSW source code last months.
>
> I find the most effective way is to start with the testcases, set debugging
> break points in the code you're interested in, and walk through the code
>
> Regards
> MyCoy
>
> On Fri, Nov 4, 2022 at 9:24 PM Rahul Goswami <rahul196452@gmail.com>
> wrote:
>
> > Hello,
> > I have been working with Lucene and Solr for quite some time and have a
> > good understanding of a lot of moving parts at the code level. However I
> > wish to learn Lucene internals from the ground up and want to
> familiarize
> > myself with all the dirty details. I would like to know what would be the
> > best way to go about it.
> >
> > To kick things off, I have been thinking about picking up “Lucene in
> > Action”, but have been hesitant (and possibly wrongly) since it is based
> on
> > Lucene 3.0 and we have come a long way since then. To give an example of
> > the level of detail I wish to learn (among other things) would be what
> > parts of a segment (.tim, .tip, etc) get loaded in memory at search time,
> > which part uses finite state machines and why, etc
> >
> > I would really appreciate any thoughts/inputs on how I can go about this.
> > Thanks in advance!
> >
> > Regards,
> > Rahul
> >
>

--
Adrien