Mailing List Archive

memory usage - RE: your crawler
I may have misunderstood something, but
if you're looking to reduce memory/RAM usage by
a convenient data structure in LARM you might
consider jdbm (http://jdbm.sf.net/), a disk-based
BTree.

I wrote a Map interface to it called PersistentMap so you
can program to it in a convenient form and possibly just drop it in
as a replacement for any existing Map's (HashMap, TreeMap).
It's not hard to do a presistent Set or List either.

http://www.tropo.com/techno/java/jdbm/

The main alternative to jdbm is JISP, which I haven't used:
http://www.coyotegulch.com/algorithm/jisp/



-----Original Message-----
From: Clemens Marschner [mailto:cmad@lanlab.de]
Sent: Friday, September 20, 2002 6:58 AM
To: Halácsy Péter
Cc: Lucene Developers List
Subject: Re: your crawler


Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F
ieldsReader.java>----- Original Message -----
>From: Halácsy Péter
>To: cmad@lanlab.de
>Sent: Friday, September 20, 2002 12:10 PM
>Subject: your crawler
>
>
>BTW what is the status of the LARM crawler. 2 months ago I promised I
could
help from September because I would >be a PHD student of Budapest
University
of Technology. Did you choose avalon as a component framework?


I'm in the last days of my master's thesis. I will get back to the
crawler
after Oct. 2nd (and a week of vacation on Garda's beautiful lakeside).

Otis has played around with the crawler in the last two weeks, and we
had
long email conversations. We have found some problems one has to cope
with.
I.e. LARM has a relatively high memory overhead per server (I mentioned
it
was made for large intranets). Otis's 100MB RAM overflew after crawling
about 40000 URLs in the .hr domain.
I for myself have crawled 500.000 files from 500 servers with about 400
mb
of main memory (by the way, that only takes about 2-3 hours [but imposes
some load on the servers...])

We have talked about how the more or less linear rising memory
consumption
could be controlled. Two components use up memory: The URLVisitedFilter,
which at this time simply holds a HashMap of already visited URLs; and
the
FetcherTaskQueue, which holds a CachingQueue with crawling tasks for
each
server. The cachingQueue itself holds up to two blocks of the queue in
RAM,
so this may rise fast if the number of servers rises (look at the
Javadoc, I
recall it's well documented).

We though about controlling this by a) compressing the visitedFilter's
contents, b) taking advantage of some locality property of URL
distributions
(making it possible to move some of the URLs to secondary storage) and
c)
binding a server to only one thread, minimizing the need for
synchronization
(and providing more possibilities to move the tasks out of the RAM). a)
can
be accomplished by compressing the sorted list of URLs (there are papers
about that on Citeseer). Incoming URLs would have to be divided into
blocks
(i.e. per server) and, when a time/space threshold is reached, the block
is
compressed. I have done a little work on that already, although my
implementation only works in batch mode, not incrementally.

Finally, the LuceneStorage is far from being optimized, and is a major
bottleneck. We thought about dividing the crawling from the indexing
process.

btw: Has anybody used a profiler with the Lucene indexing part? I
suppose
there is still a lot to optimize there.

Regarding Avalon: I haven't had the time to look at it thoroughly.
Mehran
Mehr wanted to to that, but I haven't heard anything from him for weeks
now.
Probably he wants to present us the perfect solution very soon...

What I have done is I tried to use the Jakarta BeanUtils for loading the
config files. Works pretty simple (just a few lines of code, vers
straightforward) but then the check for mandatory parameters etc. would
have
to be done by hand afterwards, something I would expect an XML reader to
get
from an xsd file or something, at least optionally.

Back to my 15 hour day... :-|

--Clemens




--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: memory usage - RE: your crawler [ In reply to ]
Actually, Clemens already has something called CachingQueue, which does
a good job of storing most of the 'URLs to fetch' in a queue stored on
disk.

A little 'background':

But there is also a need for a data structure that contains all URLs
that the crawler has already 'seen' (found and extraced from fetched
documented). There is no need to store the same URL twice, and fetch
it twice, so there is this 'VisitedURLsFilter' (don't confuse 'visited'
with 'fetched', it just means 'already seen') which contains a list of
all seen URLs.
Every URL extracted from a fetched document needs to be looked up in
this VisitedURLsFilter. If not there already, it needs to be added to
it (and to the queue of URLs to fetch). If there already, it is thrown
away.

Because of this, the data structure that VisitedURLsFilter uses to
store and look up URLs must be super fast.
This means that it cannot be on disk.
However, a crawler normally encounters hundreds of thousands and
millions or URLs, so storing them all in this filter wouldn't work (RAM
issue).

So the question is how to store such large number of URLs and at the
same time provide a fast lookup access to it.

Otis


--- "Spencer, Dave" <dave@lumos.com> wrote:
> I may have misunderstood something, but
> if you're looking to reduce memory/RAM usage by
> a convenient data structure in LARM you might
> consider jdbm (http://jdbm.sf.net/), a disk-based
> BTree.
>
> I wrote a Map interface to it called PersistentMap so you
> can program to it in a convenient form and possibly just drop it in
> as a replacement for any existing Map's (HashMap, TreeMap).
> It's not hard to do a presistent Set or List either.
>
> http://www.tropo.com/techno/java/jdbm/
>
> The main alternative to jdbm is JISP, which I haven't used:
> http://www.coyotegulch.com/algorithm/jisp/
>
>
>
> -----Original Message-----
> From: Clemens Marschner [mailto:cmad@lanlab.de]
> Sent: Friday, September 20, 2002 6:58 AM
> To: Halácsy Péter
> Cc: Lucene Developers List
> Subject: Re: your crawler
>
>
> Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F
> ieldsReader.java>----- Original Message -----
> >From: Halácsy Péter
> >To: cmad@lanlab.de
> >Sent: Friday, September 20, 2002 12:10 PM
> >Subject: your crawler
> >
> >
> >BTW what is the status of the LARM crawler. 2 months ago I promised
> I
> could
> help from September because I would >be a PHD student of Budapest
> University
> of Technology. Did you choose avalon as a component framework?
>
>
> I'm in the last days of my master's thesis. I will get back to the
> crawler
> after Oct. 2nd (and a week of vacation on Garda's beautiful
> lakeside).
>
> Otis has played around with the crawler in the last two weeks, and we
> had
> long email conversations. We have found some problems one has to cope
> with.
> I.e. LARM has a relatively high memory overhead per server (I
> mentioned
> it
> was made for large intranets). Otis's 100MB RAM overflew after
> crawling
> about 40000 URLs in the .hr domain.
> I for myself have crawled 500.000 files from 500 servers with about
> 400
> mb
> of main memory (by the way, that only takes about 2-3 hours [but
> imposes
> some load on the servers...])
>
> We have talked about how the more or less linear rising memory
> consumption
> could be controlled. Two components use up memory: The
> URLVisitedFilter,
> which at this time simply holds a HashMap of already visited URLs;
> and
> the
> FetcherTaskQueue, which holds a CachingQueue with crawling tasks for
> each
> server. The cachingQueue itself holds up to two blocks of the queue
> in
> RAM,
> so this may rise fast if the number of servers rises (look at the
> Javadoc, I
> recall it's well documented).
>
> We though about controlling this by a) compressing the
> visitedFilter's
> contents, b) taking advantage of some locality property of URL
> distributions
> (making it possible to move some of the URLs to secondary storage)
> and
> c)
> binding a server to only one thread, minimizing the need for
> synchronization
> (and providing more possibilities to move the tasks out of the RAM).
> a)
> can
> be accomplished by compressing the sorted list of URLs (there are
> papers
> about that on Citeseer). Incoming URLs would have to be divided into
> blocks
> (i.e. per server) and, when a time/space threshold is reached, the
> block
> is
> compressed. I have done a little work on that already, although my
> implementation only works in batch mode, not incrementally.
>
> Finally, the LuceneStorage is far from being optimized, and is a
> major
> bottleneck. We thought about dividing the crawling from the indexing
> process.
>
> btw: Has anybody used a profiler with the Lucene indexing part? I
> suppose
> there is still a lot to optimize there.
>
> Regarding Avalon: I haven't had the time to look at it thoroughly.
> Mehran
> Mehr wanted to to that, but I haven't heard anything from him for
> weeks
> now.
> Probably he wants to present us the perfect solution very soon...
>
> What I have done is I tried to use the Jakarta BeanUtils for loading
> the
> config files. Works pretty simple (just a few lines of code, vers
> straightforward) but then the check for mandatory parameters etc.
> would
> have
> to be done by hand afterwards, something I would expect an XML reader
> to
> get
> from an xsd file or something, at least optionally.
>
> Back to my 15 hour day... :-|
>
> --Clemens
>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>


__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: memory usage - RE: your crawler [ In reply to ]
Otis Gospodnetic wrote:
> Every URL extracted from a fetched document needs to be looked up in
> this VisitedURLsFilter. If not there already, it needs to be added to
> it (and to the queue of URLs to fetch). If there already, it is thrown
> away.
>
> Because of this, the data structure that VisitedURLsFilter uses to
> store and look up URLs must be super fast.
> This means that it cannot be on disk.
> However, a crawler normally encounters hundreds of thousands and
> millions or URLs, so storing them all in this filter wouldn't work (RAM
> issue).
>
> So the question is how to store such large number of URLs and at the
> same time provide a fast lookup access to it.

One way to speed things up without storing all URLs in RAM would be to
batch the filtering. You start with a list of pages to crawl. As each
is downloaded, you extract its URLs and add them to a queue of urls to
be filtered. Periodically process this queue. If you sort the queue by
URL, then merge it with a sorted offline data structure, like a B-Tree,
then you minimize the amount of I/O required. This is not as fast as
keeping things in RAM, but much faster than doing a B-Tree lookup as
each URL is encountered.

Doug


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: memory usage - RE: your crawler [ In reply to ]
We basically do this (store all seen URLS) using a SleepyCat backed server
over RMI,
though obviously you could do it locally also (This is pretty much
what the JDBM appears to be also, the BTree version of SleepyCat).
We average 3 ms per get, and we store information such as url, last
accessed,
md5 checksum, dupe count and error count so we can age the urls, instead of

just fetching them once. Puts are < 1ms. I think we currently have about 2
million url-states
in our cache. Access time and speed are generally a function of using
enough cache, and/or if you can use any locality for your cache.
Our system averages about 1000 reads per minute from this db. I wrote this
about
6 months ago so I don't remember what the max was, plus the machine I'm
running in production is a bit faster (1.6Ghz, Linux, ReiserFS, 1G ram, 48M
of SleepyCat cache)
then what I had in house, but I would assume 100+/sec should be doable.
Guess it depends on what "superfast" means.

cwikla

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Friday, September 20, 2002 11:31 AM
To: Lucene Developers List
Subject: Re: memory usage - RE: your crawler


Actually, Clemens already has something called CachingQueue, which does
a good job of storing most of the 'URLs to fetch' in a queue stored on
disk.

A little 'background':

But there is also a need for a data structure that contains all URLs
that the crawler has already 'seen' (found and extraced from fetched
documented). There is no need to store the same URL twice, and fetch
it twice, so there is this 'VisitedURLsFilter' (don't confuse 'visited'
with 'fetched', it just means 'already seen') which contains a list of
all seen URLs.
Every URL extracted from a fetched document needs to be looked up in
this VisitedURLsFilter. If not there already, it needs to be added to
it (and to the queue of URLs to fetch). If there already, it is thrown
away.

Because of this, the data structure that VisitedURLsFilter uses to
store and look up URLs must be super fast.
This means that it cannot be on disk.
However, a crawler normally encounters hundreds of thousands and
millions or URLs, so storing them all in this filter wouldn't work (RAM
issue).

So the question is how to store such large number of URLs and at the
same time provide a fast lookup access to it.

Otis


--- "Spencer, Dave" <dave@lumos.com> wrote:
> I may have misunderstood something, but
> if you're looking to reduce memory/RAM usage by
> a convenient data structure in LARM you might
> consider jdbm (http://jdbm.sf.net/), a disk-based
> BTree.
>
> I wrote a Map interface to it called PersistentMap so you
> can program to it in a convenient form and possibly just drop it in
> as a replacement for any existing Map's (HashMap, TreeMap).
> It's not hard to do a presistent Set or List either.
>
> http://www.tropo.com/techno/java/jdbm/
>
> The main alternative to jdbm is JISP, which I haven't used:
> http://www.coyotegulch.com/algorithm/jisp/
>
>
>
> -----Original Message-----
> From: Clemens Marschner [mailto:cmad@lanlab.de]
> Sent: Friday, September 20, 2002 6:58 AM
> To: Halácsy Péter
> Cc: Lucene Developers List
> Subject: Re: your crawler
>
>
> Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F
> ieldsReader.java>----- Original Message -----
> >From: Halácsy Péter
> >To: cmad@lanlab.de
> >Sent: Friday, September 20, 2002 12:10 PM
> >Subject: your crawler
> >
> >
> >BTW what is the status of the LARM crawler. 2 months ago I promised
> I
> could
> help from September because I would >be a PHD student of Budapest
> University
> of Technology. Did you choose avalon as a component framework?
>
>
> I'm in the last days of my master's thesis. I will get back to the
> crawler
> after Oct. 2nd (and a week of vacation on Garda's beautiful
> lakeside).
>
> Otis has played around with the crawler in the last two weeks, and we
> had
> long email conversations. We have found some problems one has to cope
> with.
> I.e. LARM has a relatively high memory overhead per server (I
> mentioned
> it
> was made for large intranets). Otis's 100MB RAM overflew after
> crawling
> about 40000 URLs in the .hr domain.
> I for myself have crawled 500.000 files from 500 servers with about
> 400
> mb
> of main memory (by the way, that only takes about 2-3 hours [but
> imposes
> some load on the servers...])
>
> We have talked about how the more or less linear rising memory
> consumption
> could be controlled. Two components use up memory: The
> URLVisitedFilter,
> which at this time simply holds a HashMap of already visited URLs;
> and
> the
> FetcherTaskQueue, which holds a CachingQueue with crawling tasks for
> each
> server. The cachingQueue itself holds up to two blocks of the queue
> in
> RAM,
> so this may rise fast if the number of servers rises (look at the
> Javadoc, I
> recall it's well documented).
>
> We though about controlling this by a) compressing the
> visitedFilter's
> contents, b) taking advantage of some locality property of URL
> distributions
> (making it possible to move some of the URLs to secondary storage)
> and
> c)
> binding a server to only one thread, minimizing the need for
> synchronization
> (and providing more possibilities to move the tasks out of the RAM).
> a)
> can
> be accomplished by compressing the sorted list of URLs (there are
> papers
> about that on Citeseer). Incoming URLs would have to be divided into
> blocks
> (i.e. per server) and, when a time/space threshold is reached, the
> block
> is
> compressed. I have done a little work on that already, although my
> implementation only works in batch mode, not incrementally.
>
> Finally, the LuceneStorage is far from being optimized, and is a
> major
> bottleneck. We thought about dividing the crawling from the indexing
> process.
>
> btw: Has anybody used a profiler with the Lucene indexing part? I
> suppose
> there is still a lot to optimize there.
>
> Regarding Avalon: I haven't had the time to look at it thoroughly.
> Mehran
> Mehr wanted to to that, but I haven't heard anything from him for
> weeks
> now.
> Probably he wants to present us the perfect solution very soon...
>
> What I have done is I tried to use the Jakarta BeanUtils for loading
> the
> config files. Works pretty simple (just a few lines of code, vers
> straightforward) but then the check for mandatory parameters etc.
> would
> have
> to be done by hand afterwards, something I would expect an XML reader
> to
> get
> from an xsd file or something, at least optionally.
>
> Back to my 15 hour day... :-|
>
> --Clemens
>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>


__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: memory usage - RE: your crawler [ In reply to ]
You could try to use a Bloom Filter. It's a probabilistic
datastructure which means that there is a small probability
for a false answer. However, Bloom Filters are very efficient,
both in terms of speed and memory usage and the probability of
a false answer can be controlled by using more memory.

More about Bloom Filters can be found at:
http://www.cap-lore.com/code/BloomTheory.html

Magnus


-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: den 20 september 2002 20:31
To: Lucene Developers List
Subject: Re: memory usage - RE: your crawler


Actually, Clemens already has something called CachingQueue, which does
a good job of storing most of the 'URLs to fetch' in a queue stored on
disk.

A little 'background':

But there is also a need for a data structure that contains all URLs
that the crawler has already 'seen' (found and extraced from fetched
documented). There is no need to store the same URL twice, and fetch
it twice, so there is this 'VisitedURLsFilter' (don't confuse 'visited'
with 'fetched', it just means 'already seen') which contains a list of
all seen URLs.
Every URL extracted from a fetched document needs to be looked up in
this VisitedURLsFilter. If not there already, it needs to be added to
it (and to the queue of URLs to fetch). If there already, it is thrown
away.

Because of this, the data structure that VisitedURLsFilter uses to
store and look up URLs must be super fast.
This means that it cannot be on disk.
However, a crawler normally encounters hundreds of thousands and
millions or URLs, so storing them all in this filter wouldn't work (RAM
issue).

So the question is how to store such large number of URLs and at the
same time provide a fast lookup access to it.

Otis


--- "Spencer, Dave" <dave@lumos.com> wrote:
> I may have misunderstood something, but
> if you're looking to reduce memory/RAM usage by
> a convenient data structure in LARM you might
> consider jdbm (http://jdbm.sf.net/), a disk-based
> BTree.
>
> I wrote a Map interface to it called PersistentMap so you
> can program to it in a convenient form and possibly just drop it in
> as a replacement for any existing Map's (HashMap, TreeMap).
> It's not hard to do a presistent Set or List either.
>
> http://www.tropo.com/techno/java/jdbm/
>
> The main alternative to jdbm is JISP, which I haven't used:
> http://www.coyotegulch.com/algorithm/jisp/
>
>
>
> -----Original Message-----
> From: Clemens Marschner [mailto:cmad@lanlab.de]
> Sent: Friday, September 20, 2002 6:58 AM
> To: Halácsy Péter
> Cc: Lucene Developers List
> Subject: Re: your crawler
>
>
> Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F
> ieldsReader.java>----- Original Message -----
> >From: Halácsy Péter
> >To: cmad@lanlab.de
> >Sent: Friday, September 20, 2002 12:10 PM
> >Subject: your crawler
> >
> >
> >BTW what is the status of the LARM crawler. 2 months ago I promised
> I
> could
> help from September because I would >be a PHD student of Budapest
> University
> of Technology. Did you choose avalon as a component framework?
>
>
> I'm in the last days of my master's thesis. I will get back to the
> crawler
> after Oct. 2nd (and a week of vacation on Garda's beautiful
> lakeside).
>
> Otis has played around with the crawler in the last two weeks, and we
> had
> long email conversations. We have found some problems one has to cope
> with.
> I.e. LARM has a relatively high memory overhead per server (I
> mentioned
> it
> was made for large intranets). Otis's 100MB RAM overflew after
> crawling
> about 40000 URLs in the .hr domain.
> I for myself have crawled 500.000 files from 500 servers with about
> 400
> mb
> of main memory (by the way, that only takes about 2-3 hours [but
> imposes
> some load on the servers...])
>
> We have talked about how the more or less linear rising memory
> consumption
> could be controlled. Two components use up memory: The
> URLVisitedFilter,
> which at this time simply holds a HashMap of already visited URLs;
> and
> the
> FetcherTaskQueue, which holds a CachingQueue with crawling tasks for
> each
> server. The cachingQueue itself holds up to two blocks of the queue
> in
> RAM,
> so this may rise fast if the number of servers rises (look at the
> Javadoc, I
> recall it's well documented).
>
> We though about controlling this by a) compressing the
> visitedFilter's
> contents, b) taking advantage of some locality property of URL
> distributions
> (making it possible to move some of the URLs to secondary storage)
> and
> c)
> binding a server to only one thread, minimizing the need for
> synchronization
> (and providing more possibilities to move the tasks out of the RAM).
> a)
> can
> be accomplished by compressing the sorted list of URLs (there are
> papers
> about that on Citeseer). Incoming URLs would have to be divided into
> blocks
> (i.e. per server) and, when a time/space threshold is reached, the
> block
> is
> compressed. I have done a little work on that already, although my
> implementation only works in batch mode, not incrementally.
>
> Finally, the LuceneStorage is far from being optimized, and is a
> major
> bottleneck. We thought about dividing the crawling from the indexing
> process.
>
> btw: Has anybody used a profiler with the Lucene indexing part? I
> suppose
> there is still a lot to optimize there.
>
> Regarding Avalon: I haven't had the time to look at it thoroughly.
> Mehran
> Mehr wanted to to that, but I haven't heard anything from him for
> weeks
> now.
> Probably he wants to present us the perfect solution very soon...
>
> What I have done is I tried to use the Jakarta BeanUtils for loading
> the
> config files. Works pretty simple (just a few lines of code, vers
> straightforward) but then the check for mandatory parameters etc.
> would
> have
> to be done by hand afterwards, something I would expect an XML reader
> to
> get
> from an xsd file or something, at least optionally.
>
> Back to my 15 hour day... :-|
>
> --Clemens
>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>


__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>