I may have misunderstood something, but
if you're looking to reduce memory/RAM usage by
a convenient data structure in LARM you might
consider jdbm (http://jdbm.sf.net/), a disk-based
BTree.
I wrote a Map interface to it called PersistentMap so you
can program to it in a convenient form and possibly just drop it in
as a replacement for any existing Map's (HashMap, TreeMap).
It's not hard to do a presistent Set or List either.
http://www.tropo.com/techno/java/jdbm/
The main alternative to jdbm is JISP, which I haven't used:
http://www.coyotegulch.com/algorithm/jisp/
-----Original Message-----
From: Clemens Marschner [mailto:cmad@lanlab.de]
Sent: Friday, September 20, 2002 6:58 AM
To: Halácsy Péter
Cc: Lucene Developers List
Subject: Re: your crawler
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F
ieldsReader.java>----- Original Message -----
>From: Halácsy Péter
>To: cmad@lanlab.de
>Sent: Friday, September 20, 2002 12:10 PM
>Subject: your crawler
>
>
>BTW what is the status of the LARM crawler. 2 months ago I promised I
could
help from September because I would >be a PHD student of Budapest
University
of Technology. Did you choose avalon as a component framework?
I'm in the last days of my master's thesis. I will get back to the
crawler
after Oct. 2nd (and a week of vacation on Garda's beautiful lakeside).
Otis has played around with the crawler in the last two weeks, and we
had
long email conversations. We have found some problems one has to cope
with.
I.e. LARM has a relatively high memory overhead per server (I mentioned
it
was made for large intranets). Otis's 100MB RAM overflew after crawling
about 40000 URLs in the .hr domain.
I for myself have crawled 500.000 files from 500 servers with about 400
mb
of main memory (by the way, that only takes about 2-3 hours [but imposes
some load on the servers...])
We have talked about how the more or less linear rising memory
consumption
could be controlled. Two components use up memory: The URLVisitedFilter,
which at this time simply holds a HashMap of already visited URLs; and
the
FetcherTaskQueue, which holds a CachingQueue with crawling tasks for
each
server. The cachingQueue itself holds up to two blocks of the queue in
RAM,
so this may rise fast if the number of servers rises (look at the
Javadoc, I
recall it's well documented).
We though about controlling this by a) compressing the visitedFilter's
contents, b) taking advantage of some locality property of URL
distributions
(making it possible to move some of the URLs to secondary storage) and
c)
binding a server to only one thread, minimizing the need for
synchronization
(and providing more possibilities to move the tasks out of the RAM). a)
can
be accomplished by compressing the sorted list of URLs (there are papers
about that on Citeseer). Incoming URLs would have to be divided into
blocks
(i.e. per server) and, when a time/space threshold is reached, the block
is
compressed. I have done a little work on that already, although my
implementation only works in batch mode, not incrementally.
Finally, the LuceneStorage is far from being optimized, and is a major
bottleneck. We thought about dividing the crawling from the indexing
process.
btw: Has anybody used a profiler with the Lucene indexing part? I
suppose
there is still a lot to optimize there.
Regarding Avalon: I haven't had the time to look at it thoroughly.
Mehran
Mehr wanted to to that, but I haven't heard anything from him for weeks
now.
Probably he wants to present us the perfect solution very soon...
What I have done is I tried to use the Jakarta BeanUtils for loading the
config files. Works pretty simple (just a few lines of code, vers
straightforward) but then the check for mandatory parameters etc. would
have
to be done by hand afterwards, something I would expect an XML reader to
get
from an xsd file or something, at least optionally.
Back to my 15 hour day... :-|
--Clemens
--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
if you're looking to reduce memory/RAM usage by
a convenient data structure in LARM you might
consider jdbm (http://jdbm.sf.net/), a disk-based
BTree.
I wrote a Map interface to it called PersistentMap so you
can program to it in a convenient form and possibly just drop it in
as a replacement for any existing Map's (HashMap, TreeMap).
It's not hard to do a presistent Set or List either.
http://www.tropo.com/techno/java/jdbm/
The main alternative to jdbm is JISP, which I haven't used:
http://www.coyotegulch.com/algorithm/jisp/
-----Original Message-----
From: Clemens Marschner [mailto:cmad@lanlab.de]
Sent: Friday, September 20, 2002 6:58 AM
To: Halácsy Péter
Cc: Lucene Developers List
Subject: Re: your crawler
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F
ieldsReader.java>----- Original Message -----
>From: Halácsy Péter
>To: cmad@lanlab.de
>Sent: Friday, September 20, 2002 12:10 PM
>Subject: your crawler
>
>
>BTW what is the status of the LARM crawler. 2 months ago I promised I
could
help from September because I would >be a PHD student of Budapest
University
of Technology. Did you choose avalon as a component framework?
I'm in the last days of my master's thesis. I will get back to the
crawler
after Oct. 2nd (and a week of vacation on Garda's beautiful lakeside).
Otis has played around with the crawler in the last two weeks, and we
had
long email conversations. We have found some problems one has to cope
with.
I.e. LARM has a relatively high memory overhead per server (I mentioned
it
was made for large intranets). Otis's 100MB RAM overflew after crawling
about 40000 URLs in the .hr domain.
I for myself have crawled 500.000 files from 500 servers with about 400
mb
of main memory (by the way, that only takes about 2-3 hours [but imposes
some load on the servers...])
We have talked about how the more or less linear rising memory
consumption
could be controlled. Two components use up memory: The URLVisitedFilter,
which at this time simply holds a HashMap of already visited URLs; and
the
FetcherTaskQueue, which holds a CachingQueue with crawling tasks for
each
server. The cachingQueue itself holds up to two blocks of the queue in
RAM,
so this may rise fast if the number of servers rises (look at the
Javadoc, I
recall it's well documented).
We though about controlling this by a) compressing the visitedFilter's
contents, b) taking advantage of some locality property of URL
distributions
(making it possible to move some of the URLs to secondary storage) and
c)
binding a server to only one thread, minimizing the need for
synchronization
(and providing more possibilities to move the tasks out of the RAM). a)
can
be accomplished by compressing the sorted list of URLs (there are papers
about that on Citeseer). Incoming URLs would have to be divided into
blocks
(i.e. per server) and, when a time/space threshold is reached, the block
is
compressed. I have done a little work on that already, although my
implementation only works in batch mode, not incrementally.
Finally, the LuceneStorage is far from being optimized, and is a major
bottleneck. We thought about dividing the crawling from the indexing
process.
btw: Has anybody used a profiler with the Lucene indexing part? I
suppose
there is still a lot to optimize there.
Regarding Avalon: I haven't had the time to look at it thoroughly.
Mehran
Mehr wanted to to that, but I haven't heard anything from him for weeks
now.
Probably he wants to present us the perfect solution very soon...
What I have done is I tried to use the Jakarta BeanUtils for loading the
config files. Works pretty simple (just a few lines of code, vers
straightforward) but then the check for mandatory parameters etc. would
have
to be done by hand afterwards, something I would expect an XML reader to
get
from an xsd file or something, at least optionally.
Back to my 15 hour day... :-|
--Clemens
--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>