Mailing List Archive

jdbm vs sleepycat RE: memory usage - RE: your crawler
I think you're right that JDBM is "equivalent" to SleepyCat (aka the
Berkeley DB right?).
JDBM is pure java.
Last I checked Sleepycat was JNI to the "C" SleepyCat DB.
SleepyCat has had extensive use for over 10 years I'm sure whereas JDBM
is new.
I'm sure their file formats are incompatible.
If you have hard core requirements you should use SleepyCat - if you
need
pure java for some reason (convenience, portability...) then JDBM is
nice.

-----Original Message-----
From: John Cwikla [mailto:Cwikla@Biz360.com]
Sent: Friday, September 20, 2002 12:08 PM
To: 'Lucene Developers List'
Subject: RE: memory usage - RE: your crawler



We basically do this (store all seen URLS) using a SleepyCat backed
server
over RMI,
though obviously you could do it locally also (This is pretty much
what the JDBM appears to be also, the BTree version of SleepyCat).
We average 3 ms per get, and we store information such as url, last
accessed,
md5 checksum, dupe count and error count so we can age the urls,
instead of

just fetching them once. Puts are < 1ms. I think we currently have about
2
million url-states
in our cache. Access time and speed are generally a function of using
enough cache, and/or if you can use any locality for your cache.
Our system averages about 1000 reads per minute from this db. I wrote
this
about
6 months ago so I don't remember what the max was, plus the machine I'm
running in production is a bit faster (1.6Ghz, Linux, ReiserFS, 1G ram,
48M
of SleepyCat cache)
then what I had in house, but I would assume 100+/sec should be doable.
Guess it depends on what "superfast" means.

cwikla

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Friday, September 20, 2002 11:31 AM
To: Lucene Developers List
Subject: Re: memory usage - RE: your crawler


Actually, Clemens already has something called CachingQueue, which does
a good job of storing most of the 'URLs to fetch' in a queue stored on
disk.

A little 'background':

But there is also a need for a data structure that contains all URLs
that the crawler has already 'seen' (found and extraced from fetched
documented). There is no need to store the same URL twice, and fetch
it twice, so there is this 'VisitedURLsFilter' (don't confuse 'visited'
with 'fetched', it just means 'already seen') which contains a list of
all seen URLs.
Every URL extracted from a fetched document needs to be looked up in
this VisitedURLsFilter. If not there already, it needs to be added to
it (and to the queue of URLs to fetch). If there already, it is thrown
away.

Because of this, the data structure that VisitedURLsFilter uses to
store and look up URLs must be super fast.
This means that it cannot be on disk.
However, a crawler normally encounters hundreds of thousands and
millions or URLs, so storing them all in this filter wouldn't work (RAM
issue).

So the question is how to store such large number of URLs and at the
same time provide a fast lookup access to it.

Otis


--- "Spencer, Dave" <dave@lumos.com> wrote:
> I may have misunderstood something, but
> if you're looking to reduce memory/RAM usage by
> a convenient data structure in LARM you might
> consider jdbm (http://jdbm.sf.net/), a disk-based
> BTree.
>
> I wrote a Map interface to it called PersistentMap so you
> can program to it in a convenient form and possibly just drop it in
> as a replacement for any existing Map's (HashMap, TreeMap).
> It's not hard to do a presistent Set or List either.
>
> http://www.tropo.com/techno/java/jdbm/
>
> The main alternative to jdbm is JISP, which I haven't used:
> http://www.coyotegulch.com/algorithm/jisp/
>
>
>
> -----Original Message-----
> From: Clemens Marschner [mailto:cmad@lanlab.de]
> Sent: Friday, September 20, 2002 6:58 AM
> To: Halácsy Péter
> Cc: Lucene Developers List
> Subject: Re: your crawler
>
>
> Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/index F
> ieldsReader.java>----- Original Message -----
> >From: Halácsy Péter
> >To: cmad@lanlab.de
> >Sent: Friday, September 20, 2002 12:10 PM
> >Subject: your crawler
> >
> >
> >BTW what is the status of the LARM crawler. 2 months ago I promised
> I
> could
> help from September because I would >be a PHD student of Budapest
> University
> of Technology. Did you choose avalon as a component framework?
>
>
> I'm in the last days of my master's thesis. I will get back to the
> crawler
> after Oct. 2nd (and a week of vacation on Garda's beautiful
> lakeside).
>
> Otis has played around with the crawler in the last two weeks, and we
> had
> long email conversations. We have found some problems one has to cope
> with.
> I.e. LARM has a relatively high memory overhead per server (I
> mentioned
> it
> was made for large intranets). Otis's 100MB RAM overflew after
> crawling
> about 40000 URLs in the .hr domain.
> I for myself have crawled 500.000 files from 500 servers with about
> 400
> mb
> of main memory (by the way, that only takes about 2-3 hours [but
> imposes
> some load on the servers...])
>
> We have talked about how the more or less linear rising memory
> consumption
> could be controlled. Two components use up memory: The
> URLVisitedFilter,
> which at this time simply holds a HashMap of already visited URLs;
> and
> the
> FetcherTaskQueue, which holds a CachingQueue with crawling tasks for
> each
> server. The cachingQueue itself holds up to two blocks of the queue
> in
> RAM,
> so this may rise fast if the number of servers rises (look at the
> Javadoc, I
> recall it's well documented).
>
> We though about controlling this by a) compressing the
> visitedFilter's
> contents, b) taking advantage of some locality property of URL
> distributions
> (making it possible to move some of the URLs to secondary storage)
> and
> c)
> binding a server to only one thread, minimizing the need for
> synchronization
> (and providing more possibilities to move the tasks out of the RAM).
> a)
> can
> be accomplished by compressing the sorted list of URLs (there are
> papers
> about that on Citeseer). Incoming URLs would have to be divided into
> blocks
> (i.e. per server) and, when a time/space threshold is reached, the
> block
> is
> compressed. I have done a little work on that already, although my
> implementation only works in batch mode, not incrementally.
>
> Finally, the LuceneStorage is far from being optimized, and is a
> major
> bottleneck. We thought about dividing the crawling from the indexing
> process.
>
> btw: Has anybody used a profiler with the Lucene indexing part? I
> suppose
> there is still a lot to optimize there.
>
> Regarding Avalon: I haven't had the time to look at it thoroughly.
> Mehran
> Mehr wanted to to that, but I haven't heard anything from him for
> weeks
> now.
> Probably he wants to present us the perfect solution very soon...
>
> What I have done is I tried to use the Jakarta BeanUtils for loading
> the
> config files. Works pretty simple (just a few lines of code, vers
> straightforward) but then the check for mandatory parameters etc.
> would
> have
> to be done by hand afterwards, something I would expect an XML reader
> to
> get
> from an xsd file or something, at least optionally.
>
> Back to my 15 hour day... :-|
>
> --Clemens
>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>


__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>