Mailing List Archive

Maximum file size problem
Hi,

I ran into a problem earlier this week, where by an index of 8
million small documents resulted in an index file of 2GB. It turns
out this is a common file system limit (some say it might be a java
limit as well).

Anyway, I have no idea which index file it was, but it seems that I
need to be able to control the distribution of the index ?

Each document looks like this:

exact_keywords: my_phrase_with_underscores , indexed, but not tokenized.
body: my phrase with underscores (repeated N times for weight)
<EOM> title words <EOM> body <EOM>
add_info1:
add_info2: (just stored, short info)
add_info3:

The body is pretty short (say 120 words or so).


As it turns out, the Index I had created was actually usable (thank
heaven, after 24 hours of indexing :)).

*** Anyway, is there anyway to control how big the indexes grow ? ****


Cheers,
Winton


p.s. A big thank you to Doug -- I was able to deliver my experimental
results on time :) My boss is V.Happy :)


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Maximum file size problem [ In reply to ]
Yes, 2GB files are a common problem.
I think the answer to your question is negative. However, in case you
are using Linux please note that the newer kernels have support for
very large files (i.e. files over 2GB).
Also, FreeBSD and 64-bit OSs don't have that limit either.

This is an OS, not Java limitation.

Some people have reported that they have successfully created indices
over 2GB in size, but they were probably running one of the operating
systems that don't suffer from this limitation.

Otis


--- Winton Davies <wdavies@overture.com> wrote:
> Hi,
>
> I ran into a problem earlier this week, where by an index of 8
> million small documents resulted in an index file of 2GB. It turns
> out this is a common file system limit (some say it might be a java
> limit as well).
>
> Anyway, I have no idea which index file it was, but it seems that I
>
> need to be able to control the distribution of the index ?
>
> Each document looks like this:
>
> exact_keywords: my_phrase_with_underscores , indexed, but not
> tokenized.
> body: my phrase with underscores (repeated N times for weight)
> <EOM> title words <EOM> body <EOM>
> add_info1:
> add_info2: (just stored, short info)
> add_info3:
>
> The body is pretty short (say 120 words or so).
>
>
> As it turns out, the Index I had created was actually usable (thank
>
> heaven, after 24 hours of indexing :)).
>
> *** Anyway, is there anyway to control how big the indexes grow ?
> ****
>
>
> Cheers,
> Winton
>
>
> p.s. A big thank you to Doug -- I was able to deliver my experimental
>
> results on time :) My boss is V.Happy :)
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>


__________________________________________________
Do You Yahoo!?
Find a job, post your resume.
http://careers.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Maximum file size problem [ In reply to ]
Otis has already answered most of this.

> From: Winton Davies [mailto:wdavies@overture.com]
>
> *** Anyway, is there anyway to control how big the indexes
> grow ? ****

The easiset thing is to set IndexWriter.maxMergeDocs. Since you hit 2GB at
8M docs, set this to 7M. That will keep Lucene from trying to merge an
index that won't fit in your filesystem. (It will actually effectively
round this down to the next lower power of Index.mergeFactor. So with the
default mergeFactor=10, maxMergeDocs=7M will generate a series of 1M
document indexes, since merging 10 of these would exceed the max.)

Slightly more complex: you could further minimize the number of segments,
if, when you've added seven million documents, optimize the index and start
a new index. Then use MultiSearcher to search.

Even more complex and optimal: write a version of FSDirectory that, when a
file exceeds 2GB, creates a subdirectory and represents the file as a series
of files. (I've done this before, and found that, on at least the version
of Solaris that I was using, the files had to be a few 100k less than 2GB
for programs like 'cp' and 'ftp' to operate correctly on them.)

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Maximum file size problem [ In reply to ]
Hi Otis and Doug,

Many thanks for your comments before, sorry I was unable to deal
with it, I was under time-pressure for results :)

I'm still hitting the 2Gb limit though -- and I'm on a solaris box
with supposedly big files enabled. I'm using Java 1.2 -- could this
be the problem ?

I adopted Doug's version, of setting the maxMergeDocuments and
mergeFactor. However, optimize() still wanted to create one big file
(I think this was the cause). Is there anyway to semi-optimize :-) ?


Cheers,
Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Maximum file size problem [ In reply to ]
From http://developer.java.sun.com/developer/bugParade/bugs/4116672.html

It looks like Java 1.2 had a bug at least... and its in RandomFile as well.
I'm trying 1.3...

Cheers,
Winton



Bug Id 4116672

Votes 1

Synopsis java.io does not support large files >2GB

Category java:classes_io

Reported Against 1.1.5, 1.1.3, 1.2beta2

Release Fixed 1.2beta4

State Closed, fixed

Related Bugs 4100321, 4110555

Submit Date Mar 02, 1998

Description
Java cannot handle files >2GB.


------------------------------------------------------------------------

SunSoft bug 4109867.

Java cannot handle files >2GB.

(1) **API Problems which need immediate attention**

+ The method FileInputStream.available() (for disk files) returns the
number of bytes remaining in the file as an int. The actual value for
a large file may not fit in an int (needs a long).

+ The RandomAccessFile class has a skipBytes(int) method,
rather than the skip(long) method which the InputStreams have.

(2) Implementation - these methods have problems:

+ java.io.FileInputStream.skip()
+ java.io.FileInputStream.available()
+ java.io.RandomAccessFile.getFilePointer()
+ java.io.RandomAccessFile.length()
+ java.io.RandomAccessFile.setlength() [JDK1.2]
+ java.io.File.length()





Workaround
None.

Evaluation
> How do you plan to manage the API change required for
> the available method? To just change the signature
> seems like an incompatible change.

People often assume that the FileInputStream.available method is supposed to
return the total number of bytes remaining in the file, but this is incorrect.
The specification in the JLS (22.3.5) says:

The general contract of available is that it returns an integer k; the next
caller of a method for this input stream, which might be the same thread or
another thread, can then expect to be able to read or skip up to k bytes
without blocking (waiting for input data to arrive).

Now there well may be a bug in that available returns a nonsense value if there
are more than 2GB remaining; if so, we'll certainly fix that. I don't see a
pressing need, however, to change the signature of the available method (which,
actually, we cannot do) or to provide some other method. We already have the
File.length method, which returns a long; this isn't exactly the same as having
an available method that returns a long, but it's pretty close.

> Do you plan to manage the API change for skip as suggested
> in the "Suggested Fix"?

We can certainly consider adding a skip(long) method to RandomAccessFile as you
suggest, but only in 1.2.

> Do you plan not to address this in the 1.1 release
> sequence? Its a bug both places.

We can fix bugs in the 1.1 implementation, but we cannot add to APIs.

-- xxxxx@xxxxx 3/4/1998


xxxxx@xxxxx 1998-03-05
I think getting a consistant semantic/syntax for the "available" family of
methods is critical. The intended use is clear from the language specification
quotation above, but the syntax is in conflict with the semantic specification.
There are two ways to resolve this. Either we change the syntax or we change
the semantic. There are a couple of variations of each theme:

Change the syntax:

1) If this were Java 0.9 (i.e.: before any FCS) the simple solution would
be to retain the simple existing semantic and provide a syntax which
matched appropriately. This implies changing the syntax of the
entire family of "available()" methods to return a long rather than
an int. This is probably inapproprate at this time as being too
significant of an incompatible change.

2) A related and palatable solution is to define a new set of
"available()"
methods (say, "bytesAvailable()") and deprecate (but support) the
existing "available()" family of methods. This must be accompanied
by a semantic definition of the behavior of "available" when the
current semantic definition can't be accomodated by the syntax.
Either option A or B below could be used.

Change the semantics:

A) Augment the semantics of available() such that if the number of bytes
available without blocking exceeds the value Integer.MAX_VALUE the
value Integer.MAX_VALUE will be returned. On the surface this may
seem a bit kludgy, but if you realize that the intent of the
available method is to feed its result as the third operand of a
subsequent invokation of a read(byte b[], int off, int len) method,
it kinda makes sense.

B) Augment the semantics of available such that a "file too large" io
exception is the specified behavior. This probably isn't ideal, but
it is clear.

My preference is A) or 2) combined with A). Note that both of these have the
property of not being able to break any existing applications. A) by itself
is probably sufficient.

The solution for the randomAccessFile.skipBytes(int) is also less than obvious.
First one needs to note that the semantics of skip(long) and skipBytes(int)
are different. skipBytes is guarenteed to skip the appropriate number of
bytes while skip is very explicit about not extending such a guarentee. I
think I can guess that the "must" semantic of skipBytes was determined to
be necessary for some classes (such as ObjectInputStream), but don't see
why it was appropriate for RandomAccessFile. I might further guess that
because the "must" semantic could be easily delivered for this class it was
chosen even though skip(long) would be more regular. (The other classes
which have skipBytes are reading structured types, rather than a byte stream.)

Considering the above, it would be good and increase the regularity of the
platform if a skip(long) method were to be added to the RandomAccessFile
class. Considering the differing semantics, I would suggest not deprecating
the skipBytes method (although, if it didn't exist, I'd not suggest adding it).

However (as has always been acknowledged) this isn't a required change for
the correct implementation of large file support. A resolution of the
available() method semantics is.


Re. FileInputStream.available: I wasn't sufficiently clear in my earlier
comment. The JLS text quoted above does not imply that the available method
must return the total number of bytes remaining in the file. The value
returned by an available method may, in fact, be much less than the total
number of bytes that can be read without pausing. A valid, though not
particularly useful, implementation of available can return zero every time
it's invoked.

So we don't need to change the signature or the specification of any of the
available methods. Since the true meaning of available is so widely
misunderstood, we should clarify the specifications on this point. We should
note in particular that if more than 2GB are immediately available, then
Integer.MAX_INT will be returned.

Re. RandomAccessFile.skipBytes(int): RandomAccessFile implements the DataInput
interface, which is why it must have a skipBytes method. The JLS does not
require "must" semantics for skipBytes; the old javadoc, which is in the
process of being updated to contain the JLS text, was incorrect on this point.

We could add a skip(long) method to RandomAccessFile, but it's roughly
equivalent to this idiom, assuming raf is a RandomAccessFile:

raf.seek(raf.getFilePointer() + offset);

The only difference is that skip(long) wouldn't throw an IOException if you try
to skip past the end of the file. If you think this functionality is
important,

please file a separate RFE. I'd like to use this bug report just to track the
implementation bugs.

-- xxxxx@xxxxx 3/9/1998


xxxxx@xxxxx 1998-03-09

I pretty much agree with the above. I got thrown for a major loop by the
out-of-sync and incorrect javadoc entries. Note that skipBytes() javadoc
has already been fixed in 1.2.

I don't know if I'd go as far as specifying that available() return
Integer.MAX_INT. I believe that taking the JLS diescription of available
and inserting it into javadoc would be sufficient.


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>