Mailing List Archive

Creating indexes
I have a big ( 40 MB or so) file to index. The file contains a whole bunch
of documents, which are each pretty small, about a few typewritten pages
long. There's a title, date, and author for each document, in addition to
the documents' actual text.

I'm not quite sure how you index this in Lucene. For each document in the
original file, I assume that I create a separate Lucene Document object in
the index with author, date, title, and text fields. If so, my question is
that when I'm reading in the original file for indexing, does Lucene know
where each document begins and ends in the original file ? Or do I have to
write a parser or filter or something for the InputStream that's reading the
file ?

Chris Sibert



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Creating indexes [ In reply to ]
depending on the build of the document, but I guess not,
I had to write my own XML parser, you get better results when
you customize something like that to your needs.

-----Original Message-----
From: Chris Sibert [mailto:chrissibert@attbi.com]
Sent: Wednesday, June 12, 2002 10:27 AM
To: Lucene Users List
Subject: Creating indexes


I have a big ( 40 MB or so) file to index. The file contains a whole bunch
of documents, which are each pretty small, about a few typewritten pages
long. There's a title, date, and author for each document, in addition to
the documents' actual text.

I'm not quite sure how you index this in Lucene. For each document in the
original file, I assume that I create a separate Lucene Document object in
the index with author, date, title, and text fields. If so, my question is
that when I'm reading in the original file for indexing, does Lucene know
where each document begins and ends in the original file ? Or do I have to
write a parser or filter or something for the InputStream that's reading the
file ?

Chris Sibert



--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Creating indexes [ In reply to ]
Lucene doesn't know where a file start or ends, actually it knows, but in your case 1 Docuemtn contains more small documents.If you want to split your big file in small files you must to that by yourself, Take a look at the Document class and you will see that Lucene use a Reader to index the body of a file, so may be you should build a class that return a Reader for each sub-document you want.
But i think is easier split your main document in small document, index this small documents with a common "keyword" that is the actual Big file name, so when you'll search you can understand where this "sub" document is allocated. After you index those files you can delete them. What you need is a BigDocumentManager that:

1.split your big file/s
2.index them. (don't forget the keyword => big doc name)
3.delete those "sub" documents (are like temp docs).

Hope this helps.


--

On Wed, 12 Jun 2002 02:26:58
Chris Sibert wrote:
>I have a big ( 40 MB or so) file to index. The file contains a whole bunch
>of documents, which are each pretty small, about a few typewritten pages
>long. There's a title, date, and author for each document, in addition to
>the documents' actual text.
>
>I'm not quite sure how you index this in Lucene. For each document in the
>original file, I assume that I create a separate Lucene Document object in
>the index with author, date, title, and text fields. If so, my question is
>that when I'm reading in the original file for indexing, does Lucene know
>where each document begins and ends in the original file ? Or do I have to
>write a parser or filter or something for the InputStream that's reading the
>file ?
>
>Chris Sibert
>
>
>
>--
>To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>


_______________________________________________________
WIN a first class trip to Hawaii. Live like the King of Rock and Roll
on the big Island. Enter Now!
http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Creating indexes [ In reply to ]
The file that I have is big, about 40 MB. And it's got a whole lot of
smaller documents in it - about 15 thousand - too many to separate into
individual files. These individual documents are actually similar to emails
stored in a large text file. The file is structured to an extent, with a
number before each document - (ex: __10001__, __10002__, etc.), with the
date, etc. Kind of like email headers.

In the Lucene index, it seems like I'll have to: 1) use a DocumentNumbers
field to index all of the document numbers, 2) a Dates field to index the
document dates, 3) and a TextBody field to index all of the document text
together. I'll have to write an InputStreamFilter or something to parse the
data as it's coming in to the lucene IndexWriter, create a new document
every time I hit a new number, and parse out the numbers - like __10001__ -
so I can separate them out in the DocumentNumbers field, the dates into a
Dates field, and the text in a TextBody field. It won't be pleasant writing
that parser, but...

My other issue at this point is how to then display the documents that
relate to the search hits. I have to be able to open that 40 MB file and go
to the document(s) that correspond to the hits in the index, for display to
the user. Does Lucene keep a location stored in the index of where each word
is found in the original file ? How do I know at what point in the original
data file to find the offset to display the original document ? Is this
something that I have to store myself in each document object in the index ?
Is this why you create separate document objects in the Lucene index ? -
Each new document object in the index will contain the file offset to the
original data file ? And if Lucene doesn't put that file offset in there
automagically, I would have to store that myself as I create the index, in
someting like a FileOffsetLocation field, for each document. Am I on the
right track here ?

Whew.

----- Original Message -----
From: "none none" <korfut@lycos.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Wednesday, June 12, 2002 11:56 AM
Subject: Re: Creating indexes


> Lucene doesn't know where a file start or ends, actually it knows, but in
your case 1 Docuemtn contains more small documents.If you want to split your
big file in small files you must to that by yourself, Take a look at the
Document class and you will see that Lucene use a Reader to index the body
of a file, so may be you should build a class that return a Reader for each
sub-document you want.
> But i think is easier split your main document in small document, index
this small documents with a common "keyword" that is the actual Big file
name, so when you'll search you can understand where this "sub" document is
allocated. After you index those files you can delete them. What you need is
a BigDocumentManager that:
>
> 1.split your big file/s
> 2.index them. (don't forget the keyword => big doc name)
> 3.delete those "sub" documents (are like temp docs).
>
> Hope this helps.
>
>
> --
>
> On Wed, 12 Jun 2002 02:26:58
> Chris Sibert wrote:
> >I have a big ( 40 MB or so) file to index. The file contains a whole
bunch
> >of documents, which are each pretty small, about a few typewritten pages
> >long. There's a title, date, and author for each document, in addition to
> >the documents' actual text.
> >
> >I'm not quite sure how you index this in Lucene. For each document in the
> >original file, I assume that I create a separate Lucene Document object
in
> >the index with author, date, title, and text fields. If so, my question
is
> >that when I'm reading in the original file for indexing, does Lucene know
> >where each document begins and ends in the original file ? Or do I have
to
> >write a parser or filter or something for the InputStream that's reading
the
> >file ?
> >
> >Chris Sibert
> >
> >
> >
> >--
> >To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> >For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> _______________________________________________________
> WIN a first class trip to Hawaii. Live like the King of Rock and Roll
> on the big Island. Enter Now!
> http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Creating indexes [ In reply to ]
just store the whole thing into the indexc .. it'll make the index bigger
but then it'll allow you to find method in madness, manually parsing a forty
meg file everytime you need to display search results is too intensive.

Nader Henein

-----Original Message-----
From: Chris Sibert [mailto:chrissibert@attbi.com]
Sent: Wednesday, June 19, 2002 10:47 AM
To: Lucene Users List
Subject: Re: Creating indexes


The file that I have is big, about 40 MB. And it's got a whole lot of
smaller documents in it - about 15 thousand - too many to separate into
individual files. These individual documents are actually similar to emails
stored in a large text file. The file is structured to an extent, with a
number before each document - (ex: __10001__, __10002__, etc.), with the
date, etc. Kind of like email headers.

In the Lucene index, it seems like I'll have to: 1) use a DocumentNumbers
field to index all of the document numbers, 2) a Dates field to index the
document dates, 3) and a TextBody field to index all of the document text
together. I'll have to write an InputStreamFilter or something to parse the
data as it's coming in to the lucene IndexWriter, create a new document
every time I hit a new number, and parse out the numbers - like __10001__ -
so I can separate them out in the DocumentNumbers field, the dates into a
Dates field, and the text in a TextBody field. It won't be pleasant writing
that parser, but...

My other issue at this point is how to then display the documents that
relate to the search hits. I have to be able to open that 40 MB file and go
to the document(s) that correspond to the hits in the index, for display to
the user. Does Lucene keep a location stored in the index of where each word
is found in the original file ? How do I know at what point in the original
data file to find the offset to display the original document ? Is this
something that I have to store myself in each document object in the index ?
Is this why you create separate document objects in the Lucene index ? -
Each new document object in the index will contain the file offset to the
original data file ? And if Lucene doesn't put that file offset in there
automagically, I would have to store that myself as I create the index, in
someting like a FileOffsetLocation field, for each document. Am I on the
right track here ?

Whew.

----- Original Message -----
From: "none none" <korfut@lycos.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Wednesday, June 12, 2002 11:56 AM
Subject: Re: Creating indexes


> Lucene doesn't know where a file start or ends, actually it knows, but in
your case 1 Docuemtn contains more small documents.If you want to split your
big file in small files you must to that by yourself, Take a look at the
Document class and you will see that Lucene use a Reader to index the body
of a file, so may be you should build a class that return a Reader for each
sub-document you want.
> But i think is easier split your main document in small document, index
this small documents with a common "keyword" that is the actual Big file
name, so when you'll search you can understand where this "sub" document is
allocated. After you index those files you can delete them. What you need is
a BigDocumentManager that:
>
> 1.split your big file/s
> 2.index them. (don't forget the keyword => big doc name)
> 3.delete those "sub" documents (are like temp docs).
>
> Hope this helps.
>
>
> --
>
> On Wed, 12 Jun 2002 02:26:58
> Chris Sibert wrote:
> >I have a big ( 40 MB or so) file to index. The file contains a whole
bunch
> >of documents, which are each pretty small, about a few typewritten pages
> >long. There's a title, date, and author for each document, in addition to
> >the documents' actual text.
> >
> >I'm not quite sure how you index this in Lucene. For each document in the
> >original file, I assume that I create a separate Lucene Document object
in
> >the index with author, date, title, and text fields. If so, my question
is
> >that when I'm reading in the original file for indexing, does Lucene know
> >where each document begins and ends in the original file ? Or do I have
to
> >write a parser or filter or something for the InputStream that's reading
the
> >file ?
> >
> >Chris Sibert
> >
> >
> >
> >--
> >To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> >For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> _______________________________________________________
> WIN a first class trip to Hawaii. Live like the King of Rock and Roll
> on the big Island. Enter Now!
> http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Creating indexes [ In reply to ]
Thanks.

----- Original Message -----
From: "Nader S. Henein" <nsh@bayt.net>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Wednesday, June 19, 2002 2:54 AM
Subject: RE: Creating indexes


> just store the whole thing into the indexc .. it'll make the index bigger
> but then it'll allow you to find method in madness, manually parsing a
forty
> meg file everytime you need to display search results is too intensive.
>
> Nader Henein
>
> -----Original Message-----
> From: Chris Sibert [mailto:chrissibert@attbi.com]
> Sent: Wednesday, June 19, 2002 10:47 AM
> To: Lucene Users List
> Subject: Re: Creating indexes
>
>
> The file that I have is big, about 40 MB. And it's got a whole lot of
> smaller documents in it - about 15 thousand - too many to separate into
> individual files. These individual documents are actually similar to
emails
> stored in a large text file. The file is structured to an extent, with a
> number before each document - (ex: __10001__, __10002__, etc.), with the
> date, etc. Kind of like email headers.
>
> In the Lucene index, it seems like I'll have to: 1) use a DocumentNumbers
> field to index all of the document numbers, 2) a Dates field to index the
> document dates, 3) and a TextBody field to index all of the document text
> together. I'll have to write an InputStreamFilter or something to parse
the
> data as it's coming in to the lucene IndexWriter, create a new document
> every time I hit a new number, and parse out the numbers - like
__10001__ -
> so I can separate them out in the DocumentNumbers field, the dates into a
> Dates field, and the text in a TextBody field. It won't be pleasant
writing
> that parser, but...
>
> My other issue at this point is how to then display the documents that
> relate to the search hits. I have to be able to open that 40 MB file and
go
> to the document(s) that correspond to the hits in the index, for display
to
> the user. Does Lucene keep a location stored in the index of where each
word
> is found in the original file ? How do I know at what point in the
original
> data file to find the offset to display the original document ? Is this
> something that I have to store myself in each document object in the index
?
> Is this why you create separate document objects in the Lucene index ? -
> Each new document object in the index will contain the file offset to the
> original data file ? And if Lucene doesn't put that file offset in there
> automagically, I would have to store that myself as I create the index, in
> someting like a FileOffsetLocation field, for each document. Am I on the
> right track here ?
>
> Whew.
>
> ----- Original Message -----
> From: "none none" <korfut@lycos.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Wednesday, June 12, 2002 11:56 AM
> Subject: Re: Creating indexes
>
>
> > Lucene doesn't know where a file start or ends, actually it knows, but
in
> your case 1 Docuemtn contains more small documents.If you want to split
your
> big file in small files you must to that by yourself, Take a look at the
> Document class and you will see that Lucene use a Reader to index the body
> of a file, so may be you should build a class that return a Reader for
each
> sub-document you want.
> > But i think is easier split your main document in small document, index
> this small documents with a common "keyword" that is the actual Big file
> name, so when you'll search you can understand where this "sub" document
is
> allocated. After you index those files you can delete them. What you need
is
> a BigDocumentManager that:
> >
> > 1.split your big file/s
> > 2.index them. (don't forget the keyword => big doc name)
> > 3.delete those "sub" documents (are like temp docs).
> >
> > Hope this helps.
> >
> >
> > --
> >
> > On Wed, 12 Jun 2002 02:26:58
> > Chris Sibert wrote:
> > >I have a big ( 40 MB or so) file to index. The file contains a whole
> bunch
> > >of documents, which are each pretty small, about a few typewritten
pages
> > >long. There's a title, date, and author for each document, in addition
to
> > >the documents' actual text.
> > >
> > >I'm not quite sure how you index this in Lucene. For each document in
the
> > >original file, I assume that I create a separate Lucene Document object
> in
> > >the index with author, date, title, and text fields. If so, my question
> is
> > >that when I'm reading in the original file for indexing, does Lucene
know
> > >where each document begins and ends in the original file ? Or do I have
> to
> > >write a parser or filter or something for the InputStream that's
reading
> the
> > >file ?
> > >
> > >Chris Sibert
> > >
> > >
> > >
> > >--
> > >To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > >For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> > >
> > >
> >
> >
> > _______________________________________________________
> > WIN a first class trip to Hawaii. Live like the King of Rock and Roll
> > on the big Island. Enter Now!
> > http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Creating indexes [ In reply to ]
ok, let's reorganize:

1.You know how to split the file.
2.If your files is like 40 mb and you do not have to index a big number you can "store" your "sub_document" in the index, take a look at the class Document -> the body field should be: Indexed, Tokenized, Stored.
3.A stored field can be loaded just retriving the Document from the Hits and getting the field value.
4.If you don't like the point 3, you can also store in you file system your "sub_documents", i suggest this solution: when you parse the big file , split it into little files and save them as [keyword].txt, into your a common folder with name [big file name], e.g:
/[big file name]/__10001__txt.
5.Run the index on that files.
6.add a keyword to reconize the "big file name", so add the folder name as a keyword: indexed,stored,not tokenized.
7.run your query as you want and if you need to search just on a particular big file, just run a query using the MultiFieldQueryParser.java and set the key folder to your preferred big_file_name.
8.If you want highlight your document do a search on that mailing list "highlight" and you'll find something.

Is that ok?
ciao.

--

On Wed, 19 Jun 2002 10:54:43
Nader S. Henein wrote:
>just store the whole thing into the indexc .. it'll make the index bigger
>but then it'll allow you to find method in madness, manually parsing a forty
>meg file everytime you need to display search results is too intensive.
>
>Nader Henein
>
>-----Original Message-----
>From: Chris Sibert [mailto:chrissibert@attbi.com]
>Sent: Wednesday, June 19, 2002 10:47 AM
>To: Lucene Users List
>Subject: Re: Creating indexes
>
>
>The file that I have is big, about 40 MB. And it's got a whole lot of
>smaller documents in it - about 15 thousand - too many to separate into
>individual files. These individual documents are actually similar to emails
>stored in a large text file. The file is structured to an extent, with a
>number before each document - (ex: __10001__, __10002__, etc.), with the
>date, etc. Kind of like email headers.
>
>In the Lucene index, it seems like I'll have to: 1) use a DocumentNumbers
>field to index all of the document numbers, 2) a Dates field to index the
>document dates, 3) and a TextBody field to index all of the document text
>together. I'll have to write an InputStreamFilter or something to parse the
>data as it's coming in to the lucene IndexWriter, create a new document
>every time I hit a new number, and parse out the numbers - like __10001__ -
>so I can separate them out in the DocumentNumbers field, the dates into a
>Dates field, and the text in a TextBody field. It won't be pleasant writing
>that parser, but...
>
>My other issue at this point is how to then display the documents that
>relate to the search hits. I have to be able to open that 40 MB file and go
>to the document(s) that correspond to the hits in the index, for display to
>the user. Does Lucene keep a location stored in the index of where each word
>is found in the original file ? How do I know at what point in the original
>data file to find the offset to display the original document ? Is this
>something that I have to store myself in each document object in the index ?
>Is this why you create separate document objects in the Lucene index ? -
>Each new document object in the index will contain the file offset to the
>original data file ? And if Lucene doesn't put that file offset in there
>automagically, I would have to store that myself as I create the index, in
>someting like a FileOffsetLocation field, for each document. Am I on the
>right track here ?
>
>Whew.
>
>----- Original Message -----
>From: "none none" <korfut@lycos.com>
>To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>Sent: Wednesday, June 12, 2002 11:56 AM
>Subject: Re: Creating indexes
>
>
>> Lucene doesn't know where a file start or ends, actually it knows, but in
>your case 1 Docuemtn contains more small documents.If you want to split your
>big file in small files you must to that by yourself, Take a look at the
>Document class and you will see that Lucene use a Reader to index the body
>of a file, so may be you should build a class that return a Reader for each
>sub-document you want.
>> But i think is easier split your main document in small document, index
>this small documents with a common "keyword" that is the actual Big file
>name, so when you'll search you can understand where this "sub" document is
>allocated. After you index those files you can delete them. What you need is
>a BigDocumentManager that:
>>
>> 1.split your big file/s
>> 2.index them. (don't forget the keyword => big doc name)
>> 3.delete those "sub" documents (are like temp docs).
>>
>> Hope this helps.
>>
>>
>> --
>>
>> On Wed, 12 Jun 2002 02:26:58
>> Chris Sibert wrote:
>> >I have a big ( 40 MB or so) file to index. The file contains a whole
>bunch
>> >of documents, which are each pretty small, about a few typewritten pages
>> >long. There's a title, date, and author for each document, in addition to
>> >the documents' actual text.
>> >
>> >I'm not quite sure how you index this in Lucene. For each document in the
>> >original file, I assume that I create a separate Lucene Document object
>in
>> >the index with author, date, title, and text fields. If so, my question
>is
>> >that when I'm reading in the original file for indexing, does Lucene know
>> >where each document begins and ends in the original file ? Or do I have
>to
>> >write a parser or filter or something for the InputStream that's reading
>the
>> >file ?
>> >
>> >Chris Sibert
>> >
>> >
>> >
>> >--
>> >To unsubscribe, e-mail:
><mailto:lucene-user-unsubscribe@jakarta.apache.org>
>> >For additional commands, e-mail:
><mailto:lucene-user-help@jakarta.apache.org>
>> >
>> >
>>
>>
>> _______________________________________________________
>> WIN a first class trip to Hawaii. Live like the King of Rock and Roll
>> on the big Island. Enter Now!
>> http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes
>>
>> --
>> To unsubscribe, e-mail:
><mailto:lucene-user-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail:
><mailto:lucene-user-help@jakarta.apache.org>
>>
>>
>
>
>--
>To unsubscribe, e-mail:
><mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail:
><mailto:lucene-user-help@jakarta.apache.org>
>
>
>
>--
>To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>


_________________________________________
Communicate with others using Lycos Mail for FREE!
http://mail.lycos.com/

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>