Mailing List Archive

Indexing Text File By Sections In Lucene
Hi, I have a requirement in which I have to index a text file using Lucene.

The text file data if from a PDF file. I have used Tika to extract text from
PDF and put it into the text file.

I want to index the text file in the following way.

1. I don't want to index the whole text file content.
2. I don't want to index sentence by sentence.
3. Instead, I want to index the text file by sections.(The text file is
huge)

How can I do this? Any help would be greatly appreciated.

--Sunil



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Text-File-By-Sections-In-Lucene-tp4156843.html
Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Indexing Text File By Sections In Lucene [ In reply to ]
Hello Sunil,

You can use XML to differentiate the different section of text file and you
can use *Field* to store different section of document. While indexing the
document it will be indexed by sections. And you can query according to the
requirement.

Hope this help.

Thank you
Prakash Kumar Dubey


On Thu, Sep 4, 2014 at 11:39 AM, sunilragidi <sunilragidi@gmail.com> wrote:

> Hi, I have a requirement in which I have to index a text file using Lucene.
>
> The text file data if from a PDF file. I have used Tika to extract text
> from
> PDF and put it into the text file.
>
> I want to index the text file in the following way.
>
> 1. I don't want to index the whole text file content.
> 2. I don't want to index sentence by sentence.
> 3. Instead, I want to index the text file by sections.(The text file is
> huge)
>
> How can I do this? Any help would be greatly appreciated.
>
> --Sunil
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-Text-File-By-Sections-In-Lucene-tp4156843.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
Re: Indexing Text File By Sections In Lucene [ In reply to ]
Thank you for the response.

Can you elaborate a little more on how can we use XML here?

I got all the data in a text file(read from a PDF file using Tika API). How
can I transform this text data into XML. There is no fixed structure on my
text data.

--Sunil



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Text-File-By-Sections-In-Lucene-tp4156843p4156855.html
Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Indexing Text File By Sections In Lucene [ In reply to ]
On 04/09/2014 07:09, sunilragidi wrote:
> Hi, I have a requirement in which I have to index a text file using Lucene.
>
> The text file data if from a PDF file. I have used Tika to extract text from
> PDF and put it into the text file.

This may be your mistake - IIRC Tika isn't great at preserving structure
within PDFs. We had a similar requirement a while ago to index large
PDFs by paragraphs, and the paragraph markers were being lost. I suggest
you look at other ways of extracting the plain text - pdftotext may
preserve more of the structure, I think that's what we used. Once you
have the individual sections you can index them as separate documents in
Solr, with metadata to indicate the document they came from.

HTH

Charlie
>
> I want to index the text file in the following way.
>
> 1. I don't want to index the whole text file content.
> 2. I don't want to index sentence by sentence.
> 3. Instead, I want to index the text file by sections.(The text file is
> huge)
>
> How can I do this? Any help would be greatly appreciated.
>
> --Sunil
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Text-File-By-Sections-In-Lucene-tp4156843.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk
Re: Indexing Text File By Sections In Lucene [ In reply to ]
Hello Everyone,
I'm not pretty sure that this is the best solution for your problem, but
following link PDFBox Extracting Paragraphs
<http://stackoverflow.com/questions/9451312/pdfbox-extracting-paragraphs> might
help.

Hope this help!

Thanks and Regards
Prakash Kumar Dubey


On Thu, Sep 4, 2014 at 1:22 PM, Charlie Hull <charlie@flax.co.uk> wrote:

> On 04/09/2014 07:09, sunilragidi wrote:
>
>> Hi, I have a requirement in which I have to index a text file using
>> Lucene.
>>
>> The text file data if from a PDF file. I have used Tika to extract text
>> from
>> PDF and put it into the text file.
>>
>
> This may be your mistake - IIRC Tika isn't great at preserving structure
> within PDFs. We had a similar requirement a while ago to index large PDFs
> by paragraphs, and the paragraph markers were being lost. I suggest you
> look at other ways of extracting the plain text - pdftotext may preserve
> more of the structure, I think that's what we used. Once you have the
> individual sections you can index them as separate documents in Solr, with
> metadata to indicate the document they came from.
>
> HTH
>
> Charlie
>
>
>> I want to index the text file in the following way.
>>
>> 1. I don't want to index the whole text file content.
>> 2. I don't want to index sentence by sentence.
>> 3. Instead, I want to index the text file by sections.(The text file
>> is
>> huge)
>>
>> How can I do this? Any help would be greatly appreciated.
>>
>> --Sunil
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/Indexing-Text-File-By-Sections-In-Lucene-tp4156843.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
> web: www.flax.co.uk
>