Mailing List Archive: Investigating Lucene for Applicability to [Unusual?] Use Case

Investigating Lucene for Applicability to [Unusual?] Use Case

Jun 13, 2007, 12:02 PM

Post #1 of 4 (1840 views)

Hello:

I'm investigating Lucene as a replacement for a special-purpose search
technology that was developed long before Lucene (or any of the current IR
libraries) became available.

The use case involves so-called print streams. Imagine 20,000 statements
concatenated into one large file suitable for delivery to a print system.
The document formats vary, but include AFP (an IBM printer format), PCL (an
HP format), Postscript, PDF, and even "plain-text".

The indexing application must track the total page count of the embedded
statements. On a hit, the search application must extract and return the
[possibly multi-page] statement embedded within the larger print-stream
file.

How would the search application know (be informed by the Lucene/indexer)
the extent of the internal document(s)?

I'm not seeing this scenario discussed in forums or books. Does anyone have
comments or thoughts on Lucene's applicability as a solution?

Thanks.

Brad
--
View this message in context: http://www.nabble.com/Investigating-Lucene-for-Applicability-to--Unusual---Use-Case-tf3917031.html#a11106468
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: Investigating Lucene for Applicability to [Unusual?] Use Case [ In reply to ]

sarowe at syr

Jun 13, 2007, 12:19 PM

Post #2 of 4 (1754 views)

Permalink

Hi Brad,

Brad Harper wrote:
> The use case involves so-called print streams. Imagine 20,000 statements
> concatenated into one large file suitable for delivery to a print system.
> The document formats vary, but include AFP (an IBM printer format), PCL (an
> HP format), Postscript, PDF, and even "plain-text".
>
> The indexing application must track the total page count of the embedded
> statements. On a hit, the search application must extract and return the
> [possibly multi-page] statement embedded within the larger print-stream
> file.
>
> How would the search application know (be informed by the Lucene/indexer)
> the extent of the internal document(s)?

You'll get faster/better responses to questions like this if you direct
them to the java-user list.

One solution is to use a Lucene stored field (call it "source")
containing the name of the print stream file (stored, I assume,
externally to the indexer), along with the document's extent within that
file, maybe in a format like "filename:beg:end". Of course, you could
also use three separate fields, one for each piece of information.

Then when the search app gets a hit, the "source" field can be retrieved
and consulted for the information you want.

Steve

--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Re: Investigating Lucene for Applicability to [Unusual?] Use Case [ In reply to ]

gsingers at apache

Jun 13, 2007, 12:23 PM

Post #3 of 4 (1741 views)

Permalink

You might get more responses on java-user@lucene.a.o

On the surface, I don't see any reason why Lucene couldn't handle
this. Essentially, you are splitting the stream into Lucene
Documents and indexing them. Keep in mind, that Lucene doesn't care
where the text comes from (PDF, AFP, whatever), that is up to the
application to control.

So, basically, the answer is Lucene can enable what you want, but you
will still need to do the application level logic.

On Jun 13, 2007, at 3:02 PM, Brad Harper wrote:

>
> Hello:
>
> I'm investigating Lucene as a replacement for a special-purpose search
> technology that was developed long before Lucene (or any of the
> current IR
> libraries) became available.
>
> The use case involves so-called print streams. Imagine 20,000
> statements
> concatenated into one large file suitable for delivery to a print
> system.
> The document formats vary, but include AFP (an IBM printer format),
> PCL (an
> HP format), Postscript, PDF, and even "plain-text".
>
> The indexing application must track the total page count of the
> embedded
> statements. On a hit, the search application must extract and
> return the
> [possibly multi-page] statement embedded within the larger print-
> stream
> file.
>
> How would the search application know (be informed by the Lucene/
> indexer)
> the extent of the internal document(s)?
>
> I'm not seeing this scenario discussed in forums or books. Does
> anyone have
> comments or thoughts on Lucene's applicability as a solution?
>
> Thanks.
>
> Brad
> --
> View this message in context: http://www.nabble.com/Investigating-
> Lucene-for-Applicability-to--Unusual---Use-Case-
> tf3917031.html#a11106468
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ

Re: Investigating Lucene for Applicability to [Unusual?] Use Case [ In reply to ]

brad.harper at epsiia

Jun 13, 2007, 12:25 PM

Post #4 of 4 (1752 views)

Permalink

Steve:

Thanks for the reply. I posted my inquiry here because it didn't seem to be
a java-only issue, as such, and I didn't want to cross-post.

Brad

Steven Rowe wrote:
>
> Hi Brad,
>
> Brad Harper wrote:
>> The use case involves so-called print streams. Imagine 20,000 statements
>> concatenated into one large file suitable for delivery to a print system.
>> The document formats vary, but include AFP (an IBM printer format), PCL
>> (an
>> HP format), Postscript, PDF, and even "plain-text".
>>
>> The indexing application must track the total page count of the embedded
>> statements. On a hit, the search application must extract and return the
>> [possibly multi-page] statement embedded within the larger print-stream
>> file.
>>
>> How would the search application know (be informed by the Lucene/indexer)
>> the extent of the internal document(s)?
>
> You'll get faster/better responses to questions like this if you direct
> them to the java-user list.
>
> One solution is to use a Lucene stored field (call it "source")
> containing the name of the print stream file (stored, I assume,
> externally to the indexer), along with the document's extent within that
> file, maybe in a format like "filename:beg:end". Of course, you could
> also use three separate fields, one for each piece of information.
>
> Then when the search app gets a hit, the "source" field can be retrieved
> and consulted for the information you want.
>
> Steve
>
> --
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>
>

--
View this message in context: http://www.nabble.com/Investigating-Lucene-for-Applicability-to--Unusual---Use-Case-tf3917031.html#a11107033
Sent from the Lucene - General mailing list archive at Nabble.com.