Mailing List Archive

best practice for indexing multiple equiv fieldnames
I'm planning to use Lucene to index scads of XML files whose data model
includes replicated blocks of tags. Translation: a novice question follows.

My files have a common XML pattern (for illustrative purposes):

<blocks>
<block id="123">some text 1</block>
<block id="456">some text 2</block>
<block id="789">some text 3</block>
</blocks>

Each block has a unique id, but the tagname is identical. The actual data
model has nested tags within these blocks - ie: metadata with the same
tagnames within each block. So, in the real data model, there are multiple
identical tagnames that are associated with a specific parent. Something
more like this:

<blocks>
<block id="123">
<author>Joe Blow</author>
<job>hack</job>
</block>
<block id="456">
<author>Jane Doe</author>
<job>President</job>
</block>
</blocks>

In latter case, I need to be able to search by author or job, for example,
and get the tag's text contents as well as the parent block id.

Adding a field name of "block" or "author" or "job" multiple times to the
same Lucene Document, according to the Lucene javadoc, has the effect of
appending the text for search purposes. I take that to mean, in order to
use a 'hit' I would need to somehow uniquely identify the field from which
the content came even though the content was appended for search purposes.

If I searched an 'author' field name and got a hit, I would not be able to
disambiguate which block id the actual hit belonged to. Or if I searched on
"job", how would I know a hit belonged to block id 456 instead of block id
123 parent?

What is the Lucene approach for indexing a single document that has the same
field name appearing in multiple places and then using the hit to find the
exact association of block id in the above example?

Hope this question makes sense. I'm sure I'm missing something
obvious/simple in how the API would work in this case. Thanks,

Landon Cox


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: best practice for indexing multiple equiv fieldnames [ In reply to ]
Dude, Landon-

How are you doing? To the novice question I have what might be a novice
answer... but hope it helps.

I don't think that the "Lucene documents" you create and add to the index
need to have the same structure as the "XML documents" you read. Instead
of creating one Lucene document for each XML document, perhaps things will
be easier for you if you create multiple Lucene documents for each XML file
you parse (one Lucene document for each block).

best,
Belskis


--
Alexander Belskis

SchlumbergerSema - International Telematics Applications
Biotechnologies & Healthcare
c/Albasanz, 12 - 28037 Madrid (Spain)
Tel. (+34) 91 440 8800 (Ext. 7629)

-----Mensaje original-----
De: Landon Cox [SMTP:lcox@interactive-media.com]
Enviado el: miercoles, 01 de mayo de 2002 1:52
Para: Lucene Users List
Asunto: best practice for indexing multiple equiv fieldnames

>
>I'm planning to use Lucene to index scads of XML files whose data model
>includes replicated blocks of tags. Translation: a novice question
follows.
>
>My files have a common XML pattern (for illustrative purposes):
>
><blocks>
> <block id="123">some text 1</block>
> <block id="456">some text 2</block>
> <block id="789">some text 3</block>
></blocks>
>
>Each block has a unique id, but the tagname is identical. The actual data
>model has nested tags within these blocks - ie: metadata with the same
>tagnames within each block. So, in the real data model, there are
multiple
>identical tagnames that are associated with a specific parent. Something
>more like this:
>
><blocks>
> <block id="123">
> <author>Joe Blow</author>
> <job>hack</job>
> </block>
> <block id="456">
> <author>Jane Doe</author>
> <job>President</job>
> </block>
></blocks>
>
>In latter case, I need to be able to search by author or job, for example,
>and get the tag's text contents as well as the parent block id.
>
>Adding a field name of "block" or "author" or "job" multiple times to the
>same Lucene Document, according to the Lucene javadoc, has the effect of
>appending the text for search purposes. I take that to mean, in order to
>use a 'hit' I would need to somehow uniquely identify the field from which
>the content came even though the content was appended for search purposes.
>
>If I searched an 'author' field name and got a hit, I would not be able to
>disambiguate which block id the actual hit belonged to. Or if I searched
on
>"job", how would I know a hit belonged to block id 456 instead of block id
>123 parent?
>
>What is the Lucene approach for indexing a single document that has the
same
>field name appearing in multiple places and then using the hit to find the
>exact association of block id in the above example?
>
>Hope this question makes sense. I'm sure I'm missing something
>obvious/simple in how the API would work in this case. Thanks,
>
>Landon Cox
------------------------------------------------------------------
This email is confidential and intended solely for the use of the individual to whom it is addressed. Any views or opinions presented are solely those of the author and do not necessarily represent those of SchlumbergerSema.
If you are not the intended recipient, be advised that you have received this email in error and that any use, dissemination, forwarding, printing, or copying of this email is strictly prohibited.
------------------------------------------------------------------


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: best practice for indexing multiple equiv fieldnames [ In reply to ]
Landon,

As Alexander suggested, I would also recommend breaking your XML documents
into multiple Lucene Documents. Therefore each element, pi,( & other nodes)
of interest can have its own Lucene Document. You can identify the sets of
related Lucene Documents that represent the same true XML document with a
per-document identifier.

If you want to see an example of this, check out the XML-Indexing
contribution (c/o W. Eliot Kimber). This solution preserves and makes many
XML structural relationships searchable. This includes preserving (and
keeping separately indentifiable) elements with the same name and the same
parent. A DOM tree location is stored, and is therefore available after
searching for navigating to specific elements in the document. It also
indexes attributes, etc...

Specific to your requirements, I would recommend specializing the indexing
to the Document Type/Schema to only index the elements you want to search
on. This will minimize the number of Lucene Docs created for each real
document. You will also need to store the parent id attribute value in child
elements based on the node name. You could therefore index the parent id as
a Field on the Child element Documents and make that a part of your queries.
It should only take a few minutes to make this change.

The code should be pretty self-explanatory, but feel free to email me if you
have any specific questions.

One problem you will encounter with this is when searching for the same
document that contains multiple element names with a logical 'AND'
relationship.

For example:
tagname:book AND tagname:magazine

These two tags will be in separate documents, and therefore you must either
make a small alteration to the query parser or do a post-search process to
combine searches based on the document id.

Hope this helps,

-Brandon

Brandon Jockman, brandonj@isogen.com
Consultant, ISOGEN International, LLC.

----- Original Message -----
From: "Alexander Belskis" <alexander.belskis@sema.es>
To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
Sent: Monday, May 06, 2002 3:47 AM
Subject: RE: best practice for indexing multiple equiv fieldnames


> Dude, Landon-
>
> How are you doing? To the novice question I have what might be a novice
> answer... but hope it helps.
>
> I don't think that the "Lucene documents" you create and add to the index
> need to have the same structure as the "XML documents" you read. Instead
> of creating one Lucene document for each XML document, perhaps things will
> be easier for you if you create multiple Lucene documents for each XML
file
> you parse (one Lucene document for each block).
>
> best,
> Belskis
>
>
> --
> Alexander Belskis
>
> SchlumbergerSema - International Telematics Applications
> Biotechnologies & Healthcare
> c/Albasanz, 12 - 28037 Madrid (Spain)
> Tel. (+34) 91 440 8800 (Ext. 7629)
>
> -----Mensaje original-----
> De: Landon Cox [SMTP:lcox@interactive-media.com]
> Enviado el: miercoles, 01 de mayo de 2002 1:52
> Para: Lucene Users List
> Asunto: best practice for indexing multiple equiv fieldnames
>
> >
> >I'm planning to use Lucene to index scads of XML files whose data model
> >includes replicated blocks of tags. Translation: a novice question
> follows.
> >
> >My files have a common XML pattern (for illustrative purposes):
> >
> ><blocks>
> > <block id="123">some text 1</block>
> > <block id="456">some text 2</block>
> > <block id="789">some text 3</block>
> ></blocks>
> >
> >Each block has a unique id, but the tagname is identical. The actual
data
> >model has nested tags within these blocks - ie: metadata with the same
> >tagnames within each block. So, in the real data model, there are
> multiple
> >identical tagnames that are associated with a specific parent. Something
> >more like this:
> >
> ><blocks>
> > <block id="123">
> > <author>Joe Blow</author>
> > <job>hack</job>
> > </block>
> > <block id="456">
> > <author>Jane Doe</author>
> > <job>President</job>
> > </block>
> ></blocks>
> >
> >In latter case, I need to be able to search by author or job, for
example,
> >and get the tag's text contents as well as the parent block id.
> >
> >Adding a field name of "block" or "author" or "job" multiple times to the
> >same Lucene Document, according to the Lucene javadoc, has the effect of
> >appending the text for search purposes. I take that to mean, in order to
> >use a 'hit' I would need to somehow uniquely identify the field from
which
> >the content came even though the content was appended for search
purposes.
> >
> >If I searched an 'author' field name and got a hit, I would not be able
to
> >disambiguate which block id the actual hit belonged to. Or if I searched
> on
> >"job", how would I know a hit belonged to block id 456 instead of block
id
> >123 parent?
> >
> >What is the Lucene approach for indexing a single document that has the
> same
> >field name appearing in multiple places and then using the hit to find
the
> >exact association of block id in the above example?
> >
> >Hope this question makes sense. I'm sure I'm missing something
> >obvious/simple in how the API would work in this case. Thanks,
> >
> >Landon Cox
> ------------------------------------------------------------------
> This email is confidential and intended solely for the use of the
individual to whom it is addressed. Any views or opinions presented are
solely those of the author and do not necessarily represent those of
SchlumbergerSema.
> If you are not the intended recipient, be advised that you have received
this email in error and that any use, dissemination, forwarding, printing,
or copying of this email is strictly prohibited.
> ------------------------------------------------------------------
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: best practice for indexing multiple equiv fieldnames [ In reply to ]
Thanks to Alexander and Brandon for replying.

I've made a lot of progress since first posting. I did end up taking the
approach of one Lucene document per XML element. That seemed to be the most
flexible if not the most intuitive approach initially.

In that process I also attached various bits of auxilliary information to
the lucene doc needed to find my way back to the element including the path
to the file itself as well as the XML attributes on that tag. Makes
indexing/search XML attributes as easy as the tag content itself, so that
was cool.

Because, as Brandon and Alexander pointed out, the elements are in separate
Lucene documents, there's a second aggregation phase that's required to
filter all the Lucene doc hits down to the unique list of actual XML file
hits.

Even the post-process works very fast as far as I can tell - I stuff a hash
table with hits where the key is the hit's path to the file (stored when
indexed), then I iterate through the hash. Multiple hits will have the same
XML path so the hash key effectively creates list of unique files which
qualify as a 'final' hit. Granted, it's not sorted that point, but
nonetheless it boils down hits to the unique list very quickly.

For my application and data-model, the typical expansion was 1 XML file to
about 50 elements (lucene docs). It was reassuring to see some of the scale
numbers posted to the list in the last month re: indexing in the 15-20
million doc range, so the expansion in my case was not too large of a
concern other than initial indexing of that much info will be quite long.

I also have implemented pre-filtering on some tags and attribute names that
are simply internal info to the XML data model, never to be searched, so
those don't get in at all.

Not sure how you're addressing the element ala "A DOM tree location is
stored", but it made sense to me to store an XPath query string so JDOM
could address the element directly if need be once the hit is selected in
the application. For that, when the Lucene document is created during
indexing, it's a tree walk up from the element to build the XPath query and
then the query string is attached to the Lucene document unindexed. Would
be nice for JDOM to return the XPath to an Element object, but couldn't find
anything like that...perhaps I'm overlooking something in JDOM.

In any case, I've progressed quite far and really appreciate Lucene the more
I get into it. I appreciate the responses and general approach advice as
well - it got me started in a productive direction. Thanks,

Landon Cox


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: best practice for indexing multiple equiv fieldnames [ In reply to ]
Sorry to subject the list to my "thinking out-loud" process. Realized some
other logic problems in my last mail that I need to solve. Will sort
through those and repost what will hopefully be a better synopsis of a
solution. I think Brandon's point about the logical query limitations may
be bigger than I first thought particularly for large results sets. Would
like to stay with the element/lucene doc approach, but may end up twisting
in the wind.

w/r to scale, I probably raised some eyebrows on boiling a hit list down
using a hashtable. I should have mentioned that in this application, we're
not expecting the hit lists to be any longer than a human is willing to sit
and look through (much < 1,000 and probably much < 100 in most cases.)
Definitely some UI things to think through even though the backend is plenty
happy speed-wise at that size.

Will go quiet again, for now.



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>