We are experimenting with XML-aware indexing. The approach we're trying
is to index every element in a given XML document as a separate Lucene
document along with a another Lucene document that captures just the
concatenated text content of the document (to handle searching for
phrases across element boundaries), what we're calling the "all-content"
Lucene document.
We are using a "node type" field to distinguish the different types of
XML document constructs we are indexing (elements, comments, PIs, etc.)
and also thought we would use node type to distinguish the "all-content"
document. When we get a hit list, we can then use the node type to
figure out which XML constructs contained the target text and reduce the
per-element Lucene documents to single XML documents for the final query
result. We can also use node type to limit the query (you might want to
search just in PIs or just in comments, for example).
Our question is this: given that for the all-content document we could
either use the default "content" field for the text and the node type
field to label the document as the all-content node or simply use a
different field name for the content (e.g., "alltext" or something),
which of the following queries would tend to perform better? This:
"some text" AND nodtype:ALL_CONTENT
or:
alltext:"some text"
Or is there any practical difference?
Which way we construct the Lucene document will affect how our front-end
and/or users have to construct queries. It would be slightly more
convenient for front-ends to get the all-content doc by default (using
the "content" field for the text), but we thought the "AND" query needed
to limit searches to just the text (thus ignoring element-specific
searching) might incur a performance penalty.
In a related question, is there anything we can or need to do to
optimize Lucene to handle lots of little Lucene documents?
Thanks,
Eliot
--
. . . . . . . . . . . . . . . . . . . . . . . .
W. Eliot Kimber | Lead Brain
1016 La Posada Dr. | Suite 240 | Austin TX 78752
T 512.656.4139 | F 512.419.1860 | eliot@isogen.com
w w w . d a t a c h a n n e l . c o m
is to index every element in a given XML document as a separate Lucene
document along with a another Lucene document that captures just the
concatenated text content of the document (to handle searching for
phrases across element boundaries), what we're calling the "all-content"
Lucene document.
We are using a "node type" field to distinguish the different types of
XML document constructs we are indexing (elements, comments, PIs, etc.)
and also thought we would use node type to distinguish the "all-content"
document. When we get a hit list, we can then use the node type to
figure out which XML constructs contained the target text and reduce the
per-element Lucene documents to single XML documents for the final query
result. We can also use node type to limit the query (you might want to
search just in PIs or just in comments, for example).
Our question is this: given that for the all-content document we could
either use the default "content" field for the text and the node type
field to label the document as the all-content node or simply use a
different field name for the content (e.g., "alltext" or something),
which of the following queries would tend to perform better? This:
"some text" AND nodtype:ALL_CONTENT
or:
alltext:"some text"
Or is there any practical difference?
Which way we construct the Lucene document will affect how our front-end
and/or users have to construct queries. It would be slightly more
convenient for front-ends to get the all-content doc by default (using
the "content" field for the text), but we thought the "AND" query needed
to limit searches to just the text (thus ignoring element-specific
searching) might incur a performance penalty.
In a related question, is there anything we can or need to do to
optimize Lucene to handle lots of little Lucene documents?
Thanks,
Eliot
--
. . . . . . . . . . . . . . . . . . . . . . . .
W. Eliot Kimber | Lead Brain
1016 La Posada Dr. | Suite 240 | Austin TX 78752
T 512.656.4139 | F 512.419.1860 | eliot@isogen.com
w w w . d a t a c h a n n e l . c o m