Mailing List Archive

Index Optimization: Which is Better?
We are experimenting with XML-aware indexing. The approach we're trying
is to index every element in a given XML document as a separate Lucene
document along with a another Lucene document that captures just the
concatenated text content of the document (to handle searching for
phrases across element boundaries), what we're calling the "all-content"
Lucene document.

We are using a "node type" field to distinguish the different types of
XML document constructs we are indexing (elements, comments, PIs, etc.)
and also thought we would use node type to distinguish the "all-content"
document. When we get a hit list, we can then use the node type to
figure out which XML constructs contained the target text and reduce the
per-element Lucene documents to single XML documents for the final query
result. We can also use node type to limit the query (you might want to
search just in PIs or just in comments, for example).

Our question is this: given that for the all-content document we could
either use the default "content" field for the text and the node type
field to label the document as the all-content node or simply use a
different field name for the content (e.g., "alltext" or something),
which of the following queries would tend to perform better? This:

"some text" AND nodtype:ALL_CONTENT

or:

alltext:"some text"

Or is there any practical difference?

Which way we construct the Lucene document will affect how our front-end
and/or users have to construct queries. It would be slightly more
convenient for front-ends to get the all-content doc by default (using
the "content" field for the text), but we thought the "AND" query needed
to limit searches to just the text (thus ignoring element-specific
searching) might incur a performance penalty.

In a related question, is there anything we can or need to do to
optimize Lucene to handle lots of little Lucene documents?

Thanks,

Eliot

--
. . . . . . . . . . . . . . . . . . . . . . . .

W. Eliot Kimber | Lead Brain

1016 La Posada Dr. | Suite 240 | Austin TX 78752
T 512.656.4139 | F 512.419.1860 | eliot@isogen.com

w w w . d a t a c h a n n e l . c o m
Re: Index Optimization: Which is Better? [ In reply to ]
Doug wrote:

> I'm having trouble getting a clear picture of your indexing scheme.

I've been doing a lot of thinking about this same problem, so I
may be a little more in tune with what Elliot's saying. By the way,
Elliot, I'm very interested in your results. I considered the basic
approach you're using, but I thought it was a bit extreme in terms of
having zillions of tiny lucene Documents. I'm working on a quick
kludge that may serve my immediate purposes (if it does, I'm planning
to post the deatils here).

> Could you provide some simple examples, e.g., for the xml:

> <tag1>this is some text
> <tag2>and some other text</tag2>
> </tag1>
> would you have something like the following?
> doc1
> node_type: tag1
> contents: this is some text
> doc2
> node_type: tag2
> contents: and some other text
> doc3
> node_type: all_contents
> contents: this is some text and some other text

I think that's exactly what Elliot is intending.


> My first instinct would be to have something like:
> doc1
> tag1: this is some text
> tag2: and some other text
> all-tags: this is some text and some other text
> What do you need that that does not achieve?

Name collision - you can have multiple Elements at different
levels, and you may have attributes and tags having the same name.
Obviously one way around this is "Don't do that", but that could get
really tiresome, quickly.

If you just conflate the elements and attributes under the same
name (i.e. field "blah" contains a concatenated set of values from all
occurrences of both elements and attributes) then your searches become
much more limited in what you can specify. This is, by the way, the
approach I'm trying out, with a second stage to refine the results and
drop out false positives. But I'll have to wait on saying any more
about that.

All of this, of course, is in the context of having arbitrary XML
documents. If you have predefined XML schemas then you can hand-code
the mappings from elements to lucene document fields. But then you
trade a heck of a lot of flexibility for a lot of maintenance.

Steven J. Owens
puff@darksleep.com
RE: Index Optimization: Which is Better? [ In reply to ]
Elliot,

I'm having trouble getting a clear picture of your indexing scheme.

Could you provide some simple examples, e.g., for the xml:
<tag1>this is some text
<tag2>and some other text</tag2>
</tag1>
would you have something like the following?
doc1
node_type: tag1
contents: this is some text
doc2
node_type: tag2
contents: and some other text
doc3
node_type: all_contents
contents: this is some text and some other text

That would help me.

My first instinct would be to have something like:
doc1
tag1: this is some text
tag2: and some other text
all-tags: this is some text and some other text
What do you need that that does not achieve?

Doug
Re: Index Optimization: Which is Better? [ In reply to ]
"Steven J. Owens" wrote:

> I think that's exactly what Elliot is intending.

Steven is correct. For each element in the XML document we create a
separate Lucene document with the following fields:

- docid (unique identifier of the input XML document, e.g., file system
path, object ID from a repository, URL, etc.)
- list of ancestor element types
- DOM tree location
- text of direct PCDATA content
- DOM node type (Element_node, processing_instruction_node,
comment_node) [.This list is probably imcomplete but it was enough for us
to test the idea.]
- For each attribute of the element, a field whose name is the attribute
name and whose value is the attribute value.

We also capture all the text content of the input XML document as a
single Lucene document with the same docid and the node type
"all_content".

Given these Lucene documents, I can do queries like this:

big brown dog AND ancestor:tag2 AND NOT ancestor:tag3 and
language:english

This will result in one doc for each element instance that contains the
text "big brown dog", is within a tag2 element, not within a tag3
element and has the value "english" for its language attribute.

To make sure you match the phrase if it crosses element boundaries, just
include the all-content doc as well:

big brown dog ((AND ancestor:tag2 AND NOT ancestor:tag3 and
language:english) OR
(nodetype:ALL_CONTENT))

Given this set of Lucene docs, we can then collect them by docid to
determine which XML documents are represented. The ancestor list and
tree location enable correlating each hit back to its original location
in the input document. It also enables post-processing to do more
involved contextual filtering, such as "find 'foo' in all paras that are
first children of chapters".

We have implemented a first pass at code that does this indexing but we
have no idea how it will perform (we only got this fully working
yesterday and haven't had time to stress it yet).

I agree that this is somewhat "twisted". In fact my collegue John
Heintz, who suggested the approach of one Lucene doc per element,
characterized the idea as an "abuse" of Lucene's design. But we haven't
been able to think of a better or easier way to do it.

It was really easy to write the DOM processing code to generate this
index and the interaction with Lucene's API couldn't have been
easier--this is my first experience programming against Lucene and I'm
really impressed with the simplicity of the API and the power of the
architecture.

The functionality described above for XML retrieval already surpasses
anything I know how to do with Verity, Fulcrum, Excallibur, etc. and it
was freaky easy to do once we got the idea for the approach. I just hope
it performs adequately.

Cheers,

E.

--
. . . . . . . . . . . . . . . . . . . . . . . .

W. Eliot Kimber | Lead Brain

1016 La Posada Dr. | Suite 240 | Austin TX 78752
T 512.656.4139 | F 512.419.1860 | eliot@isogen.com

w w w . d a t a c h a n n e l . c o m