Mailing List Archive: DB-XML/Lucene integration

The ever growing presence of mingled structured and unstructured data is a
fact of life and modern systems we have to deal with. Clearly, the tendency
is that full-text indexing is moving towards DB functionality, i.e.
<attribute,value> fields for projection/filtering, sorting, faceted queries,
transactional CRUD operations etc. Though set manipulation is not Lucene's
or Solr's forte, the document-object model maps very well to rows of
relational sets or tables, evermore when CLOBs and TEXT fields where
introduced.

On the other hand, relational databases with XML and OO extensions and
native XML repositories still have to deal with the problem of RANKING
unstructured text and combination of text fragments and structured
conditions, thus dealing no longer just with a set/relational model that
yields binary answers but extending their query languages to handled the
concept of fuzziness, relevance, etc. ( e.g. SQL/MM, XQuery-FullText).

I would like once again to open this can of worms, and perhaps think out of
the box, without classifying DB and Full-Text as simply different, as we
analyze concepts to further understand the real path for evolution of
Lucene/Sorl

Here is a very interesting attempt to create a special type of "index"
called Domain Index to query unstructured data within Oracle by Marcelo
Ochoa:
https://issues.apache.org/jira/browse/LUCENE-724

Other interesting articles:

XQuery 1.0 - Full-Text:
http://www.w3.org/TR/xquery-full-text/
SQL/MM Full-Text
http://www.wiscorp.com/2CD1R1-02-fulltext-2001-12.pdf

Discussions on *XML data model vs. relational model*
http://www.xml.com/cs/user/view/cs_msg/2645

http://www.w3.org/TR/xpath-datamodel/
http://en.wikipedia.org/wiki/Relational_model

-- Joaquin Delgado

Hi,

On 5/11/07, J. Delgado <joaquin.delgado@gmail.com> wrote:
> I would like once again to open this can of worms, and perhaps think out of
> the box, without classifying DB and Full-Text as simply different, as we
> analyze concepts to further understand the real path for evolution of
> Lucene/Sorl

Another data point to consider are hierarchical content repositories
like Jackrabbit and the myriad of custom "database + lucene"
repositories out there.

I had some interesting discussions during the last ApacheCon about the
potentially converging storage and and feature requirements of various
components along the DB/Full text index axis. At least from the
Jackrabbit perspective it would be interesting to look at how to
better integrate the Lucene search index with our native persistence
model. I guess people from DB projects like Derby have similar
interests.

Looking further in the future it might even make sense to completely
unify the storage model of these various projects. I.e. have set
handling from Derby, hierarchies from Jackrabbit, and indexing from
Lucene in a single extensible storage engine that could be used as a
unified backend layer by various projects.

I believe that many of the current storage data structures need to be
redesigned in any case due to current trends in computing. The driving
trends I see are increasingly cheap storage and the switch to more
parallelism in computing. I believe that within the next 5-10 years
we'll see many projects switching to append-only data structures that
focus on massively parallel read operations with zero locking.

I guess I'm sufficiently out of the box now... I'll crawl back in. :-)

BR,

Jukka Zitting