Mailing List Archive: XML retrieval with Intervals

XML retrieval with Intervals

May 6, 2022, 2:22 AM

Post #1 of 5 (517 views)

Hi Devs!

I found intervals quite nice and natural for retrieving scoped data
(thanks, Alan!):
<tag>foo stuff bar</tag>
I.containing(I.ordered(I.term("<tag>"), I.term("<tag>")),
I.unordered(I.term("bar"), I.term("foo")));
It works like a charm until it encounter ill nested tags:
<tag>foo <tag>bug</tag> bar</tag>
Due to intrinsic minimalizations it picks the internal tag. I feel like
plain intervals backed on positions lack tag scoping information.
Do you know any approaches for retrieving XML in Lucene?

--
Sincerely yours
Mikhail Khludnev

Re: XML retrieval with Intervals [ In reply to ]

msokolov at gmail

May 6, 2022, 5:35 AM

Post #2 of 5 (516 views)

Permalink

Many years ago I had started this Lux project that was designed to
build an XML-aware index using Solr; see
https://github.com/msokolov/lux/tree/master/src/main/java/lux/index/analysis
for the analysis chain I used. Maybe you'll find something useful in
this project? It's dormant for years, and pre-dates interval queries,
but the code is still the code, and XML has not really changed...

On Fri, May 6, 2022 at 5:23 AM Mikhail Khludnev <mkhl@apache.org> wrote:
>
> Hi Devs!
>
> I found intervals quite nice and natural for retrieving scoped data (thanks, Alan!):
> <tag>foo stuff bar</tag>
> I.containing(I.ordered(I.term("<tag>"), I.term("<tag>")),
> I.unordered(I.term("bar"), I.term("foo")));
> It works like a charm until it encounter ill nested tags:
> <tag>foo <tag>bug</tag> bar</tag>
> Due to intrinsic minimalizations it picks the internal tag. I feel like plain intervals backed on positions lack tag scoping information.
> Do you know any approaches for retrieving XML in Lucene?
>
> --
> Sincerely yours
> Mikhail Khludnev

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: XML retrieval with Intervals [ In reply to ]

wunder at wunderwood

May 6, 2022, 8:28 AM

Post #3 of 5 (515 views)

Permalink

If you need to search XML, consider MarkLogic. It is a very full-featured database and search engine based on XML.

https://www.marklogic.com

Disclaimer: I worked there for a couple of years ten years ago. But I’ve been inside that product and it is non-muggle technology.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/ (my blog)

> On May 6, 2022, at 5:35 AM, Michael Sokolov <msokolov@gmail.com> wrote:
>
> Many years ago I had started this Lux project that was designed to
> build an XML-aware index using Solr; see
> https://github.com/msokolov/lux/tree/master/src/main/java/lux/index/analysis
> for the analysis chain I used. Maybe you'll find something useful in
> this project? It's dormant for years, and pre-dates interval queries,
> but the code is still the code, and XML has not really changed...
>
> On Fri, May 6, 2022 at 5:23 AM Mikhail Khludnev <mkhl@apache.org> wrote:
>>
>> Hi Devs!
>>
>> I found intervals quite nice and natural for retrieving scoped data (thanks, Alan!):
>> <tag>foo stuff bar</tag>
>> I.containing(I.ordered(I.term("<tag>"), I.term("<tag>")),
>> I.unordered(I.term("bar"), I.term("foo")));
>> It works like a charm until it encounter ill nested tags:
>> <tag>foo <tag>bug</tag> bar</tag>
>> Due to intrinsic minimalizations it picks the internal tag. I feel like plain intervals backed on positions lack tag scoping information.
>> Do you know any approaches for retrieving XML in Lucene?
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: XML retrieval with Intervals [ In reply to ]

msokolov at gmail

May 6, 2022, 9:04 AM

Post #4 of 5 (515 views)

Permalink

+1 MarkLogic is an excellent product. This Lux thing was inspired by it.

On Fri, May 6, 2022 at 11:29 AM Walter Underwood <wunder@wunderwood.org> wrote:
>
> If you need to search XML, consider MarkLogic. It is a very full-featured database and search engine based on XML.
>
> https://www.marklogic.com
>
> Disclaimer: I worked there for a couple of years ten years ago. But I’ve been inside that product and it is non-muggle technology.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
> On May 6, 2022, at 5:35 AM, Michael Sokolov <msokolov@gmail.com> wrote:
>
> Many years ago I had started this Lux project that was designed to
> build an XML-aware index using Solr; see
> https://github.com/msokolov/lux/tree/master/src/main/java/lux/index/analysis
> for the analysis chain I used. Maybe you'll find something useful in
> this project? It's dormant for years, and pre-dates interval queries,
> but the code is still the code, and XML has not really changed...
>
> On Fri, May 6, 2022 at 5:23 AM Mikhail Khludnev <mkhl@apache.org> wrote:
>
>
> Hi Devs!
>
> I found intervals quite nice and natural for retrieving scoped data (thanks, Alan!):
> <tag>foo stuff bar</tag>
> I.containing(I.ordered(I.term("<tag>"), I.term("<tag>")),
> I.unordered(I.term("bar"), I.term("foo")));
> It works like a charm until it encounter ill nested tags:
> <tag>foo <tag>bug</tag> bar</tag>
> Due to intrinsic minimalizations it picks the internal tag. I feel like plain intervals backed on positions lack tag scoping information.
> Do you know any approaches for retrieving XML in Lucene?
>
> --
> Sincerely yours
> Mikhail Khludnev
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: XML retrieval with Intervals [ In reply to ]

romseygeek at gmail

May 6, 2022, 9:11 AM

Post #5 of 5 (515 views)

Permalink

I *think* it would be possible to write an IntervalsSource implementation that took opening and closing tags, and did the right thing here - as you say, a standard `contains` will try and minimise things, but you could write something that attempted to match an opening tag with its corresponding closing tag by taking into account how many other opening tags there are before the next closing tag. You’d need to do some caching to handle the look-ahead aspect but I don’t think that would be too tricky.

It’s a fun idea to think about, I’ll see if I can come up with something over the weekend :)

> On 6 May 2022, at 10:22, Mikhail Khludnev <mkhl@apache.org> wrote:
>
> Hi Devs!
>
> I found intervals quite nice and natural for retrieving scoped data (thanks, Alan!):
> <tag>foo stuff bar</tag>
> I.containing(I.ordered(I.term("<tag>"), I.term("<tag>")),
> I.unordered(I.term("bar"), I.term("foo")));
> It works like a charm until it encounter ill nested tags:
> <tag>foo <tag>bug</tag> bar</tag>
> Due to intrinsic minimalizations it picks the internal tag. I feel like plain intervals backed on positions lack tag scoping information.
> Do you know any approaches for retrieving XML in Lucene?
>
> --
> Sincerely yours
> Mikhail Khludnev

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org