Mailing List Archive

Re: [jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
I haven't looked at this at all, so please disregard if I'm off base, but
it sounds as if we are materializing power sets? Would it be better to just
store an array of the members and generate the combinations iteratively on
demand?

On Mon, May 31, 2021, 5:27 PM Haoyu Zhai (Jira) <jira@apache.org> wrote:

>
> [
> https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354660#comment-17354660
> ]
>
> Haoyu Zhai commented on LUCENE-9983:
> ------------------------------------
>
> I've added a simple static counter just for the adversarial test, and
> here's the stats:
> * {{incr}} called: 106073079
> * entry added to set: 100076079
> * {{decr}} called: 106069079
> * entry removed from set: 100072079
> * {{computeHash}} called: 40057
> * {{freeze}} called: 14056
>
> So seems to me my guess above holds, we're doing way more put/remove entry
> operations than others
>
> > Stop sorting determinize powersets unnecessarily
> > ------------------------------------------------
> >
> > Key: LUCENE-9983
> > URL: https://issues.apache.org/jira/browse/LUCENE-9983
> > Project: Lucene - Core
> > Issue Type: Improvement
> > Reporter: Michael McCandless
> > Priority: Major
> > Time Spent: 40m
> > Remaining Estimate: 0h
> >
> > Spinoff from LUCENE-9981.
> > Today, our {{Operations.determinize}} implementation builds powersets of
> all subsets of NFA states that "belong" in the same determinized state,
> using [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction
> ].
> > To hold each powerset, we use a malleable {{SortedIntSet}} and
> periodically freeze it to a {{FrozenIntSet}}, also sorted. We pay a high
> price to keep these growing maps of int key, int value sorted by key, e.g.
> upgrading to a {{TreeMap}} once the map is large enough (> 30 entries).
> > But I think sorting is entirely unnecessary here! Really all we need is
> the ability to add/delete keys from the map, and hashCode / equals (by key
> only – ignoring value!), and to freeze the map (a small optimization that
> we could skip initially). We only use these maps to lookup in the
> (growing) determinized automaton whether this powerset has already been
> seen.
> > Maybe we could simply poach the {{IntIntScatterMap}} implementation from
> [HPPC|https://github.com/carrotsearch/hppc]? And then change its
> {{hashCode}}/{{equals }}to only use keys (not values).
> > This change should be a big speedup for the kinds of (admittedly
> adversarial) regexps we saw on LUCENE-9981.
> >
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
> For additional commands, e-mail: issues-help@lucene.apache.org
>
>
Re: [jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily [ In reply to ]
Hi Mike,

Looking at the algorithm it's the opposite - it tries hard not to iterate
over all the combinations. The "power sets" come from combinations of
accessible sets of states using different NFA paths. This example is pretty
good:

https://en.wikipedia.org/wiki/Powerset_construction#Example

Dawid

On Tue, Jun 1, 2021 at 1:11 AM Michael Sokolov <msokolov@gmail.com> wrote:

> I haven't looked at this at all, so please disregard if I'm off base, but
> it sounds as if we are materializing power sets? Would it be better to just
> store an array of the members and generate the combinations iteratively on
> demand?
>
> On Mon, May 31, 2021, 5:27 PM Haoyu Zhai (Jira) <jira@apache.org> wrote:
>
>>
>> [
>> https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354660#comment-17354660
>> ]
>>
>> Haoyu Zhai commented on LUCENE-9983:
>> ------------------------------------
>>
>> I've added a simple static counter just for the adversarial test, and
>> here's the stats:
>> * {{incr}} called: 106073079
>> * entry added to set: 100076079
>> * {{decr}} called: 106069079
>> * entry removed from set: 100072079
>> * {{computeHash}} called: 40057
>> * {{freeze}} called: 14056
>>
>> So seems to me my guess above holds, we're doing way more put/remove
>> entry operations than others
>>
>> > Stop sorting determinize powersets unnecessarily
>> > ------------------------------------------------
>> >
>> > Key: LUCENE-9983
>> > URL: https://issues.apache.org/jira/browse/LUCENE-9983
>> > Project: Lucene - Core
>> > Issue Type: Improvement
>> > Reporter: Michael McCandless
>> > Priority: Major
>> > Time Spent: 40m
>> > Remaining Estimate: 0h
>> >
>> > Spinoff from LUCENE-9981.
>> > Today, our {{Operations.determinize}} implementation builds powersets
>> of all subsets of NFA states that "belong" in the same determinized state,
>> using [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction
>> ].
>> > To hold each powerset, we use a malleable {{SortedIntSet}} and
>> periodically freeze it to a {{FrozenIntSet}}, also sorted. We pay a high
>> price to keep these growing maps of int key, int value sorted by key, e.g.
>> upgrading to a {{TreeMap}} once the map is large enough (> 30 entries).
>> > But I think sorting is entirely unnecessary here! Really all we need
>> is the ability to add/delete keys from the map, and hashCode / equals (by
>> key only – ignoring value!), and to freeze the map (a small optimization
>> that we could skip initially). We only use these maps to lookup in the
>> (growing) determinized automaton whether this powerset has already been
>> seen.
>> > Maybe we could simply poach the {{IntIntScatterMap}} implementation
>> from [HPPC|https://github.com/carrotsearch/hppc]? And then change its
>> {{hashCode}}/{{equals }}to only use keys (not values).
>> > This change should be a big speedup for the kinds of (admittedly
>> adversarial) regexps we saw on LUCENE-9981.
>> >
>>
>>
>>
>> --
>> This message was sent by Atlassian Jira
>> (v8.3.4#803005)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: issues-help@lucene.apache.org
>>
>>