Mailing List Archive

ComplexPhraseQueryParser performance question
Hi,-

 i hope everyone is doing great.

I saw this issue with this class such that if you search for "term1*" 
it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
millisecs when it is 2 chars)

but when you search for "term1 term2*" where when term2 is a single
char, the performance degrades too much.

The query "term1 term2*" slows down 50 times (~200 millisecs) compared
to "term1*" case when term 1 has >5 chars and term2 is still 1 char.

The query "term1 term2*" slows down 400 times (~1500 millisecs) compared
to "term1*" when term1 has >5 chars and term2 is still 1 char.

Is there any suggestion to speed it up?

Best regards



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: ComplexPhraseQueryParser performance question [ In reply to ]
Please ignore the first comparison there. i was comparing there {term1
with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}


The slowdown is

The query "term1 term2*" slows down 400 times (~1500 millisecs) compared
to "term1*" when term1 has >5 chars and term2 is still 1 char.

Best regards


On 2/3/20 4:13 PM, baris.kazar@oracle.com wrote:
> Hi,-
>
>  i hope everyone is doing great.
>
> I saw this issue with this class such that if you search for "term1*" 
> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
> millisecs when it is 2 chars)
>
> but when you search for "term1 term2*" where when term2 is a single
> char, the performance degrades too much.
>
> The query "term1 term2*" slows down 50 times (~200 millisecs) compared
> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
>
> The query "term1 term2*" slows down 400 times (~1500 millisecs)
> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
>
> Is there any suggestion to speed it up?
>
> Best regards
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: ComplexPhraseQueryParser performance question [ In reply to ]
How can this slowdown be resolved?
is this another limitation of this class?
Thanks

> On Feb 3, 2020, at 4:14 PM, BARIS.KAZAR@oracle.com wrote:
>
> ?Please ignore the first comparison there. i was comparing there {term1 with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
>
>
> The slowdown is
>
> The query "term1 term2*" slows down 400 times (~1500 millisecs) compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
>
> Best regards
>
>
>> On 2/3/20 4:13 PM, baris.kazar@oracle.com wrote:
>> Hi,-
>>
>> i hope everyone is doing great.
>>
>> I saw this issue with this class such that if you search for "term1*" it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250 millisecs when it is 2 chars)
>>
>> but when you search for "term1 term2*" where when term2 is a single char, the performance degrades too much.
>>
>> The query "term1 term2*" slows down 50 times (~200 millisecs) compared to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
>>
>> The query "term1 term2*" slows down 400 times (~1500 millisecs) compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>
>> Is there any suggestion to speed it up?
>>
>> Best regards
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: ComplexPhraseQueryParser performance question [ In reply to ]
It's slow per se, since it loads terms positions. Usual advices are
shingling or edge ngrams. Note, if this is not a text but a string or enum,
it probably let to apply another tricks. Another idea is perhaps
IntervalQueries can be smarter and faster in certain cases, although they
are backed on the same slow positions.

On Tue, Feb 4, 2020 at 7:25 AM <baris.kazar@oracle.com> wrote:

> How can this slowdown be resolved?
> is this another limitation of this class?
> Thanks
>
> > On Feb 3, 2020, at 4:14 PM, BARIS.KAZAR@oracle.com wrote:
> >
> > ?Please ignore the first comparison there. i was comparing there {term1
> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
> >
> >
> > The slowdown is
> >
> > The query "term1 term2*" slows down 400 times (~1500 millisecs) compared
> to "term1*" when term1 has >5 chars and term2 is still 1 char.
> >
> > Best regards
> >
> >
> >> On 2/3/20 4:13 PM, baris.kazar@oracle.com wrote:
> >> Hi,-
> >>
> >> i hope everyone is doing great.
> >>
> >> I saw this issue with this class such that if you search for "term1*"
> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
> millisecs when it is 2 chars)
> >>
> >> but when you search for "term1 term2*" where when term2 is a single
> char, the performance degrades too much.
> >>
> >> The query "term1 term2*" slows down 50 times (~200 millisecs) compared
> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
> >>
> >> The query "term1 term2*" slows down 400 times (~1500 millisecs)
> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
> >>
> >> Is there any suggestion to speed it up?
> >>
> >> Best regards
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Sincerely yours
Mikhail Khludnev
Re: ComplexPhraseQueryParser performance question [ In reply to ]
Thanks but i thought this class would have a mechanism to fix this issue.
Thanks

> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <mkhl@apache.org> wrote:
>
> ?It's slow per se, since it loads terms positions. Usual advices are
> shingling or edge ngrams. Note, if this is not a text but a string or enum,
> it probably let to apply another tricks. Another idea is perhaps
> IntervalQueries can be smarter and faster in certain cases, although they
> are backed on the same slow positions.
>
>> On Tue, Feb 4, 2020 at 7:25 AM <baris.kazar@oracle.com> wrote:
>>
>> How can this slowdown be resolved?
>> is this another limitation of this class?
>> Thanks
>>
>>>> On Feb 3, 2020, at 4:14 PM, BARIS.KAZAR@oracle.com wrote:
>>>
>>> ?Please ignore the first comparison there. i was comparing there {term1
>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
>>>
>>>
>>> The slowdown is
>>>
>>> The query "term1 term2*" slows down 400 times (~1500 millisecs) compared
>> to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>>
>>> Best regards
>>>
>>>
>>>> On 2/3/20 4:13 PM, baris.kazar@oracle.com wrote:
>>>> Hi,-
>>>>
>>>> i hope everyone is doing great.
>>>>
>>>> I saw this issue with this class such that if you search for "term1*"
>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
>> millisecs when it is 2 chars)
>>>>
>>>> but when you search for "term1 term2*" where when term2 is a single
>> char, the performance degrades too much.
>>>>
>>>> The query "term1 term2*" slows down 50 times (~200 millisecs) compared
>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
>>>>
>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>>>
>>>> Is there any suggestion to speed it up?
>>>>
>>>> Best regards
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> --
> Sincerely yours
> Mikhail Khludnev


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: ComplexPhraseQueryParser performance question [ In reply to ]
Hi,-

Regarding this mechanisms below i mentioned,

does this class offer any Shingling capability embedded to it?

I could not find any api within this class ComplexPhraseQueryParser for
that purpose.


For instance does this class offer the most commonly used words api?

i can then use one of those words as to use the second and third char
from it to search like

term1 term2FirstCharTerm2SecondChar* (where i would look up
term2FirstChar in my dictionary hashmap for the most common word value
and bring its second char into the search query)


Having second char in the search query reduces search time by 20 times.


Otherwise, do i have to use the following at index time? i already have
TextField index with my custom analyzer.

How should i embed the shingling filter into my current custom analyzer?
i dont want to disturb my current indexing.

All i want to do is to find most common word in my data for each letter
in the alphabet.

Should i do this at search time? That would be costly, right?


view-source:http://www.philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/


<p><a href="http://lucene.apache.org/"><img class="alignleft size-full
wp-image-524" title="lucene_green_300"
src="http://www.philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_3001.gif"
alt="lucene_green_300" hspace="15" width="300" height="46" align="left"
/></a> If you need to parse the tokens n-grams of a string, you may use
the facilities offered by lucene analyzers.</p>
<p>What you simply have to do is to build you own analyzer using a
ShingleMatrixFilter with the parameters that suits you needs. For
instance, here the few lines of code to build a token bi-grams analyzer:</p>
<pre lang="java">public class NGramAnalyzer extends Analyzer {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
       return new StopFilter(new LowerCaseFilter(new
ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),
           StopAnalyzer.ENGLISH_STOP_WORDS);
     }
}</pre>
<p>The parameters of the ShingleMatrixFilter simply states the minimum
and maximum shingle size. &#8220;Shingle&#8221; is just another name for
token N-Grams and is popular to be the basic units to help solving
problems in spell checking, near-duplicate detection and others.<br />
Note also the use of a StandardTokenizer to deal with basic special
characters like hyphens or other &#8220;disturbers&#8221;. </p>
<p>To use the analyzer, you can for instance do:</p>
<pre lang="java">
    public static void main(String[] args) {
        try {
            String str = "An easy way to write an analyzer for tokens
bi-gram (or even tokens n-grams) with lucene";
            Analyzer analyzer = new NGramAnalyzer();

            TokenStream stream = analyzer.tokenStream("content", new
StringReader(str));
            Token token = new Token();
            while ((token = stream.next(token)) != null){
                System.out.println(token.term());
            }

        } catch (IOException ie) {
            System.out.println("IO Error " + ie.getMessage());
        }
    }
</pre>
<p>The output will print:</p>
<pre lang="none">
an easy
easy way
way to
to write
write an
an analyzer
analyzer for
for tokens
tokens bi
bi gram
gram or
or even
even tokens
tokens n
n grams
grams with
with lucene
</pre>
<p>Note that the text &#8220;bi-gram&#8221; was treated like two
different tokens, as a desired consequence of using a StandardTokenizer
in the ShingleMatrixFilter initialization.</p>


Best regards

On 2/4/20 11:14 AM, baris.kazar@oracle.com wrote:
>
> Thanks but i thought this class would have a mechanism to fix this issue.
> Thanks
>
>> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <mkhl@apache.org> wrote:
>>
>> ?It's slow per se, since it loads terms positions. Usual advices are
>> shingling or edge ngrams. Note, if this is not a text but a string or enum,
>> it probably let to apply another tricks. Another idea is perhaps
>> IntervalQueries can be smarter and faster in certain cases, although they
>> are backed on the same slow positions.
>>
>>> On Tue, Feb 4, 2020 at 7:25 AM <baris.kazar@oracle.com> wrote:
>>>
>>> How can this slowdown be resolved?
>>> is this another limitation of this class?
>>> Thanks
>>>
>>>>> On Feb 3, 2020, at 4:14 PM, BARIS.KAZAR@oracle.com wrote:
>>>> ?Please ignore the first comparison there. i was comparing there {term1
>>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
>>>>
>>>> The slowdown is
>>>>
>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs) compared
>>> to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>>> Best regards
>>>>
>>>>
>>>>> On 2/3/20 4:13 PM, baris.kazar@oracle.com wrote:
>>>>> Hi,-
>>>>>
>>>>> i hope everyone is doing great.
>>>>>
>>>>> I saw this issue with this class such that if you search for "term1*"
>>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
>>> millisecs when it is 2 chars)
>>>>> but when you search for "term1 term2*" where when term2 is a single
>>> char, the performance degrades too much.
>>>>> The query "term1 term2*" slows down 50 times (~200 millisecs) compared
>>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
>>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
>>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>>>> Is there any suggestion to speed it up?
>>>>>
>>>>> Best regards
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>> --
>> Sincerely yours
>> Mikhail Khludnev

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: ComplexPhraseQueryParser performance question [ In reply to ]
Hi,

See org.apache.lucene.search.PhraseWildcardQuery in Lucene's sandbox
module. It was recently added by my amazing colleague Bruno. At this time
there is no query parser that uses it in Lucene unfortunately but you can
rectify this for your own purposes. I hope this query "graduates" to
Lucene core some day. It's placement in sandbox is why it can't be added
to any of Lucene's query parsers like complex phrase.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Feb 12, 2020 at 11:07 AM <baris.kazar@oracle.com> wrote:

> Hi,-
>
> Regarding this mechanisms below i mentioned,
>
> does this class offer any Shingling capability embedded to it?
>
> I could not find any api within this class ComplexPhraseQueryParser for
> that purpose.
>
>
> For instance does this class offer the most commonly used words api?
>
> i can then use one of those words as to use the second and third char
> from it to search like
>
> term1 term2FirstCharTerm2SecondChar* (where i would look up
> term2FirstChar in my dictionary hashmap for the most common word value
> and bring its second char into the search query)
>
>
> Having second char in the search query reduces search time by 20 times.
>
>
> Otherwise, do i have to use the following at index time? i already have
> TextField index with my custom analyzer.
>
> How should i embed the shingling filter into my current custom analyzer?
> i dont want to disturb my current indexing.
>
> All i want to do is to find most common word in my data for each letter
> in the alphabet.
>
> Should i do this at search time? That would be costly, right?
>
>
> view-source:
> http://www.philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/
>
>
> <p><a href="http://lucene.apache.org/"><img class="alignleft size-full
> wp-image-524" title="lucene_green_300"
> src="
> http://www.philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_3001.gif"
>
> alt="lucene_green_300" hspace="15" width="300" height="46" align="left"
> /></a> If you need to parse the tokens n-grams of a string, you may use
> the facilities offered by lucene analyzers.</p>
> <p>What you simply have to do is to build you own analyzer using a
> ShingleMatrixFilter with the parameters that suits you needs. For
> instance, here the few lines of code to build a token bi-grams
> analyzer:</p>
> <pre lang="java">public class NGramAnalyzer extends Analyzer {
> @Override
> public TokenStream tokenStream(String fieldName, Reader reader) {
> return new StopFilter(new LowerCaseFilter(new
> ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),
> StopAnalyzer.ENGLISH_STOP_WORDS);
> }
> }</pre>
> <p>The parameters of the ShingleMatrixFilter simply states the minimum
> and maximum shingle size. &#8220;Shingle&#8221; is just another name for
> token N-Grams and is popular to be the basic units to help solving
> problems in spell checking, near-duplicate detection and others.<br />
> Note also the use of a StandardTokenizer to deal with basic special
> characters like hyphens or other &#8220;disturbers&#8221;. </p>
> <p>To use the analyzer, you can for instance do:</p>
> <pre lang="java">
> public static void main(String[] args) {
> try {
> String str = "An easy way to write an analyzer for tokens
> bi-gram (or even tokens n-grams) with lucene";
> Analyzer analyzer = new NGramAnalyzer();
>
> TokenStream stream = analyzer.tokenStream("content", new
> StringReader(str));
> Token token = new Token();
> while ((token = stream.next(token)) != null){
> System.out.println(token.term());
> }
>
> } catch (IOException ie) {
> System.out.println("IO Error " + ie.getMessage());
> }
> }
> </pre>
> <p>The output will print:</p>
> <pre lang="none">
> an easy
> easy way
> way to
> to write
> write an
> an analyzer
> analyzer for
> for tokens
> tokens bi
> bi gram
> gram or
> or even
> even tokens
> tokens n
> n grams
> grams with
> with lucene
> </pre>
> <p>Note that the text &#8220;bi-gram&#8221; was treated like two
> different tokens, as a desired consequence of using a StandardTokenizer
> in the ShingleMatrixFilter initialization.</p>
>
>
> Best regards
>
> On 2/4/20 11:14 AM, baris.kazar@oracle.com wrote:
> >
> > Thanks but i thought this class would have a mechanism to fix this issue.
> > Thanks
> >
> >> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <mkhl@apache.org> wrote:
> >>
> >> ?It's slow per se, since it loads terms positions. Usual advices are
> >> shingling or edge ngrams. Note, if this is not a text but a string or
> enum,
> >> it probably let to apply another tricks. Another idea is perhaps
> >> IntervalQueries can be smarter and faster in certain cases, although
> they
> >> are backed on the same slow positions.
> >>
> >>> On Tue, Feb 4, 2020 at 7:25 AM <baris.kazar@oracle.com> wrote:
> >>>
> >>> How can this slowdown be resolved?
> >>> is this another limitation of this class?
> >>> Thanks
> >>>
> >>>>> On Feb 3, 2020, at 4:14 PM, BARIS.KAZAR@oracle.com wrote:
> >>>> ?Please ignore the first comparison there. i was comparing there
> {term1
> >>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
> >>>>
> >>>> The slowdown is
> >>>>
> >>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
> compared
> >>> to "term1*" when term1 has >5 chars and term2 is still 1 char.
> >>>> Best regards
> >>>>
> >>>>
> >>>>> On 2/3/20 4:13 PM, baris.kazar@oracle.com wrote:
> >>>>> Hi,-
> >>>>>
> >>>>> i hope everyone is doing great.
> >>>>>
> >>>>> I saw this issue with this class such that if you search for "term1*"
> >>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
> >>> millisecs when it is 2 chars)
> >>>>> but when you search for "term1 term2*" where when term2 is a single
> >>> char, the performance degrades too much.
> >>>>> The query "term1 term2*" slows down 50 times (~200 millisecs)
> compared
> >>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
> >>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
> >>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
> >>>>> Is there any suggestion to speed it up?
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: ComplexPhraseQueryParser performance question [ In reply to ]
Thanks David, can i look at the source code?
i think ComplexPhraseQueryParser uses
something similar.
i will check the differences but do You know the differences for quick reference?
Thanks



> On Feb 12, 2020, at 6:41 PM, David Smiley <david.w.smiley@gmail.com> wrote:
>
> ?
> Hi,
>
> See org.apache.lucene.search.PhraseWildcardQuery in Lucene's sandbox module. It was recently added by my amazing colleague Bruno. At this time there is no query parser that uses it in Lucene unfortunately but you can rectify this for your own purposes. I hope this query "graduates" to Lucene core some day. It's placement in sandbox is why it can't be added to any of Lucene's query parsers like complex phrase.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
>> On Wed, Feb 12, 2020 at 11:07 AM <baris.kazar@oracle.com> wrote:
>> Hi,-
>>
>> Regarding this mechanisms below i mentioned,
>>
>> does this class offer any Shingling capability embedded to it?
>>
>> I could not find any api within this class ComplexPhraseQueryParser for
>> that purpose.
>>
>>
>> For instance does this class offer the most commonly used words api?
>>
>> i can then use one of those words as to use the second and third char
>> from it to search like
>>
>> term1 term2FirstCharTerm2SecondChar* (where i would look up
>> term2FirstChar in my dictionary hashmap for the most common word value
>> and bring its second char into the search query)
>>
>>
>> Having second char in the search query reduces search time by 20 times.
>>
>>
>> Otherwise, do i have to use the following at index time? i already have
>> TextField index with my custom analyzer.
>>
>> How should i embed the shingling filter into my current custom analyzer?
>> i dont want to disturb my current indexing.
>>
>> All i want to do is to find most common word in my data for each letter
>> in the alphabet.
>>
>> Should i do this at search time? That would be costly, right?
>>
>>
>> view-source:http://www.philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/
>>
>>
>> <p><a href="http://lucene.apache.org/"><img class="alignleft size-full
>> wp-image-524" title="lucene_green_300"
>> src="http://www.philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_3001.gif"
>> alt="lucene_green_300" hspace="15" width="300" height="46" align="left"
>> /></a> If you need to parse the tokens n-grams of a string, you may use
>> the facilities offered by lucene analyzers.</p>
>> <p>What you simply have to do is to build you own analyzer using a
>> ShingleMatrixFilter with the parameters that suits you needs. For
>> instance, here the few lines of code to build a token bi-grams analyzer:</p>
>> <pre lang="java">public class NGramAnalyzer extends Analyzer {
>> @Override
>> public TokenStream tokenStream(String fieldName, Reader reader) {
>> return new StopFilter(new LowerCaseFilter(new
>> ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),
>> StopAnalyzer.ENGLISH_STOP_WORDS);
>> }
>> }</pre>
>> <p>The parameters of the ShingleMatrixFilter simply states the minimum
>> and maximum shingle size. &#8220;Shingle&#8221; is just another name for
>> token N-Grams and is popular to be the basic units to help solving
>> problems in spell checking, near-duplicate detection and others.<br />
>> Note also the use of a StandardTokenizer to deal with basic special
>> characters like hyphens or other &#8220;disturbers&#8221;. </p>
>> <p>To use the analyzer, you can for instance do:</p>
>> <pre lang="java">
>> public static void main(String[] args) {
>> try {
>> String str = "An easy way to write an analyzer for tokens
>> bi-gram (or even tokens n-grams) with lucene";
>> Analyzer analyzer = new NGramAnalyzer();
>>
>> TokenStream stream = analyzer.tokenStream("content", new
>> StringReader(str));
>> Token token = new Token();
>> while ((token = stream.next(token)) != null){
>> System.out.println(token.term());
>> }
>>
>> } catch (IOException ie) {
>> System.out.println("IO Error " + ie.getMessage());
>> }
>> }
>> </pre>
>> <p>The output will print:</p>
>> <pre lang="none">
>> an easy
>> easy way
>> way to
>> to write
>> write an
>> an analyzer
>> analyzer for
>> for tokens
>> tokens bi
>> bi gram
>> gram or
>> or even
>> even tokens
>> tokens n
>> n grams
>> grams with
>> with lucene
>> </pre>
>> <p>Note that the text &#8220;bi-gram&#8221; was treated like two
>> different tokens, as a desired consequence of using a StandardTokenizer
>> in the ShingleMatrixFilter initialization.</p>
>>
>>
>> Best regards
>>
>> On 2/4/20 11:14 AM, baris.kazar@oracle.com wrote:
>> >
>> > Thanks but i thought this class would have a mechanism to fix this issue.
>> > Thanks
>> >
>> >> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <mkhl@apache.org> wrote:
>> >>
>> >> ?It's slow per se, since it loads terms positions. Usual advices are
>> >> shingling or edge ngrams. Note, if this is not a text but a string or enum,
>> >> it probably let to apply another tricks. Another idea is perhaps
>> >> IntervalQueries can be smarter and faster in certain cases, although they
>> >> are backed on the same slow positions.
>> >>
>> >>> On Tue, Feb 4, 2020 at 7:25 AM <baris.kazar@oracle.com> wrote:
>> >>>
>> >>> How can this slowdown be resolved?
>> >>> is this another limitation of this class?
>> >>> Thanks
>> >>>
>> >>>>> On Feb 3, 2020, at 4:14 PM, BARIS.KAZAR@oracle.com wrote:
>> >>>> ?Please ignore the first comparison there. i was comparing there {term1
>> >>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
>> >>>>
>> >>>> The slowdown is
>> >>>>
>> >>>> The query "term1 term2*" slows down 400 times (~1500 millisecs) compared
>> >>> to "term1*" when term1 has >5 chars and term2 is still 1 char.
>> >>>> Best regards
>> >>>>
>> >>>>
>> >>>>> On 2/3/20 4:13 PM, baris.kazar@oracle.com wrote:
>> >>>>> Hi,-
>> >>>>>
>> >>>>> i hope everyone is doing great.
>> >>>>>
>> >>>>> I saw this issue with this class such that if you search for "term1*"
>> >>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
>> >>> millisecs when it is 2 chars)
>> >>>>> but when you search for "term1 term2*" where when term2 is a single
>> >>> char, the performance degrades too much.
>> >>>>> The query "term1 term2*" slows down 50 times (~200 millisecs) compared
>> >>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
>> >>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
>> >>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
>> >>>>> Is there any suggestion to speed it up?
>> >>>>>
>> >>>>> Best regards
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> ---------------------------------------------------------------------
>> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>
>> >>>
>> >> --
>> >> Sincerely yours
>> >> Mikhail Khludnev
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
Re: ComplexPhraseQueryParser performance question [ In reply to ]
org.apache.lucene.search.PhraseWildcardQuery
looks very good, i hope this makes into Lucene
build soon.
Thanks

> On Feb 12, 2020, at 10:01 PM, BARIS.KAZAR@oracle.com wrote:
>
> ?Thanks David, can i look at the source code?
> i think ComplexPhraseQueryParser uses
> something similar.
> i will check the differences but do You know the differences for quick reference?
> Thanks
>
>
>
>>> On Feb 12, 2020, at 6:41 PM, David Smiley <david.w.smiley@gmail.com> wrote:
>>>
>> ?
>> Hi,
>>
>> See org.apache.lucene.search.PhraseWildcardQuery in Lucene's sandbox module. It was recently added by my amazing colleague Bruno. At this time there is no query parser that uses it in Lucene unfortunately but you can rectify this for your own purposes. I hope this query "graduates" to Lucene core some day. It's placement in sandbox is why it can't be added to any of Lucene's query parsers like complex phrase.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>>> On Wed, Feb 12, 2020 at 11:07 AM <baris.kazar@oracle.com> wrote:
>>> Hi,-
>>>
>>> Regarding this mechanisms below i mentioned,
>>>
>>> does this class offer any Shingling capability embedded to it?
>>>
>>> I could not find any api within this class ComplexPhraseQueryParser for
>>> that purpose.
>>>
>>>
>>> For instance does this class offer the most commonly used words api?
>>>
>>> i can then use one of those words as to use the second and third char
>>> from it to search like
>>>
>>> term1 term2FirstCharTerm2SecondChar* (where i would look up
>>> term2FirstChar in my dictionary hashmap for the most common word value
>>> and bring its second char into the search query)
>>>
>>>
>>> Having second char in the search query reduces search time by 20 times.
>>>
>>>
>>> Otherwise, do i have to use the following at index time? i already have
>>> TextField index with my custom analyzer.
>>>
>>> How should i embed the shingling filter into my current custom analyzer?
>>> i dont want to disturb my current indexing.
>>>
>>> All i want to do is to find most common word in my data for each letter
>>> in the alphabet.
>>>
>>> Should i do this at search time? That would be costly, right?
>>>
>>>
>>> view-source:http://www.philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/
>>>
>>>
>>> <p><a href="http://lucene.apache.org/"><img class="alignleft size-full
>>> wp-image-524" title="lucene_green_300"
>>> src="http://www.philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_3001.gif"
>>> alt="lucene_green_300" hspace="15" width="300" height="46" align="left"
>>> /></a> If you need to parse the tokens n-grams of a string, you may use
>>> the facilities offered by lucene analyzers.</p>
>>> <p>What you simply have to do is to build you own analyzer using a
>>> ShingleMatrixFilter with the parameters that suits you needs. For
>>> instance, here the few lines of code to build a token bi-grams analyzer:</p>
>>> <pre lang="java">public class NGramAnalyzer extends Analyzer {
>>> @Override
>>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>> return new StopFilter(new LowerCaseFilter(new
>>> ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),
>>> StopAnalyzer.ENGLISH_STOP_WORDS);
>>> }
>>> }</pre>
>>> <p>The parameters of the ShingleMatrixFilter simply states the minimum
>>> and maximum shingle size. &#8220;Shingle&#8221; is just another name for
>>> token N-Grams and is popular to be the basic units to help solving
>>> problems in spell checking, near-duplicate detection and others.<br />
>>> Note also the use of a StandardTokenizer to deal with basic special
>>> characters like hyphens or other &#8220;disturbers&#8221;. </p>
>>> <p>To use the analyzer, you can for instance do:</p>
>>> <pre lang="java">
>>> public static void main(String[] args) {
>>> try {
>>> String str = "An easy way to write an analyzer for tokens
>>> bi-gram (or even tokens n-grams) with lucene";
>>> Analyzer analyzer = new NGramAnalyzer();
>>>
>>> TokenStream stream = analyzer.tokenStream("content", new
>>> StringReader(str));
>>> Token token = new Token();
>>> while ((token = stream.next(token)) != null){
>>> System.out.println(token.term());
>>> }
>>>
>>> } catch (IOException ie) {
>>> System.out.println("IO Error " + ie.getMessage());
>>> }
>>> }
>>> </pre>
>>> <p>The output will print:</p>
>>> <pre lang="none">
>>> an easy
>>> easy way
>>> way to
>>> to write
>>> write an
>>> an analyzer
>>> analyzer for
>>> for tokens
>>> tokens bi
>>> bi gram
>>> gram or
>>> or even
>>> even tokens
>>> tokens n
>>> n grams
>>> grams with
>>> with lucene
>>> </pre>
>>> <p>Note that the text &#8220;bi-gram&#8221; was treated like two
>>> different tokens, as a desired consequence of using a StandardTokenizer
>>> in the ShingleMatrixFilter initialization.</p>
>>>
>>>
>>> Best regards
>>>
>>> On 2/4/20 11:14 AM, baris.kazar@oracle.com wrote:
>>> >
>>> > Thanks but i thought this class would have a mechanism to fix this issue.
>>> > Thanks
>>> >
>>> >> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <mkhl@apache.org> wrote:
>>> >>
>>> >> ?It's slow per se, since it loads terms positions. Usual advices are
>>> >> shingling or edge ngrams. Note, if this is not a text but a string or enum,
>>> >> it probably let to apply another tricks. Another idea is perhaps
>>> >> IntervalQueries can be smarter and faster in certain cases, although they
>>> >> are backed on the same slow positions.
>>> >>
>>> >>> On Tue, Feb 4, 2020 at 7:25 AM <baris.kazar@oracle.com> wrote:
>>> >>>
>>> >>> How can this slowdown be resolved?
>>> >>> is this another limitation of this class?
>>> >>> Thanks
>>> >>>
>>> >>>>> On Feb 3, 2020, at 4:14 PM, BARIS.KAZAR@oracle.com wrote:
>>> >>>> ?Please ignore the first comparison there. i was comparing there {term1
>>> >>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
>>> >>>>
>>> >>>> The slowdown is
>>> >>>>
>>> >>>> The query "term1 term2*" slows down 400 times (~1500 millisecs) compared
>>> >>> to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>> >>>> Best regards
>>> >>>>
>>> >>>>
>>> >>>>> On 2/3/20 4:13 PM, baris.kazar@oracle.com wrote:
>>> >>>>> Hi,-
>>> >>>>>
>>> >>>>> i hope everyone is doing great.
>>> >>>>>
>>> >>>>> I saw this issue with this class such that if you search for "term1*"
>>> >>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
>>> >>> millisecs when it is 2 chars)
>>> >>>>> but when you search for "term1 term2*" where when term2 is a single
>>> >>> char, the performance degrades too much.
>>> >>>>> The query "term1 term2*" slows down 50 times (~200 millisecs) compared
>>> >>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
>>> >>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
>>> >>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>> >>>>> Is there any suggestion to speed it up?
>>> >>>>>
>>> >>>>> Best regards
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> ---------------------------------------------------------------------
>>> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>>>>
>>> >>>> ---------------------------------------------------------------------
>>> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>>>
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>>
>>> >>>
>>> >> --
>>> >> Sincerely yours
>>> >> Mikhail Khludnev
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
Re: ComplexPhraseQueryParser performance question [ In reply to ]
Hello,

I picked two first questions for reply.


> does this class offer any Shingling capability embedded to it?
>
No, it doesn't allow to expand wildcard phrase with shingles.


> I could not find any api within this class ComplexPhraseQueryParser for
> that purpose.
>

There are no one.




> Best regards
>
> On 2/4/20 11:14 AM, baris.kazar@oracle.com wrote:
> >
> > Thanks but i thought this class would have a mechanism to fix this issue.
> > Thanks
> >
> >> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <mkhl@apache.org> wrote:
> >>
> >> ?It's slow per se, since it loads terms positions. Usual advices are
> >> shingling or edge ngrams. Note, if this is not a text but a string or
> enum,
> >> it probably let to apply another tricks. Another idea is perhaps
> >> IntervalQueries can be smarter and faster in certain cases, although
> they
> >> are backed on the same slow positions.
> >>
> >>> On Tue, Feb 4, 2020 at 7:25 AM <baris.kazar@oracle.com> wrote:
> >>>
> >>> How can this slowdown be resolved?
> >>> is this another limitation of this class?
> >>> Thanks
> >>>
> >>>>> On Feb 3, 2020, at 4:14 PM, BARIS.KAZAR@oracle.com wrote:
> >>>> ?Please ignore the first comparison there. i was comparing there
> {term1
> >>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
> >>>>
> >>>> The slowdown is
> >>>>
> >>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
> compared
> >>> to "term1*" when term1 has >5 chars and term2 is still 1 char.
> >>>> Best regards
> >>>>
> >>>>
> >>>>> On 2/3/20 4:13 PM, baris.kazar@oracle.com wrote:
> >>>>> Hi,-
> >>>>>
> >>>>> i hope everyone is doing great.
> >>>>>
> >>>>> I saw this issue with this class such that if you search for "term1*"
> >>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
> >>> millisecs when it is 2 chars)
> >>>>> but when you search for "term1 term2*" where when term2 is a single
> >>> char, the performance degrades too much.
> >>>>> The query "term1 term2*" slows down 50 times (~200 millisecs)
> compared
> >>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
> >>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
> >>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
> >>>>> Is there any suggestion to speed it up?
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Sincerely yours
Mikhail Khludnev
Re: ComplexPhraseQueryParser performance question [ In reply to ]
Thanks Mikhail.


On 2/13/20 5:05 AM, Mikhail Khludnev wrote:
> Hello,
>
> I picked two first questions for reply.
>
>
>> does this class offer any Shingling capability embedded to it?
>>
> No, it doesn't allow to expand wildcard phrase with shingles.
>
>
>> I could not find any api within this class ComplexPhraseQueryParser for
>> that purpose.
>>
> There are no one.
>
>
>
>
>> Best regards
>>
>> On 2/4/20 11:14 AM, baris.kazar@oracle.com wrote:
>>> Thanks but i thought this class would have a mechanism to fix this issue.
>>> Thanks
>>>
>>>> On Feb 4, 2020, at 4:14 AM, Mikhail Khludnev <mkhl@apache.org> wrote:
>>>>
>>>> ?It's slow per se, since it loads terms positions. Usual advices are
>>>> shingling or edge ngrams. Note, if this is not a text but a string or
>> enum,
>>>> it probably let to apply another tricks. Another idea is perhaps
>>>> IntervalQueries can be smarter and faster in certain cases, although
>> they
>>>> are backed on the same slow positions.
>>>>
>>>>> On Tue, Feb 4, 2020 at 7:25 AM <baris.kazar@oracle.com> wrote:
>>>>>
>>>>> How can this slowdown be resolved?
>>>>> is this another limitation of this class?
>>>>> Thanks
>>>>>
>>>>>>> On Feb 3, 2020, at 4:14 PM, BARIS.KAZAR@oracle.com wrote:
>>>>>> ?Please ignore the first comparison there. i was comparing there
>> {term1
>>>>> with 2 chars} vs {term1 with >= 5 chars + term2 with 1 char}
>>>>>> The slowdown is
>>>>>>
>>>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
>> compared
>>>>> to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>>>>> Best regards
>>>>>>
>>>>>>
>>>>>>> On 2/3/20 4:13 PM, baris.kazar@oracle.com wrote:
>>>>>>> Hi,-
>>>>>>>
>>>>>>> i hope everyone is doing great.
>>>>>>>
>>>>>>> I saw this issue with this class such that if you search for "term1*"
>>>>> it is good, (i.e., 4 millisecs when it has >= 5 chars and it is ~250
>>>>> millisecs when it is 2 chars)
>>>>>>> but when you search for "term1 term2*" where when term2 is a single
>>>>> char, the performance degrades too much.
>>>>>>> The query "term1 term2*" slows down 50 times (~200 millisecs)
>> compared
>>>>> to "term1*" case when term 1 has >5 chars and term2 is still 1 char.
>>>>>>> The query "term1 term2*" slows down 400 times (~1500 millisecs)
>>>>> compared to "term1*" when term1 has >5 chars and term2 is still 1 char.
>>>>>>> Is there any suggestion to speed it up?
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org