Mailing List Archive: QueryParser - proposed change may break existing queries.

QueryParser - proposed change may break existing queries.

Sep 16, 2020, 2:04 AM

Post #1 of 12 (740 views)

In Lucene-9445 we'd like to add a case insensitive option to regex queries
in the query parser of the form:
/Foo/i

However, today people can search for :

/foo.com/index.html

and not get an error. The searcher may think this is a query for a URL but
it's actually parsed as a regex "foo.com" ORed with a term query.

I'd like to draw attention to this proposed change in behaviour because I
think it could affect many existing systems. Arguably it may be a positive
in drawing attention to a number of existing silent failures (unescaped
searches for urls or file paths) but equally could be seen as a negative
breaking change by some.

What is our BWC policy for changes to query parser?
Do the benefits of the proposed new regex feature outweigh the costs of the
breakages in your view?

https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793

RE: QueryParser - proposed change may break existing queries. [ In reply to ]

uwe at thetaphi

Sep 16, 2020, 4:00 AM

Post #2 of 12 (740 views)

Permalink

In my opinion, the proposed syntax change should enforce to have whitespace or any other separator chat after the regex “i” parameter.

Uwe

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: uwe@thetaphi.de

From: Mark Harwood <markharwood@gmail.com>
Sent: Wednesday, September 16, 2020 11:04 AM
To: dev@lucene.apache.org
Subject: QueryParser - proposed change may break existing queries.

In Lucene-9445 we'd like to add a case insensitive option to regex queries in the query parser of the form:

/Foo/i

However, today people can search for :

/foo.com/index.html <http://foo.com/index.html>

and not get an error. The searcher may think this is a query for a URL but it's actually parsed as a regex "foo.com <http://foo.com> " ORed with a term query.

I'd like to draw attention to this proposed change in behaviour because I think it could affect many existing systems. Arguably it may be a positive in drawing attention to a number of existing silent failures (unescaped searches for urls or file paths) but equally could be seen as a negative breaking change by some.

What is our BWC policy for changes to query parser?

Do the benefits of the proposed new regex feature outweigh the costs of the breakages in your view?

https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793 <https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793> &page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793

Re: QueryParser - proposed change may break existing queries. [ In reply to ]

markharwood at gmail

Sep 16, 2020, 9:45 AM

Post #3 of 12 (740 views)

Permalink

The strictness I was thinking of adding was to make all of the following error:
/foo/bar
/foo//bar/
/foo/iphone
/foo/AND x

These would be allowed:
/foo/i bar
(/foo/ OR /bar/)
(/foo/ OR /bar/i)
/foo/^2
/foo/i^2

> On 16 Sep 2020, at 12:00, Uwe Schindler <uwe@thetaphi.de> wrote:
>
> ?
> In my opinion, the proposed syntax change should enforce to have whitespace or any other separator chat after the regex “i” parameter.
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> From: Mark Harwood <markharwood@gmail.com>
> Sent: Wednesday, September 16, 2020 11:04 AM
> To: dev@lucene.apache.org
> Subject: QueryParser - proposed change may break existing queries.
>
> In Lucene-9445 we'd like to add a case insensitive option to regex queries in the query parser of the form:
> /Foo/i
>
> However, today people can search for :
>
> /foo.com/index.html
>
> and not get an error. The searcher may think this is a query for a URL but it's actually parsed as a regex "foo.com" ORed with a term query.
>
> I'd like to draw attention to this proposed change in behaviour because I think it could affect many existing systems. Arguably it may be a positive in drawing attention to a number of existing silent failures (unescaped searches for urls or file paths) but equally could be seen as a negative breaking change by some.
>
> What is our BWC policy for changes to query parser?
> Do the benefits of the proposed new regex feature outweigh the costs of the breakages in your view?
>
> https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793
>
>

RE: QueryParser - proposed change may break existing queries. [ In reply to ]

uwe at thetaphi

Sep 17, 2020, 5:55 AM

Post #4 of 12 (740 views)

Permalink

Hi,

My idea would have been not to bee too strict and instead only detect it as a regex if its separated. So /foo/bar and /foo/iphone would both go through and ignoring the regex, only ‘/foo/ bar’ or ‘/foo/I phone’ would interpret the first token as regex.

That’s just my idea, not sure if it makes sense to have this relaxed parsing. I was always very skeptical of adding the regexes, as it breaks many queries. Now it’s even more.

Uwe

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: uwe@thetaphi.de

From: Mark Harwood <markharwood@gmail.com>
Sent: Wednesday, September 16, 2020 6:45 PM
To: dev@lucene.apache.org
Subject: Re: QueryParser - proposed change may break existing queries.

The strictness I was thinking of adding was to make all of the following error:

/foo/bar

/foo//bar/

/foo/iphone

/foo/AND x

These would be allowed:

/foo/i bar

(/foo/ OR /bar/)

(/foo/ OR /bar/i)

/foo/^2

/foo/i^2

On 16 Sep 2020, at 12:00, Uwe Schindler <uwe@thetaphi.de <mailto:uwe@thetaphi.de> > wrote:

?

In my opinion, the proposed syntax change should enforce to have whitespace or any other separator chat after the regex “i” parameter.

Uwe

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: uwe@thetaphi.de <mailto:uwe@thetaphi.de>

From: Mark Harwood <markharwood@gmail.com <mailto:markharwood@gmail.com> >
Sent: Wednesday, September 16, 2020 11:04 AM
To: dev@lucene.apache.org <mailto:dev@lucene.apache.org>
Subject: QueryParser - proposed change may break existing queries.

In Lucene-9445 we'd like to add a case insensitive option to regex queries in the query parser of the form:

/Foo/i

However, today people can search for :

/foo.com/index.html <http://foo.com/index.html>

and not get an error. The searcher may think this is a query for a URL but it's actually parsed as a regex "foo.com <http://foo.com> " ORed with a term query.

I'd like to draw attention to this proposed change in behaviour because I think it could affect many existing systems. Arguably it may be a positive in drawing attention to a number of existing silent failures (unescaped searches for urls or file paths) but equally could be seen as a negative breaking change by some.

What is our BWC policy for changes to query parser?

Do the benefits of the proposed new regex feature outweigh the costs of the breakages in your view?

https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793 <https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793> &page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793

Re: QueryParser - proposed change may break existing queries. [ In reply to ]

markharwood at gmail

Sep 17, 2020, 6:29 AM

Post #5 of 12 (740 views)

Permalink

I think the decision comes down to choosing between silent
(mis)interpratations of ambiguous queries or noisy failures..

On Thu, Sep 17, 2020 at 1:55 PM Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
>
>
> My idea would have been not to bee too strict and instead only detect it
> as a regex if its separated. So /foo/bar and /foo/iphone would both go
> through and ignoring the regex, only ‘/foo/ bar’ or ‘/foo/I phone’ would
> interpret the first token as regex.
>
>
>
> That’s just my idea, not sure if it makes sense to have this relaxed
> parsing. I was always very skeptical of adding the regexes, as it breaks
> many queries. Now it’s even more.
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> Achterdiek 19, D-28357 Bremen
>
> https://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* Mark Harwood <markharwood@gmail.com>
> *Sent:* Wednesday, September 16, 2020 6:45 PM
> *To:* dev@lucene.apache.org
> *Subject:* Re: QueryParser - proposed change may break existing queries.
>
>
>
> The strictness I was thinking of adding was to make all of the following
> error:
>
> /foo/bar
>
> /foo//bar/
>
> /foo/iphone
>
> /foo/AND x
>
>
>
> These would be allowed:
>
> /foo/i bar
>
> (/foo/ OR /bar/)
>
> (/foo/ OR /bar/i)
>
> /foo/^2
>
> /foo/i^2
>
>
>
>
>
>
>
> On 16 Sep 2020, at 12:00, Uwe Schindler <uwe@thetaphi.de> wrote:
>
> ?
>
> In my opinion, the proposed syntax change should enforce to have
> whitespace or any other separator chat after the regex “i” parameter.
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> Achterdiek 19, D-28357 Bremen
>
> https://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> *From:* Mark Harwood <markharwood@gmail.com>
> *Sent:* Wednesday, September 16, 2020 11:04 AM
> *To:* dev@lucene.apache.org
> *Subject:* QueryParser - proposed change may break existing queries.
>
>
>
> In Lucene-9445 we'd like to add a case insensitive option to regex queries
> in the query parser of the form:
>
> /Foo/i
>
>
>
> However, today people can search for :
>
>
>
> /foo.com/index.html
>
>
>
> and not get an error. The searcher may think this is a query for a URL but
> it's actually parsed as a regex "foo.com" ORed with a term query.
>
>
>
> I'd like to draw attention to this proposed change in behaviour because I
> think it could affect many existing systems. Arguably it may be a positive
> in drawing attention to a number of existing silent failures (unescaped
> searches for urls or file paths) but equally could be seen as a negative
> breaking change by some.
>
>
>
> What is our BWC policy for changes to query parser?
>
> Do the benefits of the proposed new regex feature outweigh the costs of
> the breakages in your view?
>
>
>
>
> https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793
>
>
>
>
>
>

Re: QueryParser - proposed change may break existing queries. [ In reply to ]

gus.heck at gmail

Sep 17, 2020, 10:55 AM

Post #6 of 12 (740 views)

Permalink

And as I understand it, current behavior is the silent misinterpretation.
To me, the failure to require a space after the regex (and either not
become a regex in that case or complain about invalid regex) might be
considered a bug...

On Thu, Sep 17, 2020 at 9:30 AM Mark Harwood <markharwood@gmail.com> wrote:

> I think the decision comes down to choosing between silent
> (mis)interpratations of ambiguous queries or noisy failures..
>
> On Thu, Sep 17, 2020 at 1:55 PM Uwe Schindler <uwe@thetaphi.de> wrote:
>
>> Hi,
>>
>>
>>
>> My idea would have been not to bee too strict and instead only detect it
>> as a regex if its separated. So /foo/bar and /foo/iphone would both go
>> through and ignoring the regex, only ‘/foo/ bar’ or ‘/foo/I phone’ would
>> interpret the first token as regex.
>>
>>
>>
>> That’s just my idea, not sure if it makes sense to have this relaxed
>> parsing. I was always very skeptical of adding the regexes, as it breaks
>> many queries. Now it’s even more.
>>
>>
>>
>> Uwe
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> Achterdiek 19, D-28357 Bremen
>>
>> https://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de
>>
>>
>>
>> *From:* Mark Harwood <markharwood@gmail.com>
>> *Sent:* Wednesday, September 16, 2020 6:45 PM
>> *To:* dev@lucene.apache.org
>> *Subject:* Re: QueryParser - proposed change may break existing queries.
>>
>>
>>
>> The strictness I was thinking of adding was to make all of the following
>> error:
>>
>> /foo/bar
>>
>> /foo//bar/
>>
>> /foo/iphone
>>
>> /foo/AND x
>>
>>
>>
>> These would be allowed:
>>
>> /foo/i bar
>>
>> (/foo/ OR /bar/)
>>
>> (/foo/ OR /bar/i)
>>
>> /foo/^2
>>
>> /foo/i^2
>>
>>
>>
>>
>>
>>
>>
>> On 16 Sep 2020, at 12:00, Uwe Schindler <uwe@thetaphi.de> wrote:
>>
>> ?
>>
>> In my opinion, the proposed syntax change should enforce to have
>> whitespace or any other separator chat after the regex “i” parameter.
>>
>>
>>
>> Uwe
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> Achterdiek 19, D-28357 Bremen
>>
>> https://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de
>>
>>
>>
>> *From:* Mark Harwood <markharwood@gmail.com>
>> *Sent:* Wednesday, September 16, 2020 11:04 AM
>> *To:* dev@lucene.apache.org
>> *Subject:* QueryParser - proposed change may break existing queries.
>>
>>
>>
>> In Lucene-9445 we'd like to add a case insensitive option to regex
>> queries in the query parser of the form:
>>
>> /Foo/i
>>
>>
>>
>> However, today people can search for :
>>
>>
>>
>> /foo.com/index.html
>>
>>
>>
>> and not get an error. The searcher may think this is a query for a URL
>> but it's actually parsed as a regex "foo.com" ORed with a term query.
>>
>>
>>
>> I'd like to draw attention to this proposed change in behaviour because I
>> think it could affect many existing systems. Arguably it may be a positive
>> in drawing attention to a number of existing silent failures (unescaped
>> searches for urls or file paths) but equally could be seen as a negative
>> breaking change by some.
>>
>>
>>
>> What is our BWC policy for changes to query parser?
>>
>> Do the benefits of the proposed new regex feature outweigh the costs of
>> the breakages in your view?
>>
>>
>>
>>
>> https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793
>>
>>
>>
>>
>>
>>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: QueryParser - proposed change may break existing queries. [ In reply to ]

sarowe at gmail

Sep 17, 2020, 12:09 PM

Post #7 of 12 (740 views)

Permalink

You could avoid (some of?) these problems by supporting /(?i)foo/ instead of /foo/i

--
Steve

> On Sep 17, 2020, at 1:55 PM, Gus Heck <gus.heck@gmail.com> wrote:
>
> And as I understand it, current behavior is the silent misinterpretation. To me, the failure to require a space after the regex (and either not become a regex in that case or complain about invalid regex) might be considered a bug...
>
> On Thu, Sep 17, 2020 at 9:30 AM Mark Harwood <markharwood@gmail.com <mailto:markharwood@gmail.com>> wrote:
> I think the decision comes down to choosing between silent (mis)interpratations of ambiguous queries or noisy failures..
>
> On Thu, Sep 17, 2020 at 1:55 PM Uwe Schindler <uwe@thetaphi.de <mailto:uwe@thetaphi.de>> wrote:
> Hi,
>
>
>
> My idea would have been not to bee too strict and instead only detect it as a regex if its separated. So /foo/bar and /foo/iphone would both go through and ignoring the regex, only ‘/foo/ bar’ or ‘/foo/I phone’ would interpret the first token as regex.
>
>
>
> That’s just my idea, not sure if it makes sense to have this relaxed parsing. I was always very skeptical of adding the regexes, as it breaks many queries. Now it’s even more.
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> Achterdiek 19, D-28357 Bremen
>
> https://www.thetaphi.de <https://www.thetaphi.de/>
> eMail: uwe@thetaphi.de <mailto:uwe@thetaphi.de>
>
>
> From: Mark Harwood <markharwood@gmail.com <mailto:markharwood@gmail.com>>
> Sent: Wednesday, September 16, 2020 6:45 PM
> To: dev@lucene.apache.org <mailto:dev@lucene.apache.org>
> Subject: Re: QueryParser - proposed change may break existing queries.
>
>
>
> The strictness I was thinking of adding was to make all of the following error:
>
> /foo/bar
>
> /foo//bar/
>
> /foo/iphone
>
> /foo/AND x
>
>
>
> These would be allowed:
>
> /foo/i bar
>
> (/foo/ OR /bar/)
>
> (/foo/ OR /bar/i)
>
> /foo/^2
>
> /foo/i^2
>
>
>
>
>
>
>
>
> On 16 Sep 2020, at 12:00, Uwe Schindler <uwe@thetaphi.de <mailto:uwe@thetaphi.de>> wrote:
>
> ?
>
> In my opinion, the proposed syntax change should enforce to have whitespace or any other separator chat after the regex “i” parameter.
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> Achterdiek 19, D-28357 Bremen
>
> https://www.thetaphi.de <https://www.thetaphi.de/>
> eMail: uwe@thetaphi.de <mailto:uwe@thetaphi.de>
>
>
> From: Mark Harwood <markharwood@gmail.com <mailto:markharwood@gmail.com>>
> Sent: Wednesday, September 16, 2020 11:04 AM
> To: dev@lucene.apache.org <mailto:dev@lucene.apache.org>
> Subject: QueryParser - proposed change may break existing queries.
>
>
>
> In Lucene-9445 we'd like to add a case insensitive option to regex queries in the query parser of the form:
>
> /Foo/i
>
>
>
> However, today people can search for :
>
>
>
> /foo.com/index.html <http://foo.com/index.html>
>
>
> and not get an error. The searcher may think this is a query for a URL but it's actually parsed as a regex "foo.com <http://foo.com/>" ORed with a term query.
>
>
>
> I'd like to draw attention to this proposed change in behaviour because I think it could affect many existing systems. Arguably it may be a positive in drawing attention to a number of existing silent failures (unescaped searches for urls or file paths) but equally could be seen as a negative breaking change by some.
>
>
>
> What is our BWC policy for changes to query parser?
>
> Do the benefits of the proposed new regex feature outweigh the costs of the breakages in your view?
>
>
>
> https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793 <https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17196793>
>
>
>
>
>
>
> --
> http://www.needhamsoftware.com <http://www.needhamsoftware.com/> (work)
> http://www.the111shift.com <http://www.the111shift.com/> (play)

Re: QueryParser - proposed change may break existing queries. [ In reply to ]

uwe at thetaphi

Sep 17, 2020, 12:48 PM

Post #8 of 12 (740 views)

Permalink

That's a much better idea, I like it. It's basically what Javas regex parser in the Pattern class also does.

If we do this we won't even need a syntax change.

Uwe

Am September 17, 2020 7:09:18 PM UTC schrieb Steve Rowe <sarowe@gmail.com>:
>You could avoid (some of?) these problems by supporting /(?i)foo/
>instead of /foo/i
>
>--
>Steve
>
>> On Sep 17, 2020, at 1:55 PM, Gus Heck <gus.heck@gmail.com> wrote:
>>
>> And as I understand it, current behavior is the silent
>misinterpretation. To me, the failure to require a space after the
>regex (and either not become a regex in that case or complain about
>invalid regex) might be considered a bug...
>>
>> On Thu, Sep 17, 2020 at 9:30 AM Mark Harwood <markharwood@gmail.com
><mailto:markharwood@gmail.com>> wrote:
>> I think the decision comes down to choosing between silent
>(mis)interpratations of ambiguous queries or noisy failures..
>>
>> On Thu, Sep 17, 2020 at 1:55 PM Uwe Schindler <uwe@thetaphi.de
><mailto:uwe@thetaphi.de>> wrote:
>> Hi,
>>
>>
>>
>> My idea would have been not to bee too strict and instead only detect
>it as a regex if its separated. So /foo/bar and /foo/iphone would both
>go through and ignoring the regex, only ‘/foo/ bar’ or ‘/foo/I phone’
>would interpret the first token as regex.
>>
>>
>>
>> That’s just my idea, not sure if it makes sense to have this relaxed
>parsing. I was always very skeptical of adding the regexes, as it
>breaks many queries. Now it’s even more.
>>
>>
>>
>> Uwe
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> Achterdiek 19, D-28357 Bremen
>>
>> https://www.thetaphi.de <https://www.thetaphi.de/>
>> eMail: uwe@thetaphi.de <mailto:uwe@thetaphi.de>
>>
>>
>> From: Mark Harwood <markharwood@gmail.com
><mailto:markharwood@gmail.com>>
>> Sent: Wednesday, September 16, 2020 6:45 PM
>> To: dev@lucene.apache.org <mailto:dev@lucene.apache.org>
>> Subject: Re: QueryParser - proposed change may break existing
>queries.
>>
>>
>>
>> The strictness I was thinking of adding was to make all of the
>following error:
>>
>> /foo/bar
>>
>> /foo//bar/
>>
>> /foo/iphone
>>
>> /foo/AND x
>>
>>
>>
>> These would be allowed:
>>
>> /foo/i bar
>>
>> (/foo/ OR /bar/)
>>
>> (/foo/ OR /bar/i)
>>
>> /foo/^2
>>
>> /foo/i^2
>>
>>
>>
>>
>>
>>
>>
>>
>> On 16 Sep 2020, at 12:00, Uwe Schindler <uwe@thetaphi.de
><mailto:uwe@thetaphi.de>> wrote:
>>
>> ?
>>
>> In my opinion, the proposed syntax change should enforce to have
>whitespace or any other separator chat after the regex “i” parameter.
>>
>>
>>
>> Uwe
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> Achterdiek 19, D-28357 Bremen
>>
>> https://www.thetaphi.de <https://www.thetaphi.de/>
>> eMail: uwe@thetaphi.de <mailto:uwe@thetaphi.de>
>>
>>
>> From: Mark Harwood <markharwood@gmail.com
><mailto:markharwood@gmail.com>>
>> Sent: Wednesday, September 16, 2020 11:04 AM
>> To: dev@lucene.apache.org <mailto:dev@lucene.apache.org>
>> Subject: QueryParser - proposed change may break existing queries.
>>
>>
>>
>> In Lucene-9445 we'd like to add a case insensitive option to regex
>queries in the query parser of the form:
>>
>> /Foo/i
>>
>>
>>
>> However, today people can search for :
>>
>>
>>
>> /foo.com/index.html <http://foo.com/index.html>
>>
>>
>> and not get an error. The searcher may think this is a query for a
>URL but it's actually parsed as a regex "foo.com <http://foo.com/>"
>ORed with a term query.
>>
>>
>>
>> I'd like to draw attention to this proposed change in behaviour
>because I think it could affect many existing systems. Arguably it may
>be a positive in drawing attention to a number of existing silent
>failures (unescaped searches for urls or file paths) but equally could
>be seen as a negative breaking change by some.
>>
>>
>>
>> What is our BWC policy for changes to query parser?
>>
>> Do the benefits of the proposed new regex feature outweigh the costs
>of the breakages in your view?
>>
>>
>>
>>
>https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793
><https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17196793>
>>
>>
>>
>>
>>
>>
>> --
>> http://www.needhamsoftware.com <http://www.needhamsoftware.com/>
>(work)
>> http://www.the111shift.com <http://www.the111shift.com/> (play)

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: QueryParser - proposed change may break existing queries. [ In reply to ]

dawid.weiss at gmail

Sep 17, 2020, 12:59 PM

Post #9 of 12 (740 views)

Permalink

I like this idea. The only downside is that folks will tend to think
it's a full Java Pattern and try other options. :)

On Thu, Sep 17, 2020 at 9:09 PM Steve Rowe <sarowe@gmail.com> wrote:
>
> You could avoid (some of?) these problems by supporting /(?i)foo/ instead of /foo/i
>
> --
> Steve
>
> On Sep 17, 2020, at 1:55 PM, Gus Heck <gus.heck@gmail.com> wrote:
>
> And as I understand it, current behavior is the silent misinterpretation. To me, the failure to require a space after the regex (and either not become a regex in that case or complain about invalid regex) might be considered a bug...
>
> On Thu, Sep 17, 2020 at 9:30 AM Mark Harwood <markharwood@gmail.com> wrote:
>>
>> I think the decision comes down to choosing between silent (mis)interpratations of ambiguous queries or noisy failures..
>>
>> On Thu, Sep 17, 2020 at 1:55 PM Uwe Schindler <uwe@thetaphi.de> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> My idea would have been not to bee too strict and instead only detect it as a regex if its separated. So /foo/bar and /foo/iphone would both go through and ignoring the regex, only ‘/foo/ bar’ or ‘/foo/I phone’ would interpret the first token as regex.
>>>
>>>
>>>
>>> That’s just my idea, not sure if it makes sense to have this relaxed parsing. I was always very skeptical of adding the regexes, as it breaks many queries. Now it’s even more.
>>>
>>>
>>>
>>> Uwe
>>>
>>>
>>>
>>> -----
>>>
>>> Uwe Schindler
>>>
>>> Achterdiek 19, D-28357 Bremen
>>>
>>> https://www.thetaphi.de
>>>
>>> eMail: uwe@thetaphi.de
>>>
>>>
>>>
>>> From: Mark Harwood <markharwood@gmail.com>
>>> Sent: Wednesday, September 16, 2020 6:45 PM
>>> To: dev@lucene.apache.org
>>> Subject: Re: QueryParser - proposed change may break existing queries.
>>>
>>>
>>>
>>> The strictness I was thinking of adding was to make all of the following error:
>>>
>>> /foo/bar
>>>
>>> /foo//bar/
>>>
>>> /foo/iphone
>>>
>>> /foo/AND x
>>>
>>>
>>>
>>> These would be allowed:
>>>
>>> /foo/i bar
>>>
>>> (/foo/ OR /bar/)
>>>
>>> (/foo/ OR /bar/i)
>>>
>>> /foo/^2
>>>
>>> /foo/i^2
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 16 Sep 2020, at 12:00, Uwe Schindler <uwe@thetaphi.de> wrote:
>>>
>>> ?
>>>
>>> In my opinion, the proposed syntax change should enforce to have whitespace or any other separator chat after the regex “i” parameter.
>>>
>>>
>>>
>>> Uwe
>>>
>>>
>>>
>>> -----
>>>
>>> Uwe Schindler
>>>
>>> Achterdiek 19, D-28357 Bremen
>>>
>>> https://www.thetaphi.de
>>>
>>> eMail: uwe@thetaphi.de
>>>
>>>
>>>
>>> From: Mark Harwood <markharwood@gmail.com>
>>> Sent: Wednesday, September 16, 2020 11:04 AM
>>> To: dev@lucene.apache.org
>>> Subject: QueryParser - proposed change may break existing queries.
>>>
>>>
>>>
>>> In Lucene-9445 we'd like to add a case insensitive option to regex queries in the query parser of the form:
>>>
>>> /Foo/i
>>>
>>>
>>>
>>> However, today people can search for :
>>>
>>>
>>>
>>> /foo.com/index.html
>>>
>>>
>>>
>>> and not get an error. The searcher may think this is a query for a URL but it's actually parsed as a regex "foo.com" ORed with a term query.
>>>
>>>
>>>
>>> I'd like to draw attention to this proposed change in behaviour because I think it could affect many existing systems. Arguably it may be a positive in drawing attention to a number of existing silent failures (unescaped searches for urls or file paths) but equally could be seen as a negative breaking change by some.
>>>
>>>
>>>
>>> What is our BWC policy for changes to query parser?
>>>
>>> Do the benefits of the proposed new regex feature outweigh the costs of the breakages in your view?
>>>
>>>
>>>
>>> https://issues.apache.org/jira/browse/LUCENE-9445?focusedCommentId=17196793&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17196793
>>>
>>>
>>>
>>>
>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: QueryParser - proposed change may break existing queries. [ In reply to ]

hossman_lucene at fucit

Sep 17, 2020, 2:04 PM

Post #10 of 12 (740 views)

Permalink

: And as I understand it, current behavior is the silent misinterpretation.
: To me, the failure to require a space after the regex (and either not
: become a regex in that case or complain about invalid regex) might be
: considered a bug...

I would agree ...

: >> However, today people can search for :
: >> /foo.com/index.html
: >> and not get an error. The searcher may think this is a query for a URL
: >> but it's actually parsed as a regex "foo.com" ORed with a term query.

... i didn't realize that was happening. To me that seems like it should
definitely be considered a bug, and the "regex" branch of the grammer
shouldn't be used if there is any unexpected characters after the closing
"/" ... the current behavior Mark is describgin seems analogous to the
grammer assuming "WESS ANDERSON" should be parsed as "WESS +DERSON"

: > You could avoid (some of?) these problems by supporting /(?i)foo/
: > instead of /foo/i
:
: I like this idea. The only downside is that folks will tend to think
: it's a full Java Pattern and try other options. :)

If they try to use any other options then 'i' we throow a ParseException

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: QueryParser - proposed change may break existing queries. [ In reply to ]

dawid.weiss at gmail

Sep 17, 2020, 11:46 PM

Post #11 of 12 (740 views)

Permalink

> If they try to use any other options then 'i' we throow a ParseException

+1. Complex-syntax parsers should throw (human-palatable) exceptions
on syntax errors. A lenient, "naive user" query parser should be
separate and accept a very, very
rudimentary query syntax (so that there are literally no chances of
making a syntax error).

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: QueryParser - proposed change may break existing queries. [ In reply to ]

markharwood at gmail

Sep 18, 2020, 3:05 AM

Post #12 of 12 (740 views)

Permalink

>You could avoid (some of?) these problems by supporting /(?i)foo/ instead
of /foo/i

That would avoid our parsing dilemma but brings some other concerns. This
inline syntax can normally be used to selectively turn on case sensitivity
for sections of a regex and then turn it off with (?-i).
We could potentially implement this support in the
underlying o.a.l.util.automaton.RegExp class. We changed that class
recently to take a separate global flag alongside the regex string which
can determine case sensitivity. I guess any inline (?i) syntax would
override whatever default option had been passed in the constructor flag.
That might be a hairy change though - the RegExp parser logic is
hand-crafted rather than JavaCC.

On Fri, Sep 18, 2020 at 7:47 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> > If they try to use any other options then 'i' we throow a ParseException
>
> +1. Complex-syntax parsers should throw (human-palatable) exceptions
> on syntax errors. A lenient, "naive user" query parser should be
> separate and accept a very, very
> rudimentary query syntax (so that there are literally no chances of
> making a syntax error).
>
> D.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>