Mailing List Archive: Re: [External] Re: How to ignore certain words based on query specifics

Re: [External] Re: How to ignore certain words based on query specifics

Jul 9, 2019, 10:17 AM

Post #1 of 5 (1045 views)

Michael,
Thanks for your reply.

You are correct, the desired effect is to not match 'freedom ...'.
I hadn't considered the case where both free* and freedom match.

My solution 'free* and not freedom' would NOT match either of your examples.

I think what I really want is
Get every matching term from a matching document,
and if the term also matches an ignore word, then ignore the match.

I hadn't considered the stopwords approach, I'll look into that.
If I add all the ignore words as stop words, will that effect highlighting?
Are the stopwords still available for highlighting?

Thanks,
David Shifflett

?On 7/9/19, 11:58 AM, "Michael Sokolov" <msokolov@gmail.com> wrote:

I think what you're saying in you're example is that "free*" should
match anything with a term matching that pattern, but not *only*
freedom. In other words, if a document has "freedom from stupidity"
then it should not match, but if the document has "free freedom from
stupidity" than it should.

Is that correct?

You could apply stopwords, except that it sounds as if this is a
per-user blacklist, and you want them to share the same index?

On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA]
<Shifflett_David@bah.com> wrote:
>
> Sorry for the weird reply path, but I couldn’t find an easy reply method via the list archive.
>
> Anyway …
>
> The use case is as follows:
> Allow the user to specify queries such as ‘free*’
> and also include similar words to be ignored, such as freedom.
> Another example would be ‘secret*’ and secretary.
>
> I want to keep the ignore words separate so they apply to all queries,
> but then realized the ignore words should only apply to relevant (matching) queries.
>
> I don’t want the users to be required to add ‘and not WORD’ many times to each of the listed queries.
>
> David Shifflett
>
> From: Diego Ceccarelli
>
> Could you please describe the use case? maybe there is an easier solution
>
>
>
> From: "Shifflett, David [USA]" <Shifflett_David@bah.com>
> Date: Tuesday, July 9, 2019 at 8:02 AM
> To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
> Subject: How to ignore certain words based on query specifics
>
> Hi all,
> I have a configuration file that lists multiple queries, of all different types,
> and that lists words to be ignored.
>
> Each of these lists is user configured, variable in length and content.
>
> I know that, in general, unless the ignore word is in the query it won’t match,
> but I need to be able to handle wildcard, fuzzy, and Regex, queries which might match.
>
> What I need to be able to do is ignore the words in the ignore list,
> but only when they match terms the query would match.
>
> For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
> I could modify the query to be ‘free*’ and not freedom.
>
> But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not liberty’ to that query
> because that could produce false negatives for documents containing free and liberty.
>
> I think what I need to do is:
> for each query
> for each ignore word
> if the query would match the ignore word,
> add ‘and not ignore word’ to the query
>
> How can I test if a query would match an ignore word without putting the ignore words into an index
> and searching the index?
> This seems like overkill.
>
> To make matters worse, for a query like A and B and C,
> this won’t match an index of ignore words that contains C, but not A or B.
>
> Thanks in advance, for any suggestions or advice,
> David Shifflett
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

B?KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB??[??X???X?KK[XZ[??]?K]\?\?][??X???X?PX?[?K?\X?K???B???Y][?[??[X[??K[XZ[??]?K]\?\?Z[X?[?K?\X?K???B?B

Re: [External] Re: How to ignore certain words based on query specifics [ In reply to ]

msokolov at gmail

Jul 10, 2019, 6:11 PM

Post #2 of 5 (1040 views)

Permalink

I'm not au courant with highlighters as I used to be. I think some of them
work using postings, and for those, no, you wouldn't be able to highlight
stop words. But maybe you can use the old default highlighter that would
reanalyze the document from a stored field, using an Analyzer that doesn't
remove stop words? Sorry I'm not sure if that exists any more, maybe
someone else will know.

On Tue, Jul 9, 2019, 10:17 AM Shifflett, David [USA] <
Shifflett_David@bah.com> wrote:

> Michael,
> Thanks for your reply.
>
> You are correct, the desired effect is to not match 'freedom ...'.
> I hadn't considered the case where both free* and freedom match.
>
> My solution 'free* and not freedom' would NOT match either of your
> examples.
>
> I think what I really want is
> Get every matching term from a matching document,
> and if the term also matches an ignore word, then ignore the match.
>
> I hadn't considered the stopwords approach, I'll look into that.
> If I add all the ignore words as stop words, will that effect highlighting?
> Are the stopwords still available for highlighting?
>
> Thanks,
> David Shifflett
>
>
> ?On 7/9/19, 11:58 AM, "Michael Sokolov" <msokolov@gmail.com> wrote:
>
> I think what you're saying in you're example is that "free*" should
> match anything with a term matching that pattern, but not *only*
> freedom. In other words, if a document has "freedom from stupidity"
> then it should not match, but if the document has "free freedom from
> stupidity" than it should.
>
> Is that correct?
>
> You could apply stopwords, except that it sounds as if this is a
> per-user blacklist, and you want them to share the same index?
>
> On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA]
> <Shifflett_David@bah.com> wrote:
> >
> > Sorry for the weird reply path, but I couldn’t find an easy reply
> method via the list archive.
> >
> > Anyway …
> >
> > The use case is as follows:
> > Allow the user to specify queries such as ‘free*’
> > and also include similar words to be ignored, such as freedom.
> > Another example would be ‘secret*’ and secretary.
> >
> > I want to keep the ignore words separate so they apply to all
> queries,
> > but then realized the ignore words should only apply to relevant
> (matching) queries.
> >
> > I don’t want the users to be required to add ‘and not WORD’ many
> times to each of the listed queries.
> >
> > David Shifflett
> >
> > From: Diego Ceccarelli
> >
> > Could you please describe the use case? maybe there is an easier
> solution
> >
> >
> >
> > From: "Shifflett, David [USA]" <Shifflett_David@bah.com>
> > Date: Tuesday, July 9, 2019 at 8:02 AM
> > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
> > Subject: How to ignore certain words based on query specifics
> >
> > Hi all,
> > I have a configuration file that lists multiple queries, of all
> different types,
> > and that lists words to be ignored.
> >
> > Each of these lists is user configured, variable in length and
> content.
> >
> > I know that, in general, unless the ignore word is in the query it
> won’t match,
> > but I need to be able to handle wildcard, fuzzy, and Regex, queries
> which might match.
> >
> > What I need to be able to do is ignore the words in the ignore list,
> > but only when they match terms the query would match.
> >
> > For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
> > I could modify the query to be ‘free*’ and not freedom.
> >
> > But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not
> liberty’ to that query
> > because that could produce false negatives for documents containing
> free and liberty.
> >
> > I think what I need to do is:
> > for each query
> > for each ignore word
> > if the query would match the ignore word,
> > add ‘and not ignore word’ to the query
> >
> > How can I test if a query would match an ignore word without putting
> the ignore words into an index
> > and searching the index?
> > This seems like overkill.
> >
> > To make matters worse, for a query like A and B and C,
> > this won’t match an index of ignore words that contains C, but not A
> or B.
> >
> > Thanks in advance, for any suggestions or advice,
> > David Shifflett
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>

Re: [External] Re: How to ignore certain words based on query specifics [ In reply to ]

Shifflett_David at bah

Jul 11, 2019, 5:38 AM

Post #3 of 5 (1040 views)

Permalink

I just tested this with the search.highight.Highlighter class.
Is this the 'old default highlighter'?

I phrased my question badly.
Of course the stop words shouldn't be highlighted,
as they wouldn't match any query.

My question was really, would the stop words be available for
inclusion in the highlight context (surrounding a match)?

The answer is yes the stop words do appear in the context,
and are not highlighted.

Thanks,
David Shifflett

?On 7/10/19, 9:12 PM, "Michael Sokolov" <msokolov@gmail.com> wrote:

I'm not au courant with highlighters as I used to be. I think some of them
work using postings, and for those, no, you wouldn't be able to highlight
stop words. But maybe you can use the old default highlighter that would
reanalyze the document from a stored field, using an Analyzer that doesn't
remove stop words? Sorry I'm not sure if that exists any more, maybe
someone else will know.

On Tue, Jul 9, 2019, 10:17 AM Shifflett, David [USA] <
Shifflett_David@bah.com> wrote:

> Michael,
> Thanks for your reply.
>
> You are correct, the desired effect is to not match 'freedom ...'.
> I hadn't considered the case where both free* and freedom match.
>
> My solution 'free* and not freedom' would NOT match either of your
> examples.
>
> I think what I really want is
> Get every matching term from a matching document,
> and if the term also matches an ignore word, then ignore the match.
>
> I hadn't considered the stopwords approach, I'll look into that.
> If I add all the ignore words as stop words, will that effect highlighting?
> Are the stopwords still available for highlighting?
>
> Thanks,
> David Shifflett
>
>
> On 7/9/19, 11:58 AM, "Michael Sokolov" <msokolov@gmail.com> wrote:
>
> I think what you're saying in you're example is that "free*" should
> match anything with a term matching that pattern, but not *only*
> freedom. In other words, if a document has "freedom from stupidity"
> then it should not match, but if the document has "free freedom from
> stupidity" than it should.
>
> Is that correct?
>
> You could apply stopwords, except that it sounds as if this is a
> per-user blacklist, and you want them to share the same index?
>
> On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA]
> <Shifflett_David@bah.com> wrote:
> >
> > Sorry for the weird reply path, but I couldn’t find an easy reply
> method via the list archive.
> >
> > Anyway …
> >
> > The use case is as follows:
> > Allow the user to specify queries such as ‘free*’
> > and also include similar words to be ignored, such as freedom.
> > Another example would be ‘secret*’ and secretary.
> >
> > I want to keep the ignore words separate so they apply to all
> queries,
> > but then realized the ignore words should only apply to relevant
> (matching) queries.
> >
> > I don’t want the users to be required to add ‘and not WORD’ many
> times to each of the listed queries.
> >
> > David Shifflett
> >
> > From: Diego Ceccarelli
> >
> > Could you please describe the use case? maybe there is an easier
> solution
> >
> >
> >
> > From: "Shifflett, David [USA]" <Shifflett_David@bah.com>
> > Date: Tuesday, July 9, 2019 at 8:02 AM
> > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
> > Subject: How to ignore certain words based on query specifics
> >
> > Hi all,
> > I have a configuration file that lists multiple queries, of all
> different types,
> > and that lists words to be ignored.
> >
> > Each of these lists is user configured, variable in length and
> content.
> >
> > I know that, in general, unless the ignore word is in the query it
> won’t match,
> > but I need to be able to handle wildcard, fuzzy, and Regex, queries
> which might match.
> >
> > What I need to be able to do is ignore the words in the ignore list,
> > but only when they match terms the query would match.
> >
> > For example: if the query is ‘free*’ and ‘freedom’ should be ignored,
> > I could modify the query to be ‘free*’ and not freedom.
> >
> > But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not
> liberty’ to that query
> > because that could produce false negatives for documents containing
> free and liberty.
> >
> > I think what I need to do is:
> > for each query
> > for each ignore word
> > if the query would match the ignore word,
> > add ‘and not ignore word’ to the query
> >
> > How can I test if a query would match an ignore word without putting
> the ignore words into an index
> > and searching the index?
> > This seems like overkill.
> >
> > To make matters worse, for a query like A and B and C,
> > this won’t match an index of ignore words that contains C, but not A
> or B.
> >
> > Thanks in advance, for any suggestions or advice,
> > David Shifflett
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: [External] Re: How to ignore certain words based on query specifics [ In reply to ]

evert.wagenaar at gmail

Jul 11, 2019, 8:26 AM

Post #4 of 5 (1040 views)

Permalink

I see it as a feature, not a bug. The appearance of stop words in the Search Summary makes it more clear what the Hit is about.Not sure but I think Google does the same in search summaries.-Evert(http://www.ejwagenaar.com)
-------- Original message --------From: "Shifflett, David [USA]" <Shifflett_David@bah.com> Date: 7/11/19 8:38 PM (GMT+08:00) To: java-user@lucene.apache.org Subject: Re: [External] Re: How to ignore certain words based on query specifics I just tested this with the search.highight.Highlighter class.Is this the 'old default highlighter'?I phrased my question badly.Of course the stop words shouldn't be highlighted,as they wouldn't match any query.My question was really, would the stop words be available forinclusion in the highlight context (surrounding a match)?The answer is yes the stop words do appear in the context,and are not highlighted.Thanks,David Shifflett ?On 7/10/19, 9:12 PM, "Michael Sokolov" <msokolov@gmail.com> wrote: I'm not au courant with highlighters as I used to be. I think some of them work using postings, and for those, no, you wouldn't be able to highlight stop words. But maybe you can use the old default highlighter that would reanalyze the document from a stored field, using an Analyzer that doesn't remove stop words? Sorry I'm not sure if that exists any more, maybe someone else will know. On Tue, Jul 9, 2019, 10:17 AM Shifflett, David [USA] < Shifflett_David@bah.com> wrote: > Michael, > Thanks for your reply. > > You are correct, the desired effect is to not match 'freedom ...'. > I hadn't considered the case where both free* and freedom match. > > My solution 'free* and not freedom' would NOT match either of your > examples. > > I think what I really want is > Get every matching term from a matching document, > and if the term also matches an ignore word, then ignore the match. > > I hadn't considered the stopwords approach, I'll look into that. > If I add all the ignore words as stop words, will that effect highlighting? > Are the stopwords still available for highlighting? > > Thanks, > David Shifflett > > > On 7/9/19, 11:58 AM, "Michael Sokolov" <msokolov@gmail.com> wrote: > > I think what you're saying in you're example is that "free*" should > match anything with a term matching that pattern, but not *only* > freedom. In other words, if a document has "freedom from stupidity" > then it should not match, but if the document has "free freedom from > stupidity" than it should. > > Is that correct? > > You could apply stopwords, except that it sounds as if this is a > per-user blacklist, and you want them to share the same index? > > On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA] > <Shifflett_David@bah.com> wrote: > > > > Sorry for the weird reply path, but I couldn’t find an easy reply > method via the list archive. > > > > Anyway … > > > > The use case is as follows: > > Allow the user to specify queries such as ‘free*’ > > and also include similar words to be ignored, such as freedom. > > Another example would be ‘secret*’ and secretary. > > > > I want to keep the ignore words separate so they apply to all > queries, > > but then realized the ignore words should only apply to relevant > (matching) queries. > > > > I don’t want the users to be required to add ‘and not WORD’ many > times to each of the listed queries. > > > > David Shifflett > > > > From: Diego Ceccarelli > > > > Could you please describe the use case? maybe there is an easier > solution > > > > > > > > From: "Shifflett, David [USA]" <Shifflett_David@bah.com> > > Date: Tuesday, July 9, 2019 at 8:02 AM > > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org> > > Subject: How to ignore certain words based on query specifics > > > > Hi all, > > I have a configuration file that lists multiple queries, of all > different types, > > and that lists words to be ignored. > > > > Each of these lists is user configured, variable in length and > content. > > > > I know that, in general, unless the ignore word is in the query it > won’t match, > > but I need to be able to handle wildcard, fuzzy, and Regex, queries > which might match. > > > > What I need to be able to do is ignore the words in the ignore list, > > but only when they match terms the query would match. > > > > For example: if the query is ‘free*’ and ‘freedom’ should be ignored, > > I could modify the query to be ‘free*’ and not freedom. > > > > But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not > liberty’ to that query > > because that could produce false negatives for documents containing > free and liberty. > > > > I think what I need to do is: > > for each query > > for each ignore word > > if the query would match the ignore word, > > add ‘and not ignore word’ to the query > > > > How can I test if a query would match an ignore word without putting > the ignore words into an index > > and searching the index? > > This seems like overkill. > > > > To make matters worse, for a query like A and B and C, > > this won’t match an index of ignore words that contains C, but not A > or B. > > > > Thanks in advance, for any suggestions or advice, > > David Shifflett > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > ---------------------------------------------------------------------To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.orgFor additional commands, e-mail: java-user-help@lucene.apache.org

Re: [External] Re: How to ignore certain words based on query specifics [ In reply to ]

Shifflett_David at bah

Jul 11, 2019, 8:31 AM

Post #5 of 5 (1040 views)

Permalink

Evert,
It is definitely not a bug.
I was asking about how to do something, I couldn't quite figure out.
Stop words is the way to go.

David Shifflett

?On 7/11/19, 11:26 AM, "evert.wagenaar" <evert.wagenaar@gmail.com> wrote:

I see it as a feature, not a bug. The appearance of stop words in the Search Summary makes it more clear what the Hit is about.Not sure but I think Google does the same in search summaries.-Evert

-------- Original message --------From: "Shifflett, David [USA]" <Shifflett_David@bah.com> Date: 7/11/19 8:38 PM (GMT+08:00) To: java-user@lucene.apache.org Subject: Re: [External] Re: How to ignore certain words based on query specifics I just tested this with the search.highight.Highlighter class.Is this the 'old default highlighter'?I phrased my question badly.Of course the stop words shouldn't be highlighted,as they wouldn't match any query.My question was really, would the stop words be available forinclusion in the highlight context (surrounding a match)?The answer is yes the stop words do appear in the context,and are not highlighted.Thanks,David Shifflett On 7/10/19, 9:12 PM, "Michael Sokolov" <msokolov@gmail.com> wrote: I'm not au courant with highlighters as I used to be. I think some of them work using postings, and for those, no, you wouldn't be able to highlight stop words. But maybe you can use the old default highlighter that would reanalyze the document from a stored field, using an Analyzer that doesn't remove stop words? Sorry I'm not sure if that exists any more, maybe someone else will know. On Tue, Jul 9, 2019, 10:17 AM Shifflett, David [USA] < Shifflett_David@bah.com> wrote: > Michael, > Thanks for your reply. > > You are correct, the desired effect is to not match 'freedom ...'. > I hadn't considered the case where both free* and freedom match. > > My solution 'free* and not freedom' would NOT match either of your > examples. > > I think what I really want is > Get every matching term from a matching document, > and if the term also matches an ignore word, then ignore the match. > > I hadn't considered the stopwords approach, I'll look into that. > If I add all the ignore words as stop words, will that effect highlighting? > Are the stopwords still available for highlighting? > > Thanks, > David Shifflett > > > On 7/9/19, 11:58 AM, "Michael Sokolov" <msokolov@gmail.com> wrote: > > I think what you're saying in you're example is that "free*" should > match anything with a term matching that pattern, but not *only* > freedom. In other words, if a document has "freedom from stupidity" > then it should not match, but if the document has "free freedom from > stupidity" than it should. > > Is that correct? > > You could apply stopwords, except that it sounds as if this is a > per-user blacklist, and you want them to share the same index? > > On Tue, Jul 9, 2019 at 11:29 AM Shifflett, David [USA] > <Shifflett_David@bah.com> wrote: > > > > Sorry for the weird reply path, but I couldn’t find an easy reply > method via the list archive. > > > > Anyway … > > > > The use case is as follows: > > Allow the user to specify queries such as ‘free*’ > > and also include similar words to be ignored, such as freedom. > > Another example would be ‘secret*’ and secretary. > > > > I want to keep the ignore words separate so they apply to all > queries, > > but then realized the ignore words should only apply to relevant > (matching) queries. > > > > I don’t want the users to be required to add ‘and not WORD’ many > times to each of the listed queries. > > > > David Shifflett > > > > From: Diego Ceccarelli > > > > Could you please describe the use case? maybe there is an easier > solution > > > > > > > > From: "Shifflett, David [USA]" <Shifflett_David@bah.com> > > Date: Tuesday, July 9, 2019 at 8:02 AM > > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org> > > Subject: How to ignore certain words based on query specifics > > > > Hi all, > > I have a configuration file that lists multiple queries, of all > different types, > > and that lists words to be ignored. > > > > Each of these lists is user configured, variable in length and > content. > > > > I know that, in general, unless the ignore word is in the query it > won’t match, > > but I need to be able to handle wildcard, fuzzy, and Regex, queries > which might match. > > > > What I need to be able to do is ignore the words in the ignore list, > > but only when they match terms the query would match. > > > > For example: if the query is ‘free*’ and ‘freedom’ should be ignored, > > I could modify the query to be ‘free*’ and not freedom. > > > > But if ‘liberty’ is also to be ignored, I don’t want to add ‘and not > liberty’ to that query > > because that could produce false negatives for documents containing > free and liberty. > > > > I think what I need to do is: > > for each query > > for each ignore word > > if the query would match the ignore word, > > add ‘and not ignore word’ to the query > > > > How can I test if a query would match an ignore word without putting > the ignore words into an index > > and searching the index? > > This seems like overkill. > > > > To make matters worse, for a query like A and B and C, > > this won’t match an index of ignore words that contains C, but not A > or B. > > > > Thanks in advance, for any suggestions or advice, > > David Shifflett > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > ---------------------------------------------------------------------To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.orgFor additional commands, e-mail: java-user-help@lucene.apache.org

B?KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB??[??X???X?KK[XZ[??]?K]\?\?][??X???X?PX?[?K?\X?K???B???Y][?[??[X[??K[XZ[??]?K]\?\?Z[X?[?K?\X?K???B?B