Mailing List Archive

Getting CirrusSearch full-text search to incorporate a new field?
Hi,

Suppose I have an extension that utilizes SearchIndexFieldsHook and
SearchDataForIndexHook to add a novel field, called 'more_file_text',
to the indexed data for pages. This new field is of INDEX_TYPE_TEXT.

Via those two hooks, I've managed to get data for this field stored in the
elasticsearch index. But... how do I get CirrusSearch to actually include
this field in its searches?

It would be a-ok if the 'more_file_text' could just be treated as additional
content for the 'file_text' field. (However, simply populating the existing
'file_text' field via the SearchDataForIndexHook does not work, because the
FileContentHandler::getDataForSearchIndex() method runs after the hook and
always forcefully overwrites the 'file_text' field.)

Will any existing CirrusSearch hooks or configuration parameters allow me
to achieve this?

(And/or, is there a better forum for me to ask this question?)

-mm

_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Getting CirrusSearch full-text search to incorporate a new field? [ In reply to ]
Hello!

The Discovery list [1] might be a better place, but most the people working
on Search should be on Wikitech as well. I've pinged a few people who might
be able to provide some guidance.

Have fun!

Guillaume



[1]
https://lists.wikimedia.org/postorius/lists/discovery.lists.wikimedia.org/

On Mon, 25 Oct 2021 at 06:37, Matto Marjanovic <maddog@mir.com> wrote:

> Hi,
>
> Suppose I have an extension that utilizes SearchIndexFieldsHook and
> SearchDataForIndexHook to add a novel field, called 'more_file_text',
> to the indexed data for pages. This new field is of INDEX_TYPE_TEXT.
>
> Via those two hooks, I've managed to get data for this field stored in the
> elasticsearch index. But... how do I get CirrusSearch to actually include
> this field in its searches?
>
> It would be a-ok if the 'more_file_text' could just be treated as
> additional
> content for the 'file_text' field. (However, simply populating the
> existing
> 'file_text' field via the SearchDataForIndexHook does not work, because the
> FileContentHandler::getDataForSearchIndex() method runs after the hook and
> always forcefully overwrites the 'file_text' field.)
>
> Will any existing CirrusSearch hooks or configuration parameters allow me
> to achieve this?
>
> (And/or, is there a better forum for me to ask this question?)
>
> -mm
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>


--
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
Re: Getting CirrusSearch full-text search to incorporate a new field? [ In reply to ]
Hi Matto,

please see my answers inline.

On Mon, Oct 25, 2021 at 6:37 AM Matto Marjanovic <maddog@mir.com> wrote:

> [...]

It would be a-ok if the 'more_file_text' could just be treated as additional
> content for the 'file_text' field. (However, simply populating the
> existing
> 'file_text' field via the SearchDataForIndexHook does not work, because the
> FileContentHandler::getDataForSearchIndex() method runs after the hook and
> always forcefully overwrites the 'file_text' field.)
>

This should be do-able by implementing the CirrusSearchBuildDocumentParse
hook which runs very late in the process (see cirrus doc under
docs/hooks.txt).
It could be only CirrusSearchBuildDocumentParse if you have the data at
hand when this hook runs or a combination of SearchDataForIndexHook to
populate a "more_file_text" field like you do +
CirrusSearchBuildDocumentParse to append this "more_file_text" to the
existing "file_text" and possibly empty the "more_file_text" field if you
no longer need it.

There are probably more ways to achieve what you want with greater control
of the ranking but this will probably be much more involved (i.e. writing
your own search query builder).

--
David Causse
Re: Getting CirrusSearch full-text search to incorporate a new field? [ In reply to ]
On 10/25/21 2:22 AM, David Causse wrote:
...
> On Mon, Oct 25, 2021 at 6:37 AM Matto Marjanovic <maddog@mir.com <mailto:maddog@mir.com>> wrote:
> ...
> It would be a-ok if the 'more_file_text' could just be treated as additional
> content for the 'file_text' field. (However, simply populating the existing
> 'file_text' field via the SearchDataForIndexHook does not work, because the
> FileContentHandler::getDataForSearchIndex() method runs after the hook and
> always forcefully overwrites the 'file_text' field.)
>
> This should be do-able by implementing the CirrusSearchBuildDocumentParse hook which runs very late in the process (see cirrus doc under docs/hooks.txt).
> It could be only CirrusSearchBuildDocumentParse if you have the data at hand when this hook runs or a combination of SearchDataForIndexHook to populate a "more_file_text" field like you do + CirrusSearchBuildDocumentParse to append this "more_file_text" to the existing "file_text" and possibly empty the "more_file_text" field if you no longer need it.

I guess I should point out that I am working with MediaWiki 1.35 (and beyond)...

Alas, it seems that between 1.31 and 1.35, the CirrusSearchBuildDocumentParse
hook was removed, and then reinstated very *early* in the process. It is now
run in BuildDocument::initialize(), even before the SearchDataForIndexHook.
So, again, anything it does to the 'file_text' field will just get stomped on
by the FileContentHandler later on. (And, comments in BuildDocument say
"Use of this hook is deprecated ... restoring this hook is a temporary hack
for WikibaseMediaInfo", so I wouldn't want to depend on it moving forward.)

> There are probably more ways to achieve what you want with greater
> control of the ranking but this will probably be much more involved
> (i.e. writing your own search query builder).

If only the SearchDataForIndexHook were properly run late, this would be so
simple....

-mm

ps: There are a bunch of broken things in all this code:

o SearchDataForIndexHook is run by getDataForSearchIndex() in each
ContentHandler, but the design ensures that it is run at some
ambiguous place in the middle of getDataForSearchIndex(), instead
of resolutely at the end.

o SearchIndexFieldsHook *is not* run by getFieldsForSearchIndex() of a
ContentHandler. This means that CirrusSearch\BuildDocument never
sees the definitions for any fields added by the hook. The only code
in CirrusSearch that does see the definitions is MappingConfigBuilder
(maintenance code).

I suppose I should figure out how to file a bug report somewhere.
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Getting CirrusSearch full-text search to incorporate a new field? [ In reply to ]
On 10/25/21 9:23 AM, Matto Marjanovic wrote:
...
> o SearchDataForIndexHook is run by getDataForSearchIndex() in each
> ContentHandler, but the design ensures that it is run at some
> ambiguous place in the middle of getDataForSearchIndex(), instead
> of resolutely at the end.
>
> o SearchIndexFieldsHook *is not* run by getFieldsForSearchIndex() of a
> ContentHandler. This means that CirrusSearch\BuildDocument never
> sees the definitions for any fields added by the hook. The only code
> in CirrusSearch that does see the definitions is MappingConfigBuilder
> (maintenance code).
>
> I suppose I should figure out how to file a bug report somewhere.

Follow-up for anyone following this thread: with thanks to all the
good documentation (and those who wrote/write it), I did indeed file a
bug report and a (first) patch.

https://phabricator.wikimedia.org/T294405

-mm
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/