Mailing List Archive

Word embeddings / vector search
A partial outcome of the research in natural language processing
in the last decade is the representation of language as numeric
vectors, called word embeddings. These are used in large language
models such as Bert, Elmo, and (Chat)GPT. A peculiar aspect of
these numeric vectors is that they cluster semantically, so that
words for similar concepts (dog, puppy, pet) group together even
though their spelling is very different. This can be used for
"semantic" search. If a search query (dog) is converted to a vector,
it can search terms found in documents (e.g. wiki articles) that
have similar vectors and find those of similar content even though
the text doesn't match.

https://en.wikipedia.org/wiki/Word_embedding

Here are just two of very many videos that explain the concept:
https://www.youtube.com/watch?v=xzHhZh7F25I
https://www.youtube.com/watch?v=MUve9LiEAeI

Is there any ongoing work at WMF or around the Mediawiki software
to apply this new technique to search in Wikipedia?


--
Lars Aronsson (lars@aronsson.se, user:LA2)
Linköping, Sweden
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Word embeddings / vector search [ In reply to ]
I'm curious what the actual question is. The basic concepts are
studied for about 60 years, and are in use for about 20 to 30 years.
One particular detail the industry apparently needs to re-learn every
time is how easily such vector spaces encode and reproduce any
existing bias, racism, phobia, and so on, and how hard it is to raise
awareness, despite doing something about it.

That said, the Elasticsearch technology we currently use on Wikimedia
infrastructure in version 7.10.x is already responding to the current
machine learning hype cycle.

https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
https://en.wikipedia.org/wiki/Special:Version

We certainly need to update some day, but I think nobody is actively
working on this at the moment. However, the topic appears in the
currently discussed annual plan. The responsible Search Platform team
is also quite active and monitors a good selection of communication
channels, including a separate mailing list.

https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives
https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours

Kind regards
Thiemo
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Word embeddings / vector search [ In reply to ]
On 2023-05-09 09:27, Thiemo Kreuz wrote:
> I'm curious what the actual question is. The basic concepts are
> studied for about 60 years, and are in use for about 20 to 30 years.

Sorry to hear that you're so negative. It's quite obvious that this is not
currently used in Wikipedia, but is presented everywhere as a novelty
that has not been around for 20 or 30 years.

> https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
> https://en.wikipedia.org/wiki/Special:Version
> https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives
> https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours


Thanks! This answers my question. It's particularly interesting to read
the talk page to the plan. Part of the problem is that "word embedding"
and "vector search" are not mentioned there, but a vector search could
have found the "ML-enabled natural language search" that is mentioned.
If and when this is tried, we will need to evaluate how well it works for
various languages.


--
Lars Aronsson (lars@aronsson.se, user:LA2)
Linköping, Sweden

_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Word embeddings / vector search [ In reply to ]
I encourage you to reach out to the search team, they're lovely folks and
even better engineers.

On Tue, May 9, 2023 at 1:53?PM Lars Aronsson <lars@aronsson.se> wrote:

> On 2023-05-09 09:27, Thiemo Kreuz wrote:
> > I'm curious what the actual question is. The basic concepts are
> > studied for about 60 years, and are in use for about 20 to 30 years.
>
> Sorry to hear that you're so negative. It's quite obvious that this is not
> currently used in Wikipedia, but is presented everywhere as a novelty
> that has not been around for 20 or 30 years.
>
> >
> https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
> > https://en.wikipedia.org/wiki/Special:Version
> >
> https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives
> > https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours
>
>
> Thanks! This answers my question. It's particularly interesting to read
> the talk page to the plan. Part of the problem is that "word embedding"
> and "vector search" are not mentioned there, but a vector search could
> have found the "ML-enabled natural language search" that is mentioned.
> If and when this is tried, we will need to evaluate how well it works for
> various languages.
>
>
> --
> Lars Aronsson (lars@aronsson.se, user:LA2)
> Linköping, Sweden
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Word embeddings / vector search [ In reply to ]
+1 to the suggestion to connect with the Search team. Also a few more
thoughts about vector / natural-language search and its relevance to
Wikimedia from my perspective in Research:

- The common critique of lexical / keyword-based search and why folks
point to vector / embedding-based search is handling more natural-language
queries (e.g., "What are the different objectives of the United Nations
Sustainable Development Goals?" vs. "UN SDG"). The former has a lot of
words in it that lead to keyword overlap with less-relevant pages so
keyword-based search doesn't do as well. The latter is much more direct and
even matches an existing redirect on Wikipedia to the article on UN
Sustainable Development Goals, so our existing keyword-based search handles
it very well.
- Most existing users of Wikimedia's search are probably doing something
closer to the latter above -- i.e. using pretty exact keywords to navigate
to a specific page (or find it exists). This is backed up by the data: 80%
of searches on Wikipedia are auto-completed directly to article pages
<https://upload.wikimedia.org/wikipedia/commons/8/87/Understanding_Search_Behavior_in_Wikipedia_-_Report_-_Bruno_Scarone.pdf#page=7>.
In that sense, the system is working quite well! The Search team also has
added quite a bit of normalization into the pipeline (see
https://diff.wikimedia.org/2023/04/28/language-harmony-and-unpacking-a-year-in-the-life-of-a-search-nerd/
for a fun overview). For the more complicated natural-language queries to
find relevant Wikipedia articles, my sense is that folks using natural
language searches are probably doing that within external search engines,
which have huge teams/infrastructure to support this, and then clicking
through to Wikipedia.
- That said, there are probably use-cases where natural-language search
would be more valuable. For example, within new interaction domains such as
chat-bots or for new editors / developers who don't yet know the exact
terminology to search for but want to do generic things like get access to
Toolforge or find out how to add a link to a page. I've been putting
together an example of this for Wikitech for the upcoming Hackathon (
details <https://phabricator.wikimedia.org/T333853>) and others have
proposed e.g., this for Project pages to help editors find answers to
questions about editing (details
<https://phabricator.wikimedia.org/T335013>).
- Finally, there's a second, related aspect to this which is the size
and diversity of a given document. Within the Wikipedia article namespace,
documents are generally about a single, constrained topic. So the fact that
lexical search systems like Elasticsearch operate at the document-level is
a very good fit -- i.e. index all the keywords for a given article
together. When thinking about other namespaces like Project/Help pages or
Wikitech documentation, a single page can be much larger and be about far
more diverse topics. This presents further challenges to finding good
keyword-overlap because often the search would ideally find a very specific
paragraph in a much larger document about many other things. Vector search
doesn't directly solve this but in practice, folks tend to learn embeddings
for smaller passages than an entire doc -- e.g., sections or even
paragraphs within the section. For that reason alone, I suspect vector
search will do better for namespaces outside of the article namespace on
Wikipedia. Whether it's worth the cost is a separate question as it also
introduces substantial new challenges in keeping the embeddings up-to-date
:)

Hope that helps.

Best,
Isaac

On Tue, May 9, 2023 at 2:10?PM Dan Andreescu <dandreescu@wikimedia.org>
wrote:

> I encourage you to reach out to the search team, they're lovely folks and
> even better engineers.
>
> On Tue, May 9, 2023 at 1:53?PM Lars Aronsson <lars@aronsson.se> wrote:
>
>> On 2023-05-09 09:27, Thiemo Kreuz wrote:
>> > I'm curious what the actual question is. The basic concepts are
>> > studied for about 60 years, and are in use for about 20 to 30 years.
>>
>> Sorry to hear that you're so negative. It's quite obvious that this is not
>> currently used in Wikipedia, but is presented everywhere as a novelty
>> that has not been around for 20 or 30 years.
>>
>> >
>> https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
>> > https://en.wikipedia.org/wiki/Special:Version
>> >
>> https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives
>> >
>> https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours
>>
>>
>> Thanks! This answers my question. It's particularly interesting to read
>> the talk page to the plan. Part of the problem is that "word embedding"
>> and "vector search" are not mentioned there, but a vector search could
>> have found the "ML-enabled natural language search" that is mentioned.
>> If and when this is tried, we will need to evaluate how well it works for
>> various languages.
>>
>>
>> --
>> Lars Aronsson (lars@aronsson.se, user:LA2)
>> Linköping, Sweden
>>
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
>>
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



--
Isaac Johnson (he/him/his) -- Senior Research Scientist -- Wikimedia
Foundation
Re: Word embeddings / vector search [ In reply to ]
On top of the ones mentioned,
ores topic detection model <https://github.com/wikimedia/drafttopic>(the
one that says what wikiproject an article belongs to, an example
<https://ores.wikimedia.org/v3/scores/enwiki/?revids=1153214555>) has been
using word embedding since 2018-ish.

HTH

Am Di., 9. Mai 2023 um 22:10 Uhr schrieb Isaac Johnson <isaac@wikimedia.org
>:

> +1 to the suggestion to connect with the Search team. Also a few more
> thoughts about vector / natural-language search and its relevance to
> Wikimedia from my perspective in Research:
>
> - The common critique of lexical / keyword-based search and why folks
> point to vector / embedding-based search is handling more natural-language
> queries (e.g., "What are the different objectives of the United Nations
> Sustainable Development Goals?" vs. "UN SDG"). The former has a lot of
> words in it that lead to keyword overlap with less-relevant pages so
> keyword-based search doesn't do as well. The latter is much more direct and
> even matches an existing redirect on Wikipedia to the article on UN
> Sustainable Development Goals, so our existing keyword-based search handles
> it very well.
> - Most existing users of Wikimedia's search are probably doing
> something closer to the latter above -- i.e. using pretty exact keywords to
> navigate to a specific page (or find it exists). This is backed up by the
> data: 80% of searches on Wikipedia are auto-completed directly to
> article pages
> <https://upload.wikimedia.org/wikipedia/commons/8/87/Understanding_Search_Behavior_in_Wikipedia_-_Report_-_Bruno_Scarone.pdf#page=7>.
> In that sense, the system is working quite well! The Search team also has
> added quite a bit of normalization into the pipeline (see
> https://diff.wikimedia.org/2023/04/28/language-harmony-and-unpacking-a-year-in-the-life-of-a-search-nerd/
> for a fun overview). For the more complicated natural-language queries to
> find relevant Wikipedia articles, my sense is that folks using natural
> language searches are probably doing that within external search engines,
> which have huge teams/infrastructure to support this, and then clicking
> through to Wikipedia.
> - That said, there are probably use-cases where natural-language
> search would be more valuable. For example, within new interaction domains
> such as chat-bots or for new editors / developers who don't yet know the
> exact terminology to search for but want to do generic things like get
> access to Toolforge or find out how to add a link to a page. I've been
> putting together an example of this for Wikitech for the upcoming Hackathon
> (details <https://phabricator.wikimedia.org/T333853>) and others have
> proposed e.g., this for Project pages to help editors find answers to
> questions about editing (details
> <https://phabricator.wikimedia.org/T335013>).
> - Finally, there's a second, related aspect to this which is the size
> and diversity of a given document. Within the Wikipedia article namespace,
> documents are generally about a single, constrained topic. So the fact that
> lexical search systems like Elasticsearch operate at the document-level is
> a very good fit -- i.e. index all the keywords for a given article
> together. When thinking about other namespaces like Project/Help pages or
> Wikitech documentation, a single page can be much larger and be about far
> more diverse topics. This presents further challenges to finding good
> keyword-overlap because often the search would ideally find a very specific
> paragraph in a much larger document about many other things. Vector search
> doesn't directly solve this but in practice, folks tend to learn embeddings
> for smaller passages than an entire doc -- e.g., sections or even
> paragraphs within the section. For that reason alone, I suspect vector
> search will do better for namespaces outside of the article namespace on
> Wikipedia. Whether it's worth the cost is a separate question as it also
> introduces substantial new challenges in keeping the embeddings up-to-date
> :)
>
> Hope that helps.
>
> Best,
> Isaac
>
> On Tue, May 9, 2023 at 2:10?PM Dan Andreescu <dandreescu@wikimedia.org>
> wrote:
>
>> I encourage you to reach out to the search team, they're lovely folks and
>> even better engineers.
>>
>> On Tue, May 9, 2023 at 1:53?PM Lars Aronsson <lars@aronsson.se> wrote:
>>
>>> On 2023-05-09 09:27, Thiemo Kreuz wrote:
>>> > I'm curious what the actual question is. The basic concepts are
>>> > studied for about 60 years, and are in use for about 20 to 30 years.
>>>
>>> Sorry to hear that you're so negative. It's quite obvious that this is
>>> not
>>> currently used in Wikipedia, but is presented everywhere as a novelty
>>> that has not been around for 20 or 30 years.
>>>
>>> >
>>> https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
>>> > https://en.wikipedia.org/wiki/Special:Version
>>> >
>>> https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives
>>> >
>>> https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours
>>>
>>>
>>> Thanks! This answers my question. It's particularly interesting to read
>>> the talk page to the plan. Part of the problem is that "word embedding"
>>> and "vector search" are not mentioned there, but a vector search could
>>> have found the "ML-enabled natural language search" that is mentioned.
>>> If and when this is tried, we will need to evaluate how well it works for
>>> various languages.
>>>
>>>
>>> --
>>> Lars Aronsson (lars@aronsson.se, user:LA2)
>>> Linköping, Sweden
>>>
>>> _______________________________________________
>>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
>>>
>>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>>
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
>>
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
>
>
> --
> Isaac Johnson (he/him/his) -- Senior Research Scientist -- Wikimedia
> Foundation
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



--
Amir (he/him)
Re: Word embeddings / vector search [ In reply to ]
Hi Lars!

It's certainly not a new idea, I literally wrote my master's thesis on it
<https://nbn-resolving.org/urn:nbn:de:bsz:15-qucosa2-166373> (German).

It's an interesting idea, but not easy to make it work properly nicely. There is
a lot of noise in the data.

Here's a presentation I gave at Wikimania 2009 on applying it to image search:

* https://commons.wikimedia.org/wiki/File:Wikimania2009-WikiWord-Paper.pdf
* https://upload.wikimedia.org/wikipedia/commons/4/4d/Wikimania2009-WikiWord.pdf
* https://commons.wikimedia.org/wiki/File:200908280921-Daniel_Kinzler-WikiWord_Multilingual_image_search_and_more.ogv

Am 09.05.2023 um 22:36 schrieb Amir Sarabadani:
> On top of the ones mentioned,
> ores topic detection model <https://github.com/wikimedia/drafttopic>(the one
> that says what wikiproject an article belongs to, an example
> <https://ores.wikimedia.org/v3/scores/enwiki/?revids=1153214555>) has been
> using word embedding since 2018-ish.
>
> HTH
>
> Am Di., 9. Mai 2023 um 22:10 Uhr schrieb Isaac Johnson <isaac@wikimedia.org>:
>
> +1 to the suggestion to connect with the Search team. Also a few more
> thoughts about vector / natural-language search and its relevance to
> Wikimedia from my perspective in Research:
>
> * The common critique of lexical / keyword-based search and why folks
> point to vector / embedding-based search is handling more
> natural-language queries (e.g., "What are the different objectives of
> the United Nations Sustainable Development Goals?" vs. "UN SDG"). The
> former has a lot of words in it that lead to keyword overlap with
> less-relevant pages so keyword-based search doesn't do as well. The
> latter is much more direct and even matches an existing redirect on
> Wikipedia to the article on UN Sustainable Development Goals, so our
> existing keyword-based search handles it very well.
> * Most existing users of Wikimedia's search are probably doing something
> closer to the latter above -- i.e. using pretty exact keywords to
> navigate to a specific page (or find it exists). This is backed up by
> the data: 80% of searches on Wikipedia are auto-completed directly to
> article pages
> <https://upload.wikimedia.org/wikipedia/commons/8/87/Understanding_Search_Behavior_in_Wikipedia_-_Report_-_Bruno_Scarone.pdf#page=7>.
> In that sense, the system is working quite well! The Search team also
> has added quite a bit of normalization into the pipeline (see
> https://diff.wikimedia.org/2023/04/28/language-harmony-and-unpacking-a-year-in-the-life-of-a-search-nerd/
> for a fun overview). For the more complicated natural-language queries
> to find relevant Wikipedia articles, my sense is that folks using
> natural language searches are probably doing that within external
> search engines, which have huge teams/infrastructure to support this,
> and then clicking through to Wikipedia.
> * That said, there are probably use-cases where natural-language search
> would be more valuable. For example, within new interaction domains
> such as chat-bots or for new editors / developers who don't yet know
> the exact terminology to search for but want to do generic things like
> get access to Toolforge or find out how to add a link to a page. I've
> been putting together an example of this for Wikitech for the upcoming
> Hackathon (details <https://phabricator.wikimedia.org/T333853>) and
> others have proposed e.g., this for Project pages to help editors find
> answers to questions about editing (details
> <https://phabricator.wikimedia.org/T335013>).
> * Finally, there's a second, related aspect to this which is the size
> and diversity of a given document. Within the Wikipedia article
> namespace, documents are generally about a single, constrained topic.
> So the fact that lexical search systems like Elasticsearch operate at
> the document-level is a very good fit -- i.e. index all the keywords
> for a given article together. When thinking about other namespaces
> like Project/Help pages or Wikitech documentation, a single page can
> be much larger and be about far more diverse topics. This presents
> further challenges to finding good keyword-overlap because often the
> search would ideally find a very specific paragraph in a much larger
> document about many other things. Vector search doesn't directly solve
> this but in practice, folks tend to learn embeddings for smaller
> passages than an entire doc -- e.g., sections or even paragraphs
> within the section. For that reason alone, I suspect vector search
> will do better for namespaces outside of the article namespace on
> Wikipedia. Whether it's worth the cost is a separate question as it
> also introduces substantial new challenges in keeping the embeddings
> up-to-date :)
>
> Hope that helps.
>
> Best,
> Isaac
>
> On Tue, May 9, 2023 at 2:10?PM Dan Andreescu <dandreescu@wikimedia.org> wrote:
>
> I encourage you to reach out to the search team, they're lovely folks
> and even better engineers.
>
> On Tue, May 9, 2023 at 1:53?PM Lars Aronsson <lars@aronsson.se> wrote:
>
> On 2023-05-09 09:27, Thiemo Kreuz wrote:
> > I'm curious what the actual question is. The basic concepts are
> > studied for about 60 years, and are in use for about 20 to 30 years.
>
> Sorry to hear that you're so negative. It's quite obvious that
> this is not
> currently used in Wikipedia, but is presented everywhere as a novelty
> that has not been around for 20 or 30 years.
>
> >
> https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
> > https://en.wikipedia.org/wiki/Special:Version
> >
> https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives
> >
> https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours
>
>
> Thanks! This answers my question. It's particularly interesting to
> read
> the talk page to the plan. Part of the problem is that "word
> embedding"
> and "vector search" are not mentioned there, but a vector search could
> have found the "ML-enabled natural language search" that is mentioned.
> If and when this is tried, we will need to evaluate how well it
> works for
> various languages.
>
>
> --
>    Lars Aronsson (lars@aronsson.se, user:LA2)
>    Linköping, Sweden
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
>
>
> --
> Isaac Johnson (he/him/his) -- Senior Research Scientist -- Wikimedia
> Foundation
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
>
>
> --
> Amir (he/him)
>
>
> _______________________________________________
> Wikitech-l mailing list --wikitech-l@lists.wikimedia.org
> To unsubscribe send an email towikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

--
Daniel Kinzler
Principal Software Engineer, Platform Engineering
Wikimedia Foundation
Re: Word embeddings / vector search [ In reply to ]
On 2023-05-09 22:09, Isaac Johnson wrote:

> +1 to the suggestion to connect with the Search team. Also a few more
> thoughts about vector / natural-language search and its relevance to
> Wikimedia from my perspective in Research:
>
> * The common critique of lexical / keyword-based search and why
> folks point to vector / embedding-based search is handling more
> natural-language queries (e.g., "What are the different objectives
> of the United Nations Sustainable Development Goals?" vs. "UN
> SDG"). The former has a lot of words in it that lead to keyword
> overlap with less-relevant pages so keyword-based search doesn't
> do as well. The latter is much more direct and even matches an
> existing redirect on Wikipedia to the article on UN Sustainable
> Development Goals, so our existing keyword-based search handles it
> very well.
> * Most existing users of Wikimedia's search are probably doing
> something closer to the latter above -- i.e. using pretty exact
> keywords to navigate to a specific page (or find it exists).
>

I disagree. The benefit we should expect from vector search is not the
ability to write questions with fuzzy grammar while still using exact
terminology, but instead to use fuzzy terminology. Today most users
search with exact terms, because that's the only thing our search
function can handle. You can only search for the terms that are used
in the articles. That's not any stranger than the observation that owners
of a Fortran compiler tend to write programs in Fortran, as those are
the only ones that will compile into running code. Most users would not
search for "sustainable development goals" because they are not familiar
with this exact UN terminology. Instead they might wonder how the UN
envisions the future for humanity. And if those exact words are not in
the relevant article, the current text-based search will yield nothing.

On Meta there's a list of mailing lists that mentions "wikimedia-search",
but that list seems to be dead and the archive is full of spam.
Another list exists, called "discovery", but not listed on Meta.
https://lists.wikimedia.org/hyperkitty/list/discovery@lists.wikimedia.org/


--
Lars Aronsson (lars@aronsson.se, user:LA2)
Linköping, Sweden

_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Word embeddings / vector search [ In reply to ]
>
> On Meta there's a list of mailing lists that mentions "wikimedia-search",
> but that list seems to be dead and the archive is full of spam.
> Another list exists, called "discovery", but not listed on Meta.
> https://lists.wikimedia.org/hyperkitty/list/discovery@lists.wikimedia.org/


Indeed the Discovery
<https://lists.wikimedia.org/hyperkitty/list/discovery@lists.wikimedia.org/thread/HTEAZLNPUT3W3HTLAMP6B4AXKLXPSDSV/>
list is the one I've seen get Search-related traffic. I wanted to update
the relevant Meta page
<https://meta.wikimedia.org/wiki/Mailing_lists/Overview> but got very lost
when I saw the translate tag inside a row table template with an identifier
(397). The list that's listed under Search and Discovery hasn't seen
actual traffic from anyone on the search team in over 7 years. The
discovery list I linked here is the appropriate substitute, if anyone knows
how to edit that scary template thing.