Mailing List Archive: Search Engines, Chinese and Python.

The "how do you build a search engine in Python ?" question has been
asked and answered enough times so I'll spare you all the agony. Given
the choice, I'd use Ultraseek*, WAIS or something rather than rebuild
this again myself; but I need this to work in Chinese. Forseeing a
great demand for this (for myself and in general) and failing to find a
decent ready-made solution, I figured that I may as well have a stab at
it. (If all else fails, at least I may improve my Mandarin).

Snipping from a thread last December :

[snip]
>Richard Jones <richard.jones@fulcrum.com.au> wrote:
>: The short answer is "you don't".
>
The big answer might be (this not a Gadlfy answer, but hey):

(This uses indexing on item submission to speed fetching.)

A pair of (g)dm's. One that stores your entries under some unique per
item id.

Churn through each new item looking for words. "Stop and stem" this
list (ie. kill "and" "at" "the", standardise case, collapse "runner",
"running" -> "run" or whatever. Can be tricky :-) You can automate
the stop-list just by counting word occurences)
The HTML parser and rfc822 et all could also be used to pull out
details for searches like "url:www.host.com".

The second file holds a relation between words and the documents that
contain them. ie. an inverted list.

Query comes in: search for "python programming":
list1 = db2["python"]
list2 = db2["programming"]

The intersection of the list are the documents that contain both
words. As things get big, you may need to overload the getitem
to return a smaller list and store things like the number times
the word appears in an item (you can then sort the inverted lists
on this attribute).

Hope this helps.

(Start small!)
-- James Preston, waiting for Godot.
[/snip]

Is it really as easy as that ?

It seems that the real work is in the indexing and this is going to be
even more of a chore with Chinese because words aren't separated by
spaces - so we'll also have to build a parsing engine to work that out :(

If anybody has worked with Chinese text and has any caveats with regards the
above project or programming double-byte characters in general, I'm all ears..
I'm still struggling with getting my servers/scripts to write Chinese to
the screen of Chinese-Windows machines let alone programming this into a
database.

Thank you very much,

chas

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

I have a sneaky feeling that somebody on the Python list first mentioned
this URL a couple of years back but I've just rediscovered the great
resource on CJK processing (I knew I bookmarked it for a reason) :

http://www.ora.com/people/authors/lunde/cjk_inf.html

so I answer the second half of my own question.
Now if maybe the Infoseek guys are interested in porting their
engine to the most widely-spoken language in the world ;-)

chas

*just as an aside, it was due to Infoseek that I first looked at Python;
always thought it was the best search engine, read that they used Python just
as I was despairing with another P-language... haven't looked back since. :)

sweeting@neuronet.com.my wrote:
> The "how do you build a search engine in Python ?" question has been
> asked and answered enough times so I'll spare you all the agony. Given
> the choice, I'd use Ultraseek*, WAIS or something rather than rebuild
> this again myself; but I need this to work in Chinese. Forseeing a
> great demand for this (for myself and in general) and failing to find a
> decent ready-made solution, I figured that I may as well have a stab at
> it. (If all else fails, at least I may improve my Mandarin).
>
> Snipping from a thread last December :
>
> [snip]
> >Richard Jones <richard.jones@fulcrum.com.au> wrote:
> >: The short answer is "you don't".
> >
> The big answer might be (this not a Gadlfy answer, but hey):
>
> (This uses indexing on item submission to speed fetching.)
>
> A pair of (g)dm's. One that stores your entries under some unique per
> item id.
>
> Churn through each new item looking for words. "Stop and stem" this
> list (ie. kill "and" "at" "the", standardise case, collapse "runner",
> "running" -> "run" or whatever. Can be tricky :-) You can automate
> the stop-list just by counting word occurences)
> The HTML parser and rfc822 et all could also be used to pull out
> details for searches like "url:www.host.com".
>
> The second file holds a relation between words and the documents that
> contain them. ie. an inverted list.
>
> Query comes in: search for "python programming":
> list1 = db2["python"]
> list2 = db2["programming"]
>
> The intersection of the list are the documents that contain both
> words. As things get big, you may need to overload the getitem
> to return a smaller list and store things like the number times
> the word appears in an item (you can then sort the inverted lists
> on this attribute).
>
> Hope this helps.
>
> (Start small!)
> -- James Preston, waiting for Godot.
> [/snip]
>
> Is it really as easy as that ?
>
> It seems that the real work is in the indexing and this is going to be
> even more of a chore with Chinese because words aren't separated by
> spaces - so we'll also have to build a parsing engine to work that out :(
>
> If anybody has worked with Chinese text and has any caveats with regards the
> above project or programming double-byte characters in general, I'm all ears..
> I'm still struggling with getting my servers/scripts to write Chinese to
> the screen of Chinese-Windows machines let alone programming this into a
> database.
>
> Thank you very much,
>
> chas
>
> -----------== Posted via Deja News, The Discussion Network ==----------
> http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
>

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own