Mailing List Archive: KinoSearch & Similar/Duplicate Documents

KinoSearch & Similar/Duplicate Documents

Feb 25, 2008, 3:00 AM

Post #1 of 4 (1967 views)

Hello !

I love to use KinoSearch. So far It's doing everything we need for our
project. I wonder if you could suggest me a way how to retrieve
Similar documents and Duplicates. We index few web-sites and sometimes
the documents are posted with different URLs. How to solve this?

One of the issues we also have is not related to KinoSearch. We would
like to remove some parts of the page which are similar (let's say we
want to remove navigation menu shared on all pages). Remove the
content is quite easy, but how would you detect what parts are
repeated across pages? Diff algorithm? What kind of approach would you
suggest?

Thank you,
Vlad

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch & Similar/Duplicate Documents [ In reply to ]

marvin at rectangular

Feb 25, 2008, 6:49 AM

Post #2 of 4 (1920 views)

Permalink

On Feb 25, 2008, at 3:00 AM, Vladimir Vlach wrote:

> I wonder if you could suggest me a way how to retrieve
> Similar documents and Duplicates. We index few web-sites and sometimes
> the documents are posted with different URLs. How to solve this?

Off the top of my head, I don't know of an easy or reliable approach.
I'm sure that there is academic research out there on the subject.

Brainstorming...

This is a two-stage problem. The hard part is identifying candidates
which may be similar to each other. After you have candidates, then
you can roll through the seemingly matching docs and see what kind of
matching content is really there. Is it boilerplate template code
(e.g. nav menus) that ought to be discarded? Or is this truly
meaningful content which has been duplicated in multiple locations?

Say you were to build a pure vector space search engine, as described
at <http://www.perl.com/pub/a/2003/02/19/engine.html>. Then you
perform a search using the entire contents of one document as a
query. Documents with duplicate content will appear nearly on top of
each other in vector space.

An uncompressed vector space search engine is not feasible for large
document collections; however, I suspect that a decomposed vector
engine a la LSA (latent semantic analysis) would do a good job at
picking candidates. An excellent introduction to LSA is available at <http://www.knowledgesearch.org/lsi/cover_page.htm
>. (I've started collecting these links on a wiki page at <http://www.rectangular.com/kinosearch/wiki/VectorSpaceModel
>.)

The patent on Latent Semantic Analysis expires this year. It ought to
be possible to extend KinoSearch with a KSx::LSA distro, which would
include KSx::LSA::LSAWriter, KSx::LSA::LSAQuery and so on.

> One of the issues we also have is not related to KinoSearch. We would
> like to remove some parts of the page which are similar (let's say we
> want to remove navigation menu shared on all pages). Remove the
> content is quite easy, but how would you detect what parts are
> repeated across pages? Diff algorithm? What kind of approach would you
> suggest?

I haven't studied this one in depth; from what I understand it's quite
a difficult problem. (I vaguely recall a discussion in some Lucene
forum where Andrzej Bialecki, one of Lucene's biggest contributors,
threw up his hands.) Especially annoying is template code which
varies subtly, making verification of suspected boilerplate a
challenging prospect. I can think of some vector-based techniques I
might try, but hunting down academic research on the topic is likely
to be more fruitful.

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch & Similar/Duplicate Documents [ In reply to ]

peter at peknet

Feb 25, 2008, 11:56 AM

Post #3 of 4 (1889 views)

Permalink

Vladimir Vlach wrote on 2/25/08 5:00 AM:
> Hello !
>
> I love to use KinoSearch. So far It's doing everything we need for our
> project. I wonder if you could suggest me a way how to retrieve
> Similar documents and Duplicates. We index few web-sites and sometimes
> the documents are posted with different URLs. How to solve this?
>

Duplicates can be identified simply by MD5-ing the doc content. That's what
Swish-e's spider.pl does.

Similarity is a much tougher nut. LSA is a decent approach (as Marvin
suggested). One Swish-e user tried this:

http://swish-e.org/archive/2005-02/8967.html

The key imo is to avoid indexing duplicate and for-some-value-of-similar
documents in the first place. Implement these features at the document
aggregator level, before handing them to KS.

> One of the issues we also have is not related to KinoSearch. We would
> like to remove some parts of the page which are similar (let's say we
> want to remove navigation menu shared on all pages). Remove the
> content is quite easy, but how would you detect what parts are
> repeated across pages? Diff algorithm? What kind of approach would you
> suggest?

If you have control over the content, you might add  tags around
the stuff you want excluded, and then s/// that out before you pass to KS.

If you don't have control, and the improvement is worth your time, consider
identifying some text patterns in your documents and just s/// those, as in the
example above.

--
Peter Karman . http://peknet.com/ . peter@peknet.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: KinoSearch & Similar/Duplicate Documents [ In reply to ]

nate at verse

Feb 25, 2008, 10:30 PM

Post #4 of 4 (1897 views)

Permalink

On 2/25/08, Vladimir Vlach <vladaman@gmail.com> wrote:
> One of the issues we also have is not related to KinoSearch. We would
> like to remove some parts of the page which are similar (let's say we
> want to remove navigation menu shared on all pages). Remove the
> content is quite easy, but how would you detect what parts are
> repeated across pages? Diff algorithm? What kind of approach would you
> suggest?

I recently was talking with a friend about how to do this for indexing
a blog aggregator. For his case, a straight 'diff' type algorithm
wasn't going to work very well due to rotating ads and page specific
navigation. Peter's suggestions (custom regexps) make good sense if
you have if you have control of the pages or have a set number of
sites which you are scraping.

Another approach would be to do the analysis at the DOM level rather
than the text level. There's an HTML::ContentExtractor module that
might be a good starting point for this:
<http://search.cpan.org/~jzhang/HTML-ContentExtractor/lib/HTML/ContentExtractor.pm>
It does DOM parsing, and makes simple statistical guesses about what
is real content and what is junk based on the percentage of text to
tags. With a better (or per site customized) algorithm for
classification, I think this has potential.

For my friend, it was possible that http://dapper.net was going to be
useful as well. Dapper is a web service that lets you create
customized RSS feeds of sites based on graphically entered parameters.
Probably not going to work for your needs, but might be worth
checking out for ideas.

Good luck!

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch