Mailing List Archive

How do you index ms office (.doc, .xls, .ppt) files with kinosearch
hi
I've red through most of the documentation trying to understand what
filetypes KS supports. There is the interesting oscon presentation on
http://www.rectangular.com/downloads/KinoSearch_OSCON2006.pdf, where
you can find the statement on page 13:

What is KinoSearch not?
...
- Not a file parser
...

So if I get this right, kinosearch doesn't care about your .doc, .xls,
.ppt files. As much as I personally try to avoid this formats, I think
its realistic to assume that you have to index such files when
creating something like an intranet search.

My question is, what would you suggest for indexing office formats ?
How do you extract text without ole and and an office installation on
the server?

thanks in advance
ben

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: How do you index ms office (.doc, .xls, .ppt) files with kinosearch [ In reply to ]
On Mon, August 25, 2008 1:12 pm, Ben Aurel wrote:
> My question is, what would you suggest for indexing office formats ?
> How do you extract text without ole and and an office installation on
> the server?

You use file conversion utilities such as pdftotext, xlhtml, wvHtml etc.
Most of these are far from perfect, sometimes crashing, etc.

Regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: How do you index ms office (.doc, .xls, .ppt) files with kinosearch [ In reply to ]
On 08/25/2008 08:42 AM, Henry wrote:
> On Mon, August 25, 2008 1:12 pm, Ben Aurel wrote:
>> My question is, what would you suggest for indexing office formats ?
>> How do you extract text without ole and and an office installation on
>> the server?
>
> You use file conversion utilities such as pdftotext, xlhtml, wvHtml etc.
> Most of these are far from perfect, sometimes crashing, etc.
>

Also, check out SWISH::Filter on CPAN, which uses many of those tools underneath but which
provides a common interface for converting them to parse-able text.
--
Peter Karman . peter@peknet.com . http://peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch