Mailing List Archive

Indexing local PDFs: Lucene/Solr/Nutch ?
Hello,
first of all, thanks for these great projects.
I discovered Lucene and its subs, a day ago and all these seem amazing.

My goal:
--------
A file server with numerous folders containing documents (pdf,doc,txt etc.)
that need to be indexed and searchable via a web interface or similar.
The number of files might be from 500 000 to 1 000 000 or so.
Ideally the solution would be capable of handling a lot more than that,
in case of future growth.

My question:
------------
Which of the projects (Lucene, Solr, Nutch) will be most suitable in my case?

Thank you much.

--
Veselin K
Re: Indexing local PDFs: Lucene/Solr/Nutch ? [ In reply to ]
The trunk of Solr with the new ExtractingRequestHandler (Tika) will
surely be the easiest way to get rolling. A simple script that
recurses your folders and issues a simple request posting each file in
turn to Solr will give you a full text searchable index in no time
(well, ok, it'll take a little time, but it'll be as fast as anything
else out there).

Erik

On Dec 14, 2008, at 9:15 AM, Veselin Kantsev wrote:

> Hello,
> first of all, thanks for these great projects.
> I discovered Lucene and its subs, a day ago and all these seem
> amazing.
>
> My goal:
> --------
> A file server with numerous folders containing documents
> (pdf,doc,txt etc.)
> that need to be indexed and searchable via a web interface or similar.
> The number of files might be from 500 000 to 1 000 000 or so.
> Ideally the solution would be capable of handling a lot more than
> that,
> in case of future growth.
>
> My question:
> ------------
> Which of the projects (Lucene, Solr, Nutch) will be most suitable in
> my case?
>
> Thank you much.
>
> --
> Veselin K
Re: Indexing local PDFs: Lucene/Solr/Nutch ? [ In reply to ]
: the easiest way to get rolling. A simple script that recurses your folders
: and issues a simple request posting each file in turn to Solr will give you a
: full text searchable index in no time (well, ok, it'll take a little time, but
: it'll be as fast as anything else out there).

if all the files are "local" on the machine that Solr is running on you
don't even need to POST them, Solr can be configured to read the files by
local filename using the "stream.file" param...

http://wiki.apache.org/solr/ContentStream

that said: if your fileserver implementation already exposes all of the
files over HTTP, then using Nutch and it's crawler might be an easier way
to get started on indexing all of them ... hard to say without being in
your shoes. you may want to experiement with both.



-Hoss
Re: Indexing local PDFs: Lucene/Solr/Nutch ? [ In reply to ]
Thank you Erik, Hoss.

- If using either Solr's "stream.file" or Nutch's crawler,
what is the procedure of adding new files?
That is to say, if I did not know which are the new files in a
specific folder and thus I passed all files to Solr/Nutch, would it
skip the ones that have already been indexed?

- Also what if I file gets modified, would Solr/Nutch detect
the change and re-index just this modified the file?
Or should some kind of cache be cleared and everything re-indexed?

- In order to provide the user with an option to search the indexes of
two separete Solr/Nutch servers, do I need to link both servers
somehow and join their indexes into one, or is it just a question of
designing the web front-end so that it offers the choice to send your
search query to one or multiple different servers.


Thank you,
Veselin K


On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
>
> : the easiest way to get rolling. A simple script that recurses your folders
> : and issues a simple request posting each file in turn to Solr will give you a
> : full text searchable index in no time (well, ok, it'll take a little time, but
> : it'll be as fast as anything else out there).
>
> if all the files are "local" on the machine that Solr is running on you
> don't even need to POST them, Solr can be configured to read the files by
> local filename using the "stream.file" param...
>
> http://wiki.apache.org/solr/ContentStream
>
> that said: if your fileserver implementation already exposes all of the
> files over HTTP, then using Nutch and it's crawler might be an easier way
> to get started on indexing all of them ... hard to say without being in
> your shoes. you may want to experiement with both.
>
>
>
> -Hoss
>
Re: Indexing local PDFs: Lucene/Solr/Nutch ? [ In reply to ]
Hello,
I am now using solr 1.3 with tomcat6 on a debian lenny box.

Could you please advise of any other instructions/HowTos on integrating Tika or
maybe RichDocumentHandler with Solr, that I can find online?
Apart from the Solr Wiki, as following those examples did not help in my
case.


Thank you.

Veselin K.


On Wed, Dec 17, 2008 at 10:43:57AM +0000, Veselin K wrote:
> Thank you Erik, Hoss.
>
> - If using either Solr's "stream.file" or Nutch's crawler,
> what is the procedure of adding new files?
> That is to say, if I did not know which are the new files in a
> specific folder and thus I passed all files to Solr/Nutch, would it
> skip the ones that have already been indexed?
>
> - Also what if I file gets modified, would Solr/Nutch detect
> the change and re-index just this modified the file?
> Or should some kind of cache be cleared and everything re-indexed?
>
> - In order to provide the user with an option to search the indexes of
> two separete Solr/Nutch servers, do I need to link both servers
> somehow and join their indexes into one, or is it just a question of
> designing the web front-end so that it offers the choice to send your
> search query to one or multiple different servers.
>
>
> Thank you,
> Veselin K
>
>
> On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
> >
> > : the easiest way to get rolling. A simple script that recurses your folders
> > : and issues a simple request posting each file in turn to Solr will give you a
> > : full text searchable index in no time (well, ok, it'll take a little time, but
> > : it'll be as fast as anything else out there).
> >
> > if all the files are "local" on the machine that Solr is running on you
> > don't even need to POST them, Solr can be configured to read the files by
> > local filename using the "stream.file" param...
> >
> > http://wiki.apache.org/solr/ContentStream
> >
> > that said: if your fileserver implementation already exposes all of the
> > files over HTTP, then using Nutch and it's crawler might be an easier way
> > to get started on indexing all of them ... hard to say without being in
> > your shoes. you may want to experiement with both.
> >
> >
> >
> > -Hoss
> >
Re: Indexing local PDFs: Lucene/Solr/Nutch ? [ In reply to ]
Can you provide details about the part of the examples that weren't
clear? Perhaps I can clean up the docs or help you figure it out.

-Grant

On Dec 27, 2008, at 3:42 PM, Veselin Kantsev wrote:

> Hello,
> I am now using solr 1.3 with tomcat6 on a debian lenny box.
>
> Could you please advise of any other instructions/HowTos on
> integrating Tika or
> maybe RichDocumentHandler with Solr, that I can find online?
> Apart from the Solr Wiki, as following those examples did not help
> in my
> case.
>
>
> Thank you.
>
> Veselin K.
>
>
> On Wed, Dec 17, 2008 at 10:43:57AM +0000, Veselin K wrote:
>> Thank you Erik, Hoss.
>>
>> - If using either Solr's "stream.file" or Nutch's crawler,
>> what is the procedure of adding new files?
>> That is to say, if I did not know which are the new files in a
>> specific folder and thus I passed all files to Solr/Nutch, would it
>> skip the ones that have already been indexed?
>>
>> - Also what if I file gets modified, would Solr/Nutch detect
>> the change and re-index just this modified the file?
>> Or should some kind of cache be cleared and everything re-indexed?
>>
>> - In order to provide the user with an option to search the indexes
>> of
>> two separete Solr/Nutch servers, do I need to link both servers
>> somehow and join their indexes into one, or is it just a question of
>> designing the web front-end so that it offers the choice to send
>> your
>> search query to one or multiple different servers.
>>
>>
>> Thank you,
>> Veselin K
>>
>>
>> On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
>>>
>>> : the easiest way to get rolling. A simple script that recurses
>>> your folders
>>> : and issues a simple request posting each file in turn to Solr
>>> will give you a
>>> : full text searchable index in no time (well, ok, it'll take a
>>> little time, but
>>> : it'll be as fast as anything else out there).
>>>
>>> if all the files are "local" on the machine that Solr is running
>>> on you
>>> don't even need to POST them, Solr can be configured to read the
>>> files by
>>> local filename using the "stream.file" param...
>>>
>>> http://wiki.apache.org/solr/ContentStream
>>>
>>> that said: if your fileserver implementation already exposes all
>>> of the
>>> files over HTTP, then using Nutch and it's crawler might be an
>>> easier way
>>> to get started on indexing all of them ... hard to say without
>>> being in
>>> your shoes. you may want to experiement with both.
>>>
>>>
>>>
>>> -Hoss
>>>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Indexing local PDF/Doc/XLS files with Solr? [ In reply to ]
Hello, I think the latest tarball worked for me out of the box.

I'm trying to design my Schema at present.
My goal is to index PDF/Doc/XLS files with the following fields:

0. ID number
1. Filename
2. File path
3. Modification date
4. File contents
5. Number of pages

- Any tips on what type of fields should I use to get this data indexed?

- Is there a way to get the ID number incremented automatically by Solr,
each time a document is added to the index?

- Would I be able to extract the information above using just the
Solr/Tika features? Or would I have to source all values myself, except
"file contents" and pass them to solr when indexing?


Thank you much.

Regards,
Veselin K


On Sat, Dec 27, 2008 at 09:29:05PM -0500, Grant Ingersoll wrote:
> Can you provide details about the part of the examples that weren't
> clear? Perhaps I can clean up the docs or help you figure it out.
>
> -Grant
>
> On Dec 27, 2008, at 3:42 PM, Veselin Kantsev wrote:
>
>> Hello,
>> I am now using solr 1.3 with tomcat6 on a debian lenny box.
>>
>> Could you please advise of any other instructions/HowTos on
>> integrating Tika or
>> maybe RichDocumentHandler with Solr, that I can find online?
>> Apart from the Solr Wiki, as following those examples did not help in
>> my
>> case.
>>
>>
>> Thank you.
>>
>> Veselin K.
>>
>>
>> On Wed, Dec 17, 2008 at 10:43:57AM +0000, Veselin K wrote:
>>> Thank you Erik, Hoss.
>>>
>>> - If using either Solr's "stream.file" or Nutch's crawler,
>>> what is the procedure of adding new files?
>>> That is to say, if I did not know which are the new files in a
>>> specific folder and thus I passed all files to Solr/Nutch, would it
>>> skip the ones that have already been indexed?
>>>
>>> - Also what if I file gets modified, would Solr/Nutch detect
>>> the change and re-index just this modified the file?
>>> Or should some kind of cache be cleared and everything re-indexed?
>>>
>>> - In order to provide the user with an option to search the indexes
>>> of
>>> two separete Solr/Nutch servers, do I need to link both servers
>>> somehow and join their indexes into one, or is it just a question of
>>> designing the web front-end so that it offers the choice to send
>>> your
>>> search query to one or multiple different servers.
>>>
>>>
>>> Thank you,
>>> Veselin K
>>>
>>>
>>> On Sun, Dec 14, 2008 at 11:22:00AM -0800, Chris Hostetter wrote:
>>>>
>>>> : the easiest way to get rolling. A simple script that recurses
>>>> your folders
>>>> : and issues a simple request posting each file in turn to Solr
>>>> will give you a
>>>> : full text searchable index in no time (well, ok, it'll take a
>>>> little time, but
>>>> : it'll be as fast as anything else out there).
>>>>
>>>> if all the files are "local" on the machine that Solr is running
>>>> on you
>>>> don't even need to POST them, Solr can be configured to read the
>>>> files by
>>>> local filename using the "stream.file" param...
>>>>
>>>> http://wiki.apache.org/solr/ContentStream
>>>>
>>>> that said: if your fileserver implementation already exposes all
>>>> of the
>>>> files over HTTP, then using Nutch and it's crawler might be an
>>>> easier way
>>>> to get started on indexing all of them ... hard to say without
>>>> being in
>>>> your shoes. you may want to experiement with both.
>>>>
>>>>
>>>>
>>>> -Hoss
>>>>
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
Re: Indexing local PDF/Doc/XLS files with Solr? [ In reply to ]
: I'm trying to design my Schema at present.
: My goal is to index PDF/Doc/XLS files with the following fields:

I strongly suggest you ask these questions on the solr-user@lucene mailing
list. general@lucene is for general discussions about all lucene
projects, or for questions from people interested in "search" related
technologies who don't yet know what Lucene subproject(s) might be useful
for them.


-Hoss