Mailing List Archive

An feasibility question
Hello,

I apologize for taking your time, but I am not trained in
this area, but someone suggested that this software could do want I need
completed, and I would like to enquire as to whether it can.



I require matching a series of titles (currently over 40k)
contained in individual cells in a worksheet with the contents of rich
documents (i.e. Word, PDF). The searching process would need to be automated,
since there will be several thousand titles and numerous documents. The
matching would be "fuzzy" since there may be some variation in
punctuation, or a misuse of a preposition.



The software would record the relevance of any match (i.e. a
percentage score), as well as the names of the documents and the page numbers
where the matches were found. This information would be saved in a format that
could be opened by Excel. Since there is likely to be multiple matches in the
same document or across documents, each match for each title would have its own
row.





I will appreciate your assistance and I look forward to your
reply.





Cheers!
Re: An feasibility question [ In reply to ]
Chris,

Yes, Solr can help you with that. We did something similar with company
names.

You can watch this training to help you understand better how Solr works if
you are just getting started:
http://www.pluralsight.com/courses/table-of-contents/enterprise-search-using-apache-solr


On Fri, Nov 7, 2014 at 10:36 AM, Chris Manu <chrismanu90@hotmail.com> wrote:

>
>
> Hello,
>
> I apologize for taking your time, but I am not trained in
> this area, but someone suggested that this software could do want I need
> completed, and I would like to enquire as to whether it can.
>
>
>
> I require matching a series of titles (currently over 40k)
> contained in individual cells in a worksheet with the contents of rich
> documents (i.e. Word, PDF). The searching process would need to be
> automated,
> since there will be several thousand titles and numerous documents. The
> matching would be "fuzzy" since there may be some variation in
> punctuation, or a misuse of a preposition.
>
>
>
> The software would record the relevance of any match (i.e. a
> percentage score), as well as the names of the documents and the page
> numbers
> where the matches were found. This information would be saved in a format
> that
> could be opened by Excel. Since there is likely to be multiple matches in
> the
> same document or across documents, each match for each title would have
> its own
> row.
>
>
>
>
>
> I will appreciate your assistance and I look forward to your
> reply.
>
>
>
>
>
> Cheers!
>
>




--

*Xavier Morera*

Entrepreneur | Author & Trainer | Consultant | Developer

*www.xaviermorera.com <http://www.xaviermorera.com/>*

office: (305) 600-4919

cel: +506 8849-8866

skype: xmorera
Twitter <https://twitter.com/xmorera> | LinkedIn
<https://www.linkedin.com/in/xmorera> | Pluralsight Author
<http://www.pluralsight.com/author/xavier-morera>
RE: An feasibility question [ In reply to ]
Thank you for the reply. I will use the recommended site.

Cheers!

> From: xavier@familiamorera.com
> Date: Fri, 7 Nov 2014 10:43:27 -0600
> Subject: Re: An feasibility question
> To: general@lucene.apache.org; chrismanu90@hotmail.com
>
> Chris,
>
> Yes, Solr can help you with that. We did something similar with company
> names.
>
> You can watch this training to help you understand better how Solr works if
> you are just getting started:
> http://www.pluralsight.com/courses/table-of-contents/enterprise-search-using-apache-solr
>
>
> On Fri, Nov 7, 2014 at 10:36 AM, Chris Manu <chrismanu90@hotmail.com> wrote:
>
> >
> >
> > Hello,
> >
> > I apologize for taking your time, but I am not trained in
> > this area, but someone suggested that this software could do want I need
> > completed, and I would like to enquire as to whether it can.
> >
> >
> >
> > I require matching a series of titles (currently over 40k)
> > contained in individual cells in a worksheet with the contents of rich
> > documents (i.e. Word, PDF). The searching process would need to be
> > automated,
> > since there will be several thousand titles and numerous documents. The
> > matching would be "fuzzy" since there may be some variation in
> > punctuation, or a misuse of a preposition.
> >
> >
> >
> > The software would record the relevance of any match (i.e. a
> > percentage score), as well as the names of the documents and the page
> > numbers
> > where the matches were found. This information would be saved in a format
> > that
> > could be opened by Excel. Since there is likely to be multiple matches in
> > the
> > same document or across documents, each match for each title would have
> > its own
> > row.
> >
> >
> >
> >
> >
> > I will appreciate your assistance and I look forward to your
> > reply.
> >
> >
> >
> >
> >
> > Cheers!
> >
> >
>
>
>
>
> --
>
> *Xavier Morera*
>
> Entrepreneur | Author & Trainer | Consultant | Developer
>
> *www.xaviermorera.com <http://www.xaviermorera.com/>*
>
> office: (305) 600-4919
>
> cel: +506 8849-8866
>
> skype: xmorera
> Twitter <https://twitter.com/xmorera> | LinkedIn
> <https://www.linkedin.com/in/xmorera> | Pluralsight Author
> <http://www.pluralsight.com/author/xavier-morera>