Mailing List Archive: advice on integrating NLP engine during indexing

advice on integrating NLP engine during indexing

Dec 20, 2007, 6:48 AM

Post #1 of 5 (2033 views)

Greetings all. I am new to Lucene and am looking for a little
advice/direction/feedback on what I am trying to do. I want to index and
query millions of documents that are unstructured and resemble
crime/police/phsychiatric reports; no problem, lucene is perfect for this.

The trick is that I need to exclude certain terms from the index such as
those terms that are negated or information that could potentially identify
people. I have a collection of natural language processing tools that are
able to tag or remove/replace such terms.

I need to design the indexing such that I can feed each document through
these tools and then incorporate the results into the indexing strategy.

As an example, if I have a report that has the phrase: "Mr. Smith has no
history of violence against women prior to this event"

The NLP engine would recognize the name Smith and the negation of the term
"violence" and would tag them as such. I would then like to exclude those
terms from the indexing as seems prudent.

Another strategy I would like to look at is to include the tags in the index
to incorprate it into the search engine. That is to say, whether a subject
"likely" has a history of violence, "may" have a history of violence, or
"does not" have a history of violence.

I assume that I will need to design a custom analyzer to do this, but I was
hoping to solicit any comments, advice, or general suggestions before I get
started.

Thanks in advance,

j

--
View this message in context: http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14437913.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: advice on integrating NLP engine during indexing [ In reply to ]

gsingers at apache

Dec 20, 2007, 6:55 AM

Post #2 of 5 (1946 views)

Permalink

FYI: you will get a broader audience on java-user, this list is mostly
for discussion of higher level Lucene things that effect two or more
of the Lucene projects.

That being said, a custom analyzer is the way to go to redact the
appropriate information. If you have your files in some sort of
markup, you can easily create fields to contain the various metadata
that you have generated (i.e. history of violence.) One new thing
that I have been intrigued with for use in NLP applications is the new
TeeTokenFilter and SinkTokenizer that can be used to siphon off
interesting tokens for other fields based on the tokens of an existing
field. This can save on the need to reanalyze content over and over
for different analysis needs. This is, however, advanced usage for
now (although I hope it will become more common)

Cheers
Grant

On Dec 20, 2007, at 9:48 AM, 1world1love wrote:

>
> Greetings all. I am new to Lucene and am looking for a little
> advice/direction/feedback on what I am trying to do. I want to index
> and
> query millions of documents that are unstructured and resemble
> crime/police/phsychiatric reports; no problem, lucene is perfect for
> this.
>
> The trick is that I need to exclude certain terms from the index
> such as
> those terms that are negated or information that could potentially
> identify
> people. I have a collection of natural language processing tools
> that are
> able to tag or remove/replace such terms.
>
> I need to design the indexing such that I can feed each document
> through
> these tools and then incorporate the results into the indexing
> strategy.
>
> As an example, if I have a report that has the phrase: "Mr. Smith
> has no
> history of violence against women prior to this event"
>
> The NLP engine would recognize the name Smith and the negation of
> the term
> "violence" and would tag them as such. I would then like to exclude
> those
> terms from the indexing as seems prudent.
>
> Another strategy I would like to look at is to include the tags in
> the index
> to incorprate it into the search engine. That is to say, whether a
> subject
> "likely" has a history of violence, "may" have a history of
> violence, or
> "does not" have a history of violence.
>
> I assume that I will need to design a custom analyzer to do this,
> but I was
> hoping to solicit any comments, advice, or general suggestions
> before I get
> started.
>
> Thanks in advance,
>
> j
>
>
> --
> View this message in context: http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14437913.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

RE: advice on integrating NLP engine during indexing [ In reply to ]

james at ryley

Dec 20, 2007, 8:08 AM

Post #3 of 5 (1947 views)

Permalink

Hi,

I can't answer your question -- sorry! But, I was curious about the NLP you
describe. Are there algorithms available for determining negation
automatically, and are they accurate?

Sincerely,
James

> -----Original Message-----
> From: 1world1love [mailto:jd_cowan@yahoo.com]
> Sent: Thursday, December 20, 2007 9:48 AM
> To: general@lucene.apache.org
> Subject: advice on integrating NLP engine during indexing
>
>
> Greetings all. I am new to Lucene and am looking for a little
> advice/direction/feedback on what I am trying to do. I want to index and
> query millions of documents that are unstructured and resemble
> crime/police/phsychiatric reports; no problem, lucene is perfect for this.
>
> The trick is that I need to exclude certain terms from the index such as
> those terms that are negated or information that could potentially
identify
> people. I have a collection of natural language processing tools that are
> able to tag or remove/replace such terms.
>
> I need to design the indexing such that I can feed each document through
> these tools and then incorporate the results into the indexing strategy.
>
> As an example, if I have a report that has the phrase: "Mr. Smith has no
> history of violence against women prior to this event"
>
> The NLP engine would recognize the name Smith and the negation of the term
> "violence" and would tag them as such. I would then like to exclude those
> terms from the indexing as seems prudent.
>
> Another strategy I would like to look at is to include the tags in the
index
> to incorprate it into the search engine. That is to say, whether a subject
> "likely" has a history of violence, "may" have a history of violence, or
> "does not" have a history of violence.
>
> I assume that I will need to design a custom analyzer to do this, but I
was
> hoping to solicit any comments, advice, or general suggestions before I
get
> started.
>
> Thanks in advance,
>
> j
>
>
> --
> View this message in context:
http://www.nabble.com/advice-on-integrating-NLP-
> engine-during-indexing-tp14437913p14437913.html
> Sent from the Lucene - General mailing list archive at Nabble.com.

RE: advice on integrating NLP engine during indexing [ In reply to ]

iragoldstein at usa

Dec 20, 2007, 8:43 AM

Post #4 of 5 (1966 views)

Permalink

James --
Hi. Various people have used RegEx to do negation detection.
http://www.albany.edu/~ou372553/Tawanda-Meng.pdf includes a description of
one implementation that does negation detection.
--Ira

------ Original Message ------
Received: Thu, 20 Dec 2007 11:10:00 AM EST
From: "James Ryley" <james@ryley.com>

Hi,

I can't answer your question -- sorry! But, I was curious about the NLP you

describe. Are there algorithms available for determining negation
automatically, and are they accurate?

Sincerely,
James

RE: advice on integrating NLP engine during indexing [ In reply to ]

jd_cowan at yahoo

Dec 20, 2007, 11:26 AM

Post #5 of 5 (1951 views)

Permalink

Hi James. Ira's link is a good starting point. There is another algorithm
called NegEx used in parsing medical texts that was published out of the
University of Pittsburgh. You can find a high level description here:
http://healthinformatics.wikispaces.com/NegEx+Algorithm

Although much of the research in the field is being done in medical
informatics, the general principals are really universal as long as you have
a good understanding of the domain vocabulary. You could probably search
pubmed for current literature on the subject.

As to the question of accuracy, I have found that most of the published
results are based on a "best case scenario" and that any method will need to
be tweaked for a particular problem to get the best results. You will
probably never find a method that is perfectly accurate, even human based.
My philosophy when evaluating these algorithms is "Don't let the perfect be
the enemy of the good".

j

James-10 wrote:
>
> Hi,
>
> I can't answer your question -- sorry! But, I was curious about the NLP
> you
> describe. Are there algorithms available for determining negation
> automatically, and are they accurate?
>
> Sincerely,
> James
>
>

--
View this message in context: http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14443277.html
Sent from the Lucene - General mailing list archive at Nabble.com.