Greetings all. I am new to Lucene and am looking for a little
advice/direction/feedback on what I am trying to do. I want to index and
query millions of documents that are unstructured and resemble
crime/police/phsychiatric reports; no problem, lucene is perfect for this.
The trick is that I need to exclude certain terms from the index such as
those terms that are negated or information that could potentially identify
people. I have a collection of natural language processing tools that are
able to tag or remove/replace such terms.
I need to design the indexing such that I can feed each document through
these tools and then incorporate the results into the indexing strategy.
As an example, if I have a report that has the phrase: "Mr. Smith has no
history of violence against women prior to this event"
The NLP engine would recognize the name Smith and the negation of the term
"violence" and would tag them as such. I would then like to exclude those
terms from the indexing as seems prudent.
Another strategy I would like to look at is to include the tags in the index
to incorprate it into the search engine. That is to say, whether a subject
"likely" has a history of violence, "may" have a history of violence, or
"does not" have a history of violence.
I assume that I will need to design a custom analyzer to do this, but I was
hoping to solicit any comments, advice, or general suggestions before I get
started.
Thanks in advance,
j
--
View this message in context: http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14437913.html
Sent from the Lucene - General mailing list archive at Nabble.com.
advice/direction/feedback on what I am trying to do. I want to index and
query millions of documents that are unstructured and resemble
crime/police/phsychiatric reports; no problem, lucene is perfect for this.
The trick is that I need to exclude certain terms from the index such as
those terms that are negated or information that could potentially identify
people. I have a collection of natural language processing tools that are
able to tag or remove/replace such terms.
I need to design the indexing such that I can feed each document through
these tools and then incorporate the results into the indexing strategy.
As an example, if I have a report that has the phrase: "Mr. Smith has no
history of violence against women prior to this event"
The NLP engine would recognize the name Smith and the negation of the term
"violence" and would tag them as such. I would then like to exclude those
terms from the indexing as seems prudent.
Another strategy I would like to look at is to include the tags in the index
to incorprate it into the search engine. That is to say, whether a subject
"likely" has a history of violence, "may" have a history of violence, or
"does not" have a history of violence.
I assume that I will need to design a custom analyzer to do this, but I was
hoping to solicit any comments, advice, or general suggestions before I get
started.
Thanks in advance,
j
--
View this message in context: http://www.nabble.com/advice-on-integrating-NLP-engine-during-indexing-tp14437913p14437913.html
Sent from the Lucene - General mailing list archive at Nabble.com.