Mailing List Archive

Full-Text Search in a Relational Model
Hi,

(Warning, not for the weak-hearted)

I'm currently working on a project where we have a large and complex data
model, related to Genomics. We are trying to build a search engine that
provides "full text" and "field-based text" searches for our customer base
(mostly academic research), and are evaluating different tools for this
purpose.

As a starting point, we have, as an example, a set of objects (stored in
tables as a relational model):
Gene [ID, Symbol, Description]
Article - M:M with Gene [ID, Title]
Disease - M:M with Gene [ID, Name]
Author - M:M with Article [ID, Name]
(Note: M:M tables exist, just link IDs)

An example model would be (hierarchical, relations dealt with as
duplications)

Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor]
Article [.ID=1, Title=EGFR mutations in lung cancer: correlation with
clinical response to gefitinib therapy]
Author [ID=1, Name=H. Michaelson]
Author [ID=2, Name=J. Watson]
Article [.ID=2, Title=Proteomics analysis of epidermal protein kinases by
target class-selective prefractionation and tandem mass spectrometry]
Author [ID=1, Name=H. Michaelson]
Author [ID=3, Name=M. Roberts]
Disease [ID=1, Name=Epidermal sluffing]

Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase]
Article [.ID=3, Title=Limited proteolysis of S-adenosylhomocysteine
hydrolase: implications for the three-dimensional structure]
Author [ID=4, Name=B. Cohen]
Author [ID=5, Name=L. Alexander]
Article [.ID=2, Title=Proteomics analysis of epidermal protein kinases by
target class-selective prefractionation and tandem mass spectrometry]
Author [ID=1, Name=H. Michaelson]
Author [ID=3, Name=M. Roberts]

Note IDs in the objects above, as they relay the relations in the
hierarchical model.

In our Full-Text search, we would like to allow users to search ANY textual
field for any string. For instance, the term "epidermal", and display the
list of genes which have any data associated with them with that term
(ranked, of course).
Our list of results would be something like:

EGFR
Found in Description (epidermal growth factor receptor)
Found in Article ID#2, in Title (proteomics analysis of epidermal protein
kinases by target class-selective prefractionation and tandem mass
spectrometry)
Found in Disease ID#1, in Name (Epidermal sluffing)

AHCY
Found in Article ID#2, in Title (proteomics analysis of epidermal protein
kinases by target class-selective prefractionation and tandem mass
spectrometry)

Note that the results retain a hierarchial view of our Genes (us being
Gene-Centric, we're pretty much framing the question "find this term related
in information related to those genes"). Also note that Article ID #2 has an
M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to that fact,
AHCY is considered a gene that has "epidermal" in its annotations.

Obviously, we'd like to rank fields by location in hierarchy (A term in a
gene name is scored higher than the name of the author of an article related
to a gene) and by number of hits (number of times a term is found related to
that gene, 3 in the case of EGFR above).

Ideas for how to take on this challenge? Implementation? Tools?

Thanks!
Yaron Golan

--
View this message in context: http://www.nabble.com/Full-Text-Search-in-a-Relational-Model-tp15063631p15063631.html
Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Full-Text Search in a Relational Model [ In reply to ]
My first impression is that you need a proper DB and a search on top
of it (but not using the DB/SQL). Perhaps you could try these -
1) http://www.opensymphony.com/compass/content/about.html
2) http://kasparov.skife.org/blog/2004/09/11/#lucene-ojb
3) http://www.dbsight.net/

Please let us know if you find any other useful information in your search.

- SJ

On Jan 24, 2008 5:59 PM, yarongolan <yarong@xennexinc.com> wrote:
>
> Hi,
>
> (Warning, not for the weak-hearted)
>
> I'm currently working on a project where we have a large and complex data
> model, related to Genomics. We are trying to build a search engine that
> provides "full text" and "field-based text" searches for our customer base
> (mostly academic research), and are evaluating different tools for this
> purpose.
>
> As a starting point, we have, as an example, a set of objects (stored in
> tables as a relational model):
> Gene [ID, Symbol, Description]
> Article - M:M with Gene [ID, Title]
> Disease - M:M with Gene [ID, Name]
> Author - M:M with Article [ID, Name]
> (Note: M:M tables exist, just link IDs)
>
> An example model would be (hierarchical, relations dealt with as
> duplications)
>
> Gene [ID=1, Symbol=EGFR, Description=epidermal growth factor receptor]
> Article [.ID=1, Title=EGFR mutations in lung cancer: correlation with
> clinical response to gefitinib therapy]
> Author [ID=1, Name=H. Michaelson]
> Author [ID=2, Name=J. Watson]
> Article [.ID=2, Title=Proteomics analysis of epidermal protein kinases by
> target class-selective prefractionation and tandem mass spectrometry]
> Author [ID=1, Name=H. Michaelson]
> Author [ID=3, Name=M. Roberts]
> Disease [ID=1, Name=Epidermal sluffing]
>
> Gene [ID=2, Symbol=AHCY, Description=S-adenosylhomocysteine hydrolase]
> Article [.ID=3, Title=Limited proteolysis of S-adenosylhomocysteine
> hydrolase: implications for the three-dimensional structure]
> Author [ID=4, Name=B. Cohen]
> Author [ID=5, Name=L. Alexander]
> Article [.ID=2, Title=Proteomics analysis of epidermal protein kinases by
> target class-selective prefractionation and tandem mass spectrometry]
> Author [ID=1, Name=H. Michaelson]
> Author [ID=3, Name=M. Roberts]
>
> Note IDs in the objects above, as they relay the relations in the
> hierarchical model.
>
> In our Full-Text search, we would like to allow users to search ANY textual
> field for any string. For instance, the term "epidermal", and display the
> list of genes which have any data associated with them with that term
> (ranked, of course).
> Our list of results would be something like:
>
> EGFR
> Found in Description (epidermal growth factor receptor)
> Found in Article ID#2, in Title (proteomics analysis of epidermal protein
> kinases by target class-selective prefractionation and tandem mass
> spectrometry)
> Found in Disease ID#1, in Name (Epidermal sluffing)
>
> AHCY
> Found in Article ID#2, in Title (proteomics analysis of epidermal protein
> kinases by target class-selective prefractionation and tandem mass
> spectrometry)
>
> Note that the results retain a hierarchial view of our Genes (us being
> Gene-Centric, we're pretty much framing the question "find this term related
> in information related to those genes"). Also note that Article ID #2 has an
> M:M with Gene ID2 (AHCY) and Gene ID1 (EGFR), and only due to that fact,
> AHCY is considered a gene that has "epidermal" in its annotations.
>
> Obviously, we'd like to rank fields by location in hierarchy (A term in a
> gene name is scored higher than the name of the author of an article related
> to a gene) and by number of hits (number of times a term is found related to
> that gene, 3 in the case of EGFR above).
>
> Ideas for how to take on this challenge? Implementation? Tools?
>
> Thanks!
> Yaron Golan