Mailing List Archive: Problem in unicode field value retrival

Problem in unicode field value retrival

Jun 10, 2002, 4:12 AM

Post #1 of 4 (401 views)

Hi

I am trying to index and search unicode (utf - 8) . the code i am using to index the documents is as follows :

/**************************************************************************************************************************************/
IndexWriter iw = new IndexWriter("d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\index", new SimpleAnalyzer(), true);
String dirBase = "d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\docs";
File docDir = new File(dirBase);
String[] docFiles = docDir.list();
InputStreamReader isr;
InputStream is;
Document doc;
for(int i=0;i<docFiles.length;i++)
{
File tempFile = new File(dirBase + "\\" + docFiles[i]);
if(tempFile.isFile()==true)
{
System.out.println("Indexing File :" + docFiles[i]);
is = new FileInputStream(tempFile);
isr=new InputStreamReader(is,"utf-8");
doc= new Document();
doc.add(Field.UnIndexed("path",tempFile.toString()));
doc.add(Field.Text("abc",(Reader)isr));
doc.add(Field.Text("all","sansui"));
iw.addDocument(doc);
is.close();
isr.close();
doc=null;
}
}
iw.close();
is=null;
isr=null;
iw=null;
docDir=null;

System.out.println("Indexing Complete");

/**************************************************************************************************************************************/

Now when i try to search the contents and get the field called abc by using the method doc.get("abc") , i get null as the output.

Can anyone please tell me where i am going wrong .

Thanks And Regards
Harpreet

Re: Problem in unicode field value retrival [ In reply to ]

ian at digimem

Jun 10, 2002, 4:45 AM

Post #2 of 4 (396 views)

Permalink

I don't think you can retrieve the contents of Fields that have
been loaded by a Reader. From the javadoc for Field:

Text(String name, Reader value)

Constructs a Reader-valued Field that is tokenized and indexed, but is
not stored in the index verbatim.

--
Ian.
ian@digimem.net

> harpreet@sansuisoftware.com (Harpreet S Walia) wrote
>
> Hi
>
> I am trying to index and search unicode (utf - 8) . the code i am using to index the documents is as follows :
>
> /**************************************************************************************************************************************/
> IndexWriter iw = new IndexWriter("d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\index", new SimpleAnalyzer(), true);
> String dirBase = "d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\docs";
> File docDir = new File(dirBase);
> String[] docFiles = docDir.list();
> InputStreamReader isr;
> InputStream is;
> Document doc;
> for(int i=0;i<docFiles.length;i++)
> {
> File tempFile = new File(dirBase + "\\" + docFiles[i]);
> if(tempFile.isFile()==true)
> {
> System.out.println("Indexing File :" + docFiles[i]);
> is = new FileInputStream(tempFile);
> isr=new InputStreamReader(is,"utf-8");
> doc= new Document();
> doc.add(Field.UnIndexed("path",tempFile.toString()));
> doc.add(Field.Text("abc",(Reader)isr));
> doc.add(Field.Text("all","sansui"));
> iw.addDocument(doc);
> is.close();
> isr.close();
> doc=null;
> }
> }
> iw.close();
> is=null;
> isr=null;
> iw=null;
> docDir=null;
>
> System.out.println("Indexing Complete");
>
> /**************************************************************************************************************************************/
>
> Now when i try to search the contents and get the field called abc by using the method doc.get("abc") , i get null as the output.
>
> Can anyone please tell me where i am going wrong .
>
> Thanks And Regards
> Harpreet
>
----------------------------------------------------------------------
Searchable personal storage and archiving from http://www.digimem.net/

Re: Problem in unicode field value retrival [ In reply to ]

harpreet at sansuisoftware

Jun 10, 2002, 5:21 AM

Post #3 of 4 (396 views)

Permalink

Hi,

That was the problem , Thanks :-) . still i am strugling to get lucene to
search non english unicode content . it works partially will simple analyser
but doesn't return any results with standard analyser . is there a way by
which i can output the exact contents that are going into the index .

Thanks and regards,
Harpreet

----- Original Message -----
From: "Ian Lea" <ian@digimem.net>
To: "Harpreet S Walia" <harpreet@sansuisoftware.com>
Cc: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Monday, June 10, 2002 5:15 PM
Subject: Re: Problem in unicode field value retrival

> I don't think you can retrieve the contents of Fields that have
> been loaded by a Reader. From the javadoc for Field:
>
> Text(String name, Reader value)
>
> Constructs a Reader-valued Field that is tokenized and indexed, but is
> not stored in the index verbatim.
>
>
> --
> Ian.
> ian@digimem.net
>
>
> > harpreet@sansuisoftware.com (Harpreet S Walia) wrote
> >
> > Hi
> >
> > I am trying to index and search unicode (utf - 8) . the code i am using
to index the documents is as follows :
> >
> >
/***************************************************************************
***********************************************************/
> > IndexWriter iw = new
IndexWriter("d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\index", new
SimpleAnalyzer(), true);
> > String dirBase = "d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\docs";
> > File docDir = new File(dirBase);
> > String[] docFiles = docDir.list();
> > InputStreamReader isr;
> > InputStream is;
> > Document doc;
> > for(int i=0;i<docFiles.length;i++)
> > {
> > File tempFile = new File(dirBase + "\\" + docFiles[i]);
> > if(tempFile.isFile()==true)
> > {
> > System.out.println("Indexing File :" + docFiles[i]);
> > is = new FileInputStream(tempFile);
> > isr=new InputStreamReader(is,"utf-8");
> > doc= new Document();
> > doc.add(Field.UnIndexed("path",tempFile.toString()));
> > doc.add(Field.Text("abc",(Reader)isr));
> > doc.add(Field.Text("all","sansui"));
> > iw.addDocument(doc);
> > is.close();
> > isr.close();
> > doc=null;
> > }
> > }
> > iw.close();
> > is=null;
> > isr=null;
> > iw=null;
> > docDir=null;
> >
> > System.out.println("Indexing Complete");
> >
> >
/***************************************************************************
***********************************************************/
> >
> > Now when i try to search the contents and get the field called abc by
using the method doc.get("abc") , i get null as the output.
> >
> > Can anyone please tell me where i am going wrong .
> >
> > Thanks And Regards
> > Harpreet
> >
> ----------------------------------------------------------------------
> Searchable personal storage and archiving from http://www.digimem.net/
>
>

----------------------------------------------------------------------------
----

> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Problem in unicode field value retrival [ In reply to ]

otis_gospodnetic at yahoo

Jun 10, 2002, 8:08 AM

Post #4 of 4 (411 views)

Permalink

Hello,

> That was the problem , Thanks :-) . still i am strugling to get
> lucene to
> search non english unicode content . it works partially will simple
> analyser
> but doesn't return any results with standard analyser . is there a
> way by
> which i can output the exact contents that are going into the index

Perhaps something like this will help. This is a very recent post from
the searchable mailing list archives at http://nagoya.apache.org/:

http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apache.org&msgId=352570

Otis

__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>