Mailing List Archive: Indexing other documents type than html and txt

Indexing other documents type than html and txt

Nov 29, 2001, 8:40 AM

Post #1 of 8 (2073 views)

Hi all,
I have a doubt. I know that lucene can index html and text documents, but
can it index other type of documents like pdf,docs, and xls documents? if it
can, how can I implement it? Perhaps can implement it like html and txt
indexing?

regards

Antonio

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Indexing other documents type than html and txt [ In reply to ]

avazquez at cystelcom

Nov 29, 2001, 7:26 AM

Post #2 of 8 (2056 views)

Permalink

Hi all,
I have a doubt. I know that lucene can index html and text documents, but
can it index other type of documents like pdf,docs, and xls documents? if it
can, how can I implement it? Perhaps can implement it like html and txt
indexing?

regards

Antonio

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Indexing other documents type than html and txt [ In reply to ]

otis_gospodnetic at yahoo

Nov 29, 2001, 8:48 AM

Post #3 of 8 (2056 views)

Permalink

You'd have to write parsers for each of those document types to convert
it to text and then index it.
Sure, you can feed it something like XML, but then you may consider
something like xmldb.org instead.

Otis

--- Antonio Vazquez <antonio_listas@yahoo.es> wrote:
>
> Hi all,
> I have a doubt. I know that lucene can index html and text documents,
> but
> can it index other type of documents like pdf,docs, and xls
> documents? if it
> can, how can I implement it? Perhaps can implement it like html and
> txt
> indexing?
>
> regards
>
> Antonio
>
>
> _________________________________________________________
> Do You Yahoo!?
> Get your free @yahoo.com address at http://mail.yahoo.com
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>

__________________________________________________
Do You Yahoo!?
Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Indexing other documents type than html and txt (XML) [ In reply to ]

carlson at bookandhammer

Nov 29, 2001, 10:03 AM

Post #4 of 8 (2049 views)

Permalink

I have started to create a set of generic lucene document types that can
be easily manipulated depending on the fields.
I know other have generated Documents out of PDF.
Is there some place we can add contributed classes to the lucene web
page?

Here my current version of the XMLDocument based on . It's a bit slow.
It uses a path (taken from Document example) and based on a field name /
xpath pair (key / value) from either an array or property file generates
an appropriate lucene document with the specified fields.

I have not tested all permutations of Document (I have used the File,
Properties) and it works.

Note:
It uses the xalan example ApplyXpath class to get the xml xpath.

I hope this helps.

--Peter

--------------------------------------------------

package xxx.lucene.xml;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.DateField;

import org.apache..../ApplyXpath;
import java.util.Properties;
import java.io.File;
import java.util.Enumeration;
import java.io.FileInputStream;

/**
* A utility for making lucene document from an XML source and a set of
xpaths
* based on Document example from Lucene
*
*/
public class XMLDocument
{
private XMLDocument() { }

/**
* @param file Document that to be converted to a lucene document
* @param propertyList properties where the key is the field
name and the value is the
* XML xpath.
* @throws FileNotFoundException
* @throws Exception
* @return lucene document
*/
public static Document Document (File file, Properties propertyList)
throws java.io.FileNotFoundException , Exception
{
Document doc = new Document();

// add path
doc.add(Field.Text("path", file.getPath()));

//add date modified
doc.add(Field.Keyword("modified",
DateField.timeToString(file.lastModified())));

//add field list in property list
Enumeration e = propertyList.propertyNames();
while (e.hasMoreElements())
{
String key = (String) e.nextElement();
String xpath = propertyList.getProperty(key);
String[] valueArray = ApplyXpath(file.getPath(),xpath);
StringBuffer value = new StringBuffer("");
for (int i=0; i < valueArray.length; i++)
{
value.append(valueArray[i]);
}
//System.out.println("add key "+key+" wtih value = "+value);
filter(key,value);
doc.add(Field.Text(key,value.toString()));
}

return doc;
}

/**
* @return lucene document
* @param fieldNames field names for the lucene document
* @param file Document that to be converted to a lucene document
* @param xpaths XML xpaths for the information you want to get
* @throws Exception
*/
public static Document Document(File file, java.lang.String[]
fieldNames, java.lang.String[] xpaths)
{
if (fieldNames.length != xpaths.length)
{
throw new IllegalArgumentException ("String arrays are
not equal size");
}

Properties propertyList = new Properties();

// generate properties from the arrays
for (int i=0;i<fieldNames.length;i++) {
propertyList.setProperty(fieldNames[i],xpaths[i]);
}

Document doc = Document (file, propertyList);
return doc;
}

/**
* @param path path of the Document that to be converted to a
lucene document
* @param keys
* @param xpaths
* @throws Exception
* @return
*/
public static Document Document(String path, String[]
fieldNames, String[] xpaths)
throws Exception
{
File file = new File(path);
Document doc = Document (file, fieldNames, xpaths);
return doc;
}

/**
* @param path path of document you want to convert to a lucene
document
* @param propertyList properties where the key is the field
name and the value is the
* XML xpath.
* @throws Exception
* @return lucene document
*/
public static Document Document(String path, Properties
propertyList)
throws Exception
{
File file = new File(path);
Document doc = Document (file, propertyList);
return doc;
}

/**
* @param documentPath path of the Document that to be converted
to a lucene document
* @param propertyPath path of file containing properties where
the key is the field name and the value is the
* XML xpath.
* @throws Exception
* @return
*/
public static Document Document(String documentPath, String
propertyPath)
throws Exception
{
File file = new File(documentPath);
FileInputStream fis = new FileInputStream(propertyPath);
Properties propertyList = new Properties();
propertyList.load(fis);
Document doc = Document (file, propertyList);
return doc;
}

/**
* @param documentFile Document that to be converted to a lucene
document
* @param propertyFile file containing properties where the key
is the field name and the value is the
* XML xpath.
* @throws Exception
* @return
*/
public static Document Document(File documentFile, File
propertyFile)
throws Exception
{
FileInputStream fis = new FileInputStream(propertyFile);
Properties propertyList = new Properties();
propertyList.load(fis);
Document doc = Document (documentFile, propertyList);
return doc;
}

private static String filter(String key, StringBuffer value) {
String newValue;
newValue = value.toString();
return newValue;
}
}

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Indexing other documents type than html and txt (XML) [ In reply to ]

lists at ehatchersolutions

Nov 30, 2001, 4:25 AM

Post #5 of 8 (2048 views)

Permalink

I second the motion to have a place to store contributed Document
"generators".

I've developed an HTML file handler that creates a Document using JTidy
under the covers to DOM'ify it and pull out only the non-HTML tagged text
into a "content" field and strips the <title> out as a separate field. It
actually would be far more extensible if it handed off the DOM'ified HTML to
Peter's XMLDocument class such that XPath could be used to turn things into
fields. I'm not sure how my code compares to the demo HTMLParser.jj (mine
probably requires cleaner HTML, and may not be as fast? but has the ability
to use a DOM to extract elements/attributes)

How does lucene-dev feel about creating a 'contrib' area in CVS for these
kinds of things that folks really need to make Lucene come to life for them,
but are obviously not part of the main engine?

Erik

----- Original Message -----
From: <carlson@bookandhammer.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Thursday, November 29, 2001 12:03 PM
Subject: Re: Indexing other documents type than html and txt (XML)

> I have started to create a set of generic lucene document types that can
> be easily manipulated depending on the fields.
> I know other have generated Documents out of PDF.
> Is there some place we can add contributed classes to the lucene web
> page?
>
> Here my current version of the XMLDocument based on . It's a bit slow.
> It uses a path (taken from Document example) and based on a field name /
> xpath pair (key / value) from either an array or property file generates
> an appropriate lucene document with the specified fields.
>
> I have not tested all permutations of Document (I have used the File,
> Properties) and it works.
>
> Note:
> It uses the xalan example ApplyXpath class to get the xml xpath.
>
> I hope this helps.
>
> --Peter
>
> --------------------------------------------------
>
> package xxx.lucene.xml;
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.DateField;
>
> import org.apache..../ApplyXpath;
> import java.util.Properties;
> import java.io.File;
> import java.util.Enumeration;
> import java.io.FileInputStream;
>
> /**
> * A utility for making lucene document from an XML source and a set of
> xpaths
> * based on Document example from Lucene
> *
> */
> public class XMLDocument
> {
> private XMLDocument() { }
>
> /**
> * @param file Document that to be converted to a lucene document
> * @param propertyList properties where the key is the field
> name and the value is the
> * XML xpath.
> * @throws FileNotFoundException
> * @throws Exception
> * @return lucene document
> */
> public static Document Document (File file, Properties propertyList)
> throws java.io.FileNotFoundException , Exception
> {
> Document doc = new Document();
>
> // add path
> doc.add(Field.Text("path", file.getPath()));
>
> //add date modified
> doc.add(Field.Keyword("modified",
> DateField.timeToString(file.lastModified())));
>
> //add field list in property list
> Enumeration e = propertyList.propertyNames();
> while (e.hasMoreElements())
> {
> String key = (String) e.nextElement();
> String xpath = propertyList.getProperty(key);
> String[] valueArray = ApplyXpath(file.getPath(),xpath);
> StringBuffer value = new StringBuffer("");
> for (int i=0; i < valueArray.length; i++)
> {
> value.append(valueArray[i]);
> }
> //System.out.println("add key "+key+" wtih value = "+value);
> filter(key,value);
> doc.add(Field.Text(key,value.toString()));
> }
>
> return doc;
> }
>
> /**
> * @return lucene document
> * @param fieldNames field names for the lucene document
> * @param file Document that to be converted to a lucene document
> * @param xpaths XML xpaths for the information you want to get
> * @throws Exception
> */
> public static Document Document(File file, java.lang.String[]
> fieldNames, java.lang.String[] xpaths)
> {
> if (fieldNames.length != xpaths.length)
> {
> throw new IllegalArgumentException ("String arrays are
> not equal size");
> }
>
> Properties propertyList = new Properties();
>
> // generate properties from the arrays
> for (int i=0;i<fieldNames.length;i++) {
> propertyList.setProperty(fieldNames[i],xpaths[i]);
> }
>
> Document doc = Document (file, propertyList);
> return doc;
> }
>
> /**
> * @param path path of the Document that to be converted to a
> lucene document
> * @param keys
> * @param xpaths
> * @throws Exception
> * @return
> */
> public static Document Document(String path, String[]
> fieldNames, String[] xpaths)
> throws Exception
> {
> File file = new File(path);
> Document doc = Document (file, fieldNames, xpaths);
> return doc;
> }
>
> /**
> * @param path path of document you want to convert to a lucene
> document
> * @param propertyList properties where the key is the field
> name and the value is the
> * XML xpath.
> * @throws Exception
> * @return lucene document
> */
> public static Document Document(String path, Properties
> propertyList)
> throws Exception
> {
> File file = new File(path);
> Document doc = Document (file, propertyList);
> return doc;
> }
>
> /**
> * @param documentPath path of the Document that to be converted
> to a lucene document
> * @param propertyPath path of file containing properties where
> the key is the field name and the value is the
> * XML xpath.
> * @throws Exception
> * @return
> */
> public static Document Document(String documentPath, String
> propertyPath)
> throws Exception
> {
> File file = new File(documentPath);
> FileInputStream fis = new FileInputStream(propertyPath);
> Properties propertyList = new Properties();
> propertyList.load(fis);
> Document doc = Document (file, propertyList);
> return doc;
> }
>
> /**
> * @param documentFile Document that to be converted to a lucene
> document
> * @param propertyFile file containing properties where the key
> is the field name and the value is the
> * XML xpath.
> * @throws Exception
> * @return
> */
> public static Document Document(File documentFile, File
> propertyFile)
> throws Exception
> {
> FileInputStream fis = new FileInputStream(propertyFile);
> Properties propertyList = new Properties();
> propertyList.load(fis);
> Document doc = Document (documentFile, propertyList);
> return doc;
> }
>
> private static String filter(String key, StringBuffer value) {
> String newValue;
> newValue = value.toString();
> return newValue;
> }
> }
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Indexing other documents type than html and txt [ In reply to ]

cnew at fuse

Nov 30, 2001, 7:59 AM

Post #6 of 8 (2046 views)

Permalink

Here is another version of something I had posted earlier. It attempts to
read the "text" out of binary files. Not perfect and doesn't work at all on
PDF. It permits you use the "reader" form of a Field to index.
import java.util.*;
import java.io.*;

/**
This class is designed to retrieve text from binary files.
The occasion for its development was to find a way generic way to
index typical office documents which are almost always in a
a proprietary and binary form.
This class will not work with PDF files.
You can exercise some control over the result by using the
<code>setCharArray()</code> method and the
<code>setShortestToken()</code> method.
<ul>
<li><code>setCharArray()</code>: allows you to override the default
characters to keep. All others are eliminated. The default "keepers"
are all ASCII character plus whitespace. This means that if a
text file is the input, it will pass thru unchanged (except that
consequtive blanks are squeezed to a single blank).
<li><code>setShortestToken()</code>: allows you only keep strings of
a minimum length. By default the length is zero, meaning that all
tokens are passed.
</ul>
Note lastly that this class is only designed to work with ASCII.
It may not be difficult to change to support Unicode, but I do
not know how to do that.
*/

public class BinaryReader
extends java.io.FilterReader
{
// private vars
// for debugging
private int count=0;
private int rawcnt=0;
private int shortestToken = 0;
// default char set to keep, blank out everything else
private char[][] charArray = {
{'!', '~'},
{'\t', '\t'},
{'\r', '\r'},
{'\n', '\n'},
};

private String leftovers="";

private char charFilter(char c) {
for (int i=0; i < charArray.length; i++) {
if ( c >= charArray[i][0] && c <= charArray[i][1] ) {
return c;
}
}
return ' ';
}

public BinaryReader(Reader in) {
super(in);
}

/**
This method may be used to override the ranges of characters
that are retained. All others are elminiated. The default is:
<code>
private char[][] charArray = {
{'!', '~'},
{'\t', '\t'},
{'\r', '\r'},
{'\n', '\n'},
};
</code>
Note that the ranges are inclusive and that to pick our a
"single" character instead of a range, just make that character
both the min and max (as shown for the whitespace characters above).
@param char[][] - array of ranges to keep
*/
public void setCharArray( char[][] keepers ) {
// in each row, column 1 is min and column 2 is max
// to pick out a single character instead of a range
// just make it both min and max.
charArray = keepers;
}

/**
This method may be used to eliminate "short strings" of text.
By default it takes even single letters, since the value is
initialized to zero. For example, if the
length 3 is used, the single and two letter strings will not
be returned.
Warning: the test doesn't always work for strings that
begin a line of text (at least in DOS/Windows).
@param int len - length of shortest strings to pass
*/
public void setShortestToken(int len) {
shortestToken = len;
}

/**
Reads a single character and runs it through the filter. The
(int) character returned will either be -1 for end-of-file,
a blank (indicating it was filtered), or the character unchanged.
*/
public int read() throws IOException
{
int c = in.read();
if ( c != -1 ) return c;
rawcnt++;
count++;
return charFilter((char)c);
}
/**
Reads from stream and populates the supplied char array.
@param char[] cbuf - character buffer to fill
@return int - number of characters actually placed into the buffer
*/
public int read(char[] cbuf) throws IOException
{
return read(cbuf, 0, cbuf.length);
}

/**
Reads from stream and populates the supplied char array
using the offset and length provided.
@param char[] cbuf - character buffer to fill
@param int offset - offset to being filling array
@param int length - maximun characters to place into the array
@return int - number of characters actually placed into the buffer
*/

public int read(char[] cbuf, int off, int len)
throws IOException
{
char[] cb = new char[len];
int cnt = in.read(cb);
if ( cnt == -1 ) {
file://System.out.println("At end, rawcnt is "+rawcnt);
return cnt; // done
}
int cnt2=cnt;
int loc = off;
for ( int i=0; i < cnt; i++ ) {
cbuf[loc++] = charFilter(cb[i]);
}

char[] weeded = filter(new String(cbuf, off, cnt));
if ( weeded.length > -1 ) {
cnt2 = weeded.length;
// redo buffer
for (int i=0; i < cnt2; i++) {
cbuf[off+i] = weeded[i];
}
}

rawcnt += cnt;
count += cnt2;
return cnt2;
}

private char[] filter(String instring)
{
// record the buffer size (ie, size of incoming string)
int max = instring.length();
// combine leftovers into incoming string and reset leftovers
String s = leftovers+instring;
leftovers="";

StringBuffer sb = new StringBuffer(s.length());
StringTokenizer st = new StringTokenizer(s," ");
String tok=null;
while (st.hasMoreTokens()) {
tok = st.nextToken();
int toklength = tok.length();
if ( toklength < shortestToken ) {
// skip it
continue;
}

sb.append(tok);
sb.append(' ');
}

String t = sb.toString();
t = t.substring(0,t.length()); // remove the appended blank
if ( t.length() > max ) {
leftovers = t.substring(max);
t = t.substring(0,max);
}
return t.toCharArray();
}
/**
Returns the number characters read from the Reader stream
@return int number of characters read
*/
public int getInputCount() { return rawcnt; }
/**
Returns the number characters passed by the filter (ie, after the
binary characters are removed.
@return int number of characters not filtered out
*/
public int getOutputCount() { return count; }

/**
A handy main() method to test or perform the filtering using the
defaults. Started from the command line, it takes two required
arguments: the input filename and the output filename.
An optional third argument is an integer to set the
shortest tokens to pass the filter.
*/
public static void main(String[] args)
throws FileNotFoundException, IOException
{
if ( args.length < 2 ) {
System.out.println(
"Usage: java BinaryReader infile outfile [shortest]");
System.out.println("where 'shortest' is the shortest token passed.");
System.exit(0);
}
FileWriter fw = new FileWriter( args[1] );
FileReader fr = new FileReader( args[0] );
BufferedReader br = new BufferedReader(fr);
BinaryReader binr = new BinaryReader(br);
if ( args.length > 2 ) {
binr.setShortestToken(Integer.parseInt(args[2]));
}
char[] cb = new char[1024];
int cnt;
while ( (cnt = binr.read(cb)) != -1 ) {
fw.write(cb,0,cnt);
}
fw.close();
int ocnt = binr.getOutputCount();
int icnt = binr.getInputCount();
System.out.println("Input Character Count ="+icnt);
System.out.println("Output Character Count="+ocnt);
}

}

----- Original Message -----
From: Antonio Vazquez <avazquez@cystelcom.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>
Sent: Thursday, November 29, 2001 6:26 AM
Subject: Indexing other documents type than html and txt

> Hi all,
> I have a doubt. I know that lucene can index html and text documents, but
> can it index other type of documents like pdf,docs, and xls documents? if
it
> can, how can I implement it? Perhaps can implement it like html and txt
> indexing?
>
> regards
>
> Antonio
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Indexing other documents type than html and txt (XML) [ In reply to ]

DCutting at grandcentral

Nov 30, 2001, 9:22 AM

Post #7 of 8 (2044 views)

Permalink

> From: Erik Hatcher [mailto:lists@ehatchersolutions.com]
>
> How does lucene-dev feel about creating a 'contrib' area in
> CVS for these
> kinds of things that folks really need to make Lucene come to
> life for them,
> but are obviously not part of the main engine?

I think this is a fine idea, but it needs to be managed. We don't want an
area where anyone can upload anything. It could easily become filled with
things that don't even compile, and would cause folks more headaches than it
would relieve.

So if someone would like to volunteer to administer this area, then I'm for
it. Administration would include some limited testing of each contributed
module, ensuring that each has documentation, rejecting poorly written
modules, writing a top-level document that describes all contributed
modules, etc. Anyone interested?

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Indexing other documents type than html and txt (XML) [ In reply to ]

carlson at bookandhammer

Nov 30, 2001, 10:02 AM

Post #8 of 8 (2056 views)

Permalink

I'll take on creating a Document repository.
I would like to get some ideas from people about what kind of Documents
they are creating and what people want.

What's the next step Doug?

--Peter

On Friday, November 30, 2001, at 08:22 AM, Doug Cutting wrote:

>> From: Erik Hatcher [mailto:lists@ehatchersolutions.com]
>>
>> How does lucene-dev feel about creating a 'contrib' area in
>> CVS for these
>> kinds of things that folks really need to make Lucene come to
>> life for them,
>> but are obviously not part of the main engine?
>
> I think this is a fine idea, but it needs to be managed. We don't want
> an
> area where anyone can upload anything. It could easily become filled
> with
> things that don't even compile, and would cause folks more headaches
> than it
> would relieve.
>
> So if someone would like to volunteer to administer this area, then I'm
> for
> it. Administration would include some limited testing of each
> contributed
> module, ensuring that each has documentation, rejecting poorly written
> modules, writing a top-level document that describes all contributed
> modules, etc. Anyone interested?
>
> Doug
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>