Mailing List Archive

excluding files / refining search
Hello,

I've been working with lucene for about a month now. I've got my indexes
created, but I'm having a problem with the results I've been returning.

The site I'm working on has a lot of small html files that are used for page
construction (nav bars, footers, etc) and they're being returned high in the
results because they contain the search term(s) I'm looking for and are
small so they rank higher than larger documents.

I want to exclude them from the index and I've come up with two ideas:

1) move them to a directory, which I will exclude from the index, but I'll
have to change a bunch of links

2) detect them with some sort of flag and exclude them from the index. We
were thinking that we could have a fake tag that lucene would detect and not
index those pages.

Has anyone run into this problem before? How difficult would it be to
implement 2? Is there a way to detect a fake tag? I'm assuming that I can
create a new boolean in the HTMLDocument class (true if it contains the
exclude tag) and then not run Indexwriter.addDocument() if I find it.

b


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: excluding files / refining search [ In reply to ]
Brian Rook <brian.rook@xor.com> writes:
> The site I'm working on has a lot of small html files that are used for page
> construction (nav bars, footers, etc) and they're being returned high in the
> results because they contain the search term(s) I'm looking for and are
> small so they rank higher than larger documents.
>
> I want to exclude them from the index and I've come up with two ideas:
>
> 1) move them to a directory, which I will exclude from the index, but I'll
> have to change a bunch of links
>
> 2) detect them with some sort of flag and exclude them from the index. We
> were thinking that we could have a fake tag that lucene would detect and not
> index those pages.

Why not just have an exclude list of some sort? In the code you
wrote to select files for indexing, just have it check against a list
of files you want to exclude. In the demo application, you would edit

jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java


The quick and dirty method would be to edit this section of code:

public static void indexDocs(IndexWriter writer, File file)
throws Exception {
if (file.isDirectory()) {
String[] files = file.list();
for (int i = 0; i < files.length; i++)
indexDocs(writer, new File(file, files[i]));
} else {
System.out.println("adding " + file);
writer.addDocument(FileDocument.Document(file));
}
}


To something like this:


public static void indexDocs(IndexWriter writer, File file)
throws Exception {
if (file.isDirectory()) {
String[] files = file.list();
for (int i = 0; i < files.length; i++)
indexDocs(writer, new File(file, files[i]));
} else {
if (checkFileName(file)) {
System.out.println("skipping " + file) ;
} else {
System.out.println("adding " + file);
writer.addDocument(FileDocument.Document(file));
}
}
}

public static boolean checkFileName(File file) {
String name = file.getName() ;
if (name == "footer.html" ||
name == "header.html" ||
name == "menu.html" ||
name == "navbar.html") {
return false ;
}
return true ;
}


A more realistic implementation would use an "exclude file" of
filenames to ignore, load them into a collection (probably a HashSet)
and keep that collection around as an instance variable. Then
checkFileName() just returns !excludedSet.contains(name).

Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: excluding files / refining search [ In reply to ]
Or could be handled by the Ant index task I submitted:

<index index="${build.dir}/index">
<fileset dir="docs">
<exclude name="**/header.html"/>
</fileset>
</index>


----- Original Message -----
From: "Steven J. Owens" <puffmail@darksleep.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Thursday, February 14, 2002 8:41 PM
Subject: Re: excluding files / refining search


> Brian Rook <brian.rook@xor.com> writes:
> > The site I'm working on has a lot of small html files that are used for
page
> > construction (nav bars, footers, etc) and they're being returned high in
the
> > results because they contain the search term(s) I'm looking for and are
> > small so they rank higher than larger documents.
> >
> > I want to exclude them from the index and I've come up with two ideas:
> >
> > 1) move them to a directory, which I will exclude from the index, but
I'll
> > have to change a bunch of links
> >
> > 2) detect them with some sort of flag and exclude them from the index.
We
> > were thinking that we could have a fake tag that lucene would detect and
not
> > index those pages.
>
> Why not just have an exclude list of some sort? In the code you
> wrote to select files for indexing, just have it check against a list
> of files you want to exclude. In the demo application, you would edit
>
> jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java
>
>
> The quick and dirty method would be to edit this section of code:
>
> public static void indexDocs(IndexWriter writer, File file)
> throws Exception {
> if (file.isDirectory()) {
> String[] files = file.list();
> for (int i = 0; i < files.length; i++)
> indexDocs(writer, new File(file, files[i]));
> } else {
> System.out.println("adding " + file);
> writer.addDocument(FileDocument.Document(file));
> }
> }
>
>
> To something like this:
>
>
> public static void indexDocs(IndexWriter writer, File file)
> throws Exception {
> if (file.isDirectory()) {
> String[] files = file.list();
> for (int i = 0; i < files.length; i++)
> indexDocs(writer, new File(file, files[i]));
> } else {
> if (checkFileName(file)) {
> System.out.println("skipping " + file) ;
> } else {
> System.out.println("adding " + file);
> writer.addDocument(FileDocument.Document(file));
> }
> }
> }
>
> public static boolean checkFileName(File file) {
> String name = file.getName() ;
> if (name == "footer.html" ||
> name == "header.html" ||
> name == "menu.html" ||
> name == "navbar.html") {
> return false ;
> }
> return true ;
> }
>
>
> A more realistic implementation would use an "exclude file" of
> filenames to ignore, load them into a collection (probably a HashSet)
> and keep that collection around as an instance variable. Then
> checkFileName() just returns !excludedSet.contains(name).
>
> Steven J. Owens
> puff@darksleep.com
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>