Mailing List Archive: fast phrase matching [patch]

fast phrase matching [patch]

Sep 10, 2007, 6:05 PM

Post #1 of 10 (2924 views)

Here's a stab at a faster phrase matcher. I haven't tested it on
anything other than an improved version of t/502-phrase-query and
Matthew's test case from the weekend. Looking at the object code
generates, I think should be considerably more efficient than the
current code, but it's also unlikely that the current code is ever a
bottleneck unless searching for 'and the' in a giant document.

I don't know that it makes sense as a patch right now. I won't be
offended if you decide that the current code is more reliable and
readable. I pulled out the DEBUG() and ASSERT() statements I used to
write it, which felt odd. I also convoluted it to agree with 'gcc
-pedantic', which I think definitely hurt readability. I think the
expanded tests for phrase offsets merit inclusion, though.

What is your opinion on C99 versus -pedantic? And how do you feel
about littering the code with assertions and environmentally
controlled debug statements?

Nathan Kurz
nate@verse.com

Re: fast phrase matching [patch] [ In reply to ]

marvin at rectangular

Sep 10, 2007, 7:47 PM

Post #2 of 10 (2852 views)

On Sep 10, 2007, at 6:05 PM, Nathan Kurz wrote:

> Here's a stab at a faster phrase matcher. I haven't tested it on
> anything other than an improved version of t/502-phrase-query and
> Matthew's test case from the weekend.

Thanks! This looks like a very nice patch. I'll test it and
presumably commit it after I finish some other work.

> Looking at the object code generates,

This is something I've dabbled in, but would like to pursue in
earnest. Can you suggest some links or a course of study to get me
on my way?

> I think should be considerably more efficient than the
> current code, but it's also unlikely that the current code is ever a
> bottleneck unless searching for 'and the' in a giant document.

Yes, but... we're about to attempt a generalization of this position
code.

> I don't know that it makes sense as a patch right now. I won't be
> offended if you decide that the current code is more reliable and
> readable. I pulled out the DEBUG() and ASSERT() statements I used to
> write it, which felt odd. I also convoluted it to agree with 'gcc
> -pedantic', which I think definitely hurt readability. I think the
> expanded tests for phrase offsets merit inclusion, though.
>
> What is your opinion on C99 versus -pedantic?

MSVC is a target, and it doesn't support C99.

I really hate the declaration after statement limitation in
particular, but KS needs to be maximally portable.

> And how do you feel
> about littering the code with assertions and environmentally
> controlled debug statements?

Often, assertions are clarifying. Other times, they're just paranoid
-- like checking every dang pointer that could be a NULL. I see the
addition of clarifying or important ones as beneficial and would be
pleased to have them in the code base.

PS: Tabs suck.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: fast phrase matching [patch] [ In reply to ]

Sep 10, 2007, 10:27 PM

Post #3 of 10 (2855 views)

On 9/10/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> > Looking at the object code [it] generates,
>
> This is something I've dabbled in, but would like to pursue in
> earnest. Can you suggest some links or a course of study to get me
> on my way?

Unfortunately, I'm at best an advanced beginner at such things. The
approach I used here was to compile with -ggdb3 and use 'objdump -S'
on the object file. Then I stared at the output until it started to
make sense. My analysis here went no farther than the generalization
that shorter with fewer branches is better. While generally true,
with modern processors one probably needs to test. I had the
advantage with this algorithm to be modifying from something similar I
did earlier that I had done actual testing for an Opteron, although
the modifications were major enough that that testing might not apply
anymore.

That's said, here's two links that I've found useful (and that greatly
exceed my knowledge):

Software Optimization Guide for the AMD64 Processors
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
Specific to AMD, but includes a lot of background.

Software optimization resources
http://www.agner.org/optimize/
Concentrating on C++ and Assembly, but helpful for C as well.

For the purposes of improving KinoSearch, though, I think that the
biggest room for improvement is going to be through integrating better
with the Virtual Memory Manager
(http://www.informit.com/content/images/0131453483/downloads/gorman_book.pdf)
from profiling to find bottlenecks with Oprofile, and through L2
cache optimization with Cachegrind. I still think proper use of mmap
has tremendous potential.

> > What is your opinion on C99 versus -pedantic?
>
> MSVC is a target, and it doesn't support C99.
>
> I really hate the declaration after statement limitation in
> particular, but KS needs to be maximally portable.

OK. This is probably a fine decision, but I think it will definitely
have a cost. I was hoping that targeting gcc under Cygwin would be
enough. Are there people actively compiling this under MSVC
currently? I know nothing about Windows.

> Often, assertions are clarifying. Other times, they're just paranoid
> -- like checking every dang pointer that could be a NULL. I see the
> addition of clarifying or important ones as beneficial and would be
> pleased to have them in the code base.

I'll send you a version tomorrow with such things included for you to
decide where you want to draw that line. It's out of sync with the
patch right now.

> PS: Tabs suck.

Oops. Have I been including them in things I send?

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: fast phrase matching [patch] [ In reply to ]

marvin at rectangular

Sep 27, 2007, 10:09 AM

Post #4 of 10 (2829 views)

On Sep 10, 2007, at 10:27 PM, Nathan Kurz wrote:

> On 9/10/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>>> Looking at the object code [it] generates,
>>
>> This is something I've dabbled in, but would like to pursue in
>> earnest. Can you suggest some links or a course of study to get me
>> on my way?
>
> Unfortunately, I'm at best an advanced beginner at such things. The
> approach I used here was to compile with -ggdb3 and use 'objdump -S'
> on the object file. Then I stared at the output until it started to
> make sense.

I found what I was looking for. There's a book called "Programming
from the Ground Up", by Jonathan Bartlett, available both in dead-
tree format and as a PDF under the GNU Free Documentation License.

http://xrl.us/6t7x (Link to search.barnesandnoble.com)
http://savannah.nongnu.org/projects/pgubook/
http://download.savannah.gnu.org/releases/pgubook/

A reviewer from the Barnes and Noble site explains its utility:

Most people realize that Gnu's assembler (as, commonly referred to as
'Gas') is not fit for human consumption. After all, it's real purpose
in life is to assemble GCC output. As *nobody* writes in assembly
language anymore, who needs to know anything about Gas? Well,
surprise, surprise, people *do* still program in assembly language.
While there are other assemblers available for Linux and other
platforms where Gas can run (e.g., NASM, FASM, HLA), anyone
wanting to
work with GNU tools will probably want to come to grips with GNU's
AT&T syntax at one point or another. The only problem is that there
are very few books on this subject. The FSF/GNU documentation is a
complete joke. This is where Jonathan Bartlett's book comes in. It's
the first, reasonable, beginner's book on Gas written for the x86
that
I've found. While I can't personally recommend that someone do all
their x86 work with Gas (though some will disagree with me), I'm also
of the opinion that anyone who works in assembly language is going to
have to deal with Gas sooner or later. And when that day arrives,
this
book will prove very handy. Combined with the FSF/GNU documentation
for Gas (as a reference), this book will help someone overcome the
roadblock that Gas' AT&T syntax has been in the past.

> My analysis here went no farther than the generalization
> that shorter with fewer branches is better. While generally true,
> with modern processors one probably needs to test.

Sure. Some sophisticated benchmarking code has been contributed to
Lucene over the last year. I've been waiting for the pace of commits
to settle down before contemplating a port. It seems to have
stabilized over the last couple months, and might be worth a looksee
now.

Nevertheless, I think we'll get pretty far just by looking over
generated assembly code. We'll almost always have a good idea of
what we want the processor to be doing; if the output of 'gcc -S' or
'objdump -S' doesn't look like we think it should, we can tweak the C
code until we're satisfied.

> For the purposes of improving KinoSearch, though, I think that the
> biggest room for improvement is going to be through integrating better
> with the Virtual Memory Manager
> (http://www.informit.com/content/images/0131453483/downloads/
> gorman_book.pdf)
> from profiling to find bottlenecks with Oprofile, and through L2
> cache optimization with Cachegrind. I still think proper use of mmap
> has tremendous potential.

The classes we should focus on are InStream and OutStream. First
order of business is to study how the simple functions look in
assembly: read_byte, read_int, etc.

After that, we move on to what are currently called VInts and
VLongs. (I'm contemplating renaming them C32 and C64 for "compressed
32/64-bit integer", since they are no longer the same as Lucene VInts/
VLongs -- they're now BER compressed integers, as used by Perl's pack
() function.) First, we'd like to see whether those functions are
fully optimized under the current scheme. Second, data compression
is the bugaboo for integrating mmap, and we can brainstorm
alternative approaches while optimizing.

> Are there people actively compiling this under MSVC
> currently? I know nothing about Windows.

The maint branch compiles under MSVC, which makes it possible to
produce a PPM and support the good people at ActiveState. Randy
Kobes was instrumental in getting us to this far; tye, Corion, and
some other Windows users from PerlMonks have also been helpful. It's
actually Randy who publishes the PPMs because KinoSearch's 5.8.3
requirement prevents ActiveState from publishing its own. There are
a number of people who have acquired KS via this route.

Basically the only place maint doesn't compile is under Solaris, I
think because of an alignment problem which is fixed in trunk. Trunk
does have some issues which currently prevent compiling under
Windows, but I know about them and plan to fix them. The goal is to
have KS work on everything except for esoteric systems where e.g.
floats don't conform to IEEE 754.

Supporting MSVC from within KinoSearch isn't too complicated or
painful, because most of the hard work is handled at a higher level.
In maint, we're using the information generated by Perl's Configure
script (itself generated by Metaconfig). In trunk, the Charmonizer
compatibility layer plays this role.

> I'll send you a version tomorrow with such things included for you to
> decide where you want to draw that line. It's out of sync with the
> patch right now.

I went ahead and committed the version you supplied as r2555. Thanks!

Some mild mods followed in r2556 (<http://xrl.us/6uch>):

* The "inline" keyword has been replaced with the recently introduced
INLINE Charmonizer macro, which is empty if the compiler doesn't
support inline functions.
* A sanity check was added at the top of winnow_anchors() to prevent
possible invalid pointer de-refs. Technically, this wasn't
necessary because the calling function cannot currently supply data
which triggers such a problem, but I was uneasy about the
absence of a
local safety mechanism.
* The assignment of the iteration variable "i" was moved from the
variable declaration at the top of PhraseScorer_calc_phrase_freq
to the loop initiation. This will thwart problems if stuff gets
moved around and i winds up with a new value prior to the loop.
Again, not technically necessary, just defensive programming
practice.
* winnow_anchors() now returns a u32_t rather than a size_t,
because it's returning a count of u32_t rather than char and the
C standard defines size_t as "A type to define sizes of strings
and memory blocks."

>> PS: Tabs suck.
>
> Oops. Have I been including them in things I send?

In this patch at least. No big deal, I zapped 'em.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: fast phrase matching [patch] [ In reply to ]

Sep 27, 2007, 1:06 PM

Post #5 of 10 (2824 views)

On 9/27/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> I found what I was looking for. There's a book called "Programming
> from the Ground Up", by Jonathan Bartlett, available both in dead-
> tree format and as a PDF under the GNU Free Documentation License.

Thanks, I shall check it out. I don't recall if I sent this in
another message, but provenance aside this is excellent for the
memory management in Linux side:
http://www.linux-security.cn/ebooks/ulk3-html/

> The classes we should focus on are InStream and OutStream. First
> order of business is to study how the simple functions look in
> assembly: read_byte, read_int, etc.

Yes, so long as the Zero order of business is to determine if they
make sense architecturally. :) I'd wonder whether keeping this more
encapsulated in the Posting class might be more flexibile. We'll
still probably need the same routines, though.

> After that, we move on to what are currently called VInts and
> VLongs. (I'm contemplating renaming them C32 and C64 for "compressed
> 32/64-bit integer", since they are no longer the same as Lucene VInts/
> VLongs -- they're now BER compressed integers, as used by Perl's pack
> () function.) First, we'd like to see whether those functions are
> fully optimized under the current scheme.

Check. Presumably the internal Perl code is optimized. And does
SQLite use the same format? The code there is generally pretty and
likely well tuned.

> Second, data compression
> is the bugaboo for integrating mmap, and we can brainstorm
> alternative approaches while optimizing.

I don't think this is really a problem. As you mentioned in another
thread, we don't really need to be using bulk reads, and once we are
decompressing a single posting at a time I think there are elegant
solutions for this.

> > I'll send you a version tomorrow with such things included for you to
> > decide where you want to draw that line. It's out of sync with the
> > patch right now.
>
> I went ahead and committed the version you supplied as r2555. Thanks!

Sorry for never sending that. I've been distracted with other projects.

> Some mild mods followed in r2556 (<http://xrl.us/6uch>):
>
> * The "inline" keyword has been replaced with the recently introduced
> INLINE Charmonizer macro, which is empty if the compiler doesn't
> support inline functions.

Yes, good idea.

> * A sanity check was added at the top of winnow_anchors() to prevent
> possible invalid pointer de-refs. Technically, this wasn't
> necessary because the calling function cannot currently supply data
> which triggers such a problem, but I was uneasy about the
> absence of a
> local safety mechanism.

I think this might be a better case for an ASSERT, but I understand
the impulse.

> * The assignment of the iteration variable "i" was moved from the
> variable declaration at the top of PhraseScorer_calc_phrase_freq
> to the loop initiation. This will thwart problems if stuff gets
> moved around and i winds up with a new value prior to the loop.
> Again, not technically necessary, just defensive programming
> practice.

Good thing. It ended up the way it did because I converted it from
C99 where 'i' was both declared and initialized in the loop header.

> * winnow_anchors() now returns a u32_t rather than a size_t,
> because it's returning a count of u32_t rather than char and the
> C standard defines size_t as "A type to define sizes of strings
> and memory blocks."

OK. I went with this because we add the return value to a pointer,
and I thought that on a 64-bit system this might save a back and forth
conversion. One could either argue that it is being used to define
the size of a memory block, or that the standard's definition is not
exclusive of other uses.

> >> PS: Tabs suck.
> >
> > Oops. Have I been including them in things I send?
>
> In this patch at least. No big deal, I zapped 'em.

Hmm, I downloaded the patch I attached, and didn't find any tabs in
it. Either I've done something wrong twice, or maybe something else is
adding them.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: fast phrase matching [patch] [ In reply to ]

marvin at rectangular

Sep 28, 2007, 1:31 AM

Post #6 of 10 (2839 views)

On Thu, Sep 27, 2007 at 02:06:16PM -0600, Nathan Kurz wrote:
> > The classes we should focus on are InStream and OutStream. First
> > order of business is to study how the simple functions look in
> > assembly: read_byte, read_int, etc.
>
> Yes, so long as the Zero order of business is to determine if they
> make sense architecturally. :)

I like the sound of that. :)

>I'd wonder whether keeping this more
> encapsulated in the Posting class might be more flexibile. We'll
> still probably need the same routines, though.

Other things besides Postings need these routines: Lexicons, term vectors,
document storage, etc.

> > After that, we move on to what are currently called VInts and
> > VLongs. (I'm contemplating renaming them C32 and C64 for "compressed
> > 32/64-bit integer", since they are no longer the same as Lucene VInts/
> > VLongs -- they're now BER compressed integers, as used by Perl's pack
> > () function.) First, we'd like to see whether those functions are
> > fully optimized under the current scheme.
>
> Check. Presumably the internal Perl code is optimized.

That seems likely.

IIRC, what's in KS is very close to what's in pp_pack.c. I certainly studied
it.

There are a couple interesting differences between the Lucene VInt and the BER
compressed integer. You can tack "leading zeroes" onto the BER compressed
integer, but not the VInt. Also, there's an extra count to keep track of at
either read or write; VInt pays the penalty on read, BER pays the penalty on
write.

> And does SQLite use the same format? The code there is generally pretty and
> likely well tuned.

Dunno about that. I've never spelunked the SQLite code base, and a web search
for 'sqlite compressed integer' didn't turn up anything of interest.

> > Second, data compression
> > is the bugaboo for integrating mmap, and we can brainstorm
> > alternative approaches while optimizing.
>
> I don't think this is really a problem. As you mentioned in another
> thread, we don't really need to be using bulk reads, and once we are
> decompressing a single posting at a time I think there are elegant
> solutions for this.

Cool. There ought to be room for improvement here. KS is double buffering
all input/output. The setvbuf thing is really weird, but I couldn't deny
the benchmarks. Maybe an 'objdump -S FSFileDes.o' will tell me something.

> > > I'll send you a version tomorrow with such things included for you to
> > > decide where you want to draw that line. It's out of sync with the
> > > patch right now.
> >
> > I went ahead and committed the version you supplied as r2555. Thanks!
>
> Sorry for never sending that. I've been distracted with other projects.

No worries, mate. Can you show me how you normally like your DEBUG and ASSERT
macros set up?

> > * A sanity check was added at the top of winnow_anchors() to prevent
> > possible invalid pointer de-refs. Technically, this wasn't
> > necessary because the calling function cannot currently supply data
> > which triggers such a problem, but I was uneasy about the
> > absence of a
> > local safety mechanism.
>
> I think this might be a better case for an ASSERT, but I understand
> the impulse.

My guess was that you'd had an ASSERT there but deleted it. However, I don't
think it's a good candidate for an ASSERT, because you need specifically
pathological data to trigger the problem.

> > * winnow_anchors() now returns a u32_t rather than a size_t,
> > because it's returning a count of u32_t rather than char and the
> > C standard defines size_t as "A type to define sizes of strings
> > and memory blocks."
>
> OK. I went with this because we add the return value to a pointer,
> and I thought that on a 64-bit system this might save a back and forth
> conversion.

OK, that's a good enough reason. I'll change it back.

> Hmm, I downloaded the patch I attached, and didn't find any tabs in
> it. Either I've done something wrong twice, or maybe something else is
> adding them.

Output of a search for tabs is below. I also find them with vim.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Last login: Fri Sep 28 00:31:20 on ttyp1
Welcome to Darwin!
teddy:~ marvin$ curl -o winnow_anchors.diff http://www.rectangular.com/pipermail/kinosearch/attachments/20070910/79f710e0/winnow_anchors-0001.obj
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 8914 100 8914 0 0 35623 0 --:--:-- --:--:-- --:--:-- 88862
teddy:~ marvin$ ack -a "\t" winnow_anchors.diff
--- c_src/KinoSearch/Search/PhraseScorer.c (revision 2526)
+++ c_src/KinoSearch/Search/PhraseScorer.c (working copy)
+ const u32_t *candidates, const u32_t *const candidates_end,
+ size_t offset)
+{
+ if (++candidates == candidates_end) goto DONE;
+ if (++anchors == anchors_end) goto DONE;
+ /* copy anchor positions skipping those less than initial offset */
+ if (phrase_offsets[0] > posting->prox[i]) anchors_end--;
+ /* subtract phrase_offsets[0] to account for leading virtual terms */
+ else *anchors++ = posting->prox[i] - phrase_offsets[0];
+ i++;
+ /* prepare the non-overwritten list of potential next terms */
+ /* reduce anchor_set in place to those that match the next term */
+ anchors_remaining = winnow_anchors(anchors_start, anchors_end,
+ candidates_start, candidates_end,
+ phrase_offsets[i]);
+ /* adjust end for number of anchors that remain */
+ anchors_end = anchors_start + anchors_remaining;
--- perl/t/502-phrasequery.t (revision 2526)
+++ perl/t/502-phrasequery.t (working copy)
teddy:~ marvin$

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: fast phrase matching [patch] [ In reply to ]

Sep 28, 2007, 12:16 PM

Post #7 of 10 (2829 views)

On 9/28/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Thu, Sep 27, 2007 at 02:06:16PM -0600, Nathan Kurz wrote:
> >I'd wonder whether keeping this [varint routines] more
> > encapsulated in the Posting class might be more flexible. We'll
> > still probably need the same routines, though.
>
> Other things besides Postings need these routines: Lexicons, term vectors,
> document storage, etc.

Yes, I was being unclear. The varint routines would still exist
outside as library routines. I think it might make more sense for the
Posting class to get a pointer to raw data (representing one
compressed posting) rather than stream to read from . This would both
allow for greater flexibility of data store (mmap, SQL database) and
greater efficiency (data copied directly to Scorer without
intermediary).

This last part is dependent on an earlier exchange we had:
http://www.gossamer-threads.com/lists/kinosearch/discuss/1099
In this scheme (which I still like) Posting_read does the
decompression directly from data-on-disk to Scorer-struct.

> > And does SQLite use the same format? The code there is generally pretty and
> > likely well tuned.
>
> Dunno about that. I've never spelunked the SQLite code base

In my opinion, the core SQLite code is some of the best open source
C-code out there. Also, squinted at from the right point of view it
solves a very similar problem. There are now full-text-search
extensions for it (http://www.sqlite.org/cvstrac/wiki?p=FtsTwo),
although it doesn't currently support Unicode or complex scoring.

SQLite also has a License that I quite admire:
The author disclaims copyright to this source code. In place of a
legal notice, here is a blessing:
** May you do good and not evil.
** May you find forgiveness for yourself and forgive others.
** May you share freely, never taking more than you give.

It's variable integer code is here:
http://www.sqlite.org/cvstrac/fileview?f=sqlite/src/util.c

>The setvbuf thing is really weird, but I couldn't deny
> the benchmarks. Maybe an 'objdump -S FSFileDes.o' will tell me something.

I'm not familiar with setvbuf. I just read the man pages, and my
guess is that it is orthogonal to the system level caching. But I'm
uncertain.

> No worries, mate. Can you show me how you normally like your DEBUG and ASSERT
> macros set up?

I will clean things up and try to send a version late tonight.

> > OK. I went with this because we add the return value to a pointer,
> > and I thought that on a 64-bit system this might save a back and forth
> > conversion.
>
> OK, that's a good enough reason. I'll change it back.

I didn't test this, though. It's possible the compiler already
optimizes away any conversions, since it is an inline.

> > Hmm, I downloaded the patch I attached, and didn't find any tabs in
> > it. Either I've done something wrong twice, or maybe something else is
> > adding them.

It was me messing up twice. I tried again, and found the tabs just as
you said. Problem fixed for the future (I think).

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: fast phrase matching [patch] [ In reply to ]

marvin at rectangular

Sep 28, 2007, 6:40 PM

Post #8 of 10 (2825 views)

On Fri, Sep 28, 2007 at 01:16:11PM -0600, Nathan Kurz wrote:
> Yes, I was being unclear. The varint routines would still exist
> outside as library routines. I think it might make more sense for the
> Posting class to get a pointer to raw data (representing one
> compressed posting) rather than stream to read from . This would both
> allow for greater flexibility of data store (mmap, SQL database) and
> greater efficiency (data copied directly to Scorer without
> intermediary).

OK, this is worth its own thread. Since we have one going called "The Life
Cycle of a Posting", I'll use that.

> > > And does SQLite use the same format? The code there is generally pretty and
> > > likely well tuned.
> >
> > Dunno about that. I've never spelunked the SQLite code base
>
> In my opinion, the core SQLite code is some of the best open source
> C-code out there. Also, squinted at from the right point of view it
> solves a very similar problem. There are now full-text-search
> extensions for it (http://www.sqlite.org/cvstrac/wiki?p=FtsTwo),
> although it doesn't currently support Unicode or complex scoring.

Haha, when I pigeonholed Dr. Hipp a year ago at OSCON, he mentioned that he
had studied the Lucene file format. At a glance, it looks like he's using the
same basic segmentation scheme used by Lucene and KS.

> It's variable integer code is here:
> http://www.sqlite.org/cvstrac/fileview?f=sqlite/src/util.c

Looks like he's using the BER format. There are a couple neat things
about his implementation. We'll deal with it in a specialized thread.

> >The setvbuf thing is really weird, but I couldn't deny
> > the benchmarks. Maybe an 'objdump -S FSFileDes.o' will tell me something.
>
> I'm not familiar with setvbuf. I just read the man pages, and my
> guess is that it is orthogonal to the system level caching. But I'm
> uncertain.

Well, here's the deal: fread(), fwrite() and friends are buffered i/o. But KS
InStream and OutStream objects maintain their own buffers. (These buffers are
1k each, but I've tried other sizes.) It should be possible to turn off the
buffering via setvbuf and still get all the classic benefits of buffered i/o.
But if I do that, KS starts suckin' pondwater (to quote my high school soccer
coach).

It's been a while now, but I'm almost certain I tried this on three different
systems -- my G4 laptop, a Pentium 4 running FreeBSD 5.3, and a dual Xeon
running Red Hat 9 -- and KS sucked pondwater on all of 'em.

I'm tempted to try re-implementing InStream to use mmap exclusively. I think
that the i/o usage patterns of KS might make it a suitable candidate for that
treatment. But then your idea of allowing internal classes to bypass the
stream classes altogether is interesting.

One of the downsides of bypassing would be that any class that did that
wouldn't be able to use RAMFolder. RAMFolder's "files" are implemented as a
ragged array of ByteBuf objects, so you don't get a pointer to a contiguous
chunk of memory you can pass along. RAMFolder is mostly for testing, but I'd
be crying in my beer if all the KS tests had to use disk i/o.

Another thing about mmap: how well does it work on 32-bit systems when dealing
with large files (which are common with KS)?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: fast phrase matching [patch] [ In reply to ]

Sep 30, 2007, 2:11 PM

Post #9 of 10 (2808 views)

On 9/28/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> I'm tempted to try re-implementing InStream to use mmap exclusively. I think
> that the i/o usage patterns of KS might make it a suitable candidate for that
> treatment. But then your idea of allowing internal classes to bypass the
> stream classes altogether is interesting.

I think this is worth looking at. I think the major problem is going
to be how to fake the mmap() for Windows systems where it does not
exist. It's possible that a compromise is possible where we keep the
stream classes, but change them to return raw data in
page-size-multiple chunks, with Windows double-buffering and the Linux
implementation doing nothing other returning a pointer into a mapped
region.

> RAMFolder is mostly for testing, but I'd
> be crying in my beer if all the KS tests had to use disk i/o.

The goal is to get everything running as fast (or faster) than
RAMFolder works now. There is not going to any physical disk i/o
happening during testing other than that needed to get the data cached
by the system page buffer the first time it is read.

> Another thing about mmap: how well does it work on 32-bit systems when dealing
> with large files (which are common with KS)?

I don't think this is going to be a problem, although I haven't
thought it through in detail. The underlying implementation of
system read() is essentially mmap(), so we shouldn't hit any
fundamental problems. The total amount mapped at one time can't be
larger than the address space (< 4GB for 32-bit Linux), but I think we
can solve this by mapping and unmapping as necessary.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: fast phrase matching [patch] [ In reply to ]

marvin at rectangular

Sep 30, 2007, 8:39 PM

Post #10 of 10 (2804 views)

On Sun, Sep 30, 2007 at 03:11:15PM -0600, Nathan Kurz wrote:
> I think the major problem is going to be how to fake the mmap() for Windows
> systems where it does not exist.

Do a web search for 'windows mmap' -- it doesn't look like it's too hard to
fake up. This is the kind of thing Charmonizer is for.

> It's possible that a compromise is possible where we keep the
> stream classes, but change them to return raw data in
> page-size-multiple chunks, with Windows double-buffering and the Linux
> implementation doing nothing other returning a pointer into a mapped
> region.

The priority is to get InStream to use mmap. OutStream is less important, and
I'm not even thinking about it right now.

FWIW, other than going back and creating the compound file for each segment,
there is never any re-reading during the index creation. (Hmm. I think there
aren't even any seeks. It might be possible to eliminate OutStream_SSeek).
Also, all files are written once and never revised.

> > RAMFolder is mostly for testing, but I'd
> > be crying in my beer if all the KS tests had to use disk i/o.
>
> The goal is to get everything running as fast (or faster) than
> RAMFolder works now. There is not going to any physical disk i/o
> happening during testing other than that needed to get the data cached
> by the system page buffer the first time it is read.

The majority of tests don't hit disk *at all*. It's going to be hard to beat
that. :)

> > Another thing about mmap: how well does it work on 32-bit systems when dealing
> > with large files (which are common with KS)?
>
> I don't think this is going to be a problem, although I haven't
> thought it through in detail. The underlying implementation of
> system read() is essentially mmap(), so we shouldn't hit any
> fundamental problems. The total amount mapped at one time can't be
> larger than the address space (< 4GB for 32-bit Linux), but I think we
> can solve this by mapping and unmapping as necessary.

What I was thinking we'd try is just substituting an mmap/munmap pair for each
buffer refill we currently perform. Sounds like you and I are on the same
page. ;)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch