Mailing List Archive

namelookups and databases
HaHa! <-(PeeWee Herman laugh)

I've been attempting to shove my log data into Postgres and am
coming to a sobering realization. It has taken 9 hours to process
15,000 requests..... As I am in the process of discovering that
much of that time is spent doing gethostbyaddr() for each entry.
A subsequent reload without doing lookups is ontrack to be done
in 2 hours. This rate would obviously create a serious backlog
on some sites if the server was direct connected to the database.

I suppose that my gethostbyaddr() results are not being cached
by local nameservice. One way to improve this may be to create
that cache in my perl program. Any other ideas on how to improve
this? I am beginning to question the value of this data....

-Randy
Re: namelookups and databases [ In reply to ]
> HaHa! <-(PeeWee Herman laugh)
>
> I've been attempting to shove my log data into Postgres and am
> coming to a sobering realization. It has taken 9 hours to process
> 15,000 requests.....

What are you running it on, a 8086 PC ?

There's something really wrong if it is taking that long.
Is it reindexing on each entry ?.. that'd be a braindead approach.

> As I am in the process of discovering that
> much of that time is spent doing gethostbyaddr() for each entry.
> A subsequent reload without doing lookups is ontrack to be done
> in 2 hours.

That's still really slow.

It shouldn't take more than a couple of minutes of perl time to
swallow a *100,000* request access log and be ready to do some neat
tricks with it.

What exactly are you doing with the data ?

rob
--
http://nqcd.lanl.gov/~hartill/
Re: namelookups and databases [ In reply to ]
On Tue, 25 Jul 1995, Rob Hartill wrote:
> > I've been attempting to shove my log data into Postgres and am
> > coming to a sobering realization. It has taken 9 hours to process
> > 15,000 requests.....
>
> What are you running it on, a 8086 PC ?
>
> There's something really wrong if it is taking that long.
> Is it reindexing on each entry ?.. that'd be a braindead approach.
>
> > As I am in the process of discovering that
> > much of that time is spent doing gethostbyaddr() for each entry.
^^^^^^^^^^^^^^^
> What exactly are you doing with the data ?

gethostbyaddr can be *really* slow on sites not on the 155mbps backbone
like yours, Rob ;)

If it's taking that long, and your web site isn't heavily loaded, why not
just do a gethostbyaddr at the time of the request (like the default) or
immediately *after* a response is sent?

Brian

--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
brian@organic.com brian@hyperreal.com http://www.[hyperreal,organic].com/
Re: namelookups and databases [ In reply to ]
> > > As I am in the process of discovering that
> > > much of that time is spent doing gethostbyaddr() for each entry.
> ^^^^^^^^^^^^^^^
> > What exactly are you doing with the data ?
>
> gethostbyaddr can be *really* slow on sites not on the 155mbps backbone
> like yours, Rob ;)

Yeah, but he said that it was going to take 2 hours to process the
15,000 requests without the gethostbyaddr... that's a loooong time.
Re: namelookups and databases [ In reply to ]
On Tue, 25 Jul 1995, Randy Terbush wrote:

> I've been attempting to shove my log data into Postgres and am
> coming to a sobering realization. It has taken 9 hours to process
> 15,000 requests.....

That's a bit sad. I'm currently working on a web stats package that
stores the data in an mSQL database (including number of hits/bytes per
page per date, number of hits from each domain per page per date etc).
From this the stats front end can work through the historical data and
create whatever report is required.

Shoving the data into the database isn't a big problem. I just parse CLF
logs keeping track of the date/time of the last logged access and go from
there. Works pretty well (eg -HUP the server at 00:01am and ram
yesterday's log down the loading programme). This is still
work-in-progress but it is looking like a good bit of code.


__ David J. Hughes - Bambi@Bond.edu.au
/ \ / / / http://Bond.edu.au/People/bambi
/___/ __ _ ____/ / / _
/ \ / \ / \ / / / / / \ / Senior Network Programmer, Bond University
\___/ \__// / \__/ \__/ / / / Qld. 4229 AUSTRALIA (+61 75 951450)
Re: namelookups and databases [ In reply to ]
>
> > HaHa! <-(PeeWee Herman laugh)
> >
> > I've been attempting to shove my log data into Postgres and am
> > coming to a sobering realization. It has taken 9 hours to process
> > 15,000 requests.....
>
> What are you running it on, a 8086 PC ?

With a Z80 chip....

:-) Seriously, 486/66.

> There's something really wrong if it is taking that long.
> Is it reindexing on each entry ?.. that'd be a braindead approach.

Agreed. I need to get a bit more intimate with the API for
Postgres (assuming I stick with it). I am reading a line from
the log, formating it, doing a name lookup, and sending an
INSERT query to the database. It would be nice to figure out
how to lock the database, shove all of the INSERTS in, and
re-index.

As I mentioned in this mail, eliminating the lookup reduced it
to 1 hour (not 2). I have not determined if getnamebyaddr()
caches the lookup with the lockup nameserver or not. Anyone
know? If not, I would be wise to include a simple cache in
my perl program.

> It shouldn't take more than a couple of minutes of perl time to
> swallow a *100,000* request access log and be ready to do some neat
> tricks with it.

Reading in the log file is not the issue. Looking up the names is
the real bottle neck. As Brian suggests in later reply, having
Apache do the lookup might be wise, but I am in the mode to minimize
the load on the HTTP server anticipating more traffic.

RST says he is most CPU poor... NOT! My server is running on
a Sparc 1+.... This is the one that is currently handling about
10,000/day.

> What exactly are you doing with the data ?

I am wanting to create an accounting system that can be easily
queried for bytes transfered, with specifics for servername and
URL. I am also wanting to accomplish a more space efficient way
of storying this data. I can't even imagine what a site that is
getting 100,000 requests per day let alone 500,000 is doing with
this data. Log to /dev/null?

-Randy
Re: namelookups and databases [ In reply to ]
> If it's taking that long, and your web site isn't heavily loaded, why not
> just do a gethostbyaddr at the time of the request (like the default) or
> immediately *after* a response is sent?
> Brian

It isn't heavily loaded now. I would like to come up with a solution
that I (and others) can grow into. As I pointed out in earlier mail,
I am on a Sparc 1+. It doesn't take much to make it heavily loaded...

I would like to eventually be running a 500,000/day server (I think).
I can't imagine at this stage how to handle that and would like to
make it easier before getting there...
Re: namelookups and databases [ In reply to ]
>
>
> On Tue, 25 Jul 1995, Randy Terbush wrote:
>
> > I've been attempting to shove my log data into Postgres and am
> > coming to a sobering realization. It has taken 9 hours to process
> > 15,000 requests.....
>
> That's a bit sad. I'm currently working on a web stats package that
> stores the data in an mSQL database (including number of hits/bytes per
> page per date, number of hits from each domain per page per date etc).
> >From this the stats front end can work through the historical data and
> create whatever report is required.
>
> Shoving the data into the database isn't a big problem. I just parse CLF
> logs keeping track of the date/time of the last logged access and go from
> there. Works pretty well (eg -HUP the server at 00:01am and ram
> yesterday's log down the loading programme). This is still
> work-in-progress but it is looking like a good bit of code.
>

David, after my experience with Postgres, I am planning to bring
up mSQL give it a go on the same task. Knowing you are working on
the same thing, perhaps we should pull together...

Two things I have liked about Postgres that I did not find
in mSQL are no need for table keys, and unlimited (or large)
row lengths. I know you planned to implement the later.

BTW - I am logging a nonstandard format file. I have changed
the date to be an integer+microseconds which could easily be
used as the key. I have also changed the log to be TAB separated
to be more easily digestible for the parsing program.

-Randy
Re: namelookups and databases [ In reply to ]
On Tue, 25 Jul 1995, Randy Terbush wrote:

> David, after my experience with Postgres, I am planning to bring
> up mSQL give it a go on the same task. Knowing you are working on
> the same thing, perhaps we should pull together...


Sounds good to me. You can see a _very_ early version of the code with
not much data loaded (only a few days in Jun 1995) at

http://Bond.edu.au/webstats/


> Two things I have liked about Postgres that I did not find
> in mSQL are no need for table keys, and unlimited (or large)
> row lengths. I know you planned to implement the later.

You don't have to have a key on a table in mSQL (but it does help). In
this situation a key isn't that good an idea anyway. Something like
"give me the number of hits per page for every page on the 2 July 1995"
doesn't gain anything from having a primary key. The only primary key
would be the combination of the date and the URL which doesn't help if
you do a "select URL,hits from data where date = 'some date'".

As for mega length fields and rows, I'm about to start on a major redesign
of mSQL that will take it up into the realm of the bigger boys (including
multiple keys, secondary index support, faster data handling, light
weight transactions etc).


> BTW - I am logging a nonstandard format file. I have changed
> the date to be an integer+microseconds which could easily be
> used as the key. I have also changed the log to be TAB separated
> to be more easily digestible for the parsing program.

OK, mine uses CLF and only logs hits to html pages (doesn't log inline
gif images, redirects, errors etc).

I'll be working on it more over the next week or so. When I have
something of use I'll get back to the list.



Bambi
...
Re: namelookups and databases [ In reply to ]
> > There's something really wrong if it is taking that long.
> > Is it reindexing on each entry ?.. that'd be a braindead approach.
>
> Agreed. I need to get a bit more intimate with the API for
> Postgres (assuming I stick with it). I am reading a line from
> the log, formating it, doing a name lookup, and sending an
> INSERT query to the database. It would be nice to figure out
> how to lock the database, shove all of the INSERTS in, and
> re-index.

Look to see if there's an "import" option. Simple PC databases have this,
but I don't know about Postgres. An import would presumably swallow
a plain text file of your making and reindex more efficiently (you'd hope).

> As I mentioned in this mail, eliminating the lookup reduced it
> to 1 hour (not 2). I have not determined if getnamebyaddr()
> caches the lookup with the lockup nameserver or not. Anyone
> know? If not, I would be wise to include a simple cache in
> my perl program.

You could also run a cron job throughout the day to build yourself a
perl database of IP's and names.

> > It shouldn't take more than a couple of minutes of perl time to
> > swallow a *100,000* request access log and be ready to do some neat
> > tricks with it.
>
> Reading in the log file is not the issue. Looking up the names is
> the real bottle neck. As Brian suggests in later reply, having
> Apache do the lookup might be wise, but I am in the mode to minimize
> the load on the HTTP server anticipating more traffic.

I still find that an hour to read ~15,000 lines into a database to be
slow.

> > What exactly are you doing with the data ?
>
> I am wanting to create an accounting system that can be easily
> queried for bytes transfered, with specifics for servername and
> URL. I am also wanting to accomplish a more space efficient way
> of storying this data. I can't even imagine what a site that is
> getting 100,000 requests per day let alone 500,000 is doing with
> this data. Log to /dev/null?

The ~100,000 requests at Cardiff are read, and a sorted list of the
most frequent client addresses is created (prodigy, AOL...), plus a
list of the most requested URLs (QUERY string stripped).

The addresses are added to a big list of known addresses.

Counts for

new addresses all addresses requests

are logged to a plain file. Graphs are requested via URLs to show
these three. See http://arachnid.cm.cf.ac.uk/htbin/Graphs/show_stats


The raw logs are then gzipped and when disk space drops low, the older
ones get deleted.

rob
--
http://nqcd.lanl.gov/~hartill/
Re: namelookups and databases [ In reply to ]
Last time, Randy Terbush uttered the following other thing:
>
> > What exactly are you doing with the data ?
>
> I am wanting to create an accounting system that can be easily
> queried for bytes transfered, with specifics for servername and
> URL. I am also wanting to accomplish a more space efficient way
> of storying this data. I can't even imagine what a site that is
> getting 100,000 requests per day let alone 500,000 is doing with
> this data. Log to /dev/null?

Well, I'm not sure how many www.ncsa is doing now, but this is what
some professors here at the UofI have been doing with the data
(all access logs are available to research institutions which can
handle it) Isn't mass store grand?

http://www-pablo.cs.uiuc.edu/Papers/WWW.ps.Z

And, for virtual reality stills from the CAVE:

http://www-pablo.cs.uiuc.edu/Projects/Mosaic/mosaic.html

Brandon


--
Brandon Long (N9WUC) "I think, therefore, I am confused." -- RAW
Computer Engineering Run Linux '95. It's that Easy.
University of Illinois blong@uiuc.edu http://www.uiuc.edu/ph/www/blong
Don't worry, these aren't even my views.