Mailing List Archive: LogFile Formats

> This might be bad form to complain about this functionality this late in
> the game, but conceptually I have a hard time justifying the
> two-web-log-hits effect of error response redirects. I.e., when I access
> a protected area under a bogus username/password:
>
> fully - asdfsaf [19/Apr/1995:01:03:05 -0700] "GET /Login/ HTTP/1.0" 401 -
> fully - asdfsaf [19/Apr/1995:01:03:05 -0700] "GET /401.html" 200 703
>
> The problem is that the second one, when not in the context of the first,
> looks like a valid user "asdfsaf" accessed a page under authentication.
> I'd have to tell my scripts "no, no, toss out all accesses to 401.html
> before doing any user-based analysis".

This is not a bad thing, if it brings you closer to the truth.

> What do people think?

The new behaviour came about as a result of Rob H's fix. I *want* to see
a complete record of the results that redirects produce, so I do want to see
both entries...

However I think the current solution to logging is flawed 'cuz you don't
get told explicitly that the second log entry is as a result of the first.
Rob H and I have bickered about augmenting the logfile format so's it records
a unique identifier for each 'transaction'. A normal GET / rould result in:

fully - asdfsaf [19/Apr/1995:01:03:05 -0700] "GET /Login/ HTTP/1.0" 401 - 123456

where '123456' is the unique id.

A hit that generated a redirect would produce:

fully - asdfsaf [19/Apr/1995:01:03:05 -0700] "GET /Login/ HTTP/1.0" 401 - 123456
fully - asdfsaf [19/Apr/1995:01:03:05 -0700] "GET /401.html" 200 703 123456

and so we know that the two log entries are related.

The drawback is that we now have a non-common log format, and that a lot of
existing log munging scripts will croak accordingly.

I'd like to propose that we do 3 things:

1) Log everything, absolutely everthing, and nothing but everything.
As a rule of thumb, if the action results in some text being sent out
of the server then that transmission should be logged. Even if it's
a 204 No Content or whatever.

2) Use unique ids to tie related log entries together. The id's can just
be strings, I think that using the lower-order clock timing bytes
is common practice. Either that or some guaranteed non-repeating
sequence.

[.it needs to be a solution that works for the non-forking model too]

3) Provide a support/apache2common script that sucks up Apache log
files and spits out Common Format logfiles. This means that
Joe.Webster's end-of-day stats programs can get something useful to
read.

[.This would come running to the aid of Brian's "no, no, toss out all
accesses to 401.html before doing any user-based analysis" cries]

The apache format logfile behaviour could be a .conf setting 'LogFileFormat'
with values either 'Common' or 'Apache'. As a further enhancement the
format of the logfile could be specified in a .conf file, a single line of
the form:

ApacheLogFileFormat HOST REMOTENAME USERID TIME ACCESS STATUS SIZE UNIQUE

This same entry could be read by support/apache2common when deciphering the
present state of the real logfile and converting it to the Common form.

This approach also lets you drop fields you don't care about, or add new ones
CGI-VARS perhaps, if you're running your own stats programs. If these
field names become a standard then mebbies people will write better stats
programs that don't even need support/apache2common.

> Brian

Cheers,
Ay.

Andrew Wilson URL: http://www.cm.cf.ac.uk/User/Andrew.Wilson/
Elsevier Science, Oxford Office: +44 01865 843155 Mobile: +44 0589 616144