Mailing List Archive: Problem with page counts

Problem with page counts

Feb 20, 2009, 8:47 AM

Post #1 of 15 (9159 views)

I am trying to analyse pages from a large 'portal' site and am having real
problems with page counts and all attempts with PAGEINCLUDE, TYPE and
FILEALIAS and other experiements fail.

The site generates URLs similar to:
/bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m

It seems to be the period in the input vars that's causing the problem as
the File Type report then lists things like:

reqs %reqs Gbytes %bytes extension
7277 0.08% 0.18 0.32% .s=tl"
12683 0.15% 0.11 0.20%
.t=CAMPAIGN&furlname=selfassessment&furlparam=selfassessment"
4485 0.05% 0.11 0.20% .s=m"

Note the very low percentages as this is in effect counting page by page as
a different file type.

So I've tried things like:

PAGEINCLUDE *.s*
PAGEINCLUDE *.t*

(with and without the trailing *).

I've also tried patterns like:

PAGEINCLUDE /home

But all attempts fail.

It looks like analog is parsing the URL from the right and taking everything
to the right of the last period to be the file extension rather than
dropping everything to the right of and including the query (?) and then
looking at the suffix of the URL.

I've tried FILEALIAS as in:

FILEALIAS .s* .html
FILEALIAS .t* .html
FILEALIAS .l2* .html
FILEALIAS .l1* .html

As another means of making analog consider these portal generates URLs to be
pages but nothing seems to work. I may have to work on the basis of 'All
requests - css/images/js files' = pages :-(

Thanks for any ideas.../Iain

Re: Problem with page counts [ In reply to ]

analog-author at lists

Feb 20, 2009, 11:46 AM

Post #2 of 15 (9010 views)

Permalink

2009/2/20 Iain Hunneybell <iain@ipmarketing.co.uk>:
> I am trying to analyse pages from a large 'portal' site and am having real
> problems with page counts and all attempts with PAGEINCLUDE, TYPE and
> FILEALIAS and other experiements fail.
>
> The site generates URLs similar to:
> /bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m
>
> It seems to be the period in the input vars that's causing the problem as
> the File Type report then lists things like:
>
> reqs %reqs Gbytes %bytes extension
> 7277 0.08% 0.18 0.32% .s=tl"
> 12683 0.15% 0.11 0.20%
> .t=CAMPAIGN&furlname=selfassessment&furlparam=selfassessment"
> 4485 0.05% 0.11 0.20% .s=m"
>
> Note the very low percentages as this is in effect counting page by page as
> a different file type.
>

I'm not seeing this. I just tried this experiment and I see this file
listed as [no extension] which is correct. What do they look like in
your raw logfiles? For example, is the question mark encoded as %3F,
which would be a literal question mark instead of an argument
separator?

> So I've tried things like:
>
> PAGEINCLUDE *.s*
> PAGEINCLUDE *.t*
>
> (with and without the trailing *).
>
> I've also tried patterns like:
>
> PAGEINCLUDE /home
>
> But all attempts fail.
>

PAGEINCLUDE /bdotg/action/home

works for me. But if my hypothesis above is correct, you might need

PAGEINCLUDE /bdotg/action/home*

The PAGEINCLUDE has nothing to do with the file types by the way
(although it's typically used that way). You can make any single file
into a "page".

--
Stephen Turner

--
Stephen Turner
+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

RE: Problem with page counts [ In reply to ]

iain at ipmarketing

Feb 20, 2009, 12:47 PM

Post #3 of 15 (9007 views)

Permalink

Thanks for the quick reply :-)

Sadly I have no UNIX host to hand and these are Gig files and so I can't
head/tail/grep easily. Windows grep dies... I'll write something to parse
the files so I can have a real look at the records...

I've some other 'funnies' like 1.98% Netscape/4 browser usage (according to
the summary) but if I run something like the top 2000 browser sigs in the
full browser report I can't find a single reference to Netscape/4 (or course
if I could simply grep the files... :-( ).

I'll try the PAGEINCLUDE you suggest, but something would seem to be going
wrong from the way parts of the query string are showing up in the File Type
report. Yes I get [no extension] pages but from a page tracking service I'm
expecting around 460K pages and I'm 'only' seeing 395K. But it's the long
tail of .s=m and similar files which suggest some counting is going astray.

My thought was that I could 'mop these up' by definining each 'mis-read'
filetype as a page but my various attempts have failed. I'm runnig 6.0/Win32
if that's an issue?

Thanks again.../Iain

-----Original Message-----
From: analog-help-bounces@lists.meer.net
[mailto:analog-help-bounces@lists.meer.net] On Behalf Of Stephen Turner
Sent: 20 February 2009 19:47
To: analog-help
Subject: Re: [analog-help] Problem with page counts

2009/2/20 Iain Hunneybell <iain@ipmarketing.co.uk>:
> I am trying to analyse pages from a large 'portal' site and am having
> real problems with page counts and all attempts with PAGEINCLUDE, TYPE
> and FILEALIAS and other experiements fail.
>
> The site generates URLs similar to:
> /bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m
>
> It seems to be the period in the input vars that's causing the problem
> as the File Type report then lists things like:
>
> reqs %reqs Gbytes %bytes extension
> 7277 0.08% 0.18 0.32% .s=tl"
> 12683 0.15% 0.11 0.20%
> .t=CAMPAIGN&furlname=selfassessment&furlparam=selfassessment"
> 4485 0.05% 0.11 0.20% .s=m"
>
> Note the very low percentages as this is in effect counting page by
> page as a different file type.
>

I'm not seeing this. I just tried this experiment and I see this file listed
as [no extension] which is correct. What do they look like in your raw
logfiles? For example, is the question mark encoded as %3F, which would be a
literal question mark instead of an argument separator?

> So I've tried things like:
>
> PAGEINCLUDE *.s*
> PAGEINCLUDE *.t*
>
> (with and without the trailing *).
>
> I've also tried patterns like:
>
> PAGEINCLUDE /home
>
> But all attempts fail.
>

PAGEINCLUDE /bdotg/action/home

works for me. But if my hypothesis above is correct, you might need

PAGEINCLUDE /bdotg/action/home*

The PAGEINCLUDE has nothing to do with the file types by the way (although
it's typically used that way). You can make any single file into a "page".

--
Stephen Turner

--
Stephen Turner
+-----------------------------------------------------------------------
+-
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html List
| archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
+-

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Re: Problem with page counts [ In reply to ]

analog-author at lists

Feb 20, 2009, 12:53 PM

Post #4 of 15 (9054 views)

Permalink

2009/2/20 Iain Hunneybell <iain@ipmarketing.co.uk>:
>
> Sadly I have no UNIX host to hand and these are Gig files and so I can't
> head/tail/grep easily. Windows grep dies... I'll write something to parse
> the files so I can have a real look at the records...
>

Can you install Cygwin?

--
Stephen Turner
+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

RE: Problem with page counts [ In reply to ]

iain at ipmarketing

Feb 20, 2009, 1:04 PM

Post #5 of 15 (8998 views)

Permalink

I can certainly try.

Actually, this long tail only accounts for around 1% so maybe I shouldn't
waste time and come back to this later.

I'll share anything I find as to why some records are spilling out like
this.

Thanks for your help.../Iain

-----Original Message-----
From: analog-help-bounces@lists.meer.net
[mailto:analog-help-bounces@lists.meer.net] On Behalf Of Stephen Turner
Sent: 20 February 2009 20:53
To: Support for analog web log analyzer
Subject: Re: [analog-help] Problem with page counts

2009/2/20 Iain Hunneybell <iain@ipmarketing.co.uk>:
>
> Sadly I have no UNIX host to hand and these are Gig files and so I
> can't head/tail/grep easily. Windows grep dies... I'll write something
> to parse the files so I can have a real look at the records...
>

Can you install Cygwin?

--
Stephen Turner
+-----------------------------------------------------------------------
+-
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html List
| archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
+-

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

RE: Problem with page counts [ In reply to ]

iain at ipmarketing

Feb 21, 2009, 6:02 AM

Post #6 of 15 (9005 views)

Permalink

Well Cygwin is a big help, thanks... Only it now raises more questions!

One thing which is odd is that analog is reporting quite high usage of
Netscape 4 which seemed odd and so caused me to look further. So analog
says:

5 172430 2.00% Netscape
170370 1.98% Netscape/4
167023 1.94% Netscape/4.06
3281 0.04% Netscape/4.0
41 Netscape/4.77
3 Netscape/4.5
16 Netscape/4.76
2 Netscape/4.61
1 Netscape/4.05
3 Netscape/4.7
1645 0.02% Netscape/7
1643 0.02% Netscape/7.2
2 Netscape/7.1
414 Netscape/8
371 Netscape/8.1
43 Netscape/8.1.3

Most of it seems to be Netscape 4.06 which indeed would be old. So I tried:

grep 'Netscape/' *.log > netscape.log

I then used Excel to summarise netscape.log and come up with...

user-agent Total
Netscape/7.1 2 [matches analog]
Netscape/7.2 1695 [analog says 1695]
Netscape/8.0.4 5 [missing from analog]
Netscape/8.1 387 [analog says 371]
Netscape/8.1.3 43 [matches analog]
Grand Total 2132 [way off as analog sees lots of
Netscape/4 traffic]

grep does not find any 'Netscape/4' strings at all. Note some counts
correspond: Netscape/8.1.3 is 43 under both counts, Netscape/7.1 is 2 under
both counts.

Is there user-agent signature mapping going on within analog that is
relating some string[s] other than 'Netscape/4' to be Netscape v4 user
agents? These figures will be used to derive browser compatibility tests and
so I'll be challenged on my Netscape 4 figures and so want to be certain :-)

Thx.../Iain

-----Original Message-----
From: analog-help-bounces@lists.meer.net
[mailto:analog-help-bounces@lists.meer.net] On Behalf Of Stephen Turner
Sent: 20 February 2009 20:53
To: Support for analog web log analyzer
Subject: Re: [analog-help] Problem with page counts

2009/2/20 Iain Hunneybell <iain@ipmarketing.co.uk>:
>
> Sadly I have no UNIX host to hand and these are Gig files and so I
> can't head/tail/grep easily. Windows grep dies... I'll write something
> to parse the files so I can have a real look at the records...
>

Can you install Cygwin?

--
Stephen Turner
+-----------------------------------------------------------------------
+-
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html List
| archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
+-

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

RE: Problem with page counts [ In reply to ]

iain at ipmarketing

Feb 21, 2009, 7:45 AM

Post #7 of 15 (8998 views)

Permalink

I am still struggling to reconcile page counts. Let me explain what I'm
seeing...

The summary says:
8.62M requests
268,910 pages
433,041 corrupt lines [which out of 8.6M lines is 5%]

Daily summary says:
268,910 pages [agrees with summary]

Hourly summary totals to:
268,910 pages [so again agrees]

Now, looking at the File Types report (and summarising it) I see:
7.9M css, images and js file requests [92% of requests]
395,722 [no extension] [4.6% of requests]
201,376 [directories] [2.3% of requests]
Then a long tail of these 'misunderstood' .s=tl and similar 'file
types'

So the above 3 lines represent 98.9% of all requests, so our pages are
definitely in these numbers - or are some hiding in those 433K corrupt
lines? [.is there a way to have analog spit out the lines it sees as corrupt
to examine them?]

I would consider a no extension or directory to be equivalent to a page
being a URL similar to the one I mentioned before:
/bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m [being a 'no
extension' file type]

And a directory presenting a default page (index.html for instance). But...

395,722 + 201,376 = 597,098 which doesn't match the 268,910 page figure
mentioned before. Also, [directory] count is missing from the pie chart wile
.gif and .jpg with lower request volumes are included?

As a further twist, the pages are tagged with a tracking bug (similar to
Google analytics) and this gives me a page count of 460,516 for the day, so
I can't get any of the page count data to match up (and I need to show that
the httpd log analysis ties in with the tracking service).

What I *have* shown is that the *shape* of the analog *requests* data nicely
corresponds with the tracking bug page view count [.different scales, but
shows I don't have time differentials shifting data], but the analog page
count data is way off the tracking service figure (whereas the request shape
is a very nice fit).

What I see is that early hours page views (00:00-08:00) are significantly
higher (8,000/hr vs. 2,000). Maybe this is spidering going on where pages
are being read but the bug js script isn't being run, hence analog is giving
a view of what's really going on.

But then 08:00-23:25 the bug traffic levels are much higher than analog
shows (46K vs. 15K). Maybe this is proxies at work where pages are being
re-served to clients which execute the bug script and so record the page
view, but where no request reaches the web site? Strangely the analog
*requests* data closely matches the bug page views shape (and not the analog
page view shape), but maybe this is css and other widgets many of which are
marked no-cache?

I have 304ISSUCCESS ON and so presume 304 responses will count towards page
count? I have no STATUSINCLUDE defined and so presume all responses will be
counted by analog?

Understanding what's going on is very important as I am using this
information to work out capacity and headroom. Are we serving 46K or 15K
pages/hour? Hence if I scale up what's my max page serve rate?

Thanks for any insight into how analog is counting pages, why my [no
extension] and [directory] figures exceed my page view data, whether I may
be missing lots of pages in corrupt log lines etc.

Thx.../Iain

-----Original Message-----
From: analog-help-bounces@lists.meer.net
[mailto:analog-help-bounces@lists.meer.net] On Behalf Of Stephen Turner
Sent: 20 February 2009 19:47
To: analog-help
Subject: Re: [analog-help] Problem with page counts

2009/2/20 Iain Hunneybell <iain@ipmarketing.co.uk>:
> I am trying to analyse pages from a large 'portal' site and am having
> real problems with page counts and all attempts with PAGEINCLUDE, TYPE
> and FILEALIAS and other experiements fail.
>
> The site generates URLs similar to:
> /bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m
>
> It seems to be the period in the input vars that's causing the problem
> as the File Type report then lists things like:
>
> reqs %reqs Gbytes %bytes extension
> 7277 0.08% 0.18 0.32% .s=tl"
> 12683 0.15% 0.11 0.20%
> .t=CAMPAIGN&furlname=selfassessment&furlparam=selfassessment"
> 4485 0.05% 0.11 0.20% .s=m"
>
> Note the very low percentages as this is in effect counting page by
> page as a different file type.
>

I'm not seeing this. I just tried this experiment and I see this file listed
as [no extension] which is correct. What do they look like in your raw
logfiles? For example, is the question mark encoded as %3F, which would be a
literal question mark instead of an argument separator?

> So I've tried things like:
>
> PAGEINCLUDE *.s*
> PAGEINCLUDE *.t*
>
> (with and without the trailing *).
>
> I've also tried patterns like:
>
> PAGEINCLUDE /home
>
> But all attempts fail.
>

PAGEINCLUDE /bdotg/action/home

works for me. But if my hypothesis above is correct, you might need

PAGEINCLUDE /bdotg/action/home*

The PAGEINCLUDE has nothing to do with the file types by the way (although
it's typically used that way). You can make any single file into a "page".

--
Stephen Turner

--
Stephen Turner
+-----------------------------------------------------------------------
+-
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html List
| archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
+-

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Re: Problem with page counts [ In reply to ]

pch14 at myzel

Feb 21, 2009, 8:16 AM

Post #8 of 15 (9000 views)

Permalink

Iain Hunneybell schrieb:
>
> One thing which is odd is that analog is reporting quite high usage of
> Netscape 4 which seemed odd and so caused me to look further.

maybe it helps to turn on browser reporting and compare numbers, which
will display a list of (50) found browser identification strings.
netscape4 is some kind of "Mozilla/4", BTW.

p.
+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Re: Problem with page counts [ In reply to ]

analog-author at lists

Feb 21, 2009, 9:17 AM

Post #9 of 15 (9008 views)

Permalink

OK, there are lots of things here, but the first important thing to
say is that logfile analysis and page tagging will never match up.
They use fundamentally different techniques, and each makes errors
that the other is not susceptible to. For page views you would
normally expect to see the logfile analysis numbers lower, because
page tagging will see the page again if the visitor returns to it, but
logfile analysis won't.

You do have too many corrupt lines. If you turn debugging on, you will
see all the corrupt lines, and where in the line they were corrupt.

It looks like you have about 100,000 of these strange ".s=tl" lines,
right? Page tagging may be including them as pages, depending what
they really are and whether they are tagged, so it may be worth
tracking them down in the logfiles.

Sorry, no great insights, but at least that might give you some
avenues to look down.

--
Stephen Turner
+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

RE: Problem with page counts [ In reply to ]

iain at ipmarketing

Feb 21, 2009, 10:27 AM

Post #10 of 15 (9024 views)

Permalink

Well to answer my own question...

My Netscape/4.06 seems to be reported as a user-agent string of
'Mozilla/4.06' and so there is a 'transciption' being done by analog...but
its results seem correct :-)

But just to show where log analysis can take you, looking at the requests I
see they come from private address space and so it seems something on an
internal network is generating these requests. Now the task is to find out
what and why!

As for page counts, my best rationalisation is that the high overnight count
is spiders and so analog is correctly showing page requests that aren't
being recorded by the page view 'bug'. Then over day proxies are causing the
page bug to record higher page views than seen by the servers. It's the best
rationalisation I've come up with so far!

I've had analog dump out the log lines it sees as being corrupt and they do
indeed seem to be truncated and account for about 5% of the log lines which
seems high. Now to understand why the servers would be doing this!

.../Iain

-----Original Message-----
Sent: 21 February 2009 14:03
To: 'Support for analog web log analyzer'
Subject: RE: [analog-help] Problem with page counts

Well Cygwin is a big help, thanks... Only it now raises more questions!

One thing which is odd is that analog is reporting quite high usage of
Netscape 4 which seemed odd and so caused me to look further. So analog
says:

5 172430 2.00% Netscape
170370 1.98% Netscape/4
167023 1.94% Netscape/4.06
3281 0.04% Netscape/4.0
41 Netscape/4.77
3 Netscape/4.5
16 Netscape/4.76
2 Netscape/4.61
1 Netscape/4.05
3 Netscape/4.7
1645 0.02% Netscape/7
1643 0.02% Netscape/7.2
2 Netscape/7.1
414 Netscape/8
371 Netscape/8.1
43 Netscape/8.1.3

Most of it seems to be Netscape 4.06 which indeed would be old. So I tried:

grep 'Netscape/' *.log > netscape.log

I then used Excel to summarise netscape.log and come up with...

user-agent Total
Netscape/7.1 2 [matches analog]
Netscape/7.2 1695 [analog says 1695]
Netscape/8.0.4 5 [missing from analog]
Netscape/8.1 387 [analog says 371]
Netscape/8.1.3 43 [matches analog]
Grand Total 2132 [way off as analog sees lots of
Netscape/4 traffic]

grep does not find any 'Netscape/4' strings at all. Note some counts
correspond: Netscape/8.1.3 is 43 under both counts, Netscape/7.1 is 2 under
both counts.

Is there user-agent signature mapping going on within analog that is
relating some string[s] other than 'Netscape/4' to be Netscape v4 user
agents? These figures will be used to derive browser compatibility tests and
so I'll be challenged on my Netscape 4 figures and so want to be certain :-)

Thx.../Iain

-----Original Message-----
From: analog-help-bounces@lists.meer.net
[mailto:analog-help-bounces@lists.meer.net] On Behalf Of Stephen Turner
Sent: 20 February 2009 20:53
To: Support for analog web log analyzer
Subject: Re: [analog-help] Problem with page counts

2009/2/20 Iain Hunneybell <iain@ipmarketing.co.uk>:
>
> Sadly I have no UNIX host to hand and these are Gig files and so I
> can't head/tail/grep easily. Windows grep dies... I'll write something
> to parse the files so I can have a real look at the records...
>

Can you install Cygwin?

--
Stephen Turner
+-----------------------------------------------------------------------
+-
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html List
| archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
+-

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

RE: Problem with page counts [ In reply to ]

iain at ipmarketing

Feb 21, 2009, 10:39 AM

Post #11 of 15 (9010 views)

Permalink

Many thanks for this Stephen.

Yes there do seem to be an unusally high nuber of corrupt log lines. The
problem seems to be truncation. I don't yet know (haven't had time to check)
whether they all terminate at a specific length...possibly so...so maybe
it's a server config issue and long URIs causing the log lines to overflow
and be truncated rendering them useless. So it's probalby fair to guess a
lot of these lines are page reads with long associated URIs that have been
truncated. Hence I'm losing page reads.

Re browser activiy and caching, with '304ISSUCCESS ON' I presume a GET
request with a 304 will be counted as a page read? Of course, if the browser
(or an intermediate proxy) doesn't return the request to the server...)

I've not yet got to the bottom of the 'mis-typed' URLs. I've grep-ed out
some of the 'file type' patterns but then looking at the result see these
are the referrer, not the target URL and so I need to do more to try and
find the lines analog is seeing as a specific file type.

Thanks for your help.../Iain

-----Original Message-----
From: analog-help-bounces@lists.meer.net
[mailto:analog-help-bounces@lists.meer.net] On Behalf Of Stephen Turner
Sent: 21 February 2009 17:17
To: Support for analog web log analyzer
Subject: Re: [analog-help] Problem with page counts

OK, there are lots of things here, but the first important thing to say is
that logfile analysis and page tagging will never match up.
They use fundamentally different techniques, and each makes errors that the
other is not susceptible to. For page views you would normally expect to see
the logfile analysis numbers lower, because page tagging will see the page
again if the visitor returns to it, but logfile analysis won't.

You do have too many corrupt lines. If you turn debugging on, you will see
all the corrupt lines, and where in the line they were corrupt.

It looks like you have about 100,000 of these strange ".s=tl" lines, right?
Page tagging may be including them as pages, depending what they really are
and whether they are tagged, so it may be worth tracking them down in the
logfiles.

Sorry, no great insights, but at least that might give you some avenues to
look down.

--
Stephen Turner
+-----------------------------------------------------------------------
+-
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html List
| archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
+-

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Re: Problem with page counts [ In reply to ]

analog07 at eircom

Feb 21, 2009, 10:44 AM

Post #12 of 15 (8999 views)

Permalink

On 2/21/2009 1:27 PM, Iain Hunneybell wrote:
>
> As for page counts, my best rationalisation is that the high overnight count
> is spiders and so analog is correctly showing page requests that aren't
> being recorded by the page view 'bug'. Then over day proxies are causing the
> page bug to record higher page views than seen by the servers. It's the best
> rationalisation I've come up with so far!

You should be able to test that by using FROM and TO and doing a log
analysis for an hour in the middle of the night, and looking at the Full
Browser report. Most well behaved spiders identify themselves.

You can also do a Full Browser report on requests for /robots.txt and
then use that to create a list of BROWEXCLUDE commands so that you can
see if the human-driven traffic patterns make more sense.

Aengus
+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Re: Problem with page counts [ In reply to ]

analog-author at lists

Feb 21, 2009, 11:14 AM

Post #13 of 15 (8993 views)

Permalink

2009/2/21 Iain Hunneybell <iain@ipmarketing.co.uk>:
>
> Re browser activiy and caching, with '304ISSUCCESS ON' I presume a GET
> request with a 304 will be counted as a page read? Of course, if the browser
> (or an intermediate proxy) doesn't return the request to the server...)
>

Yes, with 304ISSUCCESS, it will be treated just the same as a 200. So
it will be a page if that file is defined as a 'page'. But the
situation in your last sentence is common: often the page will be
displayed again without another request getting to your server.

--
Stephen Turner
+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

RE: Problem with page counts [ In reply to ]

iain at ipmarketing

Feb 21, 2009, 12:47 PM

Post #14 of 15 (9003 views)

Permalink

That's a very good idea :-)

Many thanks Aengus

-----Original Message-----
From: analog-help-bounces@lists.meer.net
[mailto:analog-help-bounces@lists.meer.net] On Behalf Of Aengus
Sent: 21 February 2009 18:45
To: Support for analog web log analyzer
Subject: Re: [analog-help] Problem with page counts

On 2/21/2009 1:27 PM, Iain Hunneybell wrote:
>
> As for page counts, my best rationalisation is that the high overnight
> count is spiders and so analog is correctly showing page requests that
> aren't being recorded by the page view 'bug'. Then over day proxies
> are causing the page bug to record higher page views than seen by the
> servers. It's the best rationalisation I've come up with so far!

You should be able to test that by using FROM and TO and doing a log
analysis for an hour in the middle of the night, and looking at the Full
Browser report. Most well behaved spiders identify themselves.

You can also do a Full Browser report on requests for /robots.txt and then
use that to create a list of BROWEXCLUDE commands so that you can see if the
human-driven traffic patterns make more sense.

Aengus
+-----------------------------------------------------------------------
+-
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html List
| archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
+-

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

RE: Problem with page counts [ In reply to ]

iain at ipmarketing

Feb 21, 2009, 12:47 PM

Post #15 of 15 (9005 views)

Permalink

Thanks...this all helps me build a picture in my mind of what I'm really
seeing. You know, lies, damn lies and statistics :-)

It's a case of ensuring you actually understand the figures in front of
you.../Iain

-----Original Message-----
From: analog-help-bounces@lists.meer.net
[mailto:analog-help-bounces@lists.meer.net] On Behalf Of Stephen Turner
Sent: 21 February 2009 19:15
To: Support for analog web log analyzer
Subject: Re: [analog-help] Problem with page counts

2009/2/21 Iain Hunneybell <iain@ipmarketing.co.uk>:
>
> Re browser activiy and caching, with '304ISSUCCESS ON' I presume a GET
> request with a 304 will be counted as a page read? Of course, if the
> browser (or an intermediate proxy) doesn't return the request to the
> server...)
>

Yes, with 304ISSUCCESS, it will be treated just the same as a 200. So it
will be a page if that file is defined as a 'page'. But the situation in
your last sentence is common: often the page will be displayed again without
another request getting to your server.

--
Stephen Turner
+-----------------------------------------------------------------------
+-
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html List
| archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
+-

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------