I am still struggling to reconcile page counts. Let me explain what I'm
seeing...
The summary says:
8.62M requests
268,910 pages
433,041 corrupt lines [which out of 8.6M lines is 5%]
Daily summary says:
268,910 pages [agrees with summary]
Hourly summary totals to:
268,910 pages [so again agrees]
Now, looking at the File Types report (and summarising it) I see:
7.9M css, images and js file requests [92% of requests]
395,722 [no extension] [4.6% of requests]
201,376 [directories] [2.3% of requests]
Then a long tail of these 'misunderstood' .s=tl and similar 'file
types'
So the above 3 lines represent 98.9% of all requests, so our pages are
definitely in these numbers - or are some hiding in those 433K corrupt
lines? [.is there a way to have analog spit out the lines it sees as corrupt
to examine them?]
I would consider a no extension or directory to be equivalent to a page
being a URL similar to the one I mentioned before:
/bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m [being a 'no
extension' file type]
And a directory presenting a default page (index.html for instance). But...
395,722 + 201,376 = 597,098 which doesn't match the 268,910 page figure
mentioned before. Also, [directory] count is missing from the pie chart wile
.gif and .jpg with lower request volumes are included?
As a further twist, the pages are tagged with a tracking bug (similar to
Google analytics) and this gives me a page count of 460,516 for the day, so
I can't get any of the page count data to match up (and I need to show that
the httpd log analysis ties in with the tracking service).
What I *have* shown is that the *shape* of the analog *requests* data nicely
corresponds with the tracking bug page view count [.different scales, but
shows I don't have time differentials shifting data], but the analog page
count data is way off the tracking service figure (whereas the request shape
is a very nice fit).
What I see is that early hours page views (00:00-08:00) are significantly
higher (8,000/hr vs. 2,000). Maybe this is spidering going on where pages
are being read but the bug js script isn't being run, hence analog is giving
a view of what's really going on.
But then 08:00-23:25 the bug traffic levels are much higher than analog
shows (46K vs. 15K). Maybe this is proxies at work where pages are being
re-served to clients which execute the bug script and so record the page
view, but where no request reaches the web site? Strangely the analog
*requests* data closely matches the bug page views shape (and not the analog
page view shape), but maybe this is css and other widgets many of which are
marked no-cache?
I have 304ISSUCCESS ON and so presume 304 responses will count towards page
count? I have no STATUSINCLUDE defined and so presume all responses will be
counted by analog?
Understanding what's going on is very important as I am using this
information to work out capacity and headroom. Are we serving 46K or 15K
pages/hour? Hence if I scale up what's my max page serve rate?
Thanks for any insight into how analog is counting pages, why my [no
extension] and [directory] figures exceed my page view data, whether I may
be missing lots of pages in corrupt log lines etc.
Thx.../Iain
-----Original Message-----
From: analog-help-bounces@lists.meer.net
[mailto:analog-help-bounces@lists.meer.net] On Behalf Of Stephen Turner
Sent: 20 February 2009 19:47
To: analog-help
Subject: Re: [analog-help] Problem with page counts
2009/2/20 Iain Hunneybell <iain@ipmarketing.co.uk>:
> I am trying to analyse pages from a large 'portal' site and am having
> real problems with page counts and all attempts with PAGEINCLUDE, TYPE
> and FILEALIAS and other experiements fail.
>
> The site generates URLs similar to:
> /bdotg/action/home?r.l1=1078549133&r.lc=en&r.s=m
>
> It seems to be the period in the input vars that's causing the problem
> as the File Type report then lists things like:
>
> reqs %reqs Gbytes %bytes extension
> 7277 0.08% 0.18 0.32% .s=tl"
> 12683 0.15% 0.11 0.20%
> .t=CAMPAIGN&furlname=selfassessment&furlparam=selfassessment"
> 4485 0.05% 0.11 0.20% .s=m"
>
> Note the very low percentages as this is in effect counting page by
> page as a different file type.
>
I'm not seeing this. I just tried this experiment and I see this file listed
as [no extension] which is correct. What do they look like in your raw
logfiles? For example, is the question mark encoded as %3F, which would be a
literal question mark instead of an argument separator?
> So I've tried things like:
>
> PAGEINCLUDE *.s*
> PAGEINCLUDE *.t*
>
> (with and without the trailing *).
>
> I've also tried patterns like:
>
> PAGEINCLUDE /home
>
> But all attempts fail.
>
PAGEINCLUDE /bdotg/action/home
works for me. But if my hypothesis above is correct, you might need
PAGEINCLUDE /bdotg/action/home*
The PAGEINCLUDE has nothing to do with the file types by the way (although
it's typically used that way). You can make any single file into a "page".
--
Stephen Turner
--
Stephen Turner
+-----------------------------------------------------------------------
+-
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html List
| archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
+-
+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------