Mailing List Archive

General questions, analog, summary, large log sets, incremental
Hello, I am looking at Analog. I am a current summary.net user. The
pay for an update to get features I am not going to need, but mostly
need new user agent strings change sets is getting tiresome.

My general questions are:

Looking at the analog demo stats, I am pretty sure my clients would
consider it a step down from summary, on visual appearance. Technical
merits would not be something they consider. I am pretty adept at css
and html, is it possible to "skin" or "theme" analog?

I have some older machines that run Mac OS 9 WebStar, logging in I
believe an extended common log format. Can analog handle most of what
I throw at it for log formats? For those that are out of the
ordinary, I was able to Summary, to create custom log mappings, is
this possible?

Every Apache server we have logs all hits to one log, which is rolled
nightly. Summary used built in ftp to pull down only the new log
files. Is there a provision to get logs from remote machines, or will
I need to look at something like rsync to make this happen?

I am dealing with 100's of GB's of log data, a entire year, so we
start small, and at the end of the year, there is a lot. In Summary,
I used incremental processing, on a 5 minute schedule, which it was
able to generate new reports for each client in a around 10 minutes.
Will I be parsing the entire log set over and over again?

At the end of the year, I was able to "render" out the logs to html
files, so my clients can have access to their past data, will analog
be able to perform something similar?

Is there a way to get aggregate bandwidth on a per machine basis? I
also use Summary for getting rough ideas of the bandwidth used over
http for the entire machine, as well as granular down to each virtual
host.

Will I be able to pull out the individual virtual hosts, password
protect each report, even though I have log data that is on one single
access_log?

Are DNS lookups cached?

Maybe this could all be solved with a matrix that compared stats
reporting software in a chart, has anyone ever seen one of these that
includes analog?

Thank you for your time.
--
Scott * If you contact me off list replace talklists@ with scott@ *

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------
Re: General questions, analog, summary, large log sets, incremental [ In reply to ]
Scott Haneda <talklists@newgeo.com> wrote:
> Hello, I am looking at Analog. I am a current summary.net user. The
> pay for an update to get features I am not going to need, but mostly
> need new user agent strings change sets is getting tiresome.
>
> My general questions are:
>
> Looking at the analog demo stats, I am pretty sure my clients would
> consider it a step down from summary, on visual appearance. Technical
> merits would not be something they consider. I am pretty adept at css
> and html, is it possible to "skin" or "theme" analog?

Check ReportMagic - http://www.reportmagic.org/

> I have some older machines that run Mac OS 9 WebStar, logging in I
> believe an extended common log format. Can analog handle most of what
> I throw at it for log formats? For those that are out of the
> ordinary, I was able to Summary, to create custom log mappings, is
> this possible?

Analog has extremely flexible support for custom logformats. Afaik, Webstar logformats are supported out of the box.

http://analog.cx/docs/logfmt.html

> Every Apache server we have logs all hits to one log, which is rolled
> nightly. Summary used built in ftp to pull down only the new log
> files. Is there a provision to get logs from remote machines, or will
> I need to look at something like rsync to make this happen?

Analog just analyzes the logfiles, it doesn't do any logfile "management", so you'd have to handle that yourself.

> I am dealing with 100's of GB's of log data, a entire year, so we
> start small, and at the end of the year, there is a lot. In Summary,
> I used incremental processing, on a 5 minute schedule, which it was
> able to generate new reports for each client in a around 10 minutes.
> Will I be parsing the entire log set over and over again?

Analog supports incremental processing. http://analog.cx/docs/cache.html

> At the end of the year, I was able to "render" out the logs to html
> files, so my clients can have access to their past data, will analog
> be able to perform something similar?

All Analog reports are static HTML reports - it's not a database that renders reports on the fly (though it can be configured to handle on the fly queries).

> Is there a way to get aggregate bandwidth on a per machine basis? I
> also use Summary for getting rough ideas of the bandwidth used over
> http for the entire machine, as well as granular down to each virtual
> host.

Yes, the Virtual Host Report can show you the amount of data handled by each host.

> Will I be able to pull out the individual virtual hosts, password
> protect each report, even though I have log data that is on one single
> access_log?

Yes.

> Are DNS lookups cached?

Yes. It is strongly recommended that you use a different tool to generate the DNS cache, though - Analog is not optimised for DNS lookups. http://analog.cx/docs/dns.html

Aengus

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------
Re: General questions, analog, summary, large log sets, incremental [ In reply to ]
>> Every Apache server we have logs all hits to one log, which is rolled
>> nightly. Summary used built in ftp to pull down only the new log
>> files. Is there a provision to get logs from remote machines, or
>> will
>> I need to look at something like rsync to make this happen?
>
> Analog just analyzes the logfiles, it doesn't do any logfile
> "management", so you'd have to handle that yourself.


What would be the best way to manage this then. Consider a system
where there acre apache access_logs from 10 machines. There is an
11th machine that will do analog. I have a log rolling on 24 hours,
which means, I could rsync the remote logs directories of the 10
machines and keep all 24 hour log files up to date. However, the like
log, access_log, that is in progress, needs to come over just before
analog runs. This, with incremental, gives the client, what appears
to be near real time stats.

I could run rsync every 4 minutes, and have analog run very 5, but
this is a poor method, as times get out of sync, some logs are larger
than others etc. I am going to assume analog is triggered by
scheduler? If that is the case, could I simply write a simple script...

1) rsync the log files
2) when above reports success, run analog

I believe, that Summary, is more or less doing just that, it just uses
ftp for the transport, and MDTM to determine the need to download an
ftp log again.

I understand analog is one of the most popular, though if it is not a
good suit for a large shared hosting environment, please let me know.
I have seen where logs are dropped into the virtual hosts client
directory, and analog is set as an option to point to just that users
files. I however, prefer to parse out my entire facilities worth o f
logs.

Thanks for the replies and help, most appreciated.
--
Scott * If you contact me off list replace talklists@ with scott@ *


+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------
Re: General questions, analog, summary, large log sets, incremental [ In reply to ]
On 10/4/2009 4:47 PM, Scott Haneda wrote:
>>> Every Apache server we have logs all hits to one log, which is rolled
>>> nightly. Summary used built in ftp to pull down only the new log
>>> files. Is there a provision to get logs from remote machines, or will
>>> I need to look at something like rsync to make this happen?
>>
>> Analog just analyzes the logfiles, it doesn't do any logfile
>> "management", so you'd have to handle that yourself.
>
> What would be the best way to manage this then.

How long is a piece of string? Different people will set it up in
different ways.

> Consider a system where
> there acre apache access_logs from 10 machines. There is an 11th
> machine that will do analog. I have a log rolling on 24 hours, which
> means, I could rsync the remote logs directories of the 10 machines and
> keep all 24 hour log files up to date. However, the like log,
> access_log, that is in progress, needs to come over just before analog
> runs. This, with incremental, gives the client, what appears to be near
> real time stats.

It really depends on the size of the logfiles. When the client is
looking for "real time" stats, are they just interested in the last
hours worth of activity? Rather than having a machine churning away 24
hours a day generating "real time" charts that get over-written every 5
minutes, I'd be more inclined to use something like the Analog Form
interface to allow the user to generate the report "on demand".

> I could run rsync every 4 minutes, and have analog run very 5, but this
> is a poor method, as times get out of sync, some logs are larger than
> others etc. I am going to assume analog is triggered by scheduler?

You can trigger it by scheduler, or manually (though a cgi-type form, in
this case - http://analog.cx/docs/form.html)

> I understand analog is one of the most popular, though if it is not a
> good suit for a large shared hosting environment, please let me know. I
> have seen where logs are dropped into the virtual hosts client
> directory, and analog is set as an option to point to just that users
> files. I however, prefer to parse out my entire facilities worth o f logs.

Analog can generate reports for the whole facility from a set of
"combined" logs, or from a bunch of "per host" logs - it's simply a
matter of configuration. If you're going to allow the user to customize
their own reports, there's less chance of inadvertently giving them
access to someone else's log data if you generate separate logfiles, but
it's really just a matter of preference. (Your preference, and your
customers - some customers pay scant attention, or only sporadic
attention to their logs, others may spend a lot of time delving in to them).

Analog is extremely flexible, and is often used in large hosted
environments. But there isn't one "right" way to deploy it - it really
does depend on what you want to achieve.

Aengus
+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------
Re: General questions, analog, summary, large log sets, incremental [ In reply to ]
Thanks! I think I need t just jump in and see how it works. Your
pointing me to the form base demand method is good. That may come in
handy as then I am only using CPU cycles as they are needed. A few
comments below...

On Oct 4, 2009, at 3:15 PM, Aengus wrote:

> On 10/4/2009 4:47 PM, Scott Haneda wrote:
>>>> Every Apache server we have logs all hits to one log, which is
>>>> rolled
>>>> nightly. Summary used built in ftp to pull down only the new log
>>>> files. Is there a provision to get logs from remote machines, or
>>>> will
>>>> I need to look at something like rsync to make this happen?
>>>
>>> Analog just analyzes the logfiles, it doesn't do any logfile
>>> "management", so you'd have to handle that yourself.
>>
>> What would be the best way to manage this then.
>
> How long is a piece of string? Different people will set it up in
> different ways.

Generally, my strings are pretty long, though I often end up cutting
them short, as I get tired of measuring them all the time :)

>> Consider a system where there acre apache access_logs from 10
>> machines. There is an 11th machine that will do analog. I have a
>> log rolling on 24 hours, which means, I could rsync the remote logs
>> directories of the 10 machines and keep all 24 hour log files up to
>> date. However, the like log, access_log, that is in progress,
>> needs to come over just before analog runs. This, with
>> incremental, gives the client, what appears to be near real time
>> stats.
>
> It really depends on the size of the logfiles. When the client is
> looking for "real time" stats, are they just interested in the last
> hours worth of activity?

I am only familiar with Summary, and as I said, wanted to get away
from it. Not because it is doing anything wrong, but I strongly
believe that I should not have to pay for an update just to get new
user agent strings. Every time a new iPhone, or browser comes out,
Summary will not know about it. Yes, I get some new features as well,
but the user agent and a few other things are such moving targets,
this really needs to be a file that can be user maintained.

When I say "real time" I more mean, around a 5 minute delay, which is
how I was able to work this with Summary. I could turn on
incremental, and a 5 minute schedule would be able to parse all the
log data in very short time, well under the 5 minutes it would be
before the next schedule was due.

Now, if I had to reprocess the entire batch, and was near the end of
the year, as I keep a year worth live at all times, that would take
about a hour. Keeping in mind, this was on an older Power PC
PowerMac, 800Mhz CPU upgrade, so some pretty slow stuff.

> Rather than having a machine churning away 24 hours a day generating
> "real time" charts that get over-written every 5 minutes, I'd be
> more inclined to use something like the Analog Form interface to
> allow the user to generate the report "on demand".

In the case of Summary, it was a relatively small CPU spike for a very
short time, but I do thank you for pointing me to this on demand form
interface, it seems in theory, to be a much smarter way to deal with
this. As you mention below, not all my clients even use the stats,
but they all have the ability, so those that do not need them, I may
as well not process those.

>> I could run rsync every 4 minutes, and have analog run very 5, but
>> this is a poor method, as times get out of sync, some logs are
>> larger than others etc. I am going to assume analog is triggered
>> by scheduler?
>
> You can trigger it by scheduler, or manually (though a cgi-type
> form, in this case - http://analog.cx/docs/form.html)

Perfect, thank you. I think the last time I used Analog was to parse
out mail server logs, it supported an obscure email server out of the
box as well. I am pretty sure I can configure analog to get to where
I need to be.

>> I understand analog is one of the most popular, though if it is not
>> a good suit for a large shared hosting environment, please let me
>> know. I have seen where logs are dropped into the virtual hosts
>> client directory, and analog is set as an option to point to just
>> that users files. I however, prefer to parse out my entire
>> facilities worth o f logs.
>
> Analog can generate reports for the whole facility from a set of
> "combined" logs, or from a bunch of "per host" logs - it's simply a
> matter of configuration. If you're going to allow the user to
> customize their own reports, there's less chance of inadvertently
> giving them access to someone else's log data if you generate
> separate logfiles, but it's really just a matter of preference.

I generally lock off a lot of reports, as I will spend too much time
tech supporting users explaining what each means. They see "error" or
"hijacking" and think there is something wrong on my end, so it is
best to give them just what they need to be able to detect big
mistakes, and not too much, as it will be a burden our our support
team to answer basic questions.

I do need to make sure there is no pollution of one clients log data
to anothers. I log the host request header, (virtual host name) so as
long as I can limit the report by that, I should be fine. I will have
to find out how to lock out that setting fro being changes by the
user, or if they were to guess a host name on the machine, they could
gain a lot of sensitive data.

I will start with the install, and a test log set, and see where I can
get with the docs. You have answered my basic set of questions, which
tell me analog will work for my needs, it just depends on how long a
string I am willing to maintain.

> Analog is extremely flexible, and is often used in large hosted
> environments. But there isn't one "right" way to deploy it - it
> really does depend on what you want to achieve.

Thank you, time to spend some time in the docs.
I again thank you for your time especially over a weekend.
--
Scott * If you contact me off list replace talklists@ with scott@ *

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------
Re: General questions, analog, summary, large log sets, incremental [ In reply to ]
On 10/4/2009 7:34 PM, Scott Haneda wrote:

> I am only familiar with Summary, and as I said, wanted to get away from
> it. Not because it is doing anything wrong, but I strongly believe that
> I should not have to pay for an update just to get new user agent
> strings. Every time a new iPhone, or browser comes out, Summary will
> not know about it. Yes, I get some new features as well, but the user
> agent and a few other things are such moving targets, this really needs
> to be a file that can be user maintained.

Then I should point out that Analog's basic handling of user agent
strings is hard coded, and adding completely new strings may require
recompiling, though in many cases a simple BROWALIAS command will
suffice. The only real discussion on this topic has been about adding
Windows Vista and Windows 7 to the Operating System report, rather than
dealing with fundamentally new browser agent strings.

> I do need to make sure there is no pollution of one clients log data to
> anothers. I log the host request header, (virtual host name) so as long
> as I can limit the report by that, I should be fine. I will have to
> find out how to lock out that setting fro being changes by the user, or
> if they were to guess a host name on the machine, they could gain a lot
> of sensitive data.

VHOSTINCLUDE is the command that tells Analog which virtual host to
report on. I'm not sure whether it can be specified in the form, or if
it can be excluded by adding it to the forbidden set in anlgform.pl, but
that's where to look when you're testing this.

Aengus

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------