Mailing List Archive

Struggling with memory problem using Search Query
I'm trying to get Analog 6.0 to generate a Search Query and Search Word
report, but am having some trouble due to memory, I believe.

I'm running Analog on a Dell PowerEdge 2450/600 with 512MB RAM. When I
analyze my monthly logs, with about 2.1 million log lines, I'm able to
use the extensive lists of SearchEngines.txt or SearchQuery.txt without
problems; the report is generated just as I expect. It takes about 16.5
minutes to generate.

However, when I try to do the same with a whole year's volume of log
entries (about 25 million lines) I get a segmentation fault. I was able
to prevent this by setting "REFLOWMEM 3". REFLOWMEM 2 still creates a
seg fault. But, REFLOWMEM 3 prevents any Search Query or Search Word
report.

My first question is: Am I completely screwed at this point, and nothing
short of adding more memory to the server will help?

I thought that maybe I could generate a Search report by not using the
entire lists available in SearchEngines.txt and SearchQuery.txt.
Instead, I'm trying to just look at the top ten search engines that
refer to my site. I started with Google. I entered this in my Analog
config file:

# Creating Search Query and Word reports here
REFARGSEXCLUDE * #Reject all ref arguments, to prevent
seg fault with 12 months of data, then
REFARGSINCLUDE /search* #accept only the one for Google.

SEARCHENGINE http://*.google.com/* q,as_q,as_oq,as_epq,query
SEARCHENGINE http://*.google.co.*/* q,as_q,as_oq,as_epq,query
SEARCHENGINE http://*.google.com.*/* q,as_q,as_oq,as_epq,query

This didn't produce a seg fault (with the default REFLOWMEM 0 setting),
but also didn't produce a Search Query or Search Word report. An example
of some logs that I think should have made it into my reports include:

kevinz@cn2:/opt/analog/conf.d$ fgrep www.google.com/search
/opt/analog/logdata/web1/access_log.20071231 |head -5
ABTS-NCR-Dynamic-013.35.163.122.airtelbroadband.in - -
[31/Dec/2007:00:55:13 -0500] "GET
/igwg/presentations/Monday/SubplenB/PromotionMale.pdf HTTP/1.1" 200
44424
"http://www.google.com/search?q=graduate+housewives+in+india&hl=en&rlz=1
T4GGLJ_en-GBIN214IN214&start=20&sa=N" "Mozilla/4.0 (compatible; MSIE
7.0; Windows NT 5.1)"
85.185.229.106 - - [31/Dec/2007:00:55:43 -0500] "GET /pubs/sp/20/20.pdf
HTTP/1.0" 200 466095
"http://www.google.com/search?hl=fa&q=AIDS%2BPDF&btnG=%D8%AC%D8%B3%D8%AA
%D8%AC%D9%88%D9%8A+Google&lr=" "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1; SV1)"
66.249.85.131 - - [31/Dec/2007:00:57:48 -0500] "GET
/asia/bangladesh/nsdp.shtml HTTP/1.1" 200 20061
"http://www.google.com/search?q=child+delivery+video&hl=en&start=70&sa=N
" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322)"
c-98-204-115-120.hsd1.dc.comcast.net - - [31/Dec/2007:00:58:02 -0500]
"GET /pubs/ HTTP/1.1" 200 30575
"http://www.google.com/search?q=jhccp&ie=utf-8&oe=utf-8&aq=t&rls=org.moz
illa:en-US:official&client=firefox-a" "Mozilla/5.0 (Macintosh; U; Intel
Mac OS X; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"
pool-71-182-79-153.ptldor.fios.verizon.net - - [31/Dec/2007:01:23:25
-0500] "GET /quality/expo.shtml HTTP/1.1" 200 10440
"http://www.google.com/search?hl=en&q=putting+quality+first&btnG=Search"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11)
Gecko/20071127 Firefox/2.0.0.11"
kevinz@cn2:/opt/analog/conf.d$

Can anyone offer me any advice on what I can try to generate these
report?

Thank you in advance for your help and suggestions.

-Kevin


Kevin Zembower
Internet Services Group manager
Center for Communication Programs
Bloomberg School of Public Health
Johns Hopkins University
111 Market Place, Suite 310
Baltimore, Maryland 21202
410-659-6139

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------
Re: Struggling with memory problem using Search Query [ In reply to ]
Zembower, Kevin <kzembowe@jhuccp.org> wrote:
>
> I thought that maybe I could generate a Search report by not using the
> entire lists available in SearchEngines.txt and SearchQuery.txt.
> Instead, I'm trying to just look at the top ten search engines that
> refer to my site. I started with Google. I entered this in my Analog
> config file:
>
> # Creating Search Query and Word reports here
> REFARGSEXCLUDE * #Reject all ref arguments, to prevent
> seg fault with 12 months of data, then
> REFARGSINCLUDE /search* #accept only the one for Google.

Why not just use REFINCLUDE *.google.*. Your problem is not that your list of search engines is too big, it's that your list of log entries is too big. By excluding every entry that wasn't referred by Google, you should be well able to report on just the Google Search terms. If you're primarily interested in the Search Reports, you need to use the LOWMEM commands for everything _except_ the Referrers - that's the information that you want.

Aengus

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------
RE: Struggling with memory problem using Search Query [ In reply to ]
Aengus, thanks so much for your suggestion. Unfortunately, it doesn't
seem to work for me. I made these entries in my analog config file:
REFINCLUDE *.google.*,*.jhuccp.org*,*.yahoo.*

SEARCHENGINE http://*.google.com/* q,as_q,as_oq,as_epq,query
SEARCHENGINE http://*.google.co.*/* q,as_q,as_oq,as_epq,query
SEARCHENGINE http://*.google.com.*/* q,as_q,as_oq,as_epq,query

This generates this section in the '--settings' output:
Including (+) and excluding (-) the following referrers:
All excluded, then
+ *.google.*
+ *.jhuccp.org*
+ *.yahoo.*

This leads me to believe that it's working correctly. I wanted these
three referrers, because they make up most of my referrer report.
However, I get this output when I run analog:
../analog-6.0/analog: analog version 6.0/Unix
../analog-6.0/analog: Warning M: Logfile
/opt/analog/logdata/web1/access_log*
contains lines with no referrers, which are being filtered
(For help on all errors and warnings, see docs/errors.html)
../analog-6.0/analog: Warning M: Logfile
/opt/analog/logdata/db/ccp-apps2/ex*
contains lines with no referrers, which are being filtered
sh: line 20: 19024 Segmentation fault ../analog-6.0/analog
+gweb1.analog.cfg

This error message troubles me, as it seems to indicate that log entries
without referrers were being dropped. I'll have trouble interepting this
report, I think. Even if this report didn't seg fault, I'm not sure it
would be useful to me.

Unless I or anyone else on this list can think of another suggestion, I
think that I just have two options:
1) Add more memory to this host.
2) Generate a report with just the Search Query and Search Word
sections, and minimize or eliminate everything else. Any guesses if this
would work?

Thanks, again, for your help and suggestions.

-Kevin

-----Original Message-----
From: analog-help-bounces@lists.meer.net
[mailto:analog-help-bounces@lists.meer.net] On Behalf Of Aengus
Sent: Monday, January 07, 2008 2:34 PM
To: Support for analog web log analyzer
Subject: Re: [analog-help] Struggling with memory problem using Search
Query

Zembower, Kevin <kzembowe@jhuccp.org> wrote:
>
> I thought that maybe I could generate a Search report by not using the
> entire lists available in SearchEngines.txt and SearchQuery.txt.
> Instead, I'm trying to just look at the top ten search engines that
> refer to my site. I started with Google. I entered this in my Analog
> config file:
>
> # Creating Search Query and Word reports here
> REFARGSEXCLUDE * #Reject all ref arguments, to prevent
> seg fault with 12 months of data, then
> REFARGSINCLUDE /search* #accept only the one for Google.

Why not just use REFINCLUDE *.google.*. Your problem is not that your
list of search engines is too big, it's that your list of log entries is
too big. By excluding every entry that wasn't referred by Google, you
should be well able to report on just the Google Search terms. If you're
primarily interested in the Search Reports, you need to use the LOWMEM
commands for everything _except_ the Referrers - that's the information
that you want.

Aengus

+-----------------------------------------------------------------------
-
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+-----------------------------------------------------------------------
-

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------
Re: Struggling with memory problem using Search Query [ In reply to ]
Zembower, Kevin <kzembowe@jhuccp.org> wrote:
> Aengus, thanks so much for your suggestion. Unfortunately, it doesn't
> seem to work for me. I made these entries in my analog config file:
> REFINCLUDE *.google.*,*.jhuccp.org*,*.yahoo.*
>
> SEARCHENGINE http://*.google.com/* q,as_q,as_oq,as_epq,query
> SEARCHENGINE http://*.google.co.*/* q,as_q,as_oq,as_epq,query
> SEARCHENGINE http://*.google.com.*/* q,as_q,as_oq,as_epq,query
>
> This generates this section in the '--settings' output:
> Including (+) and excluding (-) the following referrers:
> All excluded, then
> + *.google.*
> + *.jhuccp.org*
> + *.yahoo.*
>
> This leads me to believe that it's working correctly. I wanted these
> three referrers, because they make up most of my referrer report.
> However, I get this output when I run analog:
> ../analog-6.0/analog: analog version 6.0/Unix
> ../analog-6.0/analog: Warning M: Logfile
> /opt/analog/logdata/web1/access_log*
> contains lines with no referrers, which are being filtered
> (For help on all errors and warnings, see docs/errors.html)
> ../analog-6.0/analog: Warning M: Logfile
> /opt/analog/logdata/db/ccp-apps2/ex*
> contains lines with no referrers, which are being filtered
> sh: line 20: 19024 Segmentation fault ../analog-6.0/analog
> +gweb1.analog.cfg
>
> This error message troubles me, as it seems to indicate that log
> entries without referrers were being dropped. I'll have trouble
> interepting this report, I think. Even if this report didn't seg
> fault, I'm not sure it would be useful to me.

If you want to generate a Search Report, you might as well ignore lines without referrers, because the Search report information comes from the Referrer.

(As a generale rule, the number of lines that have no referrer at all is usally pretty tiny - under normal circumstances, only bookmarks and e-mail links generate log entries with no referrer. 90% of the rest of the referrers will usually be "internal" - referrers from your own site).

> Unless I or anyone else on this list can think of another suggestion,
> I think that I just have two options:
> 1) Add more memory to this host.

You can test this very easily - exclude a couple of months from the report and see if you don't get the error when you're trying to generate a smaller report.

> 2) Generate a report with just the Search Query and Search Word
> sections, and minimize or eliminate everything else. Any guesses if
> this would work?

Sorry, I thought that you were already doing this. Yes - if you have to many logfiles to crunch in a single report, running seperate reports may allow you to get the information you need. I just turning off the other reports doesn't work, try to use the LOWMEM commands for everything except the referrers.

Aengus

+------------------------------------------------------------------------
| TO UNSUBSCRIBE from this list:
| http://lists.meer.net/mailman/listinfo/analog-help
|
| Analog Documentation: http://analog.cx/docs/Readme.html
| List archives: http://www.analog.cx/docs/mailing.html#listarchives
| Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------