Mailing List Archive

"The chaching bug" - a mystery?
Less than two hours ago, I found the page
http://www.wikipedia.com/wiki/Sweden blank again. So I clicked Edit,
then Save, to restore its full contents. This has happened before.
See the page's History.

Is this "the caching bug"? How does it work? Wouldn't the obsolete
cached copy me destroyed the first time I did this? Why does the
blank page come back?

Is caching really necessary for performance now? When caching was
activated, I thought several other functions were disabled at the same
time. Do we know the impact on response times of not using caching?


--
Lars Aronsson
<lars@aronsson.se>
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/
Re: "The chaching bug" - a mystery? [ In reply to ]
On dim, 2002-05-12 at 17:38, Lars Aronsson wrote:
> Less than two hours ago, I found the page
> http://www.wikipedia.com/wiki/Sweden blank again. So I clicked Edit,
> then Save, to restore its full contents. This has happened before.
> See the page's History.
>
> Is this "the caching bug"?

Not as far as I know...

> How does it work? Wouldn't the obsolete
> cached copy me destroyed the first time I did this?

Sure should have been. Here's how the caching works:

* When a page is loaded, the contents of the cur_cache field are
checked. If it is not empty (!= ""), the cached HTML is dropped into the
output and various parsing/rendering steps are bypassed. Otherwise, the
page is rendered anew, shown, and stored in cur_cache for the next time
the page is shown.

* When a page is saved, the cur_cache field is cleared (set to ""), so
that the next time it's loaded (which should be immediately!) it will be
re-rendered.

* When a new page is created, the "unlinked" table is checked to see if
any other pages were linking to the previously-nonexistant page. If so,
those pages have their cur_cache cleared (set to ""), so that when next
viewed, they can show the link as being to an existing page. (This won't
happen currently, as the "unlinked" table is broken.)

"The caching bug" is due to the fact that, if caching is disabled, the
cur_cache field isn't touched at all, not even to clear caches when old
pages are saved: the software has no way of knowing the cache is old
(because there's no cache timestamp), so the old version is displayed as
if it were the current version of the page. Basically, it assumes that
caching will either be always off, or start off and turn on forever.
Turning it on, then turning it off for a while, then back on without
manually clearing the cache leaves bad cached pages.

This can easily be solved in the future by performing the cache-clears
whether caching is enabled or not. (I assume this was not done in the
first place so that the code would still work until the cur_cache field
was added to the database!)

The solution for our current problem is simply to clear all the cache
fields:
UPDATE cur SET cur_cache="",cur_timestamp=cur_timestamp

which can be done by someone with is_developer access (that would be
nobody, currently!) or by someone with direct access to the db server
(messieurs Wales and Richey). That will get rid of all the old cached
pages, and the _correct_ caches can start to fill up again. (Except that
unlinked links may stick around for newly created pages due to the
separate "unlinked" table problem.)

> Why does the blank page come back?

I have no idea! I haven't actually seen this problem yet, I've only seen
a couple notes that said roughly "page was blank, so I saved it again,
fine now."

Does the page _stay_ blank after reloading; ie does it consistently come
up blank until you resave it? Or does it come up blank once, but is okay
if you reload? Is it completely blank, or does it contain spaces or
something? Are there any error messages? Can you see anything suspicious
in the HTML source of the page?

If anyone with is_sysop status notices a page doing this, try doing a
direct SQL query before doing anything to the article:

SELECT * FROM cur WHERE cur_title="Title_of_the_page"

What's in the cur_cache field? Does anything else look out of place?
(Hmm, looks like special_asksql.php needs to slip in a couple of
<nowiki></nowiki> tags.)

> Is caching really necessary for performance now?

I wouldn't doubt that it's _helping_.

> When caching was
> activated, I thought several other functions were disabled at the same
> time. Do we know the impact on response times of not using caching?

Alas, yes and no in that order. Since the main limiting factor is still
the multiple/complex/slow database queries, not CPU time in PHP, caching
may be a smaller help than anything else we're doing; but I'd be quite
surprised if it's not helping at all. And what's the sense in throwing
away a performance gain if we've got one?

(Note that caching does cut down a bit on database queries, since links
don't need to be checked for existence.)

-- brion vibber (brion @ pobox.com)
Re: "The chaching bug" - a mystery? [ In reply to ]
Brion L. VIBBER wrote:
> Does the page _stay_ blank after reloading; ie does it consistently come
> up blank until you resave it?

Yes it stays blank. I can tell you how I discovered this. I
suspected that my response time monitoring script was seeing timeouts
instead of real pages when response times where in the 500+ seconds
range. So I added HTTP status code and document length to my log
file. I can tell you the Main Page is 18K in normal cases and 4K when
only the page header and footer are returned.

With times in GMT, the normal sequence is like this:

date time resp status url bytes
20020510 Fri 11:40:00 9.02 200 http://www.wikipedia.com/ 18459
20020510 Fri 11:50:00 1.13 200 http://www.wikipedia.com/ 18459

As you can see, the status code is 200 and response time very
reasonable (1.13 - 9.02 seconds). But then something happens at
12:00 GMT (5:20 am PDT) Friday:

20020510 Fri 12:00:00 0.90 200 http://www.wikipedia.com/ 3999
20020510 Fri 12:10:00 1.09 200 http://www.wikipedia.com/ 3999

It stays this way until restored at 14:10 GMT (7:10 am PDT) Friday:

20020510 Fri 13:50:00 7.80 200 http://www.wikipedia.com/ 3999
20020510 Fri 14:00:00 8.71 200 http://www.wikipedia.com/ 3999
20020510 Fri 14:10:00 1.27 200 http://www.wikipedia.com/ 18444
20020510 Fri 14:20:00 1.18 200 http://www.wikipedia.com/ 18444

The same thing happens at 13:20 GMT (6:20 am PDT) Saturday:

20020511 Sat 13:00:00 1.65 200 http://www.wikipedia.com/ 18443
20020511 Sat 13:10:00 1.12 200 http://www.wikipedia.com/ 18443
20020511 Sat 13:20:00 174.32 200 http://www.wikipedia.com/ 3998
20020511 Sat 13:30:00 3.31 200 http://www.wikipedia.com/ 3998

And is restored at 15:40 GMT (8:40 am PDT) Saturday:

20020511 Sat 15:20:00 0.59 200 http://www.wikipedia.com/ 3998
20020511 Sat 15:30:00 0.59 200 http://www.wikipedia.com/ 3998
20020511 Sat 15:40:00 7.68 200 http://www.wikipedia.com/ 18445
20020511 Sat 15:50:00 1.18 200 http://www.wikipedia.com/ 18445

And 15:30-16:30 GMT (8:30 am - 9:30 am PDT) Sunday:

20020512 Sun 15:00:00 13.99 200 http://www.wikipedia.com/ 18443
20020512 Sun 15:20:00 1.88 200 http://www.wikipedia.com/ 18443
20020512 Sun 15:30:00 0.59 200 http://www.wikipedia.com/ 3998
20020512 Sun 15:40:00 0.68 200 http://www.wikipedia.com/ 3998
[...]
20020512 Sun 16:10:00 0.61 200 http://www.wikipedia.com/ 3998
20020512 Sun 16:20:00 0.62 200 http://www.wikipedia.com/ 3998
20020512 Sun 16:30:00 4.79 200 http://www.wikipedia.com/ 18397
20020512 Sun 16:40:00 1.22 200 http://www.wikipedia.com/ 18289

Obviously, I didn't restore the Main Page since I don't have the
privileges to do this. And I have no idea who or what did this.

For the [[Sweden]] page some response time observations are so long or
timed out (zero length, no status code) that log entries appear in the
wrong order. The two restore times (15:50 GMT = 8:50 am PDT Saturday,
and 22:50 GMT = 3:50 pm PDT Sunday) correspond to the revision History
page (8:48 and 15:48), when I restored the page:

20020511 Sat 13:10:00 45.22 200 http://www.wikipedia.com/wiki/Sweden 13360
20020511 Sat 13:20:00 769.98 200 http://www.wikipedia.com/wiki/Sweden 13360
20020511 Sat 13:30:00 2549.62 http://www.wikipedia.com/wiki/Sweden 0
20020511 Sat 13:40:00 2288.51 200 http://www.wikipedia.com/wiki/Sweden 4577
20020511 Sat 13:50:00 940.10 200 http://www.wikipedia.com/wiki/Sweden 4577
[...]
20020511 Sat 15:30:00 0.80 200 http://www.wikipedia.com/wiki/Sweden 4577
20020511 Sat 15:40:00 0.81 200 http://www.wikipedia.com/wiki/Sweden 4577
20020511 Sat 15:50:00 1.22 200 http://www.wikipedia.com/wiki/Sweden 12807
20020511 Sat 16:00:00 1.09 200 http://www.wikipedia.com/wiki/Sweden 12807
[...]
20020512 Sun 10:40:00 384.72 200 http://www.wikipedia.com/wiki/Sweden 12807
20020512 Sun 11:00:00 327.15 200 http://www.wikipedia.com/wiki/Sweden 12807
20020512 Sun 11:10:00 2549.62 http://www.wikipedia.com/wiki/Sweden 0
20020512 Sun 11:20:00 2549.71 http://www.wikipedia.com/wiki/Sweden 0
20020512 Sun 11:30:00 2549.82 http://www.wikipedia.com/wiki/Sweden 0
20020512 Sun 12:10:00 481.58 200 http://www.wikipedia.com/wiki/Sweden 4576
20020512 Sun 11:40:00 2286.75 200 http://www.wikipedia.com/wiki/Sweden 4576
[...]
20020512 Sun 22:30:00 1.90 200 http://www.wikipedia.com/wiki/Sweden 4576
20020512 Sun 22:40:00 0.80 200 http://www.wikipedia.com/wiki/Sweden 4576
20020512 Sun 22:50:00 1.10 200 http://www.wikipedia.com/wiki/Sweden 12805
20020512 Sun 23:10:00 1.10 200 http://www.wikipedia.com/wiki/Sweden 12805



--
Lars Aronsson (lars@aronsson.se)
Aronsson Datateknik
Teknikringen 1e, SE-583 30 Linuxköping, Sweden
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/
Re: "The chaching bug" - a mystery? [ In reply to ]
On lun, 2002-05-13 at 00:36, Lars Aronsson wrote:
> 20020511 Sat 13:20:00 769.98 200 http://www.wikipedia.com/wiki/Sweden 13360
> 20020511 Sat 13:30:00 2549.62 http://www.wikipedia.com/wiki/Sweden 0
> 20020511 Sat 13:40:00 2288.51 200 http://www.wikipedia.com/wiki/Sweden 4577
[...]
> 20020512 Sun 11:00:00 327.15 200 http://www.wikipedia.com/wiki/Sweden 12807
> 20020512 Sun 11:10:00 2549.62 http://www.wikipedia.com/wiki/Sweden 0
> 20020512 Sun 11:20:00 2549.71 http://www.wikipedia.com/wiki/Sweden 0
> 20020512 Sun 11:30:00 2549.82 http://www.wikipedia.com/wiki/Sweden 0
> 20020512 Sun 12:10:00 481.58 200 http://www.wikipedia.com/wiki/Sweden 4576

Woo-hoo! Thanks for the detailed stats, I think I've got it figured out
now. The timeouts just before the blanking look VERY suspicious...

Here's what I think is going on:

* The main database query times out on a page load.

* wikiPage::load() doesn't check the return code from mysql_query()
(oops!) and ends up with nothing for, among other things, $this->cache
and $this->text.

* Since the cache is empty (not != ""), the text is re-rendered. Blank
text actually ends up rendering to "<p><p>\n<p>\n". (Is that what you
see if you check the HTML code of the blank pages?)

* The new rendered blankness gets saved in the cache. (For some reason,
the PHP process for the page hasn't been killed for being around too
long by this point.)

* Subsequent page views get shown with the pseudo-blank cache, since
it's not actually a null string (hence, != "")

So, it's not THE caching bug, but it's A caching bug.

Fix: always check return codes! I've added a quick check to
wikiPage::load() for the main query, so if it can't execute the query it
should spit out an error message and _not_ cache the result. All the
other queries really should be checked too, though.

-- brion vibber (brion @ pobox.com)