Mailing List Archive

Content-encoding: gzip
The last batch of fun upgrades included use of gzip compression on
cached pages where browsers accept it. I'm happy to report that this
seems to have decreased the bandwidth usage of the English-language
Wikipedia by up to roughly 25%.

Data from http://www.wikipedia.org/stats/usage_200306.html

kilobytes / hits = kb/hit
2003-06-01 6392951 / 619534 = 10.319 \
2003-06-02 7908928 / 793065 = 9.973 |
2003-06-03 8267879 / 822025 = 10.058 | mean 10.1
2003-06-04 7513917 / 755482 = 9.946 | range 0.37
2003-06-05 7347843 / 723717 = 10.153 |
2003-06-06 6300476 / 614552 = 10.252 |
2003-06-07 5159151 / 503097 = 10.255 |
2003-06-08 5732741 / 566484 = 10.120 /
-- gzip cache activated: --
2003-06-09 5376987 / 726971 = 7.396 \
2003-06-10 5442685 / 732897 = 7.426 | mean 7.6
2003-06-11 5735325 / 765204 = 7.495 | range 0.85
2003-06-12 6362049 / 772002 = 8.241 /

These counts include pages, images, css, everything. (But not the other
languages, mailing list, or database dump downloads.)

The bandwidth usage did go up a bit today, so it remains to be seen just
how stable the effect is. A number of things can affect it:

* Since caching, and thus sending of gzipped cached pages, is currently
only done for anonymous users, an increase in activity by registered
users would tend to reduce the overall percentage of savings

* Lots of editing and loading of dynamic pages, which are not
compressed, would do the same

* A large increase in brief visits by newbies drawn in by a link or news
mention would increase the relative bandwidth used by linked items
(logo, style sheet, etc) which are not additionally compressed

* Lots of work with new images might increase bandwidth usage.


Other thoughts:
- So far, no one's complained about being unable to read pages due to
pages being sent compressed when they shouldn't. (There was in fact a
bug to this effect which made the wiki unusable with Safari, but I don't
know if anyone but me noticed before I fixed it. :)

- Since gzipping is done only at cache time, this should use very little
CPU. IIRC the gzip was one of the faster steps when I was test profiling
this ;) and the number of times gzipping is done should not generally
exceed the number of edits * some factor regarding creation/deletion
rates and number of links per page. (More of course when the cache has
to be regenerated en masse.)

- The page cache presently stores both uncompressed and compressed
copies of pages, which is space inefficient, though we're not presently
hurting for space on larousse. Someone suggested storing just the
compressed pages, then in the relatively rare case a browser won't
accept gzipped pages, we can unzip it on the fly.

- We could offer either as default or as an option to compress
dynamically generated pages as well, which could shave some more
percentage points off the bandwidth usage. Might be a help for the modem
folks who do log in. :) However I'm not sure how much this would affect
CPU usage; in any case there's no urgency for this, it's just something
we might do if we have the cycles to burn (I don't think we do just now,
but we might one day).

-- brion vibber (brion @ pobox.com)
Re: Content-encoding: gzip [ In reply to ]
On Thu, Jun 12, 2003 at 11:02:37PM -0700, Brion Vibber wrote:
>
> - We could offer either as default or as an option to compress
> dynamically generated pages as well, which could shave some more
> percentage points off the bandwidth usage. Might be a help for the modem
> folks who do log in. :) However I'm not sure how much this would affect
> CPU usage; in any case there's no urgency for this, it's just something
> we might do if we have the cycles to burn (I don't think we do just now,
> but we might one day).
>
Modem people have hardware compression in their analog modems. ISDN
people might benefit, since they normally don't have link compression.

JeLuF