Mailing List Archive

Purging. Now what?
Now that I can trigger a purge when a customer presses "save" in our CMS, the
next step is trying to do it somewhat smarter...

Purging everything from all hosts in the cache is simple via telnet, but a bit
brutish. It could get noticable as well, with maybe 50 customers saving
through the day..

Purging one url at a time is more presice, but then I have to keep track of
what to purge. Finding all urls in a site is not very efficient, and 95% of
those would not be in the cache anyway.

I could build a small daemon to tail the access logs, and keep a running
buffer of recently accessed pages. Then I could easily prefetch pages as
well, after purging them. But it does not feel quite right this either...
Sort of buliding a shadow copy of varnish timeout mechanism.

Any good suggestions?

Regards
Gaute Amundsen
Purging. Now what? [ In reply to ]
Hi Gaute,

and thanks for the script examples in earlier posts.

With regards to your question about purging I must admit that I am
unsure what you are trying to achieve. Could we be talking corner
case here?
I will try to explain why and how purging is used in general, so
please don't be offended if I state the obvious or can't see your
challange or what I am saying is to trivial.

Purging is not used to control content expiration, the HTTP headers
like Expires and max-age etc. do that. Purging is used as a "oh-shit-
have-to-delete-now" mechanism. Let's say you have a default cache
time for your article/page on 5 min., but 5 min. is to long to wait
for a update if there is important stuff that needs to replace the
content in the cache. That's when you would use purging to "force" an
update on that/those page(s). If it's only one page, go directly for
that page (no reg-exp), if there are more that one you have to keep
control over what pages need to be refreshed or use a URL schema so
that a reg-exp purge will delete them.

Does this answer your question? Please explain a bit deeper, with
examples, if this does not.

Anders Berg
Sys.adm
VG Nett // www.vg.no


> Date: Sun, 12 Nov 2006 16:47:52 +0100
> From: Gaute Amundsen <gaute at pht.no>
> Subject: Purging. Now what?
> To: varnish-misc at projects.linpro.no
> Message-ID: <200611121647.54547.gaute at pht.no>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Now that I can trigger a purge when a customer presses "save" in
> our CMS, the
> next step is trying to do it somewhat smarter...
>
> Purging everything from all hosts in the cache is simple via
> telnet, but a bit
> brutish. It could get noticable as well, with maybe 50 customers
> saving
> through the day..
>
> Purging one url at a time is more presice, but then I have to keep
> track of
> what to purge. Finding all urls in a site is not very efficient,
> and 95% of
> those would not be in the cache anyway.
>
> I could build a small daemon to tail the access logs, and keep a
> running
> buffer of recently accessed pages. Then I could easily prefetch
> pages as
> well, after purging them. But it does not feel quite right this
> either...
> Sort of buliding a shadow copy of varnish timeout mechanism.
>
> Any good suggestions?
>
> Regards
> Gaute Amundsen
Purging. Now what? [ In reply to ]
I am well aware of what purging is intended for.
I guess at VG, the journalists dont mind of they post a story, and it takes 5
minutes to appear on the frontpage? Or more likely you can predict what will
change, and purge that.

Thing is, we have a whole bunch of CMS sites on different domains, and the
customers are not all so tech-savy, so telling them "sometimes it may take 5
minutes", would only create confusion.
Furthermore, the sites all differ enough that it is not simple to predict,
what pages will be affectd by an uppdate. However, few of them would have
very many pages in the cache at any time, so that flushing them all is not a
problem. Flushing all the pages of all the customers everytime one of them
saves the least little thing, IS a problem, since that would happen often
enough during "prime time" to make our sites appear to be unpredictable, and
fickle.

Is that any clearer?

In tecnical terms, as I tried to ask before, I need to:
1 ) be able to purge by domain on the console
2 ) purge by patern match by HTTP PURGE
Or if neither of those are possible
3 ) keep track of the urls in the cache, with or without the help of varnish,
so that I can purge them by HTTP PURGE.
Beeing able to prefetch would be the payoff for the hassle of the last
alternative I guess :)

I know the docs says that 2 can't be done.
An authoritative confirmation that 1 can't be either would be helpfull, as
that would let me consentrate on 3.

A way to list what is beeing cached at any one time would be really helpfull
both in implementing 3 of course, but allso in getting the headers configured
right for all the different pieces of code that live on our servers.

The information is in the logs, I know, I only find it a bit cumbersome to
work with. If I end up building some small log-tailers to assist me in this,
would that be in line with the intentions of the architecture do you think?

Gaute.

On Monday 13 November 2006 21:53, Anders Berg wrote:
> Hi Gaute,
>
> and thanks for the script examples in earlier posts.
>
> With regards to your question about purging I must admit that I am
> unsure what you are trying to achieve. Could we be talking corner
> case here?
> I will try to explain why and how purging is used in general, so
> please don't be offended if I state the obvious or can't see your
> challange or what I am saying is to trivial.
>
> Purging is not used to control content expiration, the HTTP headers
> like Expires and max-age etc. do that. Purging is used as a "oh-shit-
> have-to-delete-now" mechanism. Let's say you have a default cache
> time for your article/page on 5 min., but 5 min. is to long to wait
> for a update if there is important stuff that needs to replace the
> content in the cache. That's when you would use purging to "force" an
> update on that/those page(s). If it's only one page, go directly for
> that page (no reg-exp), if there are more that one you have to keep
> control over what pages need to be refreshed or use a URL schema so
> that a reg-exp purge will delete them.
>
> Does this answer your question? Please explain a bit deeper, with
> examples, if this does not.
>
> Anders Berg
> Sys.adm
> VG Nett // www.vg.no
>
> > Date: Sun, 12 Nov 2006 16:47:52 +0100
> > From: Gaute Amundsen <gaute at pht.no>
> > Subject: Purging. Now what?
> > To: varnish-misc at projects.linpro.no
> > Message-ID: <200611121647.54547.gaute at pht.no>
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > Now that I can trigger a purge when a customer presses "save" in
> > our CMS, the
> > next step is trying to do it somewhat smarter...
> >
> > Purging everything from all hosts in the cache is simple via
> > telnet, but a bit
> > brutish. It could get noticable as well, with maybe 50 customers
> > saving
> > through the day..
> >
> > Purging one url at a time is more presice, but then I have to keep
> > track of
> > what to purge. Finding all urls in a site is not very efficient,
> > and 95% of
> > those would not be in the cache anyway.
> >
> > I could build a small daemon to tail the access logs, and keep a
> > running
> > buffer of recently accessed pages. Then I could easily prefetch
> > pages as
> > well, after purging them. But it does not feel quite right this
> > either...
> > Sort of buliding a shadow copy of varnish timeout mechanism.
> >
> > Any good suggestions?
> >
> > Regards
> > Gaute Amundsen
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc at projects.linpro.no
> http://projects.linpro.no/mailman/listinfo/varnish-misc
Purging. Now what? [ In reply to ]
On 11/14/06, Gaute Amundsen <gaute at pht.no> wrote:
> 3 ) keep track of the urls in the cache, with or without the help of varnish,
> so that I can purge them by HTTP PURGE.

I don't see why you would need to keep track of what is cached. If a
page has changed, purge it. It doesn't matter if it is in the cache or
not. If it's not, then nothing gets purged. :-)
This should be easily done with a HTTP call from the site system which
calls a purge for the changed page, it's parent and maybe the
frontpage or something (obviously dependent on your site structure).

--
Lennart Regebro, Nuxeo http://www.nuxeo.com/
CPS Content Management http://www.nuxeo.org/
Purging. Now what? [ In reply to ]
On Tuesday 14 November 2006 10:07, Lennart Regebro wrote:
> On 11/14/06, Gaute Amundsen <gaute at pht.no> wrote:
> > 3 ) keep track of the urls in the cache, with or without the help of
> > varnish, so that I can purge them by HTTP PURGE.
>
> I don't see why you would need to keep track of what is cached. If a
> page has changed, purge it. It doesn't matter if it is in the cache or
> not. If it's not, then nothing gets purged. :-)
> This should be easily done with a HTTP call from the site system which
> calls a purge for the changed page, it's parent and maybe the
> frontpage or something (obviously dependent on your site structure).

Mostly because, as I said, many of our sites are large, and it is hard to
predict what pagees any one change wil affect, and many of our pages does not
even exist as recognizable objects, just as URLs.
( If you had ever worked with zope and it's concept of acquisition, you would
understand :-/ )

To be able to "watch the cache" would give me confidence that we where not
caching things that should not be.
Watching a "pool of cache misses" would be good as well, come to think of it.

Until prefetch is implemented in varnish, an added benefit would be the
ability to do that myself, by having "wget fetch the page" of what I just
purged.

Gaute
Purging. Now what? [ In reply to ]
One of my friends have had "So much code to hack, so little time"
in his signature for many years.

Yes, man of the features you request are already talked about and
planned, it's just a matter coding.

But things do take time.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Purging. Now what? [ In reply to ]
On Mon, Nov 13, 2006 at 09:53:39PM +0100, Anders Berg wrote:
> With regards to your question about purging I must admit that I am
> unsure what you are trying to achieve. Could we be talking corner
> case here?

It's possible that I have an edge case for you :)

> I will try to explain why and how purging is used in general, so
> please don't be offended if I state the obvious or can't see your
> challange or what I am saying is to trivial.
>
> Purging is not used to control content expiration, the HTTP headers
> like Expires and max-age etc. do that.

Right. But my problem is that I don't know in advance when something
expires, I only know when it has expired.

I'm using Varnish to cache maptiles from a WMS-server. We also have to
layers of WMS-servers. First there's mapserver, which is used to draw
general maps, then there's our own WMS-server, that's used to draw
weather maps. All requests to our WMS-server passes through
mapserver. Unfortunately, mapserver does not support any content
expiration headers, like Expires or If-Modified-Since or anything like
that. Besides, if it did - how should it behave if it needs to fetch
two layers from our WMS-server, and one of them is unchanged?
Mapserver can't be expected to keep a cache of its own.

> Purging is used as a "oh-shit-
> have-to-delete-now" mechanism. Let's say you have a default cache
> time for your article/page on 5 min., but 5 min. is to long to wait
> for a update if there is important stuff that needs to replace the
> content in the cache. That's when you would use purging to "force" an
> update on that/those page(s).

> If it's only one page, go directly for that page (no reg-exp), if
> there are more that one you have to keep control over what pages
> need to be refreshed or use a URL schema so that a reg-exp purge
> will delete them.

Well, since we're using tiled images, the number of potential URLs
becomes quite large very fast. At any given zoomlevel, there are
4**zoomlevel tiles. So at zoomlevel 12 (which we use for Google
Earth), there are 16,7 million tiles. We have about 20 different
datalayers, and every layer contains data for every hour for 48
hours. That means that we have to purge up to 16 billion tiles when a
datamodel is updated (about every 12 hours)[1]. Since our internal
WMS-server is fairly slow, we really, really don't want to generate
any tiles more than once, so it's important for us that Varnish keeps
everything in cache until it's purged.

> Does this answer your question? Please explain a bit deeper, with
> examples, if this does not.

I hope this explains why we'd like to purge URLs regularly with a
regex.

[1] Of course, only a tiny fraction of this will actually be generated
in the first place. We don't expect users to zoom in this close on
every part of the world, but it illustrates the potential number of
URLs that need to be purged if it can't be done with a regex.

--
Trond Michelsen
Purging. Now what? [ In reply to ]
In message <20061114100441.GA24310 at crusaders.no>, Trond Michelsen writes:

>Well, since we're using tiled images, the number of potential URLs
>becomes quite large very fast. At any given zoomlevel, there are
>4**zoomlevel tiles. So at zoomlevel 12 (which we use for Google
>Earth), there are 16,7 million tiles. We have about 20 different
>datalayers, and every layer contains data for every hour for 48
>hours. That means that we have to purge up to 16 billion tiles when a
>datamodel is updated (about every 12 hours)[1]. Since our internal
>WMS-server is fairly slow, we really, really don't want to generate
>any tiles more than once, so it's important for us that Varnish keeps
>everything in cache until it's purged.

This is exactly what Varnish's url.purge facility is supposed
to help you do. All that is required is that you can express
the regexp to match what you want to purge.

Right now url.purge is not available in VCL, but that is on
the list of todo items.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Purging. Now what? [ In reply to ]
On Tuesday 14 November 2006 10:50, Poul-Henning Kamp wrote:
> One of my friends have had "So much code to hack, so little time"
> in his signature for many years.
>
> Yes, man of the features you request are already talked about and
> planned, it's just a matter coding.
>
> But things do take time.

That's something of which I am very aware :)

A pitty I am nowhere near good enough with C to be able to contribute anything
directly. I guess some "duplication of effort in python" is unavoidable
then :)

I would suggest to my company to contribute a bit financially, but with the
model you suggest for that in the "version 2 and beyond" mail, I'm afraid
that it would only be a small drop in the bucket, and as such rather
impractical.

Gaute
Purging. Now what? [ In reply to ]
On Tuesday 14 November 2006 11:04, Trond Michelsen wrote:
> On Mon, Nov 13, 2006 at 09:53:39PM +0100, Anders Berg wrote:
> > With regards to your question about purging I must admit that I am
> > unsure what you are trying to achieve. Could we be talking corner
> > case here?
>
> It's possible that I have an edge case for you :)
>

Following the saying that one should allways try to make ones blunders in
front of people who can tell you where you whent wrong, I will hazard a
guess:

Your best option is to keep track of every single tile that gets generated,
and then do a match against that data to find what to purge when something
changes.
That way you can use your own regexps, in your own data, in the language of
your choice, to find all the exact urls to purge, that have actually been
served in your choosen period of time.

If things change only a few times a day, I think I would just parse the logs
there and then.

Gaute
Purging. Now what? [ In reply to ]
On Tue, Nov 14, 2006 at 11:58:27AM +0100, Gaute Amundsen wrote:
> On Tuesday 14 November 2006 11:04, Trond Michelsen wrote:
>> On Mon, Nov 13, 2006 at 09:53:39PM +0100, Anders Berg wrote:
>>> With regards to your question about purging I must admit that I am
>>> unsure what you are trying to achieve. Could we be talking corner
>>> case here?
>> It's possible that I have an edge case for you :)
> Following the saying that one should allways try to make ones blunders in
> front of people who can tell you where you whent wrong, I will hazard a
> guess:
>
> Your best option is to keep track of every single tile that gets generated,
> and then do a match against that data to find what to purge when something
> changes.
> That way you can use your own regexps, in your own data, in the language of
> your choice, to find all the exact urls to purge, that have actually been
> served in your choosen period of time.
>
> If things change only a few times a day, I think I would just parse the logs
> there and then.

That seems unneccessary complex. I have to have some way of finding
the relevant URLs from the log, and that's likely to be a regex. And I
can't really see how it's easier to use a regex on a logfile to find
relevant URLs, then purging those URLs individually. Instead of simply
telling Varnish to purge all URLs matching the very same regex.

--
Trond Michelsen
Purging. Now what? [ In reply to ]
On Tuesday 14 November 2006 13:22, Trond Michelsen wrote:
> On Tue, Nov 14, 2006 at 11:58:27AM +0100, Gaute Amundsen wrote:
> > On Tuesday 14 November 2006 11:04, Trond Michelsen wrote:
> >> On Mon, Nov 13, 2006 at 09:53:39PM +0100, Anders Berg wrote:
> >>> With regards to your question about purging I must admit that I am
> >>> unsure what you are trying to achieve. Could we be talking corner
> >>> case here?
> >>
> >> It's possible that I have an edge case for you :)
> >
> > Following the saying that one should allways try to make ones blunders in
> > front of people who can tell you where you whent wrong, I will hazard a
> > guess:
> >
> > Your best option is to keep track of every single tile that gets
> > generated, and then do a match against that data to find what to purge
> > when something changes.
> > That way you can use your own regexps, in your own data, in the language
> > of your choice, to find all the exact urls to purge, that have actually
> > been served in your choosen period of time.
> >
> > If things change only a few times a day, I think I would just parse the
> > logs there and then.
>
> That seems unneccessary complex. I have to have some way of finding
> the relevant URLs from the log, and that's likely to be a regex. And I
> can't really see how it's easier to use a regex on a logfile to find
> relevant URLs, then purging those URLs individually. Instead of simply
> telling Varnish to purge all URLs matching the very same regex.

Well, as I have been hashing out for the last few days, that is not
implemented 100% yet.
You can (1) purge by regexp on the management console, but it does not know
about the hostname.
or (2) you can purge individual urls by HTTP PURGE, hostname and all.

In your case, perhaps you only have a few hostnames, so (1) will work?

Gaute
Purging. Now what? [ In reply to ]
On 11/14/06, Gaute Amundsen <gaute at pht.no> wrote:
> To be able to "watch the cache" would give me confidence that we where not
> caching things that should not be.

Oh, you are worried about URLs disappearing when you modify stuff but
still being cached... Yeah, that can be tricky.

--
Lennart Regebro, Nuxeo http://www.nuxeo.com/
CPS Content Management http://www.nuxeo.org/
Purging. Now what? [ In reply to ]
On Tuesday 14 November 2006 15:39, Lennart Regebro wrote:
> On 11/14/06, Gaute Amundsen <gaute at pht.no> wrote:
> > To be able to "watch the cache" would give me confidence that we where
> > not caching things that should not be.
>
> Oh, you are worried about URLs disappearing when you modify stuff but
> still being cached... Yeah, that can be tricky.

More worried that I could be caching some obscure search result, or other
"must be dynamic" page deep in a site somewhere.

As it is I have to purge EVERYTHING each time something changes, so "ghost
urls" are not a broblem, - yet.

Controll, that is the main thing :)

Gaute
Purging. Now what? [ In reply to ]
On 11/14/06, Gaute Amundsen <gaute at pht.no> wrote:
> More worried that I could be caching some obscure search result, or other
> "must be dynamic" page deep in a site somewhere.

Ah, of course, you can cache search-results as well, I didn't even
think of that. :-)

> As it is I have to purge EVERYTHING each time something changes, so "ghost
> urls" are not a broblem, - yet.

Well, it's a brutal method, but it works, assuming you update your
site much less frequently than you cache times out. ;-)

--
Lennart Regebro, Nuxeo http://www.nuxeo.com/
CPS Content Management http://www.nuxeo.org/