Mailing List Archive

Re: Re: Some stats for service times with the new server
>>The "Astronomer" page contains an extraordinary number of links,
>>mostly to "year" pages. Each link on a page requires a database
>>lookup (a quick one, but still a lookup). "Current events" had
>>the same problem before I reorganized it. I'm inclined to just
>>remove the year links here. They don't really serve much purpose.

>Argh! I didn't notice that. Thanks for pointing that out.
>Wouldn't page caching sort this, though?

I have something of a love/hate relationship with caching. It
solves some problems, and it creates others. I'd like to avoid
it if possible, and I think it is possible. But I may be proven
wrong about that if people create lots of other link-fest pages.
Re: Some stats for service times with the new server [ In reply to ]
On 7/22/02 2:40 PM, "lcrocker@nupedia.com" <lcrocker@nupedia.com> wrote:

>
>>> The "Astronomer" page contains an extraordinary number of links,
>>> mostly to "year" pages. Each link on a page requires a database
>>> lookup (a quick one, but still a lookup). "Current events" had
>>> the same problem before I reorganized it. I'm inclined to just
>>> remove the year links here. They don't really serve much purpose.
>
>> Argh! I didn't notice that. Thanks for pointing that out.
>> Wouldn't page caching sort this, though?
>
> I have something of a love/hate relationship with caching. It
> solves some problems, and it creates others. I'd like to avoid
> it if possible, and I think it is possible. But I may be proven
> wrong about that if people create lots of other link-fest pages.
>
Some kind of page-caching is probably a good idea. We should definitely
allow for linkfest pages.
Re: Some stats for service times with the new server [ In reply to ]
lcrocker@nupedia.com wrote:

>>>The "Astronomer" page contains an extraordinary number of links,
>>>mostly to "year" pages. Each link on a page requires a database
>>>lookup (a quick one, but still a lookup). "Current events" had
>>>the same problem before I reorganized it. I'm inclined to just
>>>remove the year links here. They don't really serve much purpose.
>>>
>>>
>
>
>
>>Argh! I didn't notice that. Thanks for pointing that out.
>>Wouldn't page caching sort this, though?
>>
>>
>
>I have something of a love/hate relationship with caching. It
>solves some problems, and it creates others. I'd like to avoid
>it if possible, and I think it is possible. But I may be proven
>wrong about that if people create lots of other link-fest pages.
>
Lee,

I haven't read the code, but do you do the lookups as lots of individual
SQL calls per link, or do you gather up a list of them and do a single
SQL call? This might improve performance, as possibly might doing the
lookups in batches, if the latency/pipelining hit would be too much for
a single call?

Neil
Re: Some stats for service times with the new server [ In reply to ]
> >>> The "Astronomer" page contains an extraordinary number of links,
> >>> mostly to "year" pages. Each link on a page requires a database
> >>> lookup (a quick one, but still a lookup). "Current events" had

If this is true, I think the code should be rewritten to use an in-core
hash table that contains (as keys) the names of all existing pages. It
would only be used to determine whether a [[link]] should be underlined or
a ?-link. This in-core data structures would not be shared between
threads of a PHP application (at least not if load balancing over multiple
hosts), but it could be recreated from the database (select name from cur)
every few minutes or so (and whenever a reload or post is served). Pages
that are created between these intervals will falsely have ?-links for a
short while, big deal. Or hits would be read from the hash, but misses
would force a read from the database. Avoiding a roundtrip down to the
database sounds like a great saving to me.


--
Lars Aronsson (lars@aronsson.se)
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/
Re: Some stats for service times with the new server [ In reply to ]
Brion VIBBER wrote:

> Lars Aronsson wrote:
>
>>>>>> The "Astronomer" page contains an extraordinary number of links,
>>>>>> mostly to "year" pages. Each link on a page requires a database
>>>>>> lookup (a quick one, but still a lookup). "Current events" had
>>>>>
>>
>> If this is true, I think the code should be rewritten to use an in-core
>> hash table that contains (as keys) the names of all existing pages. It
>> would only be used to determine whether a [[link]] should be
>> underlined or
>> a ?-link. This in-core data structures would not be shared between
>> threads of a PHP application (at least not if load balancing over
>> multiple
>> hosts), but it could be recreated from the database (select name from
>> cur)
>> every few minutes or so (and whenever a reload or post is served).
>> Pages
>> that are created between these intervals will falsely have ?-links for a
>> short while, big deal.
>
>
> Hmm, maybe.
>
> > Or hits would be read from the hash, but misses
>
>> would force a read from the database. Avoiding a roundtrip down to the
>> database sounds like a great saving to me.
>
>
> As Neil suggested, couldn't these checks be done in just one big query
> rather than a lot of separate ones? ie, something like:
>
> SELECT cur_title
> FROM cur
> WHERE cur_title
> IN ('Scientist','Research','Astronomy','Astrophysics',[...])
>
> or would that just get more complicated?
>
I agree. Roll-your-own hash tables are hard to scale, and a bad idea
unless nothing else works. Keeping everything SQL will make things much
easier to go to a 3-tier Wikipedia (with a separate database
server/server cluster and multiple page servers) at a later date.

I _think_ that the big-query approach should be _lots_ faster: for say,
300 links, it will do only one lot of query-round-trip and parsing
overhead, instead of 300 lots.

Neil
Re: Re: Some stats for service times with the new server [ In reply to ]
> As Neil suggested, couldn't these checks be done in just one big
> query rather than a lot of separate ones? ie, something like:
>
>SELECT cur_title
> FROM cur
> WHERE cur_title
> IN ('Scientist','Research','Astronomy','Astrophysics',[...])
>
> or would that just get more complicated?

It would actually be more like:

SELECT cur_id,cur_namespace,cur_title FROM cur WHERE
(cur_namespace=0 AND cur_title='Scientist') OR
(cur_namespace=0 AND cur_title='Research') OR
(cur_namespace=1 AND cur_title='Scientist')

etc. I'm not sure how well that would optimize.
Re: Re: Some stats for service times with the new server [ In reply to ]
On Tue, Jul 23, 2002 at 04:11:14PM -0700, lcrocker@nupedia.com wrote:
>
> > As Neil suggested, couldn't these checks be done in just one big
> > query rather than a lot of separate ones? ie, something like:
> >
> >SELECT cur_title
> > FROM cur
> > WHERE cur_title
> > IN ('Scientist','Research','Astronomy','Astrophysics',[...])
> >
> > or would that just get more complicated?
>
> It would actually be more like:
>
> SELECT cur_id,cur_namespace,cur_title FROM cur WHERE
> (cur_namespace=0 AND cur_title='Scientist') OR
> (cur_namespace=0 AND cur_title='Research') OR
> (cur_namespace=1 AND cur_title='Scientist')
>
> etc. I'm not sure how well that would optimize.

You can get all the links with one simple SQL statement from the table
'links' except when doing a preview.

-- Jan Hidders
Re: Some stats for service times with the new server [ In reply to ]
Lars Aronsson wrote:
>>>>>The "Astronomer" page contains an extraordinary number of links,
>>>>>mostly to "year" pages. Each link on a page requires a database
>>>>>lookup (a quick one, but still a lookup). "Current events" had
>
> If this is true, I think the code should be rewritten to use an in-core
> hash table that contains (as keys) the names of all existing pages. It
> would only be used to determine whether a [[link]] should be underlined or
> a ?-link. This in-core data structures would not be shared between
> threads of a PHP application (at least not if load balancing over multiple
> hosts), but it could be recreated from the database (select name from cur)
> every few minutes or so (and whenever a reload or post is served). Pages
> that are created between these intervals will falsely have ?-links for a
> short while, big deal.

Hmm, maybe.

> Or hits would be read from the hash, but misses
> would force a read from the database. Avoiding a roundtrip down to the
> database sounds like a great saving to me.

As Neil suggested, couldn't these checks be done in just one big query
rather than a lot of separate ones? ie, something like:

SELECT cur_title
FROM cur
WHERE cur_title
IN ('Scientist','Research','Astronomy','Astrophysics',[...])

or would that just get more complicated?

-- brion vibber (brion @ pobox.com)
Re: Re: Re: Some stats for service times with the new server [ In reply to ]
> You can get all the links with one simple SQL statement from
> the table 'links' except when doing a preview.
>-- Jan Hidders

Now there's an idea. I'm generally not relying on the links
table for anything important, because I don't really consider it
part of the "real" information. But there's no reason I can't
use it for performance. I'll think about that.
Re: Some stats for service times with the new server [ In reply to ]
On Wed, 24 Jul 2002, Neil Harris wrote:
> I agree. Roll-your-own hash tables are hard to scale, and a bad idea

Hash tables or associative arrays are built into Perl, and I guess into
PHP as well.


--
Lars Aronsson (lars@aronsson.se)
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/