Mailing List Archive

Please run this DB query to fix Special:Randompage
Someone needs to run this:

UPDATE cur SET cur_random=RAND()

Here's a quote from Village Pump explaining why:

The pseudorandom number generator which backs the Special:Randompage link
appears to be not so good. I just hit it about two dozen times and had
several links come up multiple times. -º¡º

Emphasis on pseudo. I got RNA transcription four times in about a dozen
clicks - twice in succession. Also, Zog's favorite song came up - but
redirected to Johnny Rebel. I thought redirects were excluded. Geoffrey

Here's what I found out with a few SQL queries. The cur_random field values
are strongly clustered up around the high end. In fact, there are no
articles at all between about 0.18 and 0.4, and only few below 0.18. A few
minutes of browsing through the old versions of SpecialRandompage.php shows
why. A previous version of the software selected the lowest-numbered
cur_random value, and set it to a random value. So here's why we now see
poor results: Most of the pages are clustered up above 0.9 or so, so when
you click Special:Randompage , there's a high chance of picking one of the
few low-numbered articles. The cur_random value is then reset, and there's
still a high chance of the new value being below 0.9. Hence, the few
priveleged low-numbered articles get selected far more often, and unless
someone re-randomizes cur_random column, it take a long time for the
high-numbered articles to diffuse back down. -- Tim Starling 04:25 May 1,
2003 (UTC)

--END QUOTE--

-- Tim Starling.

_________________________________________________________________
Hotmail now available on Australian mobile phones. Go to
http://ninemsn.com.au/mobilecentral/hotmail_mobile.asp
Re: Please run this DB query to fix Special:Randompage [ In reply to ]
On Wed, 2003-04-30 at 21:48, Tim Starling wrote:
> Someone needs to run this:
>
> UPDATE cur SET cur_random=RAND()

Ok, it'll be a bit before it's done. Hooray for big slow databases. :D

-- brion vibber (brion @ pobox.com)
Re: Please run this DB query to fix Special:Randompage [ In reply to ]
On Thu, May 01, 2003 at 08:10:38AM -0700, Brion Vibber wrote:
> On Wed, 2003-04-30 at 21:48, Tim Starling wrote:
> > Someone needs to run this:
> >
> > UPDATE cur SET cur_random=RAND()
>
> Ok, it'll be a bit before it's done. Hooray for big slow databases. :D
>
> -- brion vibber (brion @ pobox.com)

Off the top of my head, I can't think of any simple mathematical way to
do want you want to do. (that being making articles selected randomly
less likely to be selected randomly over time) Even if every cur_random
is totally re-randomized, it still isn't going to be fair.

Seems to me that the best way to do this is either:
1) Be truly random. In PostgreSQL, this would be something like:
SELECT cur_id from cur LIMIT 1 OFFSET random(SELECT count(cur_id) FROM cur)
(this function should be really fast, not sure if faster than:)
SELECT cur_id from cur LIMIT 1 OFFSET random(SELECT count(*) FROM cur)
(but you get the idea, may need other constraints)
2) Keep a timestamp instead of a random number. That way, whenever an
article is "randomly" selected, it gets its timestamp updated to the
current time. Always select the oldest article for a "random" page.
New articles always get a current timestamp here.

--
Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN
Re: Please run this DB query to fix Special:Randompage [ In reply to ]
On Thu, 2003-05-01 at 08:53, Nick Reinking wrote:
> Off the top of my head, I can't think of any simple mathematical way to
> do want you want to do. (that being making articles selected randomly
> less likely to be selected randomly over time)

Not sure who wants to do that... For me it's quite sufficient to make
the probability of selection random, and let the size of the selection
field make multiple selections _reasonably_ rare.

The resetting of the random weight upon load is just meant to keep the
pot stirring, and "in theory" shouldn't have any effect at all on
randomness if the random indexes were random to begin with.

> Seems to me that the best way to do this is either:
> 1) Be truly random. In PostgreSQL, this would be something like:
> SELECT cur_id from cur LIMIT 1 OFFSET random(SELECT count(cur_id) FROM cur)
> (this function should be really fast, not sure if faster than:)

Mysql doesn't seem to let you use a non-constant expression for
'LIMIT offset,count' (and besides, still no subqueries), but we can pick
a random number easily enough in PHP. The problem is that we aren't sure
what range to use; hence a random index attached to each article whose
range is known (between 0 and 1).

We don't want to select non-article pages or redirects, but we don't
currently have a count of exactly that. There's the "good articles"
count, but it has other criteria, so if we used it we'd always crop off
relatively recent articles. If we inflate it based on an estimate we'd
get overemphasis on the _most_ recent article if we estimate too high.

Sure, there's count(*), but that takes time to run... might be faster
with an index specifically on cur_namespace,cur_is_redirect. Or we could
maintain another column in the stats table which counts non-redirect
article pages with _no_ other criteria (such as size or comma/bracket
count). Rebuilding indexes on cur is _very_ slow, so if we need to
change the indexes, best to plan ahead.

-- brion vibber (brion @ pobox.com)
Re: Please run this DB query to fix Special:Randompage [ In reply to ]
>From: Brion Vibber <brion@pobox.com>
>
>On Thu, 2003-05-01 at 08:53, Nick Reinking wrote:
> > Off the top of my head, I can't think of any simple mathematical way to
> > do want you want to do. (that being making articles selected randomly
> > less likely to be selected randomly over time)
>
>Not sure who wants to do that... For me it's quite sufficient to make
>the probability of selection random, and let the size of the selection
>field make multiple selections _reasonably_ rare.

With over 100,000 articles, it's unlikely the user will ever see the same
article come up twice. A previous flawed algorithm had the property of
making a few articles much more likely to be selected.

>The resetting of the random weight upon load is just meant to keep the
>pot stirring, and "in theory" shouldn't have any effect at all on
>randomness if the random indexes were random to begin with.

Quite right. I would recommend only setting cur_random on article creation.
It would save that little tiny bit of database load. There's no reason to
think this "pot stirring" will make the selection more random, now that the
pot has been pre-stirred.

> > Seems to me that the best way to do this is either:
> > 1) Be truly random. In PostgreSQL, this would be something like:
> > SELECT cur_id from cur LIMIT 1 OFFSET random(SELECT count(cur_id)
>FROM cur)
> > (this function should be really fast, not sure if faster than:)
>
>Mysql doesn't seem to let you use a non-constant expression for
>'LIMIT offset,count' (and besides, still no subqueries), but we can pick
>a random number easily enough in PHP. The problem is that we aren't sure
>what range to use; hence a random index attached to each article whose
>range is known (between 0 and 1).

The MySQL documentation suggests using "ORDER BY RAND() LIMIT 1", for
versions later than 3.23. But I think the current algorithm is quite clever.
Its only drawback is the overhead associated with maintaining the extra
index, and initialising the field on article creation.

>We don't want to select non-article pages or redirects, but we don't
>currently have a count of exactly that. There's the "good articles"
>count, but it has other criteria, so if we used it we'd always crop off
>relatively recent articles. If we inflate it based on an estimate we'd
>get overemphasis on the _most_ recent article if we estimate too high.
>
>Sure, there's count(*), but that takes time to run... might be faster
>with an index specifically on cur_namespace,cur_is_redirect. Or we could
>maintain another column in the stats table which counts non-redirect
>article pages with _no_ other criteria (such as size or comma/bracket
>count). Rebuilding indexes on cur is _very_ slow, so if we need to
>change the indexes, best to plan ahead.


Maybe we could have a "good article" flag, stored in cur on edit.

-- Tim Starling.

_________________________________________________________________
MSN Instant Messenger now available on Australian mobile phones. Go to
http://ninemsn.com.au/mobilecentral/hotmail_messenger.asp
Re: Please run this DB query to fix Special:Randompage [ In reply to ]
On Fri, 02 May 2003 10:16:21 +1000, Tim Starling <ts4294967296@hotmail.com>
gave utterance to the following:

>
>> From: Brion Vibber <brion@pobox.com>
>>
>> On Thu, 2003-05-01 at 08:53, Nick Reinking wrote:
>> > Off the top of my head, I can't think of any simple mathematical way
>> to
>> > do want you want to do. (that being making articles selected randomly
>> > less likely to be selected randomly over time)
>>
>> Not sure who wants to do that... For me it's quite sufficient to make
>> the probability of selection random, and let the size of the selection
>> field make multiple selections _reasonably_ rare.
>
> With over 100,000 articles, it's unlikely the user will ever see the same
> article come up twice. A previous flawed algorithm had the property of
> making a few articles much more likely to be selected.
>
I thought the reason we were having this discussion was because articles
were repeating too often, indicating some fault in the randomizer?
I have commonly seen an article twice in a sequence of 3 -5 random page
requests - about a dozen times in the past month. On one occasion I saw an
article 3 times in a sequence of 6. Which is definitely not statistically
random.
--
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Re: Please run this DB query to fix Special:Randompage [ In reply to ]
On Thu, 2003-05-01 at 17:55, Richard Grevers wrote:
> I thought the reason we were having this discussion was because articles
> were repeating too often, indicating some fault in the randomizer?

Well, two things:

First, the existing assigned indexes were very un-randomly distributed,
which would may have lead to a somewhat increased chance of duplicates.
I re-randomized the whole set this morning, and I just now ran 100
random pages and got no duplicates.

Second, I'm not 100% sure the randomizer is reliable under heavy load.
We're running MySQL 3.23.54; I notice a line in the changelog for
3.23.56 (released in mid-March, apparently, when I wasn't looking) which
may be relevent:
* Better RAND() initialization for new connections.

Were we not about to replace it with a new machine and MySQL 4.0, I'd
recommend trying an upgrade. :D

-- brion vibber (brion @ pobox.com)
Re: Please run this DB query to fix Special:Randompage [ In reply to ]
On Thu, May 01, 2003 at 08:17:27PM -0700, Brion Vibber wrote:
> On Thu, 2003-05-01 at 17:16, Tim Starling wrote:
> > >From: Brion Vibber <brion@pobox.com>
> > >The resetting of the random weight upon load is just meant to keep the
> > >pot stirring, and "in theory" shouldn't have any effect at all on
> > >randomness if the random indexes were random to begin with.
> >
> > Quite right. I would recommend only setting cur_random on article creation.
>
> Eh, may as well. Done.
>
> > The MySQL documentation suggests using "ORDER BY RAND() LIMIT 1", for
> > versions later than 3.23.
>
> That seems to end up being dreadfully slow... "where used; Using
> temporary; Using filesort" Hmm, I wonder if it's changed.... Yeah, okay,
> after a minute and a half of "Copying to tmp table" I decided to cancel
> that one. Still sucks. :)
>
> The original code used the ORDER BY RAND() trick to fill up a pool of
> 1000, then grabbed out of those 1000, which was faster than out of the
> whole wiki. However when it came time to refill the queue, it was still
> incredibly slow, so I replaced it with a nice friendly *indexable*
> pre-chosen random number attached to every page.
>
> -- brion vibber (brion @ pobox.com)

If MySQL is like PostgreSQL, a query like that gets _all_ the rows
first, randomizes them, and then applies the limit. That's why you have
to use count and offset, which makes things super-duper fast. :)


--
Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN
Re: Please run this DB query to fix Special:Randompage [ In reply to ]
On Thu, 2003-05-01 at 17:16, Tim Starling wrote:
> >From: Brion Vibber <brion@pobox.com>
> >The resetting of the random weight upon load is just meant to keep the
> >pot stirring, and "in theory" shouldn't have any effect at all on
> >randomness if the random indexes were random to begin with.
>
> Quite right. I would recommend only setting cur_random on article creation.

Eh, may as well. Done.

> The MySQL documentation suggests using "ORDER BY RAND() LIMIT 1", for
> versions later than 3.23.

That seems to end up being dreadfully slow... "where used; Using
temporary; Using filesort" Hmm, I wonder if it's changed.... Yeah, okay,
after a minute and a half of "Copying to tmp table" I decided to cancel
that one. Still sucks. :)

The original code used the ORDER BY RAND() trick to fill up a pool of
1000, then grabbed out of those 1000, which was faster than out of the
whole wiki. However when it came time to refill the queue, it was still
incredibly slow, so I replaced it with a nice friendly *indexable*
pre-chosen random number attached to every page.

-- brion vibber (brion @ pobox.com)
Re: Please run this DB query to fix Special:Randompage [ In reply to ]
>>>From: Brion Vibber <brion@pobox.com>
>>>
>>>On Thu, 2003-05-01 at 08:53, Nick Reinking wrote:
>>> > Off the top of my head, I can't think of any simple mathematical way
>>>to
>>> > do want you want to do. (that being making articles selected randomly
>>> > less likely to be selected randomly over time)
>>>
>>>Not sure who wants to do that... For me it's quite sufficient to make
>>>the probability of selection random, and let the size of the selection
>>>field make multiple selections _reasonably_ rare.
>>
>>With over 100,000 articles, it's unlikely the user will ever see the same
>>article come up twice. A previous flawed algorithm had the property of
>>making a few articles much more likely to be selected.
>>
>I thought the reason we were having this discussion was because articles
>were repeating too often, indicating some fault in the randomizer?
>I have commonly seen an article twice in a sequence of 3 -5 random page
>requests - about a dozen times in the past month. On one occasion I saw an
>article 3 times in a sequence of 6. Which is definitely not statistically
>random.

A calculator is only as smart as the fool pressing the buttons. Same with
pseudorandom number generators. The PRNG appears to have been working
perfectly, allowing us to misuse it with amazing precision. For an
explanation of what went wrong, see my first post on this topic.

As for Brion's comment about MySQL updating RAND() seeding -- I assumed they
would be using a persistent seed, but that comment suggests they're making
the same mistake our own code makes: initialising the seed on every
connection, from the system time. Just because you ask for microseconds
doesn't mean you get microseconds. The PHP manual doesn't say how
microtime() is implemented, what if it's using the old 18.2Hz clock?

-- Tim Starling.

_________________________________________________________________
Hotmail now available on Australian mobile phones. Go to
http://ninemsn.com.au/mobilecentral/hotmail_mobile.asp