Mailing List Archive

PostgreSQL and deadlock
Hi everybody,


So, you know the code I posted a week or so ago for what I called the
"index trigger"?

It's having a moment, and I'm wondering if anything in the error
messages will ring bells with any of you.

Background:

1. This template gets called from cron every three minutes. Its job is
to look through recently-published stories, figure out what covers need
to be updated, and then publish those covers.

2. Sometimes publishing all those covers takes a while (say, longer than
three minutes). When that happened, a big stack of bric_republish
processes threatened to take down the entire server.

3. So I added a bit to the top of the template, which I hoped would
prevent the nightmare situation described in #2. Here's the start of the
<%init> block:

my @running_copies_of_myself = Bric::Util::Job->list({
executing => 1,
story_instance_id => $story->get_version_id,
failed => 0,
});

if ($running_copies_of_myself[1]) {
use Data::Dumper;
$m->out(Dumper(@running_copies_of_myself));
return 0;
}

About 90% of the time, this works perfectly. If one of these templates
is already going, trying to launch another process just exits the way I
had hoped.

But when the server is under load, the Job->list() frequently hangs for
a long time. Long enough that it's still running three minutes later, so
the processes pile up.

Sometimes the Job->list() gives up. In that case, here's what the Dumper
dumps out:

$VAR1 = bless( { 'priority' => '3',
'_class_id' => '80',
'story_instance_id' => '68277',
'media_instance_id' => undef,
'sched_time' => '2010-03-23 21:45:03',
'id' => 1616337, 'tries' => '2',
'name' => 'Publish "Index trigger"',
'_executing' => '1',
'error_message' => 'Unable to execute SQL statement:
DBD::Pg::st execute failed: ERROR: deadlock detected DETAIL: Process
30142 waits for ShareLock on transaction 187816962; blocked by process
30000. Process 30000 waits for ShareLock on transaction 187816914;
blocked by process 30142. [for Statement " UPDATE resource SET id = ?,
parent_id = ?, path = ?, uri = ?, size = ?, mod_time = ?, is_dir = ?,
media_type__id = (SELECT id FROM media_type WHERE name = ?) WHERE id = ?
" with ParamValues: 9=\'99437\']
at /home/rsn/bricolage_dev/lib/Bric/Util/DBI.pm line 1087, line 19.
[/home/rsn/bricolage_dev/lib/Bric/Util/DBI.pm:1088]
[/home/rsn/bricolage_dev/lib/Bric/Dist/Resource.pm:1676]
[/home/rsn/bricolage_dev/lib/Bric/Util/Burner.pm:2045]
[/home/rsn/bricolage_dev/lib/Bric/Util/Burner/Mason.pm:598]
[/home/rsn/bricolage_dev/lib/Bric/Util/Burner/Mason.pm:297]
[/home/rsn/bricolage_dev/lib/Bric/Util/Burner.pm:1518]
[/home/rsn/bricolage_dev/lib/Bric/Util/Burner.pm:1271]
[/home/rsn/bricolage_dev/lib/Bric/Util/Job/Pub.pm:193]
[/home/rsn/bricolage_dev/lib/Bric/Util/Job.pm:1889]
[/home/rsn/bricolage_dev/lib/Bric/Util/Job.pm:1808]
[/home/rsn/bricolage_dev/lib/Bric/SOAP/Workflow.pm:294]
[/usr/lib/perl5/vendor_perl/5.8.4/SOAP/Lite.pm:2322]
[/usr/lib/perl5/vendor_perl/5.8.4/SOAP/Transport/HTTP.pm:286]
[/usr/lib/perl5/vendor_perl/5.8.4/SOAP/Transport/HTTP.pm:464]
[/home/rsn/bricolage_dev/lib/Bric/SOAP/Handler.pm:184] [PerlHandler
subroutine `Bric::SOAP::Handler::handler\':1] ',
'comp_time' => undef,
'_failed' => '0',
'grp_ids' => [ '30' ],
'_dirty' => undef,
'user_id' => '1168',
'type' => '0'
},
'Bric::Util::Job::Pub' );

$VAR2 = bless( { 'priority' => '3',
'_class_id' => '80',
'story_instance_id' => '68277',
'media_instance_id' => undef,
'sched_time' => '2010-03-23 21:48:00',
'id' => 1616345,
'tries' => '1',
'name' => 'Publish "Index trigger"',
'_executing' => '1',
'error_message' => undef,
'comp_time' => undef,
'_failed' => '0',
'grp_ids' => [ '30' ],
'_dirty' => undef,
'user_id' => '1168',
'type' => '0' },
'Bric::Util::Job::Pub' );

Anyway, I'm feeling a bit lost. Does anybody recognize the deadlock
error, or have suggestions for troubleshooting it?


Thanks,

Bret




--
Bret Dawson
Producer
Pectopah Productions Inc.
(416) 895-7635
bret@pectopah.com
www.pectopah.com
Re: PostgreSQL and deadlock [ In reply to ]
On Mar 23, 2010, at 9:20 PM, Bret Dawson wrote:

> Anyway, I'm feeling a bit lost. Does anybody recognize the deadlock
> error, or have suggestions for troubleshooting it?

A deadlock happens when two each transactions try to update two or more of the same records with foreign key relations in different orders. For example, say you have two tables:

CREATE TABLE a (
id serial primary key
);

CREATE TABLE b (
id serial primary key,
a_id integer references a(id)
);

Let's insert a couple of values;

INSERT INTO A VALUES (DEFAULT);
INSERT INTO B VALUES (DEFAULT, 1);¨

Now, start a transaction and update b:

BEGIN;
UPDATE b SET a_id = 1;

Now start a second psql section, start a transaction, and update a:

BEGIN;
UPDATE a SET id = 1;

Go back to the first section and try to update a there, too:

UPDATE a SET id = 1;

You'll notice that it will hang, not returning. This is because the second psql connection has a write lock on table a row 1, and the session needs to wait for that lock to be released before it can actually update it.

Now go back to the second section again and try to update b:

UPDATE b SET a_id = 1;
ERROR: deadlock detected
DETAIL: Process 27255 waits for ShareLock on transaction 64808; blocked by process 27198.
Process 27198 waits for ShareLock on transaction 64809; blocked by process 27255.
HINT: See server log for query details.

This happens because the first session already has a lock on table b row 1, and is waiting on the lock on table a row 1. The second session has the lock on table a row 1 and tries to update table b row 1, and has to wait for the lock that the first connection has.

This is a deadlock: each transaction is waiting for a lock to be released by the other transaction. This is an irresolvable situation, so what PostgreSQL does is to pick one transaction and let it finish, while forcing the other to fail and need to be rolled back.

Lesson over. Now, the way to eliminate deadlocks is to make sure that your code always updates tables in the same order. If both connections updated table a and then table b, there'd be no problem. The problem comes when one updates table a and then table b, while the other updates table b and then table a.

It also makes sense to try to keep transactions a short as possible, to avoid long locks. This is why each job runs in its own transaction. Bricolage used to run all the jobs it was processing in a queue in a single transaction, and deadlocks were a huge problem. I haven't heard about them in quite a long while, though.

So let's look at your error again:

> 'error_message' => 'Unable to execute SQL statement:
> DBD::Pg::st execute failed: ERROR: deadlock detected DETAIL: Process
> 30142 waits for ShareLock on transaction 187816962; blocked by process
> 30000. Process 30000 waits for ShareLock on transaction 187816914;
> blocked by process 30142. [for Statement " UPDATE resource SET id = ?,
> parent_id = ?, path = ?, uri = ?, size = ?, mod_time = ?, is_dir = ?,
> media_type__id = (SELECT id FROM media_type WHERE name = ?) WHERE id = ?
> " with ParamValues: 9=\'99437\']

So this transaction is trying to update resource after updating something else, while process 30142 has already updated this resource row and is trying to update whatever it is that this process has already updated.

Figuring this out is hard. My [other company](http://pgexperts.com/) has developed a pretty good set of skills to work though such things. One of our tools is this query:

SELECT pg_stat_activity.datname,
pg_class.relname,
substr(pg_stat_activity.current_query,1,250) AS query,
age(now(),pg_stat_activity.query_start) AS "age"
FROM pg_stat_activity
JOIN pg_locks ON pg_locks.pid = pg_stat_activity.procpid
LEFT OUTER JOIN pg_class ON pg_locks.relation = pg_class.oid
WHERE pg_locks.mode = 'AccessExclusiveLock'
AND pg_class.relname IS NOT NULL
ORDER by pg_stat_activity.datname, pg_class.relname;

Run that when things are starting to pile up, to see what's locking what. It should help you to identify what processes are locking resource rows, which will hopefully help you to identify what code is running queries out of order.

HTH,

David
Re: PostgreSQL and deadlock [ In reply to ]
Thanks David. This is totally illuminating, at least for understanding
what a deadlock is.

The big mystery for me is in figuring out why a Job->list() operation
would be doing an UPDATE at all. (Or in figuring out what's actually
attempting the UPDATE, when I'm incorrectly blaming Job->list().)


Anyway, the big clog usually happens in the late morning, so I'll run
the pgexperts query then and let you know what happens.


Thanks again,

Bret


On Tue, 2010-03-23 at 22:09 -0400, David E. Wheeler wrote:
> On Mar 23, 2010, at 9:20 PM, Bret Dawson wrote:
>
> > Anyway, I'm feeling a bit lost. Does anybody recognize the deadlock
> > error, or have suggestions for troubleshooting it?
>
> A deadlock happens when two each transactions try to update two or more of the same records with foreign key relations in different orders. For example, say you have two tables:
>
> CREATE TABLE a (
> id serial primary key
> );
>
> CREATE TABLE b (
> id serial primary key,
> a_id integer references a(id)
> );
>
> Let's insert a couple of values;
>
> INSERT INTO A VALUES (DEFAULT);
> INSERT INTO B VALUES (DEFAULT, 1);¨
>
> Now, start a transaction and update b:
>
> BEGIN;
> UPDATE b SET a_id = 1;
>
> Now start a second psql section, start a transaction, and update a:
>
> BEGIN;
> UPDATE a SET id = 1;
>
> Go back to the first section and try to update a there, too:
>
> UPDATE a SET id = 1;
>
> You'll notice that it will hang, not returning. This is because the second psql connection has a write lock on table a row 1, and the session needs to wait for that lock to be released before it can actually update it.
>
> Now go back to the second section again and try to update b:
>
> UPDATE b SET a_id = 1;
> ERROR: deadlock detected
> DETAIL: Process 27255 waits for ShareLock on transaction 64808; blocked by process 27198.
> Process 27198 waits for ShareLock on transaction 64809; blocked by process 27255.
> HINT: See server log for query details.
>
> This happens because the first session already has a lock on table b row 1, and is waiting on the lock on table a row 1. The second session has the lock on table a row 1 and tries to update table b row 1, and has to wait for the lock that the first connection has.
>
> This is a deadlock: each transaction is waiting for a lock to be released by the other transaction. This is an irresolvable situation, so what PostgreSQL does is to pick one transaction and let it finish, while forcing the other to fail and need to be rolled back.
>
> Lesson over. Now, the way to eliminate deadlocks is to make sure that your code always updates tables in the same order. If both connections updated table a and then table b, there'd be no problem. The problem comes when one updates table a and then table b, while the other updates table b and then table a.
>
> It also makes sense to try to keep transactions a short as possible, to avoid long locks. This is why each job runs in its own transaction. Bricolage used to run all the jobs it was processing in a queue in a single transaction, and deadlocks were a huge problem. I haven't heard about them in quite a long while, though.
>
> So let's look at your error again:
>
> > 'error_message' => 'Unable to execute SQL statement:
> > DBD::Pg::st execute failed: ERROR: deadlock detected DETAIL: Process
> > 30142 waits for ShareLock on transaction 187816962; blocked by process
> > 30000. Process 30000 waits for ShareLock on transaction 187816914;
> > blocked by process 30142. [for Statement " UPDATE resource SET id = ?,
> > parent_id = ?, path = ?, uri = ?, size = ?, mod_time = ?, is_dir = ?,
> > media_type__id = (SELECT id FROM media_type WHERE name = ?) WHERE id = ?
> > " with ParamValues: 9=\'99437\']
>
> So this transaction is trying to update resource after updating something else, while process 30142 has already updated this resource row and is trying to update whatever it is that this process has already updated.
>
> Figuring this out is hard. My [other company](http://pgexperts.com/) has developed a pretty good set of skills to work though such things. One of our tools is this query:
>
> SELECT pg_stat_activity.datname,
> pg_class.relname,
> substr(pg_stat_activity.current_query,1,250) AS query,
> age(now(),pg_stat_activity.query_start) AS "age"
> FROM pg_stat_activity
> JOIN pg_locks ON pg_locks.pid = pg_stat_activity.procpid
> LEFT OUTER JOIN pg_class ON pg_locks.relation = pg_class.oid
> WHERE pg_locks.mode = 'AccessExclusiveLock'
> AND pg_class.relname IS NOT NULL
> ORDER by pg_stat_activity.datname, pg_class.relname;
>
> Run that when things are starting to pile up, to see what's locking what. It should help you to identify what processes are locking resource rows, which will hopefully help you to identify what code is running queries out of order.
>
> HTH,
>
> David
>
>


--
Bret Dawson
Producer
Pectopah Productions Inc.
(416) 895-7635
bret@pectopah.com
www.pectopah.com
Re: PostgreSQL and deadlock [ In reply to ]
On Mar 24, 2010, at 6:27 PM, Bret Dawson wrote:

> Thanks David. This is totally illuminating, at least for understanding
> what a deadlock is.
>
> The big mystery for me is in figuring out why a Job->list() operation
> would be doing an UPDATE at all. (Or in figuring out what's actually
> attempting the UPDATE, when I'm incorrectly blaming Job->list().)

I suspect it's your returning 0, but I'm not sure. That will trigger the job finishing and its values getting written. But since it's returning 0, no files will be written, so I'm not sure why resource would be updated.

> Anyway, the big clog usually happens in the late morning, so I'll run
> the pgexperts query then and let you know what happens.

Cool.

Best,

David
Re: PostgreSQL and deadlock [ In reply to ]
Well, I have a bit of news, but not tons. The clog happened, but this
query never returned anything:


SELECT pg_stat_activity.datname,
pg_class.relname,
substr(pg_stat_activity.current_query,1,250) AS query,
age(now(),pg_stat_activity.query_start) AS "age"
FROM pg_stat_activity
JOIN pg_locks ON pg_locks.pid = pg_stat_activity.procpid
LEFT OUTER JOIN pg_class ON pg_locks.relation = pg_class.oid
WHERE pg_locks.mode = 'AccessExclusiveLock'
AND pg_class.relname IS NOT NULL
ORDER by pg_stat_activity.datname, pg_class.relname;


So I changed the return 0 to $m->clear_and_abort, and that didn't change
anything either. Much of the time it would run correctly (killing itself
when an earlier publish job was still going), but it frequently got
stuck just doing the Job->list(), and never managed to make it to the
suicide part.


I looked through Job.pm. I'm pretty sure that this list:

my @running_copies_of_myself = Bric::Util::Job->list_ids({
executing => 1,
story_instance_id => $story->get_version_id,
failed => 0,
});

...produces this SQL:

SELECT DISTINCT a.id
FROM job a, member m, job_member c
WHERE a.id = c.object_id
AND m.id = c.member__id
AND m.active = '1'
AND a.story_instance__id = <$story_version_id>
AND a.executing = true
AND a.failed = false
ORDER BY a.id;


An EXPLAIN on that, with a hard-coded story version id, produces this:

Unique (cost=9.07..9.08 rows=1 width=4)
-> Sort (cost=9.07..9.07 rows=1 width=4)
Sort Key: a.id
-> Nested Loop (cost=0.00..9.06 rows=1 width=4)
-> Nested Loop (cost=0.00..5.43 rows=1 width=8)
-> Index Scan using idx_job__executing on job a
(cost=0.00..2.01 rows=1 width=4)
Index Cond: (executing = true)
Filter: ((story_instance__id = 68277) AND
executing AND (NOT failed))
-> Index Scan using fkx_job__job_member on
job_member c (cost=0.00..3.40 rows=1 width=8)
Index Cond: ("outer".id = c.object_id)
-> Index Scan using pk_member__id on member m
(cost=0.00..3.62 rows=1 width=4)
Index Cond: (m.id = "outer".member__id)
Filter: active
(13 rows)


I don't know much about the expense of the average query, but I wonder
if the list operation needs to bother with member and job_member at all.
Could we look only in job, unless there's a parameter asking
specifically about membership in a job group?

Greg points out that every record in the "All Jobs" group has active =
true in member. And All Jobs is the only job group.


Anyway, I'm rambling here. Does this sound like even slightly the right
approach?


Thanks,

Bret


On Thu, 2010-03-25 at 00:45 -0400, David E. Wheeler wrote:
> On Mar 24, 2010, at 6:27 PM, Bret Dawson wrote:
>
> > Thanks David. This is totally illuminating, at least for understanding
> > what a deadlock is.
> >
> > The big mystery for me is in figuring out why a Job->list() operation
> > would be doing an UPDATE at all. (Or in figuring out what's actually
> > attempting the UPDATE, when I'm incorrectly blaming Job->list().)
>
> I suspect it's your returning 0, but I'm not sure. That will trigger the job finishing and its values getting written. But since it's returning 0, no files will be written, so I'm not sure why resource would be updated.
>
> > Anyway, the big clog usually happens in the late morning, so I'll run
> > the pgexperts query then and let you know what happens.
>
> Cool.
>
> Best,
>
> David
>
>
>


--
Bret Dawson
Producer
Pectopah Productions Inc.
(416) 895-7635
bret@pectopah.com
www.pectopah.com
Re: PostgreSQL and deadlock [ In reply to ]
On Mar 25, 2010, at 12:47 PM, Bret Dawson wrote:

> I don't know much about the expense of the average query, but I wonder
> if the list operation needs to bother with member and job_member at all.
> Could we look only in job, unless there's a parameter asking
> specifically about membership in a job group?

I suppose, but that's not the problem, I suspect. This is a SELECT query, not an updating query, and so isn't triggering the deadlock.

> Greg points out that every record in the "All Jobs" group has active =
> true in member. And All Jobs is the only job group.

Yeah, but the groups are there for permissions. You gotta have them.

> Anyway, I'm rambling here. Does this sound like even slightly the right
> approach?

I'm afraid not. It's more likely that, because you're aborting, it's not writing files, so it's doing something to dissociate resources or something. But frankly, I really don't get it.

Best,

David
Re: PostgreSQL and deadlock [ In reply to ]
Oh dear. I had hoped the groups weren't as important as they seem to be.

I've attached a patch for the change I made this afternoon. It avoids
the member/job_member part if you're only calling Job->list_ids(), and
if you don't have a grp_id key in your parameter list. So if you call
Job->list, you get everything you always did. But Job->list_ids()
involves fewer tables.

In the morning (11:00am-ish, ET) I'll know for sure, but so far this
query seems less prone to sticking than the earlier version. So I'm sort
of thinking that the SQL error I posted before might not be even all
that directly related to the main problem.

Thoughts?


Cheers,

Bret





On Thu, 2010-03-25 at 22:20 -0400, David E. Wheeler wrote:
> On Mar 25, 2010, at 12:47 PM, Bret Dawson wrote:
>
> > I don't know much about the expense of the average query, but I wonder
> > if the list operation needs to bother with member and job_member at all.
> > Could we look only in job, unless there's a parameter asking
> > specifically about membership in a job group?
>
> I suppose, but that's not the problem, I suspect. This is a SELECT query, not an updating query, and so isn't triggering the deadlock.
>
> > Greg points out that every record in the "All Jobs" group has active =
> > true in member. And All Jobs is the only job group.
>
> Yeah, but the groups are there for permissions. You gotta have them.
>
> > Anyway, I'm rambling here. Does this sound like even slightly the right
> > approach?
>
> I'm afraid not. It's more likely that, because you're aborting, it's not writing files, so it's doing something to dissociate resources or something. But frankly, I really don't get it.
>
> Best,
>
> David
>
>


--
Bret Dawson
Producer
Pectopah Productions Inc.
(416) 895-7635
bret@pectopah.com
www.pectopah.com