Mailing List Archive

Stuck/Missing Grid Job for tools.william-avery-bot
Hi,

I got the email below telling me that my cron job running as
william-avery-bot had throw an error, and I noticed that the Grid job that
it kicks off hasn't run since.

I tried deleting the job using the instructions at
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Stopping_jobs_with_%E2%80%98qdel%E2%80%99_and_%E2%80%98jstop%E2%80%99
but it appeared "stuck".

"qstat -xml" outputs the following:
<?xml version='1.0'?>
<job_info xmlns:xsd="
http://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/schemas/qstat/qstat.xsd
">
<queue_info>
<job_list state="running">
<JB_job_number>9999749</JB_job_number>
<JAT_prio>0.25319</JAT_prio>
<JB_name>cron-TaxonbarSyncerBot</JB_name>
<JB_owner>tools.william-avery-bot</JB_owner>
<state>dr</state>
<JAT_start_time>2021-03-25T17:49:16</JAT_start_time>
<queue_name>task@tools-sgeexec-0916.tools.eqiad.wmflabs</queue_name>
<slots>1</slots>
</job_list>
</queue_info>
<job_info>
</job_info>
</job_info>

But when I ssh to tools-sgeexec-0916.tools.eqiad.wmflabs I see no sign of
any processes under tools.william-avery-bot, except the ones associated
with my interactive session.

Can anyone help resolve this or advise of a venue to raise it?

Thanks in advance,

Will

---------- Forwarded message ---------
From: Cron Daemon <root@tools.wmflabs.org>
Date: Thu, 25 Mar 2021 at 16:49
Subject: Cron <tools.william-avery-bot@tools-sgecron-01> /usr/bin/jsub -N
cron-TaxonbarSyncerBot -once -quiet ~/TaxonbarSyncerBot.sh
To: <tools.william-avery-bot@tools.wmflabs.org>


error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host
"tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud": got send error
Traceback (most recent call last):
File "/usr/bin/job", line 48, in <module>
root = xml.etree.ElementTree.fromstring(proc.stdout.read())
File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 1345, in XML
return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
Re: Stuck/Missing Grid Job for tools.william-avery-bot [ In reply to ]
On Fri, Mar 26, 2021 at 3:27 PM William Avery <willm.avery@gmail.com> wrote:
>
> Hi,
>
> I got the email below telling me that my cron job running as william-avery-bot had throw an error, and I noticed that the Grid job that it kicks off hasn't run since.
>
> I tried deleting the job using the instructions at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Stopping_jobs_with_%E2%80%98qdel%E2%80%99_and_%E2%80%98jstop%E2%80%99 but it appeared "stuck".

I have "force deleted" your job using my Toolforge admin rights.

$ sudo qdel -f 9999749
root forced the deletion of job 9999749

The Toolforge grid engine had numerous problems yesterday which led to
the scheduler losing track of the state of many jobs. Brooke did
several rounds of looking for these and cleaning the queue state, but
obviously yours was not cleaned up in that process. Thank you for your
report, and I hope you can get your tool back into its proper working
state.

Bryan
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Stuck/Missing Grid Job for tools.william-avery-bot [ In reply to ]
Thanks Bryan,

It's now resumed it's not particularly critical task:
https://www.wikidata.org/wiki/Special:Contributions/William_Avery_Bot

Will

On Fri, 26 Mar 2021 at 21:45, Bryan Davis <bd808@wikimedia.org> wrote:

> On Fri, Mar 26, 2021 at 3:27 PM William Avery <willm.avery@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I got the email below telling me that my cron job running as
> william-avery-bot had throw an error, and I noticed that the Grid job that
> it kicks off hasn't run since.
> >
> > I tried deleting the job using the instructions at
> https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Stopping_jobs_with_%E2%80%98qdel%E2%80%99_and_%E2%80%98jstop%E2%80%99
> but it appeared "stuck".
>
> I have "force deleted" your job using my Toolforge admin rights.
>
> $ sudo qdel -f 9999749
> root forced the deletion of job 9999749
>
> The Toolforge grid engine had numerous problems yesterday which led to
> the scheduler losing track of the state of many jobs. Brooke did
> several rounds of looking for these and cleaning the queue state, but
> obviously yours was not cleaned up in that process. Thank you for your
> report, and I hope you can get your tool back into its proper working
> state.
>
> Bryan
> --
> Bryan Davis Technical Engagement Wikimedia Foundation
> Principal Software Engineer Boise, ID USA
> [[m:User:BDavis_(WMF)]] irc: bd808
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>