f*****h 发帖数: 228 | 1 【 以下文字转载自 Computation 讨论区 】
发信人: franfyh (franfyh), 信区: Computation
标 题: how to shut down pbs server or kill all my jobs
发信站: BBS 未名空间站 (Tue Oct 12 23:09:06 2010, 美东)
hi, my pbs server seems to be jammed. all the commands doesn't seem to work.
(qstat qdel etc.) I'm thinking of shut down pbs server or kill all the jobs
. does anybody here get any ideas about how to do this? |
|
f*****h 发帖数: 228 | 2 【 以下文字转载自 Computation 讨论区 】
发信人: franfyh (franfyh), 信区: Computation
标 题: how to shut down pbs server or kill all my jobs
发信站: BBS 未名空间站 (Tue Oct 12 23:09:06 2010, 美东)
hi, my pbs server seems to be jammed. all the commands doesn't seem to work.
(qstat qdel etc.) I'm thinking of shut down pbs server or kill all the jobs
. does anybody here get any ideas about how to do this? |
|
S**********l 发帖数: 3835 | 3 一个是本来在compute node上run的job,会被kill.
另外一个是上一个job还没有run完,就被下一个job顶上来。(没有wall time 限制)
,为啥?
另外就是qstat时候只能看见自己的job,看不见别的user的。
是安装的时候出错了么? |
|
L***i 发帖数: 11 | 4 On our PC-Cluster system, we are using DQS3.3.2 to manage all jobs. There are
several different classes of queues, for example, X1000, X2000, X3000. Now, I
got a problem, one queue X2000 could not be submitted jobs. For instance, if
you execute 'qsub file.job' and then execute "qstat', there is no jobs with
queue "X2000 in the queueing list or running list. But sometimes, the jobs
could be picked up. It's very strange to me.
Does anyone knows? You would be greatly appreciated for your help. |
|
L***i 发帖数: 11 | 5 On our PC-Cluster system, we are using DQS3.3.2 to manage all jobs. There are
several different classes of queues, for example, X1000, X2000, X3000. Now, I
got a problem, one queue X2000 could not be submitted jobs. For instance, if
you execute 'qsub file.job' and then execute "qstat', there is no jobs with
queue "X2000 in the queueing list or running list. But sometimes, the jobs
could be picked up. It's very strange to me.
Does anyone knows? You would be greatly appreciated for your hel |
|
d*****w 发帖数: 124 | 6 r u sure the queue is okay?
1. dqs_execd is runing at the node?
2. the died job at the queue is clear?
try qstat -f to find the problem.
are
I |
|
d*****w 发帖数: 124 | 7
are
I
Seems it is okay. Perhaps the file.job is running at the other nodes.
if qstat -f show X2000 is UP, then hsould be normal.
|
|
L***i 发帖数: 11 | 8 I checked all queues with "qstat -f", every machine is UP. But those machine
with X2000 queue could not pick up jobs and also dqs_execd does run.
And in err_file, the following message are listed( where host033 runs qmaster
daemon):
time=1058023801 DQS_WARNING_0257 dqs_open_tcp: cannot connect to peer host033
errno= 111 ../SRC/dqs_io.c 212 /usr/local/DQS_332/bin/dq
s_execd332 host067
time=1058023801 DQS_ERROR_0458 unable to connect to host "host033"
../SRC/dqs_send_receive.c 170 /usr/ |
|
t*****z 发帖数: 812 | 9 一般是通过PBS提交你的程序的,要不大家一起submit机器不慢死才怪
看看有没有qstat 命令 |
|
f*****h 发帖数: 228 | 10 hi, my pbs server seems to be jammed. all the commands doesn't seem to work.
(qstat qdel etc.) I'm thinking of shut down pbs server or kill all the jobs
. does anybody here get any ideas about how to do this? |
|
d******x 发帖数: 11837 | 11 揣了揣真的可以work。。。
######################
catch {exec qstat -f} val1
if {[string match -nocase "*ERROR*" $val1]} {
puts "find error for exec..."
}
puts $val1
##########################
outputs on screen:
find error for exec...
error: commlib error: access denied (client IP resolved to host name "
einstein-a". This is not identical to clients host name "einstein")
error: unable to contact qmaster using port 536 on host "sgemaster"
keyword |
|