L***i 发帖数: 11 | 1 On our PC-Cluster system, we are using DQS3.3.2 to manage all jobs. There are
several different classes of queues, for example, X1000, X2000, X3000. Now, I
got a problem, one queue X2000 could not be submitted jobs. For instance, if
you execute 'qsub file.job' and then execute "qstat', there is no jobs with
queue "X2000 in the queueing list or running list. But sometimes, the jobs
could be picked up. It's very strange to me.
Does anyone knows? You would be greatly appreciated for your hel | d*****w 发帖数: 124 | 2 r u sure the queue is okay?
1. dqs_execd is runing at the node?
2. the died job at the queue is clear?
try qstat -f to find the problem.
are
I
【在 L***i 的大作中提到】 : On our PC-Cluster system, we are using DQS3.3.2 to manage all jobs. There are : several different classes of queues, for example, X1000, X2000, X3000. Now, I : got a problem, one queue X2000 could not be submitted jobs. For instance, if : you execute 'qsub file.job' and then execute "qstat', there is no jobs with : queue "X2000 in the queueing list or running list. But sometimes, the jobs : could be picked up. It's very strange to me. : : Does anyone knows? You would be greatly appreciated for your hel
| d*****w 发帖数: 124 | 3
are
I
Seems it is okay. Perhaps the file.job is running at the other nodes.
if qstat -f show X2000 is UP, then hsould be normal.
【在 L***i 的大作中提到】 : On our PC-Cluster system, we are using DQS3.3.2 to manage all jobs. There are : several different classes of queues, for example, X1000, X2000, X3000. Now, I : got a problem, one queue X2000 could not be submitted jobs. For instance, if : you execute 'qsub file.job' and then execute "qstat', there is no jobs with : queue "X2000 in the queueing list or running list. But sometimes, the jobs : could be picked up. It's very strange to me. : : Does anyone knows? You would be greatly appreciated for your hel
| L***i 发帖数: 11 | 4 I checked all queues with "qstat -f", every machine is UP. But those machine
with X2000 queue could not pick up jobs and also dqs_execd does run.
And in err_file, the following message are listed( where host033 runs qmaster
daemon):
time=1058023801 DQS_WARNING_0257 dqs_open_tcp: cannot connect to peer host033
errno= 111 ../SRC/dqs_io.c 212 /usr/local/DQS_332/bin/dq
s_execd332 host067
time=1058023801 DQS_ERROR_0458 unable to connect to host "host033"
../SRC/dqs_send_receive.c 170 /usr/
【在 d*****w 的大作中提到】 : : are : I : Seems it is okay. Perhaps the file.job is running at the other nodes. : if qstat -f show X2000 is UP, then hsould be normal. :
|
|