Setting Up a MOP Worker Site.
Overview
========
MOP is
a system for distributing CMS production jobs. There is a MOP master site
where jobs are
defined
by the CMS production scripts. The mop_submitter then distributes those jobs
to remote
sites
through CondorG and Globus. When the jobs are finished, the output is collected
by GDMP.
Preparing
remote sites for MOP jobs
===================================
Remote
site overview:
In general,
the remote site will need VDT installed on one or more machines and a Globus
job
manager
for the local batch system.
Accounts:
MOP worker
site should map cmsprod account from fnal.gov to one of its local user accounts.
The contact string
is:
"/O=Grid/O=Globus/OU=fnal.gov/CN=CMS
Production"
Remote
site summary:
The following information
needs to be conveyed to MOP Master site owner:
--------------------------------------------------------------------------
(1-1)
stage-in job manager
(1-2)
GLOBUS_LOCATION value.
(1-3)
Shared directory for mop files if not home directory.
(2-1)
run job manager.
(2-2)
location of CMS DAR installation. NB: only the path needs to be provided.
The DAR file(s)
themselves
can be installed through MOP.
--------------------------------------------------------------------------
Example
remote site values:
The first
"remote" site is at Fermilab. Here are the Fermilab values:
(1-1)
droidf.fnal.gov:/jobmanager
(1-2)
/opt/globus/globus20
(1-3)
/cms/work
(2-1)
droidf.fnal.gov:/jobmanager-condor
(2-2)
/data/dar
The site
parameters are stored in the mop_submitter/site-info directory. Job manager
and scratch
directory
info is in the *.site.* files. The .vars files hold the following information:
Appendum
1:
============
(1) In
the $GLOBUS_LOCATION/etc/globus-job-manager-condor.conf file, edit the two
lines at the
bottom
according to the comment above them, adding INTEL and LINUX as arguments,
like so:
-condor-arch
INTEL
-condor-os
LINUX
and remove
the two comment lines (appearing just before):
# Edit
the following two lines to complete
# the
configuration of your condor jobmanager
Then add
another line to retain debugging info on errors:
-save-logfile
on_errors
Finally,
rename the file to globus-job-manager-condor-INTEL-LINUX.conf
(2) Let's
use testulix.phys.ufl.edu as an example. In
$GLOBUS_LOCATION/etc/jobmanager-condor,
add "-condor-os LINUX -condor-arch INTEL" to
the end
of the argument list, and change the existing -conf and -rdn arguments to
refer to
globus-job-manager-condor-INTEL-LINUX.conf
and
testulix.phys.ufl.edu/jobmanager-condor-INTEL-LINUX
instead of the old names.
Finally,
rename the file to jobmanager-condor-INTEL-LINUX.
We thereafter
refer to that job manager as
testulix.phys.ufl.edu/jobmanager-condor-INTEL-LINUX
instead of just
testulix.phys.ufl.edu/jobmanager-condor
(when using globus-job-run, Condor-G,
etc).
At Florida,
the condor job manager is then:
% cat
jobmanager-condor-INTEL-LINUX
stderr_log,local_cred
- /usr/local/globus/globus-2.0/libexec/globus-job-manager
globus-job-manager
-conf
/usr/local/globus/globus-2.0/etc/globus-job-manager-condor-INTEL-LINUX.conf
-type
condor
-rdn testulix.phys.ufl.edu/condor-INTEL-LINUX -machine-type unknown
-publish-jobs
-condor-os LINUX -condor-arch INTEL
and the
condor job manager config file reads as:
% cat
globus-job-manager-condor-INTEL-LINUX.conf
-home
"/usr/local/globus/globus-2.0"
-e /usr/local/globus/globus-2.0/libexec
-globus-gatekeeper-host
testulix.phys.ufl.edu
-globus-gatekeeper-port
2119
-globus-gatekeeper-subject
"/O=Grid/O=Globus/CN=testulix.phys.ufl.edu"
-globus-host-cputype
i686
-globus-host-manufacturer
pc
-globus-host-osname
Linux
-globus-host-osversion
2.2.14-5.0smp
-condor-arch
INTEL
-condor-os
LINUX
-save-logfile
on_errors
Appendum
2:
============
The default
Condor installation is configured so that Condor will suspend all jobs on
detection of
keyboard
activity. The instructions on how to fix this are to modify "PART 3" of the
$CONDOR_LOCATION/etc/condor_config
file underneath where it says:
#####################################################################
## This
where you choose the configuration that you would like to
## use.
It has no defaults so it must be defined. We start this
## file
off with the UWCS_* policy.
######################################################################
The modifications
are:
Original
condor_config file:
START
= $(UWCS_START)
SUSPEND
= $(UWCS_SUSPEND)
CONTINUE
= $(UWCS_CONTINUE)
PREEMPT
= $(UWCS_PREEMPT)
KILL =
$(UWCS_KILL)
Modified
condor_config file:
#START
= $(UWCS_START)
START
= True
#SUSPEND
= $(UWCS_SUSPEND)
#CONTINUE
= $(UWCS_CONTINUE)
#PREEMPT
= $(UWCS_PREEMPT)
SUSPEND
= False
CONTINUE
= True
PREEMPT
= False
#KILL
= $(UWCS_KILL)
KILL =
$(ActivityTimer) > $(MaxVacateTime)
#PREEMPTION_REQUIREMENTS
= $(UWCS_PREEMPTION_REQUIREMENTS)
PREEMPTION_REQUIREMENTS=False
Then,
to enact the changes for the new Condor configuration, execute
% condor_reconfig
node1 node2 node3 ...
where
node1, node2, node3 ... are the different Condor compute machines (including
the
Condor
Master machine).