DPE 2_4_1 Production Operations Manual
(MOP for CMKIN/CMSIM/OSCAR/Hit/Digi with/without Pile-Up with direct
dcache stage-in/stage-out and access)
Quick
Reference:
Various
(Modes of) Operations:
General Setup:
Please carefully read all Notes and Important
notes
on this page.
1. Install
DPE-Master 2_3 using pacman, as specified on DPE
page.
2. It will
create a setup.sh file for you in
your prod-ops
directory with these contents,
export PROD_OPS=`pwd`
export DPE_PATH=<<DPEClientPath>>
export GLOBUS_PATH=$DPE_PATH/globus
source $DPE_PATH/setup.sh
export MOP_DIR=$PROD_OPS
#
##MOP Staging Setup
#
#Stage-in
#
# Set this
flag true if you like your input data file to be staged in from dCache
directly
export
MOP_DCACHE_STAGEIN_FLAG=true
#
#Provide
the dCache host and Path of the input files.
#iExample: For assignment 3921, the input fz files
were generated by
assignment 3417
export MOP_DCACHE_STAGEIN_HOST=cmsgridftp.fnal.gov
export
MOP_DCACHE_STAGEIN_PATH=/prod/PCP/3417/data
<--------
pre-existing directory (in dCache
mind mount point (/pnfs/cms for FNAL) )
#
#Stage-out
#
export MOP_LOG_STAGEOUT_HOST=cmsgridftp.fnal.gov
export MOP_DATA_STAGEOUT_HOST=cmsgridftp.fnal.gov
export MOP_MASTER_HOST=`hostname`
export MOP_LOG_GUC_DIR=/prod/PCP/validation/test/anzar/log
<-------- pre-create this directory (in dCache
mind mount point)
export MOP_DATA_GUC_DIR=/prod/PCP/validation/test/anzar/data
<-------- pre-create this directory (in dCache mind
mount point)
#
## McRunjob setup
#
export PROD_RESOURCES=$PROD_OPS/McRunjob/cms/ImpalaLite
export TrackingPath=$PROD_OPS/McRunjob/cms/ImpalaLite/cms_db
export MC_RUNJOB_PATH=$PROD_OPS/McRunjob/mcj_scripts/IMPLite
export localCacheArea=$PROD_OPS/McRunjob/mcj_scripts/IMPLite/localCache
<--------- pre-create this directory
export PYTHONPATH=$PROD_OPS/mop_submitter:$PROD_OPS/McRunjob/py_script
export commonOutDir=$PROD_OPS/IGT-tests/commonOutDir
export
MOP_MATCHMAKER_URL=<<match-maker-hostname>>:/jobmanager-condor
<<--------------- This is a new variable
Note: In general the match-maker-hostname is same
as
the MOP master host. Also one could use only jobmanager (fork) but
using a jobmanager-condor (condor) is highly recomended, even if that
means setting up condor for local host only.
3. Source
setup.sh
4. Make sure if
you are user "X" then your DN is
mapped to
your used-id "X" in the
grid-mapfile on the gatekeeper
specified by MOP_MATCHMAKER_URL above (in general on MOP Master). This
is not required on other worker nodes. Yujun Wu (yujun@fnal.gov) or
VOMS
manager should be contacted to make sure this happens on your MOP
Master. Without
meeting this condition, Match maker might not
function.
5. cd
mop_submitter and run
./mop_matchmaker_monitor.py --scheduler
--group_size = 20
This will start
Condor_G Match-Maker. Verify by running condor_q. This
process should always be running.
Take special care
when using wild-cards to with condor_rm not to
remove this process too. If it happens just restart it as mentioned
here.
6. Make
sure site files (in mop_submitter/site-info directory)
are UPTO DATE, each site is
represented ONLY by a
<site>.vars file now (also read the note
below for
backward compatibility). Which need to have following variables,
MOP_MAX_JOBS=100
<<-------Total
number of Jobs allowed at a site.
# Globus gatekeeper contact strings.
MOP_REMOTE_JOB_MANAGER_FOR_RUN=<worker
node>:/jobmanager-condor
MOP_REMOTE_JOB_MANAGER_FOR_STAGE_IN=<worker
node>:/jobmanager
MOP_REMOTE_JOB_MANAGER_FOR_STAGE_OUT=<worker
node>:/jobmanager
MOP_REMOTE_JOB_MANAGER_FOR_PUBLISH=<worker
node>:/jobmanager
MOP_REMOTE_JOB_MANAGER_FOR_CLEANUP=<worker
node>:/jobmanager
# Where to create working directories for jobs.
MOP_REMOTE_RUNTIME_AREA=/home/anzar/MOPTMPAREA
MOP_EXPORT_DIR=/home/anzar/MOPTMPAREA/flatfiles
# this is the dir under which Globus is installed on
the remote system
MOP_REMOTE_VDT_LOCATION=/vdt
# this is the dir under which the CMS DAR is
installed on the remote
system
MOP_REMOTE_DAR_ROOT=/home/anzar/DAR
##Leave this one like this
MOP_NO_SHARED_FS=N
###These
are new parameters for localizing jobs on a worker node
#To turn on
localization
MOP_USE_WORKER_SCRATCH=Y
#Loacl worker
node area where you want MOP to create runtime area for your jobs
(sufficiently big ~2 GB)
MOP_WORKER_SCRATCH_AREA=/tmp
#PU location,
which is accessible to worker nodes
MOP_PURuntimeLocation=
/MYPUSample/OnMyDisk/PUSamples/OSCAR_2_4_5
#Direct Dcache
Access from
worker (described below)
##Set to true if you want worker node to
pull files directly out of dcache
MOP_WN_DCACHE_FILE_CP=false
## Location of Pre Load library
for pull/push of data and log files directly from/into dcache
MOP_SITE_LD_PRELOAD=/home/uscms01/mop/export/libpdcap.so
Also see: Direct Stage-in from dcache/cmsgridftp
server
Note:
- The old format of having 05 files per site is NOT
"Preserved", so no need to "maintain" 05 files per-site.
- Also make sure that you have 05 site files for "Generic
site" always present.
generic.site,
generic.site.publish, generic.site.stage-out,
generic.site.cleanup, generic.site.stage-in, generic.vars
Following New Mode of
Operations could be selected depending upon the settings of
Environment and MOP variables.
Direct Worker Node
Local Area/Cache Writting
This mode
enables worker nodes to write locally, instead of on head
node or NFS mounted shared area. This mode reduces load on the head
node, in case of NFS mounted areas.
To
turn on this mode, these MOP variables in <site>.vars (Site file
in site-info directory) need to be set,
MOP_USE_WORKER_SCRATCH=Y
MOP_WORKER_SCRATCH_AREA=/tmp
Note: Still
working on direct Dcache area writting
Direct Stage-in from Dcache
Worker sites
can directly stage-in from dcache, even though they do not
have dcache access. They use dcache-grid-ftp gate to pull files on head
node. This is similar to pulling files from MOP master node. In this
case only "input data files", FZ/ntpl/EVD* will be pulled from Dcache,
while rest of stage-in operation pulls files from MOP master.
To turn
on this mode, environment (in setup.csh/sh) variables
need to be set before job creation,setenv
MOP_DCACHE_STAGEIN_FLAG false
setenv
MOP_DCACHE_STAGEIN_FLAG true
setenv MOP_DCACHE_STAGEIN_HOST
cmsgridftp.fnal.gov
setenv MOP_DCACHE_STAGEIN_PATH
/prod/PCP/3401/data/
Direct Dcache Access from Worker
Node (New MODE of operation)
This is a new
mode of operation in MOP, where Worker Nodes having
direct Dcache access, can stage-in/stage-out without involving Head
Node. Saves time and reduces modes of failure.
To
turn on this mode, these MOP variables in <site>.vars (Site file
in site-info directory) need to be set,
MOP_WN_DCACHE_FILE_CP=true
MOP_SITE_LD_PRELOAD=/storage/data/lib/libpdcap.so
<===== Location of preload library (accessible to head
node).
Also this environment (in
setup.csh/sh) variable need to be set before job creation.
setenv MOP_DCACHE_STAGEIN_PATH
/prod/PCP/3401/data/
Site file Creation
Using ConfMon:
The site file
for a particular site could be generated by running
configuration monitor client, like this,
<DPE>/confmon/client/client_query_glue.py
site-host GIIS-host mop
IMPORTANT:
Turning ON/OFF a worker site to Matchmaker.
The
sites could be turned on or off to match-maker, i.e.
Match-maker will submit/not-submit jobs to a particular site.
- To make a site visible to matchmaker, just "touch" a
<sitename>.ClassAd file in mop_submitter/site-info directory. You
can almost immediately notice that Match-maker puts site ClassAd into
this file.
- To remove a submission site, just delete the
<sitename>.ClassAd file from mop_submitter/site-info directory
Submitting to a
site directly
1.
Create jobs for "generic" site. (ALWAYS.)
2. Submit to the <site>.
To turn on this mode, these environment (in setup.csh/sh) variables
need to be set.
setenv MOP_DCACHE_STAGEIN_FLAG false
setenv MOP_DCACHE_STAGEIN_HOST cmsgridftp.fnal.gov
#echo "Direct stage in of EVD files from dcache for 4642, from 4034"
echo "Direct stage in of FZ files from dcache for 3921, from 3414"
#setenv MOP_DCACHE_STAGEIN_PATH /WAX/2/prod/PCP/4034/data
setenv MOP_DCACHE_STAGEIN_PATH /prod/PCP/3401/data/
Running
CMKIN Production assignments
7. Edit McRunjob/cms/ImpalaLite/CMKIN.conf
Update following varibales as given
here,
IfSaveOutput=true <<<-----Note
the change from previous version
OutProtocol=cp
OutputPath=/data/ANZAR/dgt-prod-ops/commonOutDir
useBoss=0
useDAR=1
<<<-----Note the change from previous version, instead of
true/false we use 1/0 now.
DARpath=$MOP_REMOTE_DAR_ROOT
EnvironmentType=MOP
8. To run jobs, go to step
8
Running
CMSIM Production assignments
7. Edit McRunjob/cms/ImpalaLite/CMSIM.conf
Update following varibales as given
here,
GeometryFile=./cms133.rz
InputPath=/data/ANZAR/dgt-prod-ops/commonOutDir
<<<-----Note
the change from previous version
IfStageInput=true
<<<-----Note
the change from previous version
IfSaveOutput=true
<<<-----Note
the change from previous version
InProtocol=cp
IfSaveHBOOK=true
useBoss=0
useDAR=1
<<<-----Note
the change from previous version
DARpath=$MOP_REMOTE_DAR_ROOT
EnvironmentType=MOP
8. To run jobs, go to step 8
Running
OSCAR Production assignments
7. Edit McRunjob/cms/ImpalaLite/OSCAR.conf
Update
following varibales
as given here,
EnvironmentType=MOP
GeometryPath=.
InputPath=/data/ANZAR/dgt-prod-ops/commonOutDir
<--------
input base-path of datasetname
directory containing ntpl file(s)
IfStageInput=true
InProtocol=cp
OutProtocol=cp
IfSaveOutput=true
useBoss=0
useDAR=1
DARpath=$MOP_REMOTE_DAR_ROOT
8. To run jobs, go to step 8
Running
Hit Production assignments
7. Edit McRunjob/cms/ImpalaLite/Hit.conf
EnvironmentType=MOP
InputPath=/data/ANZAR/dgt-prod-ops/commonOutDir
<--------
input base-path of datasetname
directory containing FZ file(s)
IfStageInput=true
IfSaveOutput=true
InProtocol=cp
OutProtocol=cp
useBoss=0
useDAR=1
DARpath=$MOP_REMOTE_DAR_ROOT
##Geometry_PATH will be resolved
through DAR
8. To run jobs, go to step 8
Running Digi w/o PU
7. Edit McRunjob/cms/ImpalaLite/Digi.conf
EnvironmentType=MOP
InputPath=/data/ANZAR/dgt-prod-ops/commonOutDir
<--------
input base-path of datasetname
directory containing FZ file(s)
IfStageInput=true
IfSaveOutput=true
InProtocol=cp
OutProtocol=cp
useBoss=0
useDAR=1
DARpath=$MOP_REMOTE_DAR_ROOT
##Geometry_PATH will be resolved
through DAR
Make sure following are
commented out
#EventFileInputPath=lxcmsa:/raid/cmsprod/TMP_Hits
#EventReadProtocol=rfio:
8. To run jobs, go to step 8
Running Digi with PU
Digi with/without is decided by the RefDB assignment, we just have to provide the PU path. At FNAL its kept in dcache, but via preload library its used as any disk area. Making it transparent fr the application.
7. Edit McRunjob/cms/ImpalaLite/Digi.conf
EnvironmentType=MOP
InputPath=/data/ANZAR/dgt-prod-ops/commonOutDir
<--------
input base-path of datasetname
directory containing FZ file(s)
IfStageInput=true
IfSaveOutput=true
InProtocol=cp
OutProtocol=cp
useBoss=0
useDAR=1
DARpath=$MOP_REMOTE_DAR_ROOT
##Geometry_PATH will be
resolved
through DAR
##PU related variables.
PUReadProtocol=dcap:
IfStageInPU=false
PURuntimeLocation=$MOP_PURuntimeLocation
Make sure following of these are commented out
#EventFileInputPath=lxcmsa:/raid/cmsprod/TMP_Hits
#EventReadProtocol=rfio:
8. To run jobs, go to step 8
Step 8: Creation and Submission of
Jobs
i. cd McRunjob/py_script
ii. Run Linker and create jobs for the correct Assignmen Type
(CMKIN/CMSIM/OSCAR/Hit/Digi)
python Linker.py
script=CMSOneStep.mcj context=CMSProduction.ctx:MOP.ctx
AssignmentType=<type> CMKIN AssignmentID=1234
nloop=5
For CMKIN use addition parameter on commandline : cmsimInputPath=$commonOutDir
iii. Submit jobs.
python Linker.py
script=IMPLRunJob_MOP.mcj Scheduler=MOP
AssignmentID=1234 useBoss=0 mopSite=generic mopDagPath=$commonOutDir
mopNumOfJobs=1
Submitting to a site directly
python Linker.py script=IMPLRunJob_MOP.mcj Scheduler=MOP
AssignmentID=1234 useBoss=0 mopSite=<site>
mopDagPath=$commonOutDir
mopNumOfJobs=1
To be Noted:
When submitting
to Generic site (Match Maker), for every
submitted
batch of jobs there will be TWO
(instead of one) dagman
process, displayed by condor_q. Do not confuse that with some
error. First dagman is running job that you have submitted to
match-maker site, and then match-maker site has further submitted
another dagman job to a grid-site, that it has matched with.
Production Operation
Tools.
MOP now has a
new set of tools to help Production Operations. Thanks to
Nickolai !.
These tools are present in mop_submitter/misc directory. Adding this to your path/python-path will make them available.
Follow this link http://home.fnal.gov/~kuropat/cms/cms.html .
Error Reporting: Please report errors to dpe-discuss[AT]fnal.gov or contact lists from DPE page.
======================================================================
M. Anzar Afaq anzar[AT]fnal.gov
Fermi National Accelerator Laboratory phone: (630) 840-6856
Computing Division - CMS Group fax : (630) 840-2783
P.O.Box 500, MS 234, Batavia, IL 60510 http://home.fnal.gov/~anzar
Last Updated: March 05, 2004