Stuff Steve has got to do or make other people do
- General Organizational Issues
- (1) get ISGC papers written (deadline, Sept. 18) 2/20/09
- (2) Set next allocation meeting with mcbride and wolbers 11/16/09
- (2) ITIL foundations training, get in the next course 11/4/09
- (5) contact Tsan at ASGC--re. getting asgc_osg into his contr+ol, upgrading the test OSG instance 4/27.09
- (2) next condor phone-con 12/3/09
- (2) set meeting with qizhong, margaret, diesburg, keith, me
- (2) next grid admins meeting, 12/4/09
- (2) make changes in the fermigrid request form, and deploy it 9/9/08
- (2) write a new FermiGrid Users survey 10/28/09
- (3) mcas requirements 7/13/09
- (3) check NetIDMgr from home 2/11/09
- (3) Add HEPSPEC numbers to newly-upgraded GIP--what's the right factor, 80 or 200, what is right hepspec for opteron 2389 8/25/09
- (4) contact M. Mattingly with dates to speak at AU seminar 12/30/08
- (4) Faarooq--make sure he gets safety shoes, pager, gridadmin 10/22/09
- (4) Tim Currie, IDM and Fed trust discussion, 11/10 3 pm (forward my commentS) 11/4/09
- (6) how many total specint2K in FermiGrid, complete the spreadsheet. 6/24/08
- (6) learn how to use meeting maker web interface 12/6/08
- (6) update Escrow information, be sure up to date, reorganize secrets 1/7/08
- (6) add Keith to escrow ring 2/2/07
- (6) Watch for Operations Manager Training to be offered (asked) 8/25/08
- Meetings
- Grid Admins Meeting agenda items
- Allocations meeting agenda items
- (2) renormalize half of farm, 1/3 of farm allocations 11/16/09
- Grid Users Meeting agenda items
- (2) service desk requester console training 11/16/09
- Issues for potential FGS<->SSA management discussion
- (2) streamline/automate the process to get new VO's accepted in gplazma 1/23/09
- (2) get the stken/dca registered as an OSG Storage Element in OIM and get SSA hooked up to be the support center (asked) 1/23/09
- Issues for potential FGS<->FEF management discussion
- (1) Change in yp management/Automation of passwd file, alternative non-yp methods 4/27/09
- (2) yp errors on D0 cluster, why, will making dedicated slave yp servers help? 10/7/09
- (2) review of condor/osg Wn-client packaging installation for quality control 9/14/09
- (3) ace nodes, many disk failures, out of warranty in feb 2010, is there a plan 4/30/09
- (3) possible transition of CDF and GP grid worker nodes to separate partition for each job? 1/27/09
- (3) enlarging /local/stage space on GP worker nodes 6/5/09
- (4) SLF5 transition on worker nodes 10/5/09
- Future Project Planning
- FermiGrid Cloud project
- (1) make all cloud nodes a different password, different Escrow secret 10/6/09
- (2) call first cloud project meeting 9/2/09
- (2) find some rpath guy to bring in and talk 7/1/09
- (2) note cern batch virtualization talk, talk to them 10/26/09
- (3) check cloud stuff from ISGC conference.
- (3) Set up first virtualization roundtable 11/7/08
- (3) read up on Virtual Iron 5/19/08
- (3) figure out what BNL did to get on Amazon EC2, do the same 9/4/09
- (3) check out rpath, openqrm, enomalism, convirt 2/2/09
- (3) any cloud ideas for sbir? 10/12/09
- (3) buzzwords to try, nimbus,eucalyptus, elasticfox, spandexfox, ovirt,openqrm, xen orchestra 10/28/08
- (3) Keahey/Freeman, try out the workspaces, run on cloud, also contact Alex younts @ purdue to run on his cloud. 7/13/08
- (4) magellan project, doe cloud, talk to mine about including new researcher, talk to David Martin 10/7/09
- (4) Ask Jason/Ling for help to set up one virtual iron or oracleVM machine ( asked) 9/4/09
- (5) download XenEnterprise evaluation license, see if it works, call K.Laughlin back 847 695 1990 1/7/08
- GPCF planning
- (1) review Stu's design doc 10/22/09
- (2) draw a functional box diagram 10/6/09
- (2) plan for bluearc mounts 10/15/09
- (2) plan for batch system 10/15/09
- Grid School issues
- (1) Fix the example tarball for fermigrid 201 (cdf test job doesn't work) 5/12/08
- (1) fix Fermigrid 101, outdated vdt client install instructions 4/13/09
- (1) more info on opportunistic use 5/15/09
- (1) get more info on which fermigrid gatekeeper is which in the tutorial, also describe /fermilab/grid group 2/20/09
- (1) add guidelines on moving I/O to and from /grid/data 6/5/09
- (1) get materials prepared for FermiGrid 203,204,301,401,501 (301 high priority) 3/19/08
- (2) get new Grid School schedule made. Is January good? 8/15/08
- Documentation issues
- (1) make overall shutdown procedure for all FermiGrid machines 7/6/08
- (1) update template of FermiGrid worker node software load 3/2/08
- (2) document how to access fermigrid and related cvs repositories 9/4/09
- (2) document how to do a gentle shutdown of a gatekeeper 12/12/08
- (2) kill the old user-guide.html on fermigrid web pages 12/8/08
- (2) Write the Goomsday guide for GUMS and SAZ intervention 8/20/08
- (2) update location and console info for all with rack diagrams, re-organize escrow secrets 9/10/07
- (2) document fermigrid condor sleeper pool 6/25/08
- (2) review Anne Heavey's iSGTW glossary http://www.isgtw.org/?pid=1001622 2/10/09
- (2) document HA stuff, osg twiki 3/5/09
- (3) troubleshooting--group trouble by possible symptom 1/29/08
- (3) gratia machines, what daemons should be running, how to start and stop, 8/22/07
- (4) pull myproxy info out of old users guide into separate page 7/24/07
- (4) Dept. web pages 2/16/07
- (4) C. Green issues--User guide for non-root client install, vomrs access from top level 6/13/07
- (5) update VO privilege doc pages (yocum) 3/30/07
- Laptop/desktop HW/OS
- (1) how to handle E-mail store 10/23/08
- (2) make java plugins work on laptop, 12/30/08
- (3) figure out how to get shockwave and flash working? 32-bit browser? 64-bit flash plugin? 9/30/08
- (3) get fgitbse down the hall to 8th fl. computer room 9/4/08
- FermiGrid Operations HW/OS
- FermiGrid Linux Workgroup issues
- (2) bison, flex, readline-devel 6/23/09
- (5) openmanage, can we get it to install out of yum repo 6/5/09
- (6) fermigrid workgroup, try biarchonly 5/8/07
- FermiGrid Production Systems HW/OS
- (1) make uniform /etc/ssh/sshd_config a la fg2x3 everywhere, first verifying that it won't break Keith's monitoring 7/21/09
- (1) fermigrid3, why the flashing yellow light on 1 drive (Faarooq) 11/9/09
- (2) clean out /boot/grub/grub.conf everywhere 5/21/09
- (2) verify tibs functionality on all 5/8/09
- (2) get modified sensors_wrap again running to check temperature on all dells (using openmanage) or ipmitool or the stuff that FEF has 9/11/08
- (2) new keyboard tray jams keys, either missing or repeating. Try to hook usb keyboard straight to kvm switch, does problem continue? 10/15/08
- (2) racsvc doesn't start up under xen kernel, can we disable and still have openmanage work? 9/5/08
- (2) what do we do with 24 old 2GB dimms (back pocket?!) 11/12/09
- (3) get ganglia gmond installed on any machine where it is missing 8/28/08
- (4) make sure sysrq is enabled on all dom0's and xen's--re-query list if there is any way to invoke it across the serial console on dom0 9/12/08
- (4) fermigrid0, pull all our kerberos keytabs and host certs there for copies and storage 2/28/07
- (4) work on provisioning method to distribute common config files for FermiGrid nodes 6/12/07
- (4) mysql port on fg1x1 shouldn't be open from offsite, use iptables 1/11/08
- (4) make startx work on certain machines with firefox 1/23/09
- (5) set non-default password on all DRAC O 8/8/08
- (5) understand how to get good cert onto DRAC's, openmanage, 3ware 8/25/08
- (6) periodic check, daemons, k5logins, etc. 5/19/09
- (7) does Xen VM migration from machine to machine help us 7/6/07
- (7) Consider having an nfs-mountable OSG Client on the bluearc 7/29/08
- (7) get a hypermail interface up for fermigrid-logwatch and fermigrid-yum mail
- FermiGrid Console Servers HW/OS
- (2) followup with T. Dawson on moving console server out of samgsrv1/2 rack
- (2) test nulled 8-wire cable so we can cross-connect console port of one server to a different one (found old spec, asked avocent) 8/4/09
- (2) syslog-ng, get all ACS logging to syslog-ng repository 8/20/08
- (2) all, make sure you can log in as admin but not as root on web ui 8/14/08dor
- (3) get the nfs-mounted files rotating again. 8/28/08
- FermiGrid Test/Development Systems (fgtest0-6) HW/OS
- (2) review Parag's install of glideinwms 8/27/09
- (2) clean out old ITB systems out of OIM 8/4/09
- (2) can we get a big-memory instance for Parag 2/3/09
- (2) get working X on fgtest0 12/11/08
- (2) middleware install on new fgtnxm 12/12/08
- (3) fgtest4 E07F0 Proc1 IERR, Proc2 IERR, why? 8/25/09
- (3) fgtest5, drive 0, went bad, reseated, keep a eye if it goes bad again (it is showing errors) 7/30/09
- (3) why didn't drac firmware take, is there a network firmware, why didn't base firmware take 12/30/08
- (3) get test glidewms going 10/21/08
- (4) test IPMI on fgtest0 5/20/09
- (4) re-commission old fgtest1 as new fgtest2. 12/30/08
- CDF HW/OS
- Things to do on all CDF clusters
- (2) fcdfsrvt0, keep watch, five SC so far 11/6/09
- (2) check daemons that are started on all 4/29/09
- (3) make all condor masters have local certs not dependent on /grid/wnclient 6/2/09
- (3) Make syslog.conf and syslog-ng.conf uniform for all gatekeepers, get syslog-ng running on all 7/23/08
- CDF Grid Cluster 1 HW/OS
- CDF Grid Cluster 2 HW/OS
- (1) plan to enable private net on next downtime
- CDF Grid Cluster 3 HW/OS
- (2) plan to enable private net on next downtime
- (2) set fcdfosg4 up to take over and distribute proxies for fcdfosg3 if
fcdfosg3 is down 10/6/08
- CDF Sleeper Pool HW/OS
- (1) document sleeper pool, formalize and automate startup/shutdown mechanism 5/14/08
- CDF test cluster (fcdfosgt1/fcdfsrvt0) HW/OS
- (2) old fcdfosgt1,fncdf211, fncdf213, fcdfheadt1,make sure they are off my prop list.8/25/09
- (2) restore postgres and quill 6/30/08
- (3) Investigate private net worker nodes 7/31/09
- D0 HW/OS
- (1) network contractors access into gatekeeper rack on 12/1 downtime 12/1/09
- (1) install private net and crossover cable on 12/1 downtime 12/1/09
- (1) plan for kernel upgrades on or before 12/1/09
- (2) set meeting with qizhong, keith, diesburg, vicky re d0 on gpgrid
- (2) Why still seeing yp exceptions on gratia probes afer new servers 3/12/09
- (3) watch decommissioned nodes and new nodes, reconfigure subclusters accordingly when they go 8/4/08
- (3) consult M. Diesburg on possible home area restrictions ose/gce 6/5/09
- (4) plans for bringing D0 PBS clusters in compliance with OSE baseline 10/2/08
- Fermilab Submit node fnpcsrv1 (and friends) HW/OS
- (1) have Neha fix the quota reports (asked) 4/29/09
- (2) fnpcsrv1--do all users have to be in /etc/shadow. Plan to keep shadow updated on fnpcsrv1, with nis map? 3/13/08
- GP Grid Cluster HW/OS
- (2) clean up and secure all the cables in the rack 8/25/09
- (2) set up redundant means of proxy distribution on GP Grid 10/6/08
- (2) check other staging areas, get people to clean out more stuff 4/17/08
- (2) make sure /etc/shadow has all entries on gatekeepers 6/25/08
- (2) get better way to monitor for ypbind failures and worker node disk failures, work with FEF to think about ldap passwd distribution 7/13/08
- (2) try to get lm_sensors working on non-dell machines and/or steal FEF wrapper 9/11/08
- (3) fnpcsrv5 and 6 console ports are swapped
- (3) need to check condor stop order on fnpc3x1, fnpc4x1, fnpc5x2, they didn't go down clean
- (3) install small switch for 192.168.167.0 private net (it's here) 4/28/09
- (4) Monitoring for busted dccp, srmcp, other isolated zombies 1/11/08
- (5) ypconnect failures, securenets on fnpc5x4, why 4/30/08
- (5) consolidate postgres on one of machines with raid10 6/12/07
- GRATIA machines HW/OS
- (1) gratia03 battery backup on raid controller bad, 10/5/09
- (1) new passwd/.k5login 6/4/09
- (2) get lowe account on all, also get Faarooq root on all 10/1/09
- (2) make sure console redirect bios set is set on all 10/1/09
- (1) gratia06 failed drive was reset, keep a watch on it 8/3/09
- (2) gratia06 bmc firmware didn't take, why? 6/22/09
- (2) gratia01,02,03 (and all fgtest) BIOS updates didn't take, why 6/22/09
- (3) gratia-vm02, busted mysql-local yum repo in the config file 9/8/09
- (3) clean out all /boot/grub/grub.conf from non-xen kernels 9/8/09
- (3) gratia07, problems with double rpms of libselinux libacl and others 6/23/09
- (3) block mysql ports from off-site except where needed (see Tim Rupp suggestion) 8/26/08
- (4) watch gratia08,gratia09 after previous machine panics 11/6/09
- (4) gratia machines, dress power cords and monitor cables 4/1/08
- (4) gratia07,06 not getting yum updates, why?
- FGITB cluster HW/OS
- (2) fgitb301 orange light, memory error, why 11/19/09
- (2) get a private net on those nodes (switch ordered, here) 2/25/09
- (2) label the new fgitb nodes 2/25/09
- (3) installation of kvm panel on fgitb/cloud rack 11/6/09
- (4) get lm_sensors or ipmi working on those nodes, steal monitoring scripts from FEF for this purpose 9/1/09
- (4) prepare for SNMP signals from ups on battery when they come 9/1/09
- Gridworks FAPL HW/OS
- (2) enable serial consoles on gw014-gw018, also label apc power plugs (Dan)10/1/09
- OSG-ReSS HW/OS
- SAMGRID HW/OS
- (2) decommission samgsrv2 in coordination with REX 11/19/09
- (2) plan to virtualize samgsrv1 6/9/09
- Non-FermiGrid machines HW/OS
- (5) syscollect checkout for M. Mihalek, get it done, 8/7/07
- FermiGrid Operations (Services)
- Metrics and Monitor Scripts Services
- (1) get new critical nodes added to ngop 4/7/09
- (1) histogram of cdf glidein clock times, look for peaks corresponding to exiting glidein that never matched (look at Giuseppe's work). 7/29/08
- (1) why intermittent failures of RSV 7/21/08
- (1) move monitoring host off of fermigrid0 to fg0x0? 7/21/08
- (1) other strategic grid metrics 7/25/08
- (1) N of slots available to FermiGrid per cluster 1/11/08
- (1) N of slots delivered to FermiGrid (total, other Fermi VO, Non-fermiOSG) 1/11/08
- (1) Percent of total jobs submitted via grid interface 1/11/08
- (1) percent of data moving through grid interface 2/1/08
- (1) number of jobs that come through site gateway 1/2/09
- (2) dedicated 2nd display to show a monitoring slideshow without login 10/2/09
- (2) get new machines into ngop 7/30/09
- (2) monitor for nanny timeouts on fg5x0/fg6x0 5/1/09
- (2) work with H. Wenzel to get his stuff deployed 2/10/09
- (2) automate the generation of graphs from gratia either as custom report on gratia-fermi or other means 7/17/08
- (2) percent pingable 1/11/08
- (2) percent up in pbs/condor 1/11/08
- (3) alternate X display for computer room graphics and no login 10/5/09
- (3) monitor held jobs on fnpcsrv1 and fg1x1 7/25/08
- (3) meaningful and fast condor_status for fermigrid1--free jobs by cluster by VO, with totals.1/11/08
- (3) percent of glidein jobs that exit without ever running a user job. 1/11/08
- (4) monitor condor and globus port usage 7/29/08
- (5) evaluate Neha's Zabbix setup 1/7/08
- (5) make sure ngop is running ping agents for all 3/22/08
- (5) make sure ngop monitor gui is OK for all 3/22/08
- (6) ngop web monitor, admin tool can our people log in 3/9/08
- VOMRS Service
- (2) is it possible to split vomrs and SAM for the D0 vomrs 10/13/09
- (3) unexpire S. Timm and other certs in ilc and i2u2 vomrs 4/30/09
- (3) add Faarooq as admin to all Fermi-managed VO's 10/5/09
- (3) need procedure to remove a VOMS/VOMRS vo once we lose contact with the group 4/14/09
- RSV service
- (2) change my kcron to run off of fg0x3 for all 6/5/09
- (2) check the rsv proxy forwarding script, make it so it voms-proxy-init only once and not many times 9/16/09
- (3) use the rsv role, make change as soon as we have gums/gratia mappings we want 6/5/09
- (3) have unified rsv server and web page? 6/5/09
- Worker node client service
- gLExec service
- (1) new/new glexec version for testing from Dave 8/31/09
- (1) test new glexec as part of current privilege project/xacml testing 12/11/08
- (2) to test: new gums/scas mapping, gums stress test, get cdf sleeper pool to use test glexec,
- (2) get the right sazc.conf into /usr/local/testgrid directories 8/31/09
- (2) check out the glideinwms factory on fgt3x7 ( 7/17/08
- (2) bugfix requests, different timestamps in the logs, how to move glexec_log and glexec_monitor.log elsewhere.
- (2) CMS--when are they going to deploy gLexec on their worker nodes, check 9/10/09
- (3) ways to make a centralized monitoring system (asked 1/10), need FEF cooperation with /etc/syslog.conf 12/7/07
- FermiGrid Web Page Service
- (2) Move production web server off of fermigrid1 6/5/09
,li> (4) re-enable FermiGrid pacman mirror 6/5/09
- Site job gateway service
- (1) plans for VDT 2.0.0 upgrade 8/12/09
- (1) test nfs-lite 1/11/08
- (1) update osg-info-dynamic-condor 5/22/08
- (1) set up a BDII and try Brian Bockelman's bdii GIP 4/13/09
- (1) followup on Rank statement, add priority if possible (takes glue schema changes), also consider waiting jobs of high prio user 10/21/08
- (1) fix the gip os version using burt's workaround 9/14/09
- (2) try the multi-ce hint from Brian in gip 10/12/09
- (2) figure out how to calculate fcdfosg3/4 ambiguity in free slots 8/26/09
- (2) tests of gram 5 9/4/09
- (2) consider frontending the OSG using full proxy delegation 9/4/09
- (2) globus bugzilla--for jobmanager-condor should return "suspended" state if job is held on far end, follow up 9/4/09
- (2) figure out alternate method than GlobusResubmit to stop held/released jobs from getting resubmitted ( system periodic remove) 2/25/09
- (2) obtain cemon collector for raw from IU 2/20/09
- (2) obtain perl scripts for raw from IU 2/20/09
- (2) consider reworking automount maps to make admin's home directories be of form /grid/login/ everywhere, that way we do not need auto.home automount map 2/2/09
- (2) clean out old vo's out of passwd file, do something with osgedu 2/2/09
- (2) fix iptables. 7/15/08
- (2) make an am_i_being_preempted function 8/31/09
- (2) report the OS version for the subclusters correctly in gip (Burt gave recipe) 9/4/09
- (2) check to make sure ress classads sent to osg-ress-1 are valid 5/22/08
- (2) add a $FERMIGRID_APP variable to the others, $OSG_APP,.$OSG_DATA and get it into GIP 10/24/08
- (2) regression testing on wsgram cemon, pre-wsgram cemon 2/13/08
- (2) get back to B. Bockelman with final fermigrid count. 1/30/08
- (2) figure out how to clean up leftover files from WS-GRAM 1/30/08
- (2) d0cabosg2--figure out how to handle case of restricted free nodes 9/7/07
- (2) how to configure globus gatekeeper, WSGRAM, condor, with multiple IP's including service IP's on same node or Xen instance (done, need to document in twiki) 7/8/08
- (2) get container-real.log to rotate daily (charles gave us a recipe) 6/5/08
- (3) modify extrarsl parameters to make them the right ones (still need to figure out how to do it for PBS, see Jason's email in July 2008 on how they did the d0farm queue), also Paul Mercure's mail of the -l option that was used at mcgill. 4/10/08
- (3) redundant condor/gatekeeper/cemon/IG on fermigrid1,4 1/11/08
- (3) redundant non-nfs shared file system to support dual-write gatekeepers 1/23/09
- (4) double directories, both fg1x1.fnal.gov and fermigridosg1.fnal.gov under .globus/job, why? and why not fcdf1x1? 2/25/09
- (4) Tune system_periodic_remove on fg1x1 to remove held jobs of certain holdreasonsubcode (code 31 should have a -forcex). Condor ticket is in 8/28/08
- (4) Make second advertisement of fermigrid1 member clusters directly to osg-ress with changed contact strings to go through fermigrid1? 9/13/07
- (4) monitoring for dead globus-ws, where's the rsv probe? 10/21/09
- (5) Make sure pre-emption and priority fields of glue 1.3 are populated correctly in glue 1.3, 7/6/07
- syslog-ng service
- (1) why are all the fg1x1 logs blank 3/25/09
- (2) get syslog-ng running on fcdf* and d0osg* 4/7/09
- (2) fix startup script so it doesn't leave tails running of files that don't and won't exist 3/17/09
- (2) can we make a more generic way to do the config files on the client and server so manual intervention isn't needed each time a log file name is added or subtracted 3/17/09
- (2) can we package syslog-ng client as rpm? 1/12/09
- (3) consider failover options for syslog-ng figure out what to do with blocking flow-control, should we do heartbeat 1/23/09
- ganglia service
- squid service
- (2) squid--review the other tuning parameters that Dave Dykstra suggested 5/6/09
- (3) why does a failed crl file update on fg2x3/fg3x3 get purged from the squid server and cause all to fail? 6/5/09
- (3) monitor the nanny timeouts 6/5/09
- MyProxy service
- (1) deploy vdt 2.0.0 version on fg3x4 and fg-myproxy.fnal.gov (Faarooq) 6/5/09
- (1) further tests and documentation on new myproxy version. Also let them know about change of command line options, are they cross-interoperable? (Faarooq) 10/15/07
- (2) plan for security review of myproxy--talk to major users, do they need passphrase retrieval, any policy on length of policies, any policy on renewers? What about full certs stored therein? (Faarooq)
- VOMS Service
- (1) Auger VOMS is moving, upgrading 7/20/09
- (2) either locking or error correction added to voms-sync 12/1/08
- (2) need way to monitor for voms-admin failure where part or all of the voms is unlistable due to a bad entry or inconsistent DB (CMS request), especially in the case of replica VOMS 3/2/09
- (3) why binary garbagoose at beginning of voms tomcat logs on fg6x1? 3/3/09
- (6) ILC VO cross-mirroring--waiting on internal discussion among ILC 8/10/07
- GUMS service
- (1) gums 1.3--Is there a better way to do manual mappings for cms and lqcd 5/15/09
- (2) gluex vo support 10/31/09
- (2) cms rsvuser mapping, why is it different with voms-proxy and without? 10/1/09
- (2) gplazma maps my proxy to fgtest, is that right 9/16/09
- (2) test the new xacml-based authorization 9/16/09
- (2) update new fermi groups in goc template (add minerva) 4/29/09
- (5) request fast command-line test for gums, similar to sazcheck (requested) 7/17/08
- (4) pragma should show as "gin" vo? 3/29/07
- (4) check /dzero/services group, should it also not have fqan required 3/6/09
- (4) check fgtest mappings again 8/8/08
- SAZ service
- (3) plan to automatically clear /var on saz servers periodically (Neha) 6/23/09
- MySQL Service
- CDF Gatekeeper Services
- All CDF Grid Cluster Gatekeeper Service
- (1) plans for vdt-2.0.0 upgrade (tentative Dec. 8) 8/12/09
- (2) problem with glexec jobs and system periodic remove, J. Boyd entered ticket 9/3/09
- (3) fix all log rotations particularly WS-GRAM log4j and glite/var/log log4j. (cemon) 7/23/08
- (3) check alter-attributes.conf for vo-by-vo-parameters 4/7/09
- (3) check subcluster sizes and number of classads 4/7/09
- (3) why is some CDF user writing in /grid/testhome/cdf 7/20/09
- (4) Make plans to add periodic VACUUM FULL to the postgres 4/7/09
- (4) change checkpoint_segments field in postgres 4/7/09
- (4) plans to deploy SLOTx_EXECUTE macro to keep one condor job from overwrting the other's disk space 3/4/08
- (5) set up periodic monitoring to detect lastupdate field in headers of condor_q 10/7/08
- (5) switch OSG_WN_TMP to be $_CONDOR_SCRATCH_DIR? (takes condor user_wrapper to make it work) 1/31/08
- CDF Grid Cluster 1 Gatekeeper Service
- CDF Grid Cluster 2 Gatekeeper Service
- CDF Grid Cluster 3 Gatekeeper Service
- CDF Sleeper Pool Gatekeeper Service
- (1) deploy condor 7.3.2, 9/18/09
- (1) schedd/shadow timeouts, condor ticket open 3/10/09
- (2) try shared-secret configuration 4/14/09
- (2) continue sleeper pool scalability tests. 8/8/08
- (3) get postgres and quill running again 12/8/08
- (5) Install condor locally on master machines to mitigate NFS use. 8/26/08
- (5) Make dedicated schedd to run the managed-fork 8/12/08
- (6) if time, try a nfs-lite configuration 8/26/08
- CDF Test Cluster Gatekeeper Service
- (2) install condor 7.3.2 everywhere 9/18/09
- (2) finish configuring condor and vdt on the fcdft0xn nodes 6/22/09
- D0 Gatekeeper Services
- (2) why twice as many classads getting seen by verify-gip-for-cemon, asked 11/3/09
- (2) enhanced (smarter) cleanout of globus/tmp/gram_job_state 1/7/09
- (2) set up redundant proxy distribution on D0 grid clusters 10/6/08
- (2) verify spec and worker node disk sizes in alter-attributes.conf and config.ini
- (2) change FermiGridWorkerNodeTmpDisk to GlueHost..... 7/21/08
- (2) glexec probes on d0cabosg1,2, and workers, not handshaking, why? Check ProbeConfigs 7/30/08
- (2) availablememory setting in gip, plus memoryenforce setting 7/23/08
- (2) Failure code 111 in gram_job_mgr.log what does it mean 2/25/09
- (3) followup with Mike re getting roles into and out of myproxy 5/15/09
- (3) RSL settings, can they be used to pick out the various subclusters 1/9/08
- (3) fix log rotations wsgram, cemon 7/3/08
- fnpcsrv1 Submission Services
- (1) plan for VDT upgrades, condor patches 8/12/09
- (2) enable condor_quill on all 4 9/18/09
- (2) install /usr/local/vdt on fnpcsrv5xx 9/18/09
- (2) plan for periodic VACUUM FULL 5/15/09
- (2) add a FermiGridSubmitHost and FermiGridJobID to all outbound grid uni. jobs from fnpcsrv1 and friends 7/21/08
- GP Grid Gatekeeper Services
- (1) override the jobmanager-condor-default from getting advertised on fnpc3x1/fnpc4x1 7/1/09
- (1) set the GIP group_quota to VO mapping, test it out 6/30/09
- (2) change the condor_submit mapping to use osg-user-vo-map.txt 7/1/0
- (2) setup a periodic VACUUM FULL on the postgres db 4/7/09
- (2) set up a fcp server on fnpc3x3, fnpc4x3 8/21/09
- (2) general condor hardening program of work (local wn client, certs, individual WN partitions, etc) 8/21/09
- (3) enable HEPSPEC option in GIP 10/16/09
- (3) make all condor masters have local certs not dependent on /grid/wnclient 6/2/09
- (2) do we want to add a max memory limit on the GP Grid cluster nodes? 3/24/09
- (2) check that we've done all the recommended optimizations for WS-GRAM, per B. Daudert E-mail 10/15/08
- (2) do we want to make a hierarchy of opportunistic use to prefer Fermilab VO? YES 2/25/09
- (2) investigate changing the SCHEDD_NAME to the same name as the service ip 1/16/09
- (3) what about the multiple GlueClusterSErvice entry 8/28/08
- (3) fix log rotations, particularly log4j ones 7/3/08
- (5) new condorview, can we customize to do multi-user views? (asked B. Gietzel) 1/11/08
- (5) condor--group quota usage in negotiator, now in manual, try it out. 7/25/08
- (5) enable condor startd on fnpc3x4,4x4,5x4, make sure nodes configured with proper rpms, document proper SL5 worker config 7/8/08
- (5) request the /pnfs mounts on fnpc3x4, fnpc4x4, fnpc5x4 6/30/08
- (5) switch OSG_WN_TMP to be $_CONDOR_SCRATCH_DIR? (see easy wrapper from B. Holzman) 5/9/07
- Fermilab GRATIA service
- (1) Fermigrid opportunistic usage plots from gratia, group by group and cluster by cluster 10/8/08
- (2) do we want goc's rsv probe? (notified Chris of offer) 10/9/09
- (2) the yp errors on the psacct probes in D0 cluster 5/15/09
- (2) fix psacct gratia DB to fix the unknown entries 2/9/09
- (2) unphysical psaccting for FNAL_DZEROOSG_2 (reported to gratia-ops) 2/20/09
- (2) get gratia test job of fixed cpu time (monitors its own cpu time and quits when 1 hr is done) 7/8/08
- (4) is Gratia doing the right thing with WS-GRAM jobs? 8/8/07
- FermiGrid OSG Services
- OSG-TG Gateway
- (1) write short report of developer's work over the summer 10/12/09
- (2) give instructions to D0 on how to use the pre-wsgram interface
- (2) Is the globusrsl function working? Looks like it might not be 8/21/09
- (4) osg-tg gateway, is it in ReSS (Yes) , is the GT4 service info correct? 6/22/09
- (4) osg-tg gateway, get it registered in OIM 2/2/09
- (4) osg-tg gateway, how to do the bureaucracy to get all OSG accepted on TG (Ruth) 1/2/09
- OSG ReSS service
- (1) condor vulnerability, use ALLOW* settings plus GSI authentication, make plan (test on new fgintinfo) 6/6/07
- OSG GRATIA service
- OSG VOMRS service
- OSG VOMS service
- OSG Persistent ITB Service
- (1) prepare for next round of ITB testing 9/3/09
- (1) get the x86_64 hardwire out of condor.pm 7/1/09
- (1) keep glexec, cemon, gratia current on worker nodes 1/11/08
- (2) update dCache on fgitbse 10/1/09
- (2) install condor 7.3.2 on all condor nodes 9/18/09
- (2) why is fgitbgkc2 giving "invalid group" for fnalgrid sometimes 8/4/09
- (2) report the osg printing version problem 7/1/09
- (2) check gratia ProbeConfig files on all 6/25/08
- (2) install the new gatekeepers 6/8/09
- (2) install the new condor, pbs, sge 6/8/09
- (2) fix postgres/quill 6/8/09
- OSG Integration Activity
- (1) get our gatekeepers up to vdt-2.0.99 cache, 10/15/09
- (1) participate in VDT 2.0/OSG 1.2 ITB 1.10 activity 6/8/09
- (1) participate in working with Doug Olson for next doegrids cert change 11/2/09
- (2) xacml, IGE, L+L testing, set schedule 11/4/<09
- (2) SVOPME testing, in contact with Nanbor and Bala
- FermiGrid Integration Activity
- (1) confirm convention on GlueSubClusterUniqueID==GlueClusterUniqueID 9/5/09
- (2) try the XACML privilege stuff 2/20/09
- (4) Deploy proxy delegation (plus local proxy transfer) on fgitb-gk 3/27/08
- (4) try to run fgitb-gk with no $OSG_WN_TMP, or dynamically set 3/6/07
- (2) re-enable group quotas on fgitb-gk, fgitb-cm, see if GIP does it right 2/20/09
- (2) Need to re-make the fermi special changes to GIP. 4/10/08
- (3) try Parag's MPI advertising stuff on ITB 2/23/09
- (3) Why wasn't ldapsearch giving right answer on our bdii/gip output, need to check 4/10/08
- FermiGrid Security
- (1) security--run test against submit machine with large number of open client ssl connections 7/12/07
- (1) security--bring us into compliance with baselines and minor application plan (including mysql/database baseline) 1/11/08
- FermiGrid Support
- Support Center
- (2) get OIM contacts updated for all VO's 7/6/09
- (2) get CDF gridmap entry contact E-mail and order in gums template fixed. D. Box will request 7/6/09
- (2) add JDEM to list of VO's for support center (asked) 3/25/09
- (2) be ready to register voms and gums servers when it comes to that. 8/20/08
- (3) get Fermilab storage admins to register the SE with the GOC and be responsible for it, if possible, add category to SC E-mail if possible (asked), need meeting with Stan 8/28/08
- (3) Testing web services interface to new service desk remedy for ticket submission/exchange when ready 7/6/09
- uid/gid's to add to FermiGrid or remove from it
- (1) groups to add to Fermilab VO--muon g-2? argoneut? Microboone? E906?--muon g-2 and microboone are aware of grid trust doc. 5/20/09
- (2) get postgres user in all default passwd files. 6/5/09
- (2) add uscms2001-2999? still to do on D0 8/18/09
- (2) add non-numbered CMS users that map to people? only if we can automate 8/18/09
- (2) regularlize nagy jfrey bbockelm glideinwms vofrontend users 8/27/09
- (3) dteam almost out of pool accounts, do we care? No. 8/18/09
- User issues
- (1) Review monthly for obsolete fnpcsrv1 users and check fermigrid-announce membership 4/30/08
- (2) Dave McGinnis (AD), how to select all cores of one box (see recipe in the condor how-to's) 3/2/08
- (3) Walter Giele, problems with condor compile 5/4/09
- (3) auger, get it to join osg 10/21/08
- (3) auger, does anyone want the files in /home/auger or /home/augerbat 3/26/09
- (2) Minos--monitor progress of Parrot deployment to get them off AFS 12/3/07
- FermiGrid Storage
- (1) review no_root_squash settings on FermiGrid, make plan to change 10/27/09
- (2) write short blurb on the jython installation stuff in the dcache readme 10/12/09
- (2) get the neutrino program bluearc stuff mounted on FermiGrid 10/15/09
- (2) revisit with Andy and Ray how we should be backing up the bluearc partitions, via ndmp or via one of our servers 7/1/09
- (2) followup with Ray on repartitioning plan, what is their feedback 5/1/09
- (2) review out-mounts of fermigrid-app, fermigrid-data, enforce -noexec (ilcsim1,2, minos25) 5/20/09
- (2) go over old bluearc exports lists, purge outdated ip's 8/12/09
- (3) Adam para over quota on /grid/data, keep track /4/7/09
- (2) Make plan to disable suid options 1/30/09
- (2) get monitoring on the new volumes (/grid/testhome, /grid/home)
- (2) liaison with SSA and others, make business process to automatically get new VO's into gplazma storage-authzdb 1/23/09
- (2) liaison with dcache developers to get gratia probes working for stken dcache. Followup and see if they did it. Probably only transfer probe will work. 7/27/09
- (2)verify content of new storage GIP provider.7/27/09
- FermiGrid Outreach
- (4) send info on pbs prologues to Marco Mambelli 9/5/07
- (4) Suchandra Thapa pbs question (sent first reply ) 10/8/07
- (5) Make contact with Sebastian Goasguen, OSG Campus Grids group 2/5/09
- FermiGrid Development
- SAZ development
- (1) fix my bugzilla account 11/10/09
- (1) try to build sazclient with suggested patches from Rachana 12/10/08
- (1) find the bug in /CN=proxy/CN=proxy/CN=proxy/CN=proxy/CN=proxy sazclient/saz-check problem (found, now trying to fix) goc issue 5339 7/25/08
- (1) is mysql exception in sazserver happening anymore 1/10/08
- (1) modify vdt's configure_gums to be a configure_saz 8/8/08
- (1) document how to do the build of the tarballs and pacman package 8/7/09
- (2) review SAZ project plan, make sure it contains plan for all known bugs 12/30/08
- (2) why can't we increase the heapsize in the sazserver above 1024 anymore. Could 64-bit JVM be the answer/ did we use to have that? 12/3/08
- (2) code for anam.jar and alldepends.jar, find it (Neha)--it's in beta 12/11/08
- (2) throttle on number of server threads that get spawned 12/12/08
- (2) stress testing of new version 10/19/09
- (2) input verification on incoming input 12/12/08
- (2) address other items found in the code review 12/30/08
- (2) why does the logger sometimes just not log even though saz is still working 12/12/08
- (2) make the verify_saz_function.sh script run offset in time on the 2 servers 11/20/08
- (2) check the changes Neha made in code re. the handling of X509_USER_PROXY to make the first GSS handshake 10/15/08
- (2) modify glexec getmapping script to trap for long proxies (5 or more delegations) 1/29/09
- (2) sazclient, can we link it static 7/29/08
- (2) sazclient, segfault in any -file argument that doesn't end in .conf 8/27/09
- (2) automate nightly build of sl4, sl5, 32/64 bit 6/27/08
- (2) get all docs in cvs area up to date 5/28/08
- (2) need to make regression tests more complete, more generic, include glexec. myproxy, make sure CA database inserts are what they are supposed to be 2/4/08
- (2) automated way to get all the logger.debug statements enabled in the code (server and client) 1/30/08
- (2) add -m option to sazclient (pilot or otherwise) 8/8/08
- (2) pacman setup script for sazserver needs updates to automate sazserver config 1/10/08
- (2) make sazserver package work with pacman update (should remove version number from name of package?) 1/10/08
- (2) unify server, client under "saz" subdirectory again in vdt installation 11/11/08
- (2) fix hardwired path name in log4j.saz.properties 11/19/08
- (2) try thread dumps, thread-enabled top and ps, jmx, other stuff from workshop 2/2/09
- (3) sazclient,sazserver, add optional questions package to answer the questions 11/11/08
- (3) saz, add timeouts to server 8/8/08
- (3) saz, add timeouts to client 8/8/08
- (3) voms cert dir, why does it need to be there for server? 10/12/07
- (3) long-term get our own way to have condor-delegated proxies to test saz/glexec 11/27/07
- (3) make changes to docs as requested (where are the saz servers, how to stress test) 1/10/08
- (3) co-test saz with glexec 1/10/08
- (3) no sazc.conf file equals segmentation fault in libSAZ, why? 7/23/08
- (3) can we trap the java nullpointer exception when scanning software and/or LVS ping connects? 2/5/08
- (3) Recompile with -Xlint:unchecked for details--why is it happening? 2/5/08
- (3) what is linux standards base and can we conform to it 2/5/08
- (4) SAZ could we have "OR" arrangement of clients for fallback 9/13/07
- (4) what's the difference between teragrid/nanohub certs and rest of world (supposedly grid-proxy-init -old? Neha is getting a cert) 1/8/08
- Jobmanager-Cemon Development
- (4) Jobmanager-cemon, IG, info prov. into the VDT (Neha) 8/9/07
- (4) fix brendan's scripts and re-install them on fg4x2, osg-ress-1, osg-ress-4 8/17/07
- FermiGrid R+D
- Condor improvement R+D
- (2) Try condor suggestions on using DELEGATE_JOB_GSI_CREDENTIALS in combination with transfer_files to move the proxy to the local worker node, not in NFS. Ditto for GT4. (part of nfs-lite) 3/6/08
- (2) try nfs-lite? make us less dependent on nfs home and wn client (at cost of needing more disk-heavy gatekeeper and making HA much much harder 1/28/09
- (2) user wrapper, change OSG_WN_TMP to $_CONDOR_SCRATCH_DIR 1/29/09
- (2) Rolling maintenance feature: figure out how to remove from pool, upgrade condor, reboot, scheduled in the batch system 7/24/08
- (2) multiple core reservations within same machine, see condor howto 1/28/09
- (3) the new shared-secret authentication in 7.1 and 7.2 series, try in sleeper pool 12/30/08
- (3) chris green question, when is ExitBySignal ever true in a classad 8/28/08
- (3) try SLOTn_execute option in condor to get each job to execute on a different partition 2/13/08
- (3) check J. Hover's "Certify" package to manage host certs 12/19/08
- (3) review fabio Martinelli's and Stu Martin's thread on globus-HA 2/25/09
- (3) monitoring, can we detect availability of yp, scratch disk and avoid black hole nodes 1/28/09
- (3) job monitoring capacity, does glideWMS help us there 1/28/09
- (3) ways to kill a job and still get stdout, stderr, other files back, maybe condor userlog too 1/28/09
- (3) dagman executing on distributed schedd, different nodes, how to share the dagman logs? NFS doesn't work 7/24/08
- (4) globus-gass-cache-destroy, try it out, plus other manual globus-gass-cache functions. 5/16/08
- (4) cms condor monitoring tool, make it work on GP Grid 1/11/08
- (4) Can we get condor rpm to work out of a yum repository 12/11/07
- (4) fairshare weighted on cpu speed, can it be done,(asked) accounting-weighted cpu usage, can it be measured? (asked) 7/25/08
- (5) make master-worker work across glidein, work through the examples 5/1/07
- (5) parallel universe, how to run mpi jobs on dedicated scheduler (kerberos or mpich-g2?1/11/08
- Globus improvement R+D
- Xen R+D
- (2) xendomains script, how to make sure it gets them all shutdown 7/30/09
- (2) private net nodes, what can we do with them, can we run condor, etc. 6/22/09
- Known open bugs and issues
- Condor support and development issues see http://fermigrid.fnal.gov/condor/
- VDT open bugs
- (1) look through VDT's RT list, update the list below. 4/7/09
- (1) hung globus-gass-cache-util, maybe due to messed up globus gass-cache, follow up with osg-sites, have others seen it, could it be due to bluearc wierdness, (checkpointing and/or backups) we think so. No ticket is open 7/29/08
- (2) gip/cemon making double classads on pbs sites 11/3/09
- (2) make sure condor feature requests are in the condor wiki 5/7/09
- (2) review the condor wiki re. condor_gridmanager features 5/7/09
- (2) vdt-support 3710, delegation depth in globus ssl chain. VDT sent a patch but it didn't solve the problem, working directly with globus people right now. 8/15/08
- (2) GIP--Make sure gluecepolicypriority (lower is better) is filled in everywhere. 7/17/08
- (2) gip--why is cpu_platform not showing up--asked, will be fixed in later release 11/4/09
- (2) gip, what is policy for storage paths, are they right? No! GlueCESEBindAccesspoint null for some, duplcated for others, what is right? 7/17/08
- (2) gip--do I have the right normalization factor for SI00->hepspec, should it be 80 or 200 11/4/09
- (2) check patch from Arvind Gopu re. rsv directories 3/12/09
- (2) tomcat-55 restarts for cemon not clean, takes 1 minute to quit, getting out of memory errors and having to clear the dir to restart clean, try Tim Cartwright suggestion. vdt-support 3632 7/17/08
- (2) cemon memory leak on ce, reported to the vdt 11/10/09, 8/25/08
- (3) pbs--how to make an rsl to select out a set of nodes with certain properties, use paul mercure example of -l option 7/17/08
- (4) sl5 globus gatekeeper/gsiftp /etc/services entries, vdt-support 6000 1/16/09
- (4) vdt-support 4159, ugly errors when installing vdt on a volume with no_root_squash, either as root or as non-root 10/15/08
- (5) put in feature request for vdt hostname (requested 9/15/08, vdt support 3960)--they acknowledge they can do it but it is a lot of work 9/15/08
- gLExec open bugs
- (1) double check all /usr/local/grid and /usr/local/testgrid, be sure correctly configured "getmapping" script points to the right config files, also make sure fetch-crl is running hourly 7/8/09
- (2) see if we can incorporate D. Dykstra changes to speed up nfs-mounted glexec on SL5, otherwise investigating de-nfsizing it--they are in new release, need to test. 7/20/09
- (2) glexec_monitor can pick up bad /usr/local/bin/python, is there a way to override, asked Dave 8/31/09
- (3) request glexec super-user knob--with given cert be able to kill process 4/30/09
- (3) get rid of zero-indexed month field in lcas_lcmaps.log, Igor asked developers again. 12/7/07
- (3) how to make logs go to alternate directory 7/7/09
- (4) slow performance of new glexec under SL5, related to NFS--dave has proposed fix 7/7/09
- (4) glexec SL5 rpms are malformed, should be making symlinks to /grid/wnclient_sl5 and /grid/testwnclient_sl5 7/8/09
- OSG MyOSG/OIM/VORS open issues
- (3) Plan for conversion to xml format, away from wizard pages 10/1/09
- OSG configure-osg open issues
- (2) glueceinfocontactstring/jobmanager contact in config.ini, doesn't accept :2119 anymore, reported, they will fix 11/3/09
- (2) cpu_platform field doesn't get into GIP 11/4/09
- OSG GIP open issues
- OSG other GOC open issues
Steven Tim(m
Last modified: Fri Nov 6 16:43:14 CST 2009