Friday, July 21, 2000 Break Out Session: Data processing operations, resources, current data distribution and future, the science database (SX), schedules and goals. These notes are B. Lee's interpretation of the events in this meeting, with help from his own and Ani Thakar's notes. Apologies for inaccuracies, mis-attributed comments, and mistakes of any sort. The session started with a question which prompted S. Kent to clarify that the Operational Database (OpDB) stores that data we need to run the survey, and is not intended for science but for the operation of the survey. Q: Who/what is responsible for tracking seeing coverage, and allowing us to check off a piece of the sky as done? S. Kent and J. Munn: OpDB is the keeper of truth, it will be determined there. All the needed tools and data are there already and it could be done (for plates too). It is currently done by hand since the scripts to generate the reports have not been written. Q: Is the OpDB robust? J. Munn: It has been corrupted once (due to an Objy bug). Since then we have been more careful not to use some of Objy's features that might introduce problems. But it is a concern. Q: (Something about experience of Objy with large databases?) S. Kent says the SDSS data set is small compared to Objy's other clients. Our problems have not been with size but with features the other clients are not interested in, such as data mining capabilities. Q: What happens to OpDB if Objy goes out of business? J. Munn notes that the survey can be done with flat files alone. The only thing we would loose is an easy way to merge multiple observations. And we have already changed data bases once in the past. S. Kent says if Objy folded we would use another database or write our own to merge observations. A. Thakar: can we re-implement on a relational database (instead of an object-oriented one)? For the SX he says probably yes, with some effort (probably 6 months?). For the OpDB, it would be more difficult to get the desired functionality. However, if Objy were to fold, we will still have a working version and are supposed to get the source code. S. Kent: We are moving to Linux for the next SX release, which should help with some of the problems. Steve also points out that we have worried about Objy folding for six years -- it hasn't happened yet, and now they've got a track record, so perhaps we're worrying too much about it. Q: OpDB is more important because survey operations will halt if Objy goes under? J. Munn thinks that if Objy folds, we only loose the link between objects, and it will only require ~1-2 months of work to recover. Others suggest this is optimistic, but certainly it would not be more than a few months of work and would not be catastrophic to the survey. Q: What do the data analysts worry about? C. Stoughton introduces Jen Adelman (photo pipeline operation, stuffs OpDB) and Bruce Greenwalt (automation of the pipelines, MT pipeline operation) and mentions Bob Peterson (spectro and MT pipeline operation) who is not present. J. Adelman's nightmare: too many plots to look at. Developers tend to add QA plots for her to look at, and rarely remove old ones. We are solving the problem this summer by asking each pipeline developer to come up with a single plot or test per pipeline, so that the operator only has to look further if something is wrong. Also, tweaking of files and parameters required to push data through the pipelines. This does feed back into determining standardized parameters for the pipelines, and for some (such as astrom) this is no longer a problem. J. Munn: Astrom has run on many (10?) runs now with the standard parameters and has required no adjustments. Q: When things don't run, what's the nature of the failure? J. Adelman: Bugs discovered with new objects on the sky. Jen mentions a photo bug found recently when photo worked for 2 days on a star cluster before finally stopping. Also, the manual tweaking of parameters needed to get pipelines to run on data, although this was a lot more common a year ago than it is now. Still, we are not yet at the stage of complete automation. C. Stoughton: Configuration control used to be more of a problem. The files with hardware parameters and rough calibrations are now data model compliant, but configuration control is still done by hand. J. Munn: Most problems are because this is 1st year commissioning data and a lot of things are changing and/or non-standard in the data. C. Stoughton: While some of these files (opcamera) are troublesome, others (such as opcamera) have been working. J. Munn: We know what we need to do but we are manpower limited. The goal is to have all of this information stored in the data, but J. Annis suggests this is several months to a year away due to other more urgent problems. Q: Are the parameters used to run the pipelines saved so that the outputs can be reproduced later? J. Adelman: Yes, parameter and plan files are saved. D. Eisenstein: Is this information in the SX? Several answer: No, but run and rerun are, and from this one can reconstruct all this information. It is archived but not in a user friendly way. Perhaps it should be added to the SX when the pipeline software is released to the public. Q: Do we know what needs to be done to automate the pipelines? B. Greenwalt: Approx 1.5 years ago, most of the knowledge needed to run the pipelines was in 1 or 2 people's heads. Bruce has been trying to put this into scripts to automate the pipelines. As of the present, basically all the needed automation scripts exist. The QC step is the problem, we are still waiting for tests that can be used in the scripts from the pipeline developers. This step is what holds up automation, right now QC for each pipeline must be done by hand. There have been major advances in the last few weeks, we are maybe a month away from having these tests. G. Knapp: Part of why this happened with the photo pipeline was that the group originally working on QC for photo bowed out and was not replaced. Princeton just started working on it about 6 months ago -- it took the photo team "longer than it should have" to realize that they needed to do this. J. Munn: This was also complicated by the fact that we were looking at a lot of bad or non-standard data. B. Greenwalt: MTPIPE needs some work by hand that can't be automated at this point, and the spectro pipelines are very new and still under development. So these two probably will not be automated as soon as the others. C. Stoughton: Zeljko Ivezic has made a web page that allows an operator to drill down through diagnostic plots for photo and diagnose pipeline problems quickly. The plan is to have developers available on the end of a phone line, and call them less and less. B. Greenwalt: Right now QC's main job is to tell you if something goes wrong. All data analysts should be able to identify problems and quickly trouble shoot the simple ones. Difficult problems are referred to local experts (J. Adelman, B. Yanny, etc.) first and then to developers. Q: Have responses (fixes) to problems been quick? C. Stoughton: Yes, unless it's a really difficult problem. Q: Is there a requirement that we declare data good or bad within the same dark run? S. Kent: There is no requirement that all the data be turned around within a dark run. The only requirement is that the data can be processed to produce plates, and declaring the runs needed for plates (complete stripes) good or bad is part of this. Stripes which still need to be filled in may not be examined until later. There is a strategy tool, and it likes to complete stripes, so there is a priority on getting complete stripes which can be used for plates. Currently, though, there is no direct connection between the good/bad declaration and the strategy tool. As with many things, we know what needs to be done to complete this loop, it's just not automated yet. For spectroscopy, they know the (approximate) good/bad declaration on the mountain, and only have to ask FNAL about borderline cases, in which case we let them know the next day. We are currently calibrating the mountain vs. final S/N, which will reduce the number of borderline cases. Q: What are other people's nightmares? B. Lee's nightmare (final calibration pipeline developer): Not enough PT secondary patches and extinction measurements to really do and test the calibrations. We should have a patch every hour, and the test in the requirements actually calls for a patch every half hour. Currently we only have 1-3 patches per column in an 8 hour run. Things look fine now, but it's hard to really know and there are problems that are hard to examine. (The PT schedule is already quite tight, it's hard to get extra data.) G. Knapp's nightmare: The photo pipeline is doing something wrong that we don't notice or can't detect. C. Stoughton: The problem of seeing dependent systematic errors on the photo aperture magnitudes was only recently discovered, and had long gone unnoticed (in part due to the lack of real final calibrations). J. Munn: We don't enforce the regression testing requirements for pipelines. Does photo need another person to write regression tests? G. Knapp: Yes, probably someone outside Princeton so that it is independent. S. Kent's nightmare: Steve worries about the completeness of target, and how we can test this. G. Knapp suggests that we have people look at individual spectra for a large number of plates. B. Lee: This study is already underway by B. Wilhite and others at FNAL, some results were briefly presented on Thursday. C. Stoughton: How do we make the survey uniform, and how to we get the bookkeeping straight? S. Kent: An additional fear is that we are testing software and selection algorithms which keep changing, so we are testing something we have already thrown away. D. Eisenstein and J. Munn: Important to test by looking at overlapping 745/752/756 runs, see if we get the same answers for the same objects. This is currently being done. A. Thakar's nightmare (SX developer): No time to do regression testing or system testing for changes made. Too many and too frequent changes are needed and there are not enough people to do the testing. Also, enough time hasn't been budgeted for loading the SX, another person is needed. Eventually loading will be handed over to J. Adelman at FNAL. Q: Is the SX stable? There have been a lot of crashes (referring to Ani's plot presented on Thursday), how serious were these? A. Thakar: Generally only SX crashes, not the system, and SX is restarted by scripts within about a minute. One cause is a memory leak in the SX which has been fixed in the development version but not yet released. There have also been some problems with the SGI machines. Q: Is SGI a good platform (for the SX)? A. Thakar: The SGI version of Objy is poorly written, slow, and prone to crashes as compared to other platforms due in part to some known problems with the SGIs. There was also a problem with output files being corrupted which was directly traceable to SGI; this has been fixed by SGI. We are now switching over to Linux which is much faster. Q: Are people outside of JHU using the SX? (Another of Ani's plots seems to show little usage elsewhere.) D. Eisenstein: SX 2.2 (not yet released) is the first version with the full functionality end users really want, so previous versions have not been as useful as the flat files. J. Annis: Most people who have been actively pursuing science have pre-existing code to work from the flat files. They will not switch until the SX provides better functionality and ease than their own code, which is something that is just now happening. Jim points out that UW, which traditionally has not dealt with the flat files and thus does not have as much pre-existing flat file code, has been the heaviest non-JHU user of the SX even in its current state. B. Lee: Until now the data sets have been (relatively) small and local copies of the flat files were possible. Already many institutions do not have enough disk space for their own copies of all the flat files they want, and soon only 2 or 3 will. This will be a strong incentive for using the SX instead of flat files. A. Thakar: SX 2.2 has several improvements over the earlier versions, including much better performance and more features. He adds that local copies would be very fast. D. Eisenstein: The SX is very easy to use and is likely to take over very soon. Q: What are the loading problems with SX? Can't you just press a button and walk away? A. Thakar: There used to be frequent crashes during loading, but this is now less of a problem. The two main causes were the instability of the SGI (may be solved now) and changes to the data model, especially the spectra. The earlier the SX team is notified of the data model, the better. (D. Eisenstein suggests concentrated feedback within the next 2 weeks.)