InTheLoop | 11.06.2006

The weekly electronic newsletter for Berkeley Lab Computing Sciences employees

November 6, 2006

OSF Power Upgrade Gets NERSC Ready for UPS, Franklin

A planned power outage at the Oakland Scientific Facility (OSF) last week allowed the NERSC computer room to be safely upgraded to accommodate a new uninterruptible power supply (UPS) and future computing systems, including Franklin, NERSC's soon-to-be-installed new Cray supercomputer. Several carefully timed email notices over the past month informed all NERSC users about the outage of that began last Monday morning, Oct. 30, and was scheduled to last for two days.

The electrical substations in the OSF basement were built to deliver up to 6 megawatts (MW) of power, but until now, only 2 MW were actually used in the machine room. Soon, however, NERSC will need about 3 MW to power the increased computing capability and cooling requirements of Franklin and future machines.

To meet these needs, PG&E upgraded its connection to the building, and new 480V feeds were connected between the basement and the machine room to deliver the increased power. The chilled water piping under the machine room floor is also being rearranged to improve the air flow, since each of Franklin's 102 racks will need 2300 cubic feet of cooled air per minute.

NERSC staff began shutting down the computing, storage, and network systems at 4 a.m. on Monday, and the OSF power was shut off at 9:30 a.m. so the work could proceed safely. The power upgrade was completed a little ahead of schedule, with the OSF power restored and the computer room stabilized around 8 p.m. on Halloween Tuesday. NERSC staff cut short their trick-or-treating to return NERSC systems to production, with most systems restored by 10 a.m. Wednesday, Nov. 1. A few more hours of unscheduled hardware and software maintenance were required on Seaborg on Wednesday and Thursday evenings due to a new kernel bug that caused nodes to crash, but the NERSC web site kept users informed with system status updates.

For the first time, NERSC is installing an uninterruptible power supply (UPS) to protect critical data in the NERSC Global Filesystem (NGF) and HPSS. If an unscheduled power outage were to crash NGF—which is mounted on all NERSC production systems and holds up to 70 TB of data—new data that had not yet been backed up might be lost, and previously backed up data could take a week to restore. Once the UPS is operational (scheduled for January), if an unscheduled power outage does happen, the UPS will allow a graceful shutdown of NERSC's critical storage disks and databases. And that added margin of safety will help NERSC staff and users keep their cool on hot summer afternoons.

John Bell Elected Chair of SIAM Computational Science and Engineering Group

John Bell, head of CRD's Center for Computational Sciences and Engineering, has been elected as the next chair of the Computational Science and Engineering activity group for the Society for Industrial and Applied Mathematics (SIAM). He will officially serve Jan. 1, 2007 through Dec. 31, 2008. In 2003, Bell and Phil Colella were named as co-recipients of the 2003 SIAM/ACM Prize in Computational Science and Engineering.

According to the SIAM Web site, the activity group fosters collaboration and interaction among applied mathematicians, computer scientists, domain scientists and engineers in those areas of research related to the theory, development, and use of computational technologies for the solution of important problems in science and engineering. The activity group promotes computational science and engineering as an academic discipline and promotes simulation as a mode of scientific discovery on the same level as theory and experiment. The group also organizes a biennial conference.

New Staff, New Assignments at NERSC

NERSC's Open Software and Programming Group has three new members:

  • Randy Kersnick is a Bay Area local with a background in Computer Science and web technologies. Randy has programming experience in the areas of web applications and information portals as well as database technologies.

  • Jeff Porter was the LBNL Nuclear Science Division's liaison to PDSF from 1996 to 1998. He then moved to Brookhaven National Lab to work as a physicist for the STAR experiment, managing STAR's database and online computing efforts, and in 2004 moved to Seattle to continue research with STAR at the University of Washington. Jeff works closely with the Open Science Grid's emerging software.

  • Shreyas Cholia has moved to OSP from the Mass Storage Group. Shreyas will work to bring grid technologies to the NERSC user community. He has been a developer in the HPSS community, where he was involved with the HPSS-GridFTP integration effort.

In addition, Aaron Garrett has moved from the Computer Operations and ESnet Support Group to the Networking, Security, Workstations and Servers Group, where he will be providing workstation and desktop support.

Job Postings: NERSC Seeking HPC Network Engineer and Systems Engineer

The NERSC Division has a job posting for a High Performance Computing Network Engineer. This person will be responsible for designing, building, maintaining and monitoring the 10 gigabit Ethernet NERSC local area networks, the 10 gigabit connection to the Internet, and the NERSC Global Filesystem (NGF) local FiberChannel network. Other duties include: perform as one member of a 24/7 on-call network response team; work closely with ESnet personnel to analyze and optimize WAN performance for NERSC nationwide clients; work with NERSC clients to troubleshoot and improve client applications' end-to-end performance; work closely with vendors to ensure that NERSC's future networking needs will be attainable; track regulatory requirements that impact networking and develop plans to meet these requirements; and more. Find additional information at http://www.lbl.gov/CS/Careers/OpenPositions/NE019474.html.

NERSC is also looking for a High Performance Computational Systems Engineer to serve on the system administration team. This person will configure, administer and evaluate hardware and software, as well as oversee the installation, testing and operation of large scale computational systems including job scheduling, system-wide functionality and tuning in a 24 x 7 production environment. Other duties include detecting and diagnosing system problems, assisting NERSC users and staff in optimizing their use of the computational systems, and developing and maintaining up-to-date system procedures and documentation. More details can be found at http://www.lbl.gov/CS/Careers/OpenPositions/NE019454.html.

