Share this Job

Postdoctoral Research Associate in HPC Resilience

Apply now »

Date: Aug 19, 2019

Location: Oak Ridge, TN, US, 37831

Company: Oak Ridge National Laboratory

Requisition Id 1063 

Pay Band: Postdoc

Relocation:  Domestic and International relocation services are available.




We are seeking a Postdoctoral Research Associate who will focus on research into resilience for high-performance parallel computers.  The position resides in the Computer Science Research Group in the Computer Science and Mathematics Division (CSMD) at Oak Ridge National Laboratory (ORNL).


The Computer Science Research (CSR) Group addresses challenges in technical computing at the largest scales, especially scientific and engineering modeling and simulation, from a computer science perspective.  We research, develop, and deploy solutions in parallel and distribute programming environments, system software, and the engineering of scientific software with the goal of making current and future high-end computer systems more productive and more usable. Our work is motivated and validated by close collaboration with scientific application teams.


The CSR Group’s work in resilience includes (a) characterization of faults, errors and failures in supercomputers, (b) analyzing and modeling cost-benefit trade-off between performance, resilience and power consumption for current and future-generation extreme-scale computing systems, (c) exploring the design space for optimizing this cost-benefit trade-off in future supercomputers, (d) tuning this cost-benefit trade-off dynamically at runtime in future systems, and (e) resilient parallel programming model solutions for future supercomputers. We have a particular interest in accelerated-node architectures, with a near-term focus on GPU accelerators, expanding over longer time horizons to include additional types of accelerators and generally more heterogeneous systems. Our work spans exploratory research to supporting the production environment of the Oak Ridge Leadership Computing Facility, currently home to Summit, the world’s most powerful supercomputer.


The successful candidate will become part of a team of researchers working on current- and next-generation (exascale) systems, engaging deeply with application software teams, vendors, and the broader community to design and develop responses to important challenges.


Major Duties/Responsibilities:


  • Modeling the performance, resilience, and power consumption of the wide-ranging resilience capabilities in supercomputer system software, programming models, libraries, and applications
  • Modeling and simulating coordination of resilience hardware and software components, either statically with policies or dynamically with a parallel programming model runtime
  • Rapid design, prototyping and evaluation of modeling and simulation tools for design space exploration
  • Developing an understanding of the performance, resilience and energy trade-off in future architectures
  • Maintaining an up-to-date understanding of other work in the field.
  • Actively collaborating with industry, academia, government labs, and applications developers in a variety of venues
  • Conduct research and report results in open literature journals, technical reports, and at relevant conferences


 Basic Qualifications:


  • A PhD in Computer Science, or a closely related discipline completed within the last five years
  • Background in fault tolerance for parallel and distributed computing systems
  • Demonstrated research experience in fault tolerance for parallel and distributed systems
  • Strong analytical, software development and programming skills
  • Experience with parallel and distributed computing


Preferred Qualifications:

  • Demonstrated experience working at the largest scales of computing
  • Expertise in fault tolerance approaches for supercomputer systems and applications
  • Experience with performance, resilience, and power consumption modeling of supercomputer systems and applications
  • Familiarity with parallel discrete event simulation
  • Understanding of pattern-based software design
  • An excellent record of productive and creative research as demonstrated by publications in peer-reviewed journals
  • Excellent written and oral communication skills and the ability to communicate in English to a scientific audience
  • Demonstrated written and oral communication skills
  • Effective interpersonal skills
  • Motivated self-starter with the ability to work independently and to participate creatively in collaborative and multi-disciplinary teams of researchers 
  • Ability to function well in a fast-paced research environment, set priorities to accomplish multiple tasks within deadlines, and adapt to ever changing needs


Additional Information:

Applicants cannot have received their Ph.D. more than five years prior to the date of application and must complete all degree requirements before starting their appointment. The appointment length will be for up to 24 months with the potential for extension. Initial appointments and extensions are subject to performance and the availability of funding.


ORNL Ethics and Conduct:

As a member of the ORNL scientific community, you will be expected to commit to ORNL's Research Code of Conduct. Our full code of conduct, and a statement by the Lab Director's office can be found here:


Benefits at ORNL:  UT-Battelle offers a quality benefits package, including a matching 401(k), contributory pension plan, paid vacation, and medical/dental plan options. Onsite amenities include a credit union, medical clinic, cafeteria, coffee stands, and fitness facilities.   


Relocation:  UT-Battelle offers a generous relocation package to ease the transition process. Domestic and international relocation assistance is available for certain positions. If invited to interview, be sure to ask your Recruiter (Talent Acquisition Partner) for details.


For more information about our benefits, working here, and living here, visit the “About” tab at




This position will remain open for a minimum of 5 days after which it will close when a qualified candidate is identified and/or hired.

We accept Word (.doc, .docx), Adobe (unsecured .pdf), Rich Text Format (.rtf), and HTML (.htm, .html) up to 5MB in size. Resumes from third party vendors will not be accepted; these resumes will be deleted and the candidates submitted will not be considered for employment.

If you have trouble applying for a position, please email

ORNL is an equal opportunity employer. All qualified applicants, including individuals with disabilities and protected veterans, are encouraged to apply.  UT-Battelle is an E-Verify employer.

Nearest Major Market: Knoxville

Find similar jobs: