Share this Job

Site Reliability Engineer

Date: Sep 11, 2021

Location: Oak Ridge, TN, US, 37830-8050

Company: Oak Ridge National Laboratory

Requisition Id 6385 

The Company

The National Center for Computational Sciences (NCCS) at the Oak Ridge National Laboratory (ORNL) is seeking highly qualified individuals to play a key role in improving the security, performance, and reliability of the NCCS computing infrastructure. The NCCS is a leadership computing facility providing high performance computing resources for tackling scientific grand challenges.


The Team

The Platforms group is tasked with architecting and running our Kubernetes platform called Slate which provides a service to NCCS users and staff to develop, manage, and deliver their own applications that integrate with NCCS HPC resources.


We strive to provide the best Kubernetes service for both our internal staff as well as our scientific users. We achieve this goal in part by dogfooding and we use Kubernetes to run all of our own internal services we support. We have great opportunities to work with other staff helping them develop their applications on the platform as well as working with our outstanding scientific community as we bring Kubernetes to the HPC world.


We are at the intersection of container orchestration and HPC, come help us build the bridge.


About you

We are looking for an experienced systems engineer who can code and focus on customer success. You handle infrastructure with code because automation lets you focus on the more difficult and rewarding problems. You love collaboration with others and coming up with the best solution to the problem. You enjoy and can pick up a new technology quickly. You love CI/CD and GitOps. You probably have production experience with Kubernetes and Golang. You may have a GitHub account with cool projects. You may have technical leadership experience.


Tools we use: Kubernetes, OpenShift, Helm, Prometheus, RHEL, GitLab CI, Terraform, Puppet, Python, Golang



Major Duties/Responsibilities

  • Participate in an on-call rotation for off-hours support
  • Keeping the Kubernetes platform reliable, available and fast
  • Architecting solutions to problems that improve the reliability, scalability, performance and efficiency of our services
  • Respond to, investigate, and fix service issues all the way from bare metal through the OS to the application code
  • Design, build, and maintain the infrastructure we need to support the NCCS
  • Work with our users to help them use Kubernetes
  • Write awesome documentation


Basic Requirements

  • At least a Bachelor’s degree in a scientific field or equivalent experience
  • At least four years’ experience as an SRE/Sysadmin/Systems engineer


Preferred Qualifications

Experience with Kubernetes, OpenShift, Helm, Prometheus, RHEL, Puppet, Python, Golang


Special Requirement

This position requires access to technology that is subject to export control requirements. Successful candidates must be qualified for such access without an export control license.



Battelle offers a generous relocation package to ease the transition process. Domestic and international relocation assistance is available for certain positions. If invited to interview, be sure to ask your Recruiter (Talent Acquisition Partner) for details.
For more information about our benefits, working here, and living here, visit the “About” tab at


This position will remain open for a minimum of 5 days after which it will close when a qualified candidate is identified and/or hired.

We accept Word (.doc, .docx), Adobe (unsecured .pdf), Rich Text Format (.rtf), and HTML (.htm, .html) up to 5MB in size. Resumes from third party vendors will not be accepted; these resumes will be deleted and the candidates submitted will not be considered for employment.

If you have trouble applying for a position, please email

ORNL is an equal opportunity employer. All qualified applicants, including individuals with disabilities and protected veterans, are encouraged to apply.  UT-Battelle is an E-Verify employer.