Site Reliability Engineering (SRE) Manager
Eastern USA (Remote)
Paige is a software company helping pathologists and clinicians make faster, more informed diagnostic and treatment decisions by mining decades of data from the world’s experts in cancer care. We are leading a digital transformation in pathology by leveraging advanced Artificial Intelligence (AI) technology to create value for the oncology clinical team.
We are the first company to develop clinical grade AI tools for the pathologist, which resulted in our receiving FDA breakthrough designation for our first product. Paige has also received FDA-clearance for our digital viewer, FullFocus™. We have also established multiple relationships with biopharma, laboratory, and equipment manufacturers that enables Paige to develop an ecosystem ready to help patients receive better diagnoses and treatment.
We are seeking a Site Reliability Engineering (SRE) Manager. In this role you will be responsible for leading a team of SREs working on the Paige platform and cloud infrastructure. We are looking for a leader who can empower teams to deliver high quality products efficiently while incorporating all components and practices related to site reliability engineering. You will collaborate closely with extremely talented and committed individuals from the Engineering, Security, Quality, Regulatory and Product groups.
This is an extraordinary opportunity to be part of a high-performing team and to pursue a life-changing mission.
- People management – Leading of team of skilled engineers across geographic locations.
- Setting SLOs, SLAs, and SLIs – Work with teams to define and implement service-level objectives, service-level agreements and service-level indicators for the product and infrastructure.
- Project planning and prioritization – Project planning, task prioritization, and developing SRE roadmap utilizing Agile and scrum methodologies.
- Improving the on-call incident response process - Scaling and optimizing the overall on-call process.
- Improving service observability to define and capture necessary metrics and monitoring.
- Set Strategic and Operational goals and work with the team to deliver on goals.
- Implementation of proactive monitoring, alerting, trend analysis and self-healing system.
- Participate in on-call rotations, driving restoration and repair of service-impacting issues
- Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
- Contribute to product development / engineering as needed to ensure Quality of Service of Highly Available services
- Take a command-and-control role as Incident Manager during critical incidents focusing on minimizing MTTR & MTTD Identify, evaluate and execute preventive measures to minimize/avoid impact to the customers experience.
- Participate in After Action Reviews and facilitate Root Cause Analysis to drive the repair of Problem Records in order to prevent recurrence through to closure including, but not limited to, resolution of product/service defects or design changes, infrastructure changes, or operational changes
- You have a minimum of 5 years of management experience
- You have a minimum of 3 years of experience implementing managing site reliability engineering practices
- You have a minimum of 5 years of hands-on experience with building and scaling AWS infrastructure.
- You have a minimum of 3 years of hands-on experience automating microservice deployments on Kubernetes.
- You have a minimum of 5 years of hands-on experience with configuration management tools such as Terraform and Salt.
- You have a minimum of 5 years of hands-on experience managing CI/CD tools such as GitLab or GitHub.
- You have a minimum of 5 years of hands-on experience with distributed source code controls systems such as Git.
- You have a minimum of 3 years of hands-on experience with observability solutions such as Datadog or New Relic.