Principal Site Reliability Engineer - AI Infrastructure Operations

Related keywords: network engineer remote jobremote job software engineerengineer remote job

Company Overview

Nscale is a leading provider of GPU cloud infrastructure specifically engineered for artificial intelligence (AI) applications. The company focuses on delivering high-performance and cost-effective solutions designed for both AI start-ups and large enterprises. Nscale not only simplifies the complexity associated with AI development but also empowers AI-focused organizations to achieve remarkable results in areas such as cost management, rapid innovation, and environmental sustainability. At Nscale, the culture revolves around continuous innovation, accountability, and excellence, encouraging all employees to take ownership of their work and contribute meaningfully to the company's technological advancements.

Position Overview

The job opening for a Principal Site Reliability Engineer (SRE) is pivotal within the AI Infrastructure Operations team. This role emphasizes technical leadership, focusing on ensuring the reliability and scalability of one of the industry's most demanding AI platforms. The position calls for an individual who not only thinks systemically but also can inspire and lead operational excellence across the organization. The role encompasses the establishment of reliability strategies, the design of foundational systems, and the enhancement of operational practices across various teams.

Key Responsibilities

In the role of Principal Site Reliability Engineer, you will be charged with several critical responsibilities:

  • Owning and evolving the long-term reliability strategy for Nscale's AI and HPC infrastructure.
  • Designing and leading the development of extensive control-plane systems, automation frameworks, and operational tools.
  • Defining reliability standards, SLO frameworks, and operational best practices for use across multiple operational teams.
  • Serving as a senior technical escalation point during critical incidents, guiding the resolution process and ensuring comprehensive fixes.
  • Identifying structural reliability risks and advancing cross-functional initiatives at the architectural level to mitigate those risks.
  • Collaborating closely with Engineering, Network Operations, and Fleet Operations to influence platform design and elevate operational maturity.
  • Mentoring both senior and mid-level engineers, enhancing the overall quality and efficacy of SRE practices.
  • Driving measurable improvements in terms of availability, mean time to recovery (MTTR), cost efficiency, and operational scalability.

Required Skills

The position mandates a high level of expertise, along with a rich history in complex infrastructure management:

  • A minimum of 10 years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering involving large-scale infrastructure.
  • Expert-level software engineering skills, emphasizing a strong history of creating production-grade automation and systems.
  • Profound knowledge of Linux, networking, and distributed systems design at scale.
  • Extensive experience in debugging and resolving issues across the hardware, OS, networking, and application layers.
  • Demonstrated leadership ability to guide technical initiatives across teams without direct authority, showcasing strong communication skills and a systems-thinking mindset.

Nice to Have

Although not mandatory, the following skills and experiences would be beneficial:

  • Hands-on experience with AI or HPC platforms, particularly dealing with GPUs, InfiniBand/RDMA interconnects, and workload schedulers like SLURM.
  • Familiarity with Kubernetes at scale and various cloud architectures (hybrid and bare-metal).
  • A history of delivering significant enhancements in reliability, scalability, or operational efficiency.

Salary Information

The salary range for this position is notably lucrative, demonstrating Nscale's commitment to attracting top talent. The base salary is outlined to be between $150,000 and $2,150,000 USD. Please note that actual compensation can vary based on factors like skill set, experience, education, and location. Alongside the base salary, the role may offer additional benefits such as bonuses, equity, and participation in commission programs.

Benefits

Nscale emphasizes a collaborative and innovative work environment that values employee contributions. The company promises a highly competitive package, including benefits such as:

  • Medical, dental, and vision coverage.
  • Flexible paid time off.
  • Parental leave.
  • Retirement plan participation.

The company is deeply committed to creating an inclusive workplace and encourages applicants from diverse backgrounds, including people of color, the LGBTQ+ community, individuals with disabilities, and those from underrepresented socio-economic backgrounds.

Culture and Work Environment

Nscale adopts a remote-first approach, demonstrating a commitment to flexibility and work-life balance. Employees are encouraged to create their schedules around significant life moments, ensuring a human-first workplace environment that fosters both productivity and well-being. This dynamic progression plan is tailored to individual ambitions, enabling employees to grow and innovate in their roles.

In summary, the Principal Site Reliability Engineer role at Nscale is an exceptional opportunity for professionals seeking to contribute to the cutting-edge field of AI infrastructure while enjoying a rewarding work environment, competitive salary, and ample opportunities for personal and professional growth.



This job offer was originally published on himalayas.app

Nscale

United States

Software development

Full-time

June 25, 2026

2 views

0 clicks on Apply Now


Similar job offers


This job offer summary has been generated using automated technology. While we strive for accuracy, it may not always fully capture the nuances and details of the original job posting. We recommend reviewing the complete job listing before making any decisions or applications.