Principal Site Reliability Engineer

Nscale · United States · Remote

Job boardDevOps$150,000 - $2,150,000 USD

From the original post

Company Overview Nscale is a leading provider of GPU cloud infrastructure specifically engineered for artificial intelligence (AI) applications. The company focuses on delivering high-performance and cost-effective solutions designed for both AI start-ups and large enterprises. Nscale not only simplifies the complexity associated with AI development but also empowers AI-focused organizations to achieve remarkable results in areas such as cost management, rapid innovation, and environmental sustainability. At Nscale, the culture revolves around continuous innovation, accountability, and excellence, encouraging all employees to take ownership of their work and contribute meaningfully to the company's technological advancements. Position Overview The job opening for a Principal Site Reliability Engineer (SRE) is pivotal within the AI Infrastructure Operations team. This role emphasizes technical leadership, focusing on ensuring the reliability and scalability of one of the industry's most demanding AI platforms. The position calls for an individual who not only thinks systemically but also can inspire and lead operational excellence across the organization. The role encompasses the establishment of reliability strategies, the design of foundational systems, and the enhancement of operational practices across various teams. Key Responsibilities In the role of Principal Site Reliability Engineer, you will be charged with several critical responsibilities: Owning and evolving the long-term reliability strategy for Nscale's AI and HPC infrastructure. Designing and leading the development of extensive control-plane systems, automation frameworks, and operational tools. Defining reliability standards, SLO frameworks, and operational best practices for use across multiple operational teams. Serving as a senior technical escalation point during critical incidents, guiding the resolution process and ensuring comprehensive fixes. Identifying structural reliability risks and advancing cross-functional initiatives at the architectural level to mitigate those risks. Collaborating closely with Engineering, Network Operations, and Fleet Operations to influence platform design and elevate operational maturity. Mentoring both senior and mid-level engineers, enhancing the overall quality and efficacy of SRE practices. Driving measurable improvements in terms of availability, mean time to recovery (MTTR), cost efficiency, and operational scalability. Required Skills The position mandates a high level of expertise, along with a rich history in complex infrastructure management: A minimum of 10 years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering involving large-scale infrastructure. Expert-level software engineering skills, emphasizing a strong history of creating production-grade automation and systems. Profound knowledge of Linux, networking, and distributed systems design at scale. Extensive experience in debugging and resolving issues across the hardware, OS, networking, and application layers. Demonstrated leadership ability to guide technical initiatives across teams without direct authority, showcasing strong communication skills and a systems-thinking mindset. Nice to Have Although not mandatory, the following skills and experiences would be beneficial: Hands-on experience with AI or HPC platforms, particularly dealing with GPUs, InfiniBand/RDMA interconnects, and workload schedulers like SLURM. Familiarity with Kubernetes at scale and various cloud architectures (hybrid and bare-metal). A history of delivering significant enhancements in reliability, scalability, or operational efficiency. Salary Information The salary range for this position is between $150,000 and $2,150,000 USD. Actual compensation can vary based on factors like skill set, experience, education, and location. Alongside the base salary, the role may offer additional benefits such as bonuses, equity, and participation in commission programs. Benefits Medical, dental, and vision coverage. Flexible paid time off. Parental leave. Retirement plan participation. Culture and Work Environment Nscale adopts a remote-first approach, demonstrating a commitment to flexibility and work-life balance. Employees are encouraged to create their schedules around significant life moments, ensuring a human-first workplace environment that fosters both productivity and well-being.