New

Director, System Reliability Engineering

Microsoft
United States, Washington, Redmond
May 12, 2025
OverviewMicrosoft Silicon, Cloud Hardware Infrastructure Engineering (SCHIE) is the team behind Microsoft's expanding Cloud Infrastructure and responsible for powering Microsoft's "Intelligent Cloud" mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for passionate, high energy engineers to help achieve that mission. As Microsoft's Cloud business continues to grow the ability to deploy new offerings and HW infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for Cloud infrastructure reliability, improving the planning process, manufacturing, quality, delivery at scale, serviceability and sustainability. We are looking for a System Reliability Engineering Leader with a passion for customer focused solutions, insight and industry knowledge to envision and implement future technical solutions that will optimize the Cloud infrastructure and its reliability. We are looking for an experienced Director, System Reliability Engineering who will be responsible to drive reliability performance across architecture, design, component and material selections, manufacturing and integration of datacenter hardware, ensuring that all electrical, mechanical, thermal, environmental, transportation and operational aspects along with telemetry, diagnostic and the SW/FW stack of the cloud solution are optimized throughout the lifecycle of each cloud service. The candidate will interact with Engineering, Supply Chain, Sourcing, Manufacturing & Quality, Fleet Management, Datacenter Operations, and other internal and external stakeholders. ResponsibilitiesLead the design, implementation, and continuous improvement of reliability practices across our AI infrastructure. Ensure the performance, scalability, and resilience of AI systems in production environmentsLead the development and execution of both systems and components' reliability engineering strategies for all Cloud platforms and servicesCollaborate across HW and SW architecture, data engineering, and platform teams to ensure robust deployment of resilient solutions and servicesLead strategic innovations and develop processes to integrate industry practices to ensure efficiency in achieving high reliability and qualityDesign and implement observability frameworks tailored to AI workloadsDrive incident response, root cause analysis, and postmortem processes for HW system outages or degradationsEstablish and monitor SLAs (Availability, Node In Service, Time to restore Availability) for all cloud services, ensuring alignment with business goals and product requirements Foster a culture of reliability, automation, consistency of execution and continuous improvement across engineering teamsSupport manufacturing, datacenter operation, troubleshooting and diagnostic methods to optimize the cloud infrastructure reliability