Basic Qualifications
Bachelor's degree in Software Engineering, or related Science, Technology, Engineering or Mathematics field, plus a minimum of 8 years of relevant experience; or Master's degree, plus 6 years relevant experience.
Responsibilities for this Position
What You'll Own
- SLOs and reliability metrics. Define service level objectives for every AI service that goes to production. Establish error budgets and use them to drive engineering decisions - not just measure uptime.
- Monitoring and observability. Build and maintain monitoring, logging, and alerting infrastructure for AI services. You will know when something is degrading before users do.
- Incident response. Establish incident management procedures, lead post-incident reviews, and drive corrective actions. When something breaks, you coordinate the response and ensure it doesn't break the same way again.
- Operational readiness reviews. Before any AI service goes live, you validate that it meets reliability, security, and operational standards. You are the gate between "it works in dev" and "it's ready for production."
- Capacity planning and cost monitoring. Track resource consumption, forecast capacity needs, and monitor costs - tokens, compute, storage. You ensure the platform scales without surprises.
- Toil elimination. Identify and automate repetitive operational tasks. If a human is doing something a script could do, you fix that.
What You Won't Own
- Application development or AI model building - you ensure what they build is operable, you don't build it
- Infrastructure provisioning - IT provides the infrastructure; you define what's needed and validate it works
- Business process decisions or backlog prioritization
What Makes This Role Different
- AI services have failure modes that traditional applications don't - model drift, token budget exhaustion, prompt injection, upstream data quality degradation. You will build monitoring for problems that most SRE teams have never encountered.
- You are applying SRE principles from scratch. There is no existing SRE practice to inherit - you will define it for the platform.
- Your operational readiness reviews directly determine whether AI services go live. You have real authority to say "not ready."
Required Qualifications
- Bachelor's degree in Computer Science, Software Engineering, or a related field, plus 5 years of experience; or Master's degree plus 3 years of experience
- Production SRE or DevOps experience - you have owned the reliability of systems that real users depended on, not just built CI/CD pipelines
- Hands-on experience with monitoring and observability tools - Prometheus, Grafana, Datadog, ELK, CloudWatch, or similar. You have built dashboards and alerts that caught real problems.
- Strong scripting and automation skills - Python, Bash, infrastructure-as-code (Terraform, CloudFormation, or similar)
- Experience with containerized environments - Docker, Kubernetes, container orchestration at scale
- Experience defining and managing SLOs, error budgets, and incident response procedures in production
- S. citizenship required. Department of Defense Secret security clearance is required at time of hire.
Preferred Qualifications
- Experience with AI/ML production systems - model serving, inference monitoring, token cost tracking, or similar
- Multi-cloud experience (AWS, Azure, GCP) including cloud-native monitoring and logging services
- Experience building operational readiness review processes or production launch checklists
- Familiarity with Google SRE principles - you have read the book and applied the concepts, not just referenced them in interviews
- Experience in environments where reliability has compliance or safety implications - defense, healthcare, finance, or critical infrastructure
What Sets You Apart
- You think about failure before you think about features. Your first question about any new system is "how does this break?"
- You automate yourself out of toil. If you're doing the same thing twice, you write a script.
- You have said "not ready" to a team that wanted to ship, and you were right.
- You build monitoring that tells you what's wrong, not just that something is wrong.
- You write post-incident reviews that actually change how systems are built, not just how incidents are documented.
Details
- Remote - 100% telework
- 9/80 schedule
- Defense industry experience is not required
Target salary range: USD $142,696.00/Yr. - USD $158,303.00/Yr. This estimate represents the typical salary range for this position based on experience and other factors (geographic location, etc.). Actual pay may vary. This job posting will remain open until the position is filled.
Company Overview
General Dynamics Mission Systems (GDMS) engineers a diverse portfolio of high technology solutions, products and services that enable customers to successfully execute missions across all domains of operation. With a global team of 12,000+ top professionals, we partner with the best in industry to expand the bounds of innovation in the defense and scientific arenas. Given the nature of our work and who we are, we value trust, honesty, alignment and transparency. We offer highly competitive benefits and pride ourselves in being a great place to work with a shared sense of purpose. You will also enjoy a flexible work environment where contributions are recognized and rewarded. If who we are and what we do resonates with you, we invite you to join our high-performance team! Equal Opportunity Employer / Individuals with Disabilities / Protected Veterans
|