New

Senior System Administrator

TISTA Science and Technology Corporation
life insurance, paid time off, paid holidays, tuition reimbursement, 401(k)
United States, Texas, Austin
Sep 05, 2025
Overview Are you a Senior Systems Administrator who would like to have a positive impact for millions of people? If so, we may have an opportunity for you! TISTA associates enjoy above Industry Healthcare Benefits, Remote Working Options, Paid Time Off, Training/Certification opportunities, Healthcare Savings Account & Flexible Savings Account, Paid Life Insurance, Short-term & Long-term Disability, 401K Match, Tuition Reimbursement, Employee Assistance Program, Paid Holidays, Military Leave, and much more! Responsibilities The Senior System Administrator/Site Reliability Engineer (SRE)in the VA's Enterprise Cloud is responsible for ensuring the resilience, performance, reliability, and compliance of mission-critical cloud services that support Veterans and VA stakeholders. This role bridges software engineering, systems engineering, and operations to deliver highly available, secure, and efficient cloud-based platforms aligned with VA's modernization strategy and federal compliance mandates, with a focus on reliability, performance, scalability, and automation. Though day-to-day tasks vary, depending on the various organizations and their systems, generally this role's daily work cadence follows these categories: Proactively monitor system health, availability, and performance using observability tools (e.g., Prometheus, Grafana, Datadog, Splunk). Respond to alerts and incidents, triage issues, and perform root cause analysis (RCA). Lead on-call rotations to ensure 24/7 uptime and quick recovery from outages. Document incident reports and contribute to postmortems to prevent recurrence. Automate manual operational tasks such as deployments, scaling, and configuration using tools like Ansible, Terraform, or Puppet. Manage infrastructure as code (IaC) to ensure consistency across environments. Optimize CI/CD pipelines for reliable and repeatable software delivery. Build self-healing systems to minimize downtime. Conduct load and stress testing to validate system performance under peak demand. Establish and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs). Identify and reduce sources of latency, bottlenecks, and single points of failure. Work with development teams to design reliability, scalability, and fault tolerance into customer servers. Patch operating systems, containers, and dependencies to address vulnerabilities. Ensure compliance with organizational and regulatory requirements. Implement access controls, secrets management, and least privileged principles. Monitor resource utilization (CPU, memory, storage, network) to anticipate scaling needs. Plan for growth by forecasting demand and preparing infrastructure accordingly. Optimize cloud costs by rightsizing instances, using autoscaling, and leveraging reserved/spot instances. Partner with software engineers to embed reliability practices into development. Mentor teams on best practices for observability, automation, and incident handling. Participate in blameless postmortems and contribute to knowledge-sharing sessions. Continuously evaluate new tools and technologies to improve system reliability. Design, monitor, and maintain Customer Servers to meet VA's 99.9%+ uptime and SLA requirements across multi-cloud and hybrid environments. Implement fault-tolerant and self-healing architectures leveraging automation. Develop and manage observability frameworks (logging, metrics, tracing) to detect, respond to, and remediate incidents quickly. Lead blameless postmortems and drive corrective actions to strengthen VAEC resilience. Engineer scalable automation pipelines for provisioning, patching, and compliance (e.g., Ansible, Terraform, Puppet, GitHub Actions). Reduce manual effort through self-service tools for operations teams. Monitor and optimize application and infrastructure performance to meet demand from VA Medical Centers, Enterprise Data Warehouses, and end users. Ensure latency, throughput, and resource utilization align with mission needs. Integrate VA 6500, NIST 800-53, FedRAMP, and Zero Trust requirements into daily operations. Partner with cybersecurity teams to enforce continuous ATO (cATO) practices and vulnerability remediation. Collaborate with Release Management, Engineering, and Operations teams to improve change management, deployment pipelines, and reliability practices. Drive the adoption of SRE principles (error budgets, SLIs, SLOs, SLAs) into VA's IT Service Management (ITSM) processes. Operate across VA's Enterprise Cloud (VAEC), on-premises data centers, and hybrid platforms, ensuring seamless integration and interoperability. Support workloads across AWS GovCloud, Microsoft Azure Government, and Oracle Cloud Infrastructure (OCI) where applicable. Mission Assurance: Continuous availability of systems supporting Veterans' health, benefits, and administrative services. Operational Efficiency: Automated and standardized cloud operations reduce manual risk and speed delivery. Compliance Assurance: Alignment with VA 6500, NIST, and federal mandates, minimizing audit risks. Veteran-Centered Reliability: Ensure services that Veterans depend on are consistently reliable, secure, and performant. Qualifications 5 years of experience in Site Reliability Engineering, DevOps, or Systems Engineering Strong experience with Linux/Unix systems administration and troubleshooting Proficient with cloud platforms (AWS and/or Azure), especially in deploying Production workloads Deep understanding of monitoring, metrics, alerting, and observability Proficient in designing, implementing, and managing automation solutions using Ansible Experience with CI/CD tools (e.g., GitHub Actions, Jenkins, GitLab CI, Azure DevOps) Hands-on with containers and orchestration (Docker, Kubernetes, EKS, AKS) Familiarity with networking concepts (TCP/IP, DNS, TLS, VPCs, load balancing) Solid understanding of software development lifecycle (SDLC) and Agile methodologies Comfortable participating in on-call rotations and handling high-priority incidents Preferred Qualifications (optional but preferred): AWS Certified SysOps Administrator or DevOps Engineer Linux Certified: Azure Administrator or DevOps Engineer Expert Certified Kubernetes Administrator (CKA) Experience in chaos engineering, capacity modeling, or SRE tooling Excellent analytical and problem-solving skills Ability to work in cross-functional teams and communicate effectively with developers, operations, and leadership A strong bias for automation and self-healing systems Ownership mindset with a commitment to reliability and continuous improvement Education: Bachelor's degree in computer science, electronics engineering or related technical discipline and 5+ years' work experience Eight (8) years of additional relevant experience may be substituted for education (13 years total) Clearance: The ability to pass a Tier 4/HIGH Background Investigation Location: Department of Veteran's Affairs (100% On-site) Monday - Friday (8:00 AM - 4:30 PM EST Time) Austin Information Technology Center (AITC) 1615 Woodward Street, Austin, TX 78741