Unlimited Job Postings Subscription - $99/yr!

Job Details

Staff Software Development Engineer

  2026-05-17     CVS Health     Idaho Falls,ID  
Description:

Job Description: Define and implement enterprise-wide SRE practices, including SLIs, SLOs, error budgets, and reliability governanceDrive a culture of reliability, automation, and continuous improvement across engineering teamsEstablish metrics-driven approaches to measure system health, availability, and performanceLead adoption of AIOps solutions to enable predictive monitoring, anomaly detection, and automated root cause analysisIntegrate machine learning models and analytics into monitoring pipelines to proactively detect and prevent incidentsDevelop intelligent alerting systems to reduce noise and improve signal qualityArchitect and build scalable observability frameworks covering metrics, logs, traces, and eventsDefine standards for instrumentation, telemetry collection, and distributed tracingEnable real-time insights into system performance across microservices and cloud-native architecturesLead incident response practices, including on-call readiness, RCA, postmortems, and continuous learning loopsBuild self-healing systems and automate remediation workflows to reduce Mean Time to Resolution (MTTR)Implement runbooks, playbooks, and automated escalationsDevelop internal platforms and tools for observability, monitoring, and performance optimizationIntegrate observability into CI/CD pipelines to enable proactive quality and reliability checksDrive infrastructure automation using IaaC frameworks and GitOps principlesPartner with engineering, platform, and product teams to embed reliability and observability into system designMentor engineers and lead design reviews focused on scalability, resilience, and operabilityInfluence enterprise architecture decisions and promote best practices across teamsRequirements: 5+ years of experience in software engineering, SRE, or production engineering in large-scale distributed systemsHands-on experience with Observability tools such as AppDynamics, Grafana, Prometheus, Datadog, OpenTelemetry, or similarExperience with AIOps or intelligent monitoring platforms, including anomaly detection and event correlationStrong expertise in cloud platforms (AWS, Azure, or GCP) and cloud-native architectures (Kubernetes, containers, microservices)Proficiency in at least one programming language (e.g., Python, Java, Go)Strong understanding of distributed systems, resiliency patterns, and fault toleranceExperience implementing incident management, on-call processes, and root cause analysisHands-on expertise with Infrastructure as Code (Terraform, ARM, CloudFormation) and CI/CD pipelinesExperience using GenAI/Automation tools and frameworks such as OpenAI, CoPilot, Gemini, Claude, MCP etcProven ability to design scalable, reliable, and observable systemsBenefits: medical, dental, and vision coveragepaid time offretirement savings optionswellness programsother resources, based on eligibility


Apply for this Job

Please use the APPLY HERE link below to view additional details and application instructions.

Apply Here

Back to Search