Job Details

View jobs in our app

Learn more about the app. Workinapps.com

Staff Software Development Engineer

2026-05-17 CVS Health Idaho Falls,ID

Description:

Job Description: Define and implement enterprise-wide SRE practices, including SLIs, SLOs, error budgets, and reliability governanceDrive a culture of reliability, automation, and continuous improvement across engineering teamsEstablish metrics-driven approaches to measure system health, availability, and performanceLead adoption of AIOps solutions to enable predictive monitoring, anomaly detection, and automated root cause analysisIntegrate machine learning models and analytics into monitoring pipelines to proactively detect and prevent incidentsDevelop intelligent alerting systems to reduce noise and improve signal qualityArchitect and build scalable observability frameworks covering metrics, logs, traces, and eventsDefine standards for instrumentation, telemetry collection, and distributed tracingEnable real-time insights into system performance across microservices and cloud-native architecturesLead incident response practices, including on-call readiness, RCA, postmortems, and continuous learning loopsBuild self-healing systems and automate remediation workflows to reduce Mean Time to Resolution (MTTR)Implement runbooks, playbooks, and automated escalationsDevelop internal platforms and tools for observability, monitoring, and performance optimizationIntegrate observability into CI/CD pipelines to enable proactive quality and reliability checksDrive infrastructure automation using IaaC frameworks and GitOps principlesPartner with engineering, platform, and product teams to embed reliability and observability into system designMentor engineers and lead design reviews focused on scalability, resilience, and operabilityInfluence enterprise architecture decisions and promote best practices across teamsRequirements: 5+ years of experience in software engineering, SRE, or production engineering in large-scale distributed systemsHands-on experience with Observability tools such as AppDynamics, Grafana, Prometheus, Datadog, OpenTelemetry, or similarExperience with AIOps or intelligent monitoring platforms, including anomaly detection and event correlationStrong expertise in cloud platforms (AWS, Azure, or GCP) and cloud-native architectures (Kubernetes, containers, microservices)Proficiency in at least one programming language (e.g., Python, Java, Go)Strong understanding of distributed systems, resiliency patterns, and fault toleranceExperience implementing incident management, on-call processes, and root cause analysisHands-on expertise with Infrastructure as Code (Terraform, ARM, CloudFormation) and CI/CD pipelinesExperience using GenAI/Automation tools and frameworks such as OpenAI, CoPilot, Gemini, Claude, MCP etcProven ability to design scalable, reliable, and observable systemsBenefits: medical, dental, and vision coveragepaid time offretirement savings optionswellness programsother resources, based on eligibility

Job Details

View jobs in our app

Staff Software Development Engineer

Apply for this Job

Registration Required

Login to Apply

You are leaving our site

Registration Required

Email this job to a friend

Job: Staff Software Development Engineer

Job Alert Sign Up

Add To Job Alert

Job Alert Updated

Email Customer Care