About the role
Platform, DevOps & SRE (Primary Focus)
Infrastructure & Automation
Design and manage AWS infrastructure using Terraform (ECS, RDS, Redis, Kafka, networking, IAM).
Own service deployment patterns on ECS / Fargate.
Build safe, repeatable environments (dev, staging, prod).
Manage VPC architecture, service discovery, secrets, and access controls.
Reliability & Operations
Define and implement SLIs, SLOs, and error budgets.
Build alerting and incident response playbooks.
Improve system resilience against:
service crashes
network failures
dependency latency
traffic spikes
Lead incident response and postmortems.
Reduce MTTR through automation and tooling.
Observability
Implement structured logging, metrics, and distributed tracing.
Instrument services and infrastructure for performance and reliability visibility.
Own dashboards and alerts for critical systems.
CI/CD & Release Engineering
Build and maintain CI/CD pipelines.
Improve deployment safety (rollbacks, canaries, blue-green where needed).
Standardize build and release workflows.
Enable high deployment velocity with operational safety.
Platform Tooling & Automation
Build internal tools for infra lifecycle management, cost monitoring, and scaling.
Automate provisioning, scaling, and recovery workflows.
Write Python / scripting utilities where infra meets runtime systems.
Infrastructure Collaboration
Work with backend and platform teams to:
Ensure services are production-ready and observable.
Improve deployment patterns and runtime configurations.
Reduce operational risk in service design.
Provide reliability and scalability input during architecture reviews.
Responsibilities
What We’re Looking For
Core Requirements
5 - 9 years of experience in DevOps, SRE, or platform engineering roles.
Strong hands-on experience with AWS (ECS/Fargate, RDS, networking, IAM).
Solid experience with Terraform in production systems.
Strong understanding of Linux, containers, and networking basics.
Proficiency in Python / Bash for automation, tooling, or infra services.
Experience running Redis, Kafka, Postgres in real systems.
Strong debugging and incident-handling mindset.
SRE Mindset
Comfort owning systems end-to-end.
Ability to reason about failure modes, not just happy paths.
Bias toward automation over manual operations.
Nice to Have
Experience defining SLOs and alerting strategies.
Experience with high-throughput or real-time systems.
Exposure to geo-distributed systems or event-driven architectures.
Experience building internal developer platforms or golden paths.