We’re looking for a Senior Site Reliability Engineer to join Cloud Platform at one of the world’s biggest publishers; to help us on the web application side with reliability, performance and scalability, as well as observability practices & tools, CI/CD pipelines and incident resolution & troubleshooting.
We aim to build a world class operational organisation allowing developers to own their applications while being supported by site reliability to provide the best experience to our customers.
About Cloud Platform
You’ll be joining the Cloud Platform group which is responsible for three main areas: Site Reliability, Developer Experience and Operational Excellence, working across Europe, India and the US.
We are responsible for building and maintaining a global Kubernetes platform, built in AWS, for the deployment of applications that serve hundreds of Condé Nast websites around the world. You will work alongside the application development teams in CNs London, New York, and Bangalore Headquarters to build a scalable, highly available, resilient platform for use by engineers around the world.
What we’re solving
As a group our goal is to make a single global, highly resilient platform. As an SRE group, our main challenges are to maintain the stability of the platform, provide services which simplify the running of applications, and drive operational improvements as we expand globally.
You will have previous experience with:
• Significant experience with NodeJS, Python, Go or other language
• Understanding of software architecture and design
• Cloud engineering on AWS, including building fault tolerant systems
• Infrastructure as code
• Driving operational quality through common SRE best practices
• Observability of applications, using monitoring, logging, tracing and alerting solutions
• Operating in production
• Performance and load testing
• AWS - for most of our infrastructure
• Terraform - infrastructure as code
• Kubernetes - our platforms
• CircleCi - Deployment
• Java - One of our deployment tools
• Go - our preferred language for building auxiliary services
• Datadog, Kibana, Elastic search, Splunk, Prometheus,Grafana - for observing our platforms
How we work
• Infrastructure as Code everywhere
• Pairing. We like knowledge sharing and upskilling
• Remote friendly. We work with engineers across time zones & locations
• (Oncall - teams are responsible for their own apps)
• Strong background in software development, having lead projects or workstreams
• You love working with a global team and are flexible to adjust hours accordingly
• You will be working with a global team and need to be accommodative of different time-zones as required
• You would like to develop a large scale cutting edge platform to support the deployment of hundreds of applications
• Raise the bar on all work, and share best practices
• A desire to automate everything
• Used and supported Kubernetes in a production environment
• Any of ELK, Grafana, Datadog, Splunk experience
• AWS or other cloud experience