Site Reliability Engineer
New Yesterday
What we do:
Zefr is the leading global technology company enabling responsible marketing in walled garden social environments. Zefr’s solutions empower brands to manage their content adjacency on scaled platforms such as YouTube, Meta, TikTok, and Snap, in accordance with industry standard frameworks. Through its patented AI technology, Zefr offers brands and agencies more accurate and transparent solutions for social walled gardens. The company is headquartered in Los Angeles, California, with additional locations across the globe.
What you’ll do:
As a Site Reliability Engineer at Zefr, you’ll apply your expertise in cloud infrastructure, CI/CD, Observability, and core SRE concepts, to deliver high-quality, reliable, and scalable solutions. A significant aspect of this role involves working closely with Zefr's Engineering and Data Science teams ensuring the infrastructure required for our services is robust, efficient, and scalable.
We’re looking for someone to combine their technical expertise with strong leadership and a passion for continuous improvement and innovation. By ensuring the continuous health and efficiency of our infrastructure, you will directly contribute to Zefr’s commitment to providing a consistently high-quality user experience. This is a role where we both expect to learn from you and have you learn from us!
Support and build systems and tools that enable other engineers to generate, deploy, and manage product features.
Deploy and support a multi-cloud, micro-service architecture deployed via Github Actions, ArgoCD & Kubernetes.
Collaborate with other engineers to architect secure, resilient, scalable, and cost-efficient applications and systems/pipelines in AWS and GCP.
Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.
Proactively maintain the health of production environments, including monitoring application performance and resource utilization.
Participate in 24/7 on-call rotation, respond to system performance issues and outages.
Debug code at the application and infrastructure level.
Mature our CI/CD workflows and release process.
Maintains a forward-thinking approach, actively researching and proposing new solutions.
Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.
Technology Stack at Zefr:
Core Infrastructure & Cloud Platforms:
Cloud Providers: Google Cloud Platform (GCP), Amazon Web Services (AWS)
Infrastructure as Code (IaC): Terraform
Containerization & Orchestration: Docker, Kubernetes (experience with GKE and/or EKS expected), Helm, Kustomize
Service Mesh: Istio
CI/CD & Automation:
CI/CD Pipelines: GitHub Actions
GitOps / Continuous Delivery: Argo CD
Primary Scripting/Automation Language: Python
Observability & Monitoring:
Monitoring & Alerting: Prometheus, Datadog, Pagerduty
Telemetry Standards: OpenTelemetry
Application & Data Ecosystem (Supporting):
Application Languages/Frameworks: Python, FastAPI, Flask, Node.js, React
Data Streaming: Apache Kafka
Data Processing/Transformation: Pandas, DBT
Workflow Orchestration: Apache Airflow, Ray
Machine Learning Stack:
Serving: Triton Inference Server
MLOps/Experiment Tracking: Weights and Biases, DVC
Libraries/Frameworks: Transformers, HuggingFace
Model Optimization/Formats: Onnx, TensorRT
Data Stores & Databases:
Relational Databases: PostgreSQL (including managed versions like AWS Aurora, GCP Cloud SQL)
NoSQL Databases: DynamoDB
Search Databases: OpenSearch, Elasticsearch
Vector Databases: Qdrant
Caching: Redis
Data Warehousing: Snowflake
What we’re looking for:
4+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers. (One of GCP or AWS required)
Production experience designing, managing, deploying, and maintaining container based workloads into Kubernetes clusters
Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)
Knowledge of IaC and configuration management tools (Terraform, OpenTofu, Crossplane, Pulumi, Ansible, CloudFormation)
Strong problem-solving experience, focusing on automation
Production experience with Monitoring and Observability tools (Prometheus, Grafana, Datadog, Thanos, New Relic, Open Telemetry)
Understanding of Cloud Networking concepts (Mesh Networking, NAT, Load Balancers, SSL Certificates and TLS termination, API Gateways, proxies, etc)
Strong written and verbal communication, organization, and documentation skills
Benefits:
At Zefr, we embrace a flexible work environment that empowers our team to do their best work—whether that’s from home, a favorite local spot, or our vibrant London office. While remote work is supported, we also value in-person connection and collaboration. Our team regularly comes together in our office space for brainstorming sessions, team-building, and shared moments that spark creativity and strengthen our culture.
Monthly allowance toward Health Care, Dental, Optical, Income Protection and Relevant Life
Pension Scheme with 3% contribution from the Company
Holidays: Total of 28 days per year (including UK Bank Holidays)
Flexibly hybrid work schedule
Summer Fridays (we leave early)
Compensation:
The anticipated salary for this position is between £70,000 to £90,000. Within the range, individual pay is determined by factors such as job-related skills, experience, and relevant education or training. If your compensation expectations fall outside of this range, it may still be worth having a conversation.
Zefr is an equal opportunity employer that embraces diversity and inclusion in the workplace. We are committed to building a team that represents a variety of backgrounds, skills, and perspectives because we know this only makes us better. We strongly encourage women, persons of color, LGBTQIA+ individuals, persons with disabilities, members of ethnic minorities, foreign-born residents, and veterans to apply even if you do not meet 100% of the qualifications.
#J-18808-Ljbffr- Location:
- London, England, United Kingdom
- Salary:
- £150,000 - £200,000
- Category:
- Engineering
We found some similar jobs based on your search
-
New Yesterday
Senior Site Reliability Engineer
-
London Borough Of Harrow, England, United Kingdom
-
£80,000 - £100,000
- Engineering
Join to apply for the Senior Site Reliability Engineer role at Dunelm. Be among the first 25 applicants to apply. The role is a hybrid role, with time split between working from home and our London or Leicester offices. You will work through scenario-based questions designed to help you highlight your knowledge and approach.
More Details -
-
New Yesterday
Site Reliability Engineer
-
Bristol, England, United Kingdom
-
£100,000 - £125,000
- Engineering
Join to apply for the Site Reliability Engineer role at INOVERSE GROUPE. Be among the first 25 applicants. Salary: Up to £100,000 per annum (Depending on Experience) Clearance: Applicants must be eligible for SC and/or DV clearance.
More Details -
-
New Yesterday
Site Reliability Engineer
-
London, England, United Kingdom
-
£150,000 - £200,000
- Engineering
Site Reliability Engineer is needed at Zefr, the leading global technology company enabling responsible marketing in walled garden social environments. You will apply your expertise in cloud infrastructure, CI/CD, Observability, and core SRE concepts to deliver high-quality, reliable, and scalable solutions.
More Details -
-
New Yesterday
Site Reliability Engineer
-
City Of Edinburgh, Scotland, United Kingdom
-
£125,000 - £150,000
- Engineering
Site Reliability Engineer (SRE) / Unix Infrastructure Engineer. Role involves ensuring high availability, disaster recovery readiness, and automation-driven improvements across RHEL, Oracle DB, Kubernetes, and AWS environments. Base pay range: $35,000-$40,000, outside of IR35.
More Details -
-
New Yesterday
Senior Site Reliability Engineer
-
London, England, United Kingdom
-
£125,000 - £150,000
- Engineering
Senior Site Reliability Engineer page is loaded Senior Site Reliable Engineer Apply remote type Remote Job: Remote locations GBR-London-5 Canada Square time type Full time posted on Posted Yesterday job requisition id JREQ190781 Senior Sitereliability Engineer – Reuters The Reuters Professional DevOps team is a global squad with...
More Details -
-
New Yesterday
Senior Site Reliability Engineer (SRE) - C13 - London
-
London, England, United Kingdom
-
£125,000 - £150,000
- Engineering
Senior Site Reliability Engineer (SRE) - C13 - London Join to apply for the Senior Site Reliable Engineer role at Citi. The ideal candidate will bring a combination of deep technical expertise, strategic thinking, and people leadership to drive our engineering excellence forward.
More Details -