Site Reliability Engineer
New Today
We are looking for a highly skilled Engineer with expertise in Python programming, automation, and modern observability practices to help build and operate scalable distributed systems for an award-winning London Hedge Fund. This role sits at the intersection of platform engineering, AI tooling, and system reliability. You will design automation frameworks, develop AI-assisted engineering tools, and implement observability solutions that provide deep insights into complex distributed architectures.
Responsibilities
- Design, develop, and maintain robust automation solutions using Python.
- Build and maintain observability pipelines including metrics, logs, and traces across distributed systems.
- Develop internal AI-powered tools that enhance engineering productivity and operational intelligence.
- Implement monitoring, alerting, and diagnostics to improve system reliability, performance, and scalability.
- Integrate observability platforms with automation workflows and incident response systems.
- Collaborate with platform, infrastructure, data and development teams to improve system visibility and operational maturity.
- Design tooling that enables proactive detection, analysis, and remediation of system issues across distributed environments.
- Contribute to architecture decisions around telemetry, AI-assisted debugging, and automation frameworks.
- Support business users and stakeholders (direct) with system analysis, problem management, and technical resolution.
Skills & Experience
- Strong professional experience with Python development in production environments.
- Proven experience building automation frameworks, scripts, and developer tooling.
- Strong experience working with distributed systems and large-scale service architectures.
- Hands-on experience working with Kubernetes in production environments.
- Deep understanding of observability practices, including metrics, logs, tracing, and telemetry pipelines.
- Experience integrating AI or machine learning tooling into engineering workflows.
- Strong understanding of APIs, microservices, and containerised environments.
- Experience with CI/CD pipelines and infrastructure automation.
- Ability to design scalable, maintainable engineering tools.
- Experience in supporting business users directly, project or problem coordination with dev and infra teams, project ownership experience.
Interesting Technologies
- Observability: OpenTelemetry, Prometheus, Grafana, Elastic Stack (ELK), Jaeger
- Automation & CI/CD: GitHub Actions, Jenkins, GitLab CI, Argo Workflows
- Distributed Systems & Messaging: Kafka, Redis, gRPC
Offer
- World-class technology environment (award-winning) with best-in-class engineering teams.
- Fast-paced and low-bureaucracy culture - get stuff done mindset.
- Up to £150,000 base salary. 50%-100% annual cash bonus. Pension, Healthcare, Gym, Food, 30 days holiday etc.
- 4 days onsite, 1 day wfh.
- The chance to shape the future of intelligent automation and operational insight in distributed platforms.
- Location:
- Greater London, England, United Kingdom
- Salary:
- £100,000 - £125,000
- Job Type:
- FullTime
- Category:
- Engineering
We found some similar jobs based on your search
-
New Today
Site Reliability Engineer
-
Greater London, England, United Kingdom
-
£100,000 - £125,000
- Engineering
We are looking for a highly skilled Engineer with expertise in Python programming, automation, and modern observability practices to help build and operate scalable distributed systems for an award-winning London Hedge Fund. This role sits at the int...
More Details -
-
New Today
Site Reliability Engineer: Build Reliable, Scalable Systems
-
Greater London, England, United Kingdom
-
£100,000 - £125,000
- Engineering
A leading investment firm is seeking a Site Reliability Engineer in Greater London. The role involves ensuring the availability and performance of production applications and leading the SRE mindset across teams. Candidates should have a relevant Bac...
More Details -
-
New Today
Site Reliability Engineer
-
Greater London, England, United Kingdom
-
£100,000 - £125,000
- Engineering
At Citadel, a leading investor in the world’s financial markets, we aim to win together as one team to earn the long‑term trust of our capital partners and each other. Our collaborative approach allows technologists to grow alongside other team membe...
More Details -
-
New Today
Senior Site Reliability Engineer - Financial Platform
-
England, United Kingdom
A prominent financial services organization in the UK is seeking a Senior Site Reliability Engineer to join its Financial Wellbeing Platform. The role demands expertise in cloud environments and SRE principles while allowing for mentorship and team s...
More Details -
-
New Today
Senior Site Reliability Engineer - Scale & Observability Lead
-
London, England, United Kingdom
A leading IT consulting firm in London is seeking a Senior Site Reliability Engineer to play a key role in operational support while implementing a variety of AWS services. You will monitor systems, troubleshoot issues, and ensure a strong performanc...
More Details -
-
New Today
Site Reliability Engineer SRE Azure SaaS
-
Cambridge
- Technology
Job Description Site Reliability Engineer / SRE (Azure SaaS) Cambridge / WFH to £100k Do you have expertise with observability and monitoring within a SaaS environment? You could be progressing your career in a hands-on, influential Site Reliabil...
More Details -