Site Reliability Engineer

New Yesterday

Role: Site Reliability Engineers (SRE)Location: London, UK - 5 Days OnsiteJob Type: Contract & Fixed term EmploymentDomain:Banking / Finance / Trading Market RiskSkills:SRE experience with Python-based applications(not Java)Exposure to cloud technologiesFamiliarity with Athena ecosystemor similar (SecDB, Quartz)Trade Lifecycle / Market Risk / Risk platform experienceExperience:Minimum 8+ yearsSRE Role descriptionWe need an experienced SRE to focus predominantly on automation, optimization, and process re-engineering using AI for the Market Risk Platform. Success is measured by capacity created 9toil eliminated, fewer manual steps, faster recovery, safer/faster changes) not by being the primary BAU support resources. Strong Python and provable agentic AI deliveryPrimary Objectives:Eliminate Operational toil and recurring manual work through durable automationRe-engineer support/change processes to reduce handoffs, approvals friction and rerun complexityIndustrialize reliability operations so existing SREs spend less time firefighting and more time engineeringKey Responsibilities (Automation & Process first)Automation Engineering (Core)Build production grade automation in Python(tools, services, workflows) to remove repetitive work: environment checks, dependency validation, automated reruns/reprocessing, safe restarts, drift detection, remediation actions, and standardized operation tasksCreate self-service capabilities for common requests(guard railed, auditable, repeatable)Implement automation with Safety: idempotency, dry-run modes, approval gates where needed, rollback/undo strategies, and clear audit trailsProcess Re-engineering (Core)Map current operation processes (incident/problem/change, release readiness, rerun/recovery, access/entitlements, environment onboarding) and redesign them to remove waster and reduce cycle time.Standardize runbooks/playbooks into executable workflows, reduce tribal knowledge via templates, checklists, and automated pre-flight controlsDefined and track operation KPIs (toil hours removed, alert volume reduction, MTTR improvements, change failure rate reduction, rerun time reduction).Agentic AIDesign and implement agentic workflows that take action using tools/runbooks(e.g., diagnostics, evidence gathering, correlation, guided remediation, change-risk checks, automated rerun orchestration)Put strong controls in place: soped permissions, deterministic fallbacks, human-in-the-loop approvals for risky actions, evaluation harnesses and measurable outcomes.Productionize with monitoring, logging and post incident learnings feeding back into the agent/toolingObservability (enablement for automation)Required skills & ExperienceSenior SRE experience on distributed systems and batch/intraday workloads in a production environment.Strong PythonProvable agentic AI experience showingTool integration, guard rails, evaluation approach, Measurable impact (toil reduction, MTTR reduction, alert reduction etc)Demonstrated process optimization ability (removing steps/handoffs, standardizing workflows, implementing light weight controls with metrics)Strong Linux and troubleshooting fundamentals across application/system/network layersExperience working across mixed estates ( On Pre VMs + Cloud, with some Kubernetes exposure for operational monitoring/reruns)DifferentiatorsExposure to Banking/Finance Market Risk DomainsExperience and knowledge of Athena eco system familiarity or similar (Sec DB Quartz)JBRP1_UKTJ
Location:
London
Job Type:
FullTime

We found some similar jobs based on your search