About the Role

Lead AI Engineer to architect, build, and operationalize enterprise-grade agentic LLM systems and observability within a regulated financial environment. The role combines SRE practices, full-stack development (Node.js/TypeScript and Python), and leadership to deliver secure, auditable AI automation and production RAG pipelines.

Job Description

Role

Lead AI Engineer responsible for designing, building, and operating agentic LLM systems, Model Context Protocol (MCP) servers, and production-grade RAG/GraphRAG pipelines in a regulated financial institution. The position blends AI architecture, site reliability engineering, full-stack development, and technical leadership.

Key Responsibilities

Architect and deploy multi-agent LLM workflows that perform autonomous reasoning, planning, and secure tool execution within banking systems.
Design and implement Model Context Protocol (MCP) servers for standardized context management between models, internal APIs, and external data sources.
Build production Retrieval-Augmented Generation (RAG) and GraphRAG pipelines with full auditability grounded in enterprise financial data.
Lead full-stack development using Node.js (TypeScript) and Python (FastAPI/Django) to expose RESTful and GraphQL APIs for LLM inferences and agent actions.
Implement AI observability using the ELK stack (Elasticsearch, Logstash, Kibana) focused on LLM metrics (latency, token usage, hallucinations, model drift).
Apply SRE best practices to AI workloads: HA, fault tolerance, incident playbooks, SLO/SLA management, CI/CD for models, shadow deployments, and rollback strategies.
Automate infrastructure and deployments with Bash and Python scripts; establish prompt versioning, model drift detection, and automated evaluation pipelines.
Mentor and lead engineers, influence cross-functional technical decisions, and drive AI-native development practices including AI-assisted developer workflows (Cursor, GitHub Copilot).

Requirements

Expert-level proficiency in Node.js (TypeScript/JavaScript) and Python; Bash scripting mandatory.
Deep understanding of LLM architectures, prompt engineering, fine-tuning techniques (LoRA/qLoRA), embeddings, and production LLM application operation.
Hands-on experience with agentic frameworks and implementing MCP servers.
Proven experience building RAG/GraphRAG pipelines and working with vector databases (Pinecone, Milvus, Weaviate).
Extensive experience with ELK Stack for logging/observability and AI-specific metric tracking.
Cloud-native architecture experience; Azure and AKS strongly preferred.
Experience with enterprise AI tooling: Microsoft Copilot (Copilot Studio), Meta AI (Llama ecosystem), Google AI (Gemini, Vertex AI).
Minimum 8+ years progressive software engineering experience and at least 3 years in a technical leadership or architectural role.

Banking & Compliance

Knowledge of SOC 2 Type II principles, financial data classification, PII protection, and audit trails for AI outputs.
Experience with secure credential management (Azure Key Vault, HashiCorp Vault), model governance (versioning, explainability), and zero-trust/least-privilege patterns.

Preferred Qualifications

Experience integrating AI observability with OpenTelemetry, Elastic certifications, familiarity with Elastic Agent/Fleet, prior financial services experience, or contributions to open-source AI/ML projects.

Location & Compensation

Work Location: Hybrid remote in Toronto, ON (York District).
Pay: $59,408.22 - $139,005.25 per year.

Lead AI Engineer - SRE, LLM Agents, Full-Stack Architecture

About the Role

Job Description

Role

Key Responsibilities

Requirements

Banking & Compliance

Preferred Qualifications

Location & Compensation

Tech Stack

Skills

Experience Level

Salary

Employment Type

Benefits