Lead AI Engineer - SRE, LLM Agents, Full-Stack Architecture
Explicitly calls out 'Vibe Coding' and AI-assisted workflows using Cursor and GitHub Copilot for developer productivity.
About the Role
Lead AI Engineer to architect, build, and operationalize enterprise-grade agentic LLM systems and observability within a regulated financial environment. The role combines SRE practices, full-stack development (Node.js/TypeScript and Python), and leadership to deliver secure, auditable AI automation and production RAG pipelines.
Job Description
Role
Lead AI Engineer responsible for designing, building, and operating agentic LLM systems, Model Context Protocol (MCP) servers, and production-grade RAG/GraphRAG pipelines in a regulated financial institution. The position blends AI architecture, site reliability engineering, full-stack development, and technical leadership.
Key Responsibilities
- Architect and deploy multi-agent LLM workflows that perform autonomous reasoning, planning, and secure tool execution within banking systems.
- Design and implement Model Context Protocol (MCP) servers for standardized context management between models, internal APIs, and external data sources.
- Build production Retrieval-Augmented Generation (RAG) and GraphRAG pipelines with full auditability grounded in enterprise financial data.
- Lead full-stack development using Node.js (TypeScript) and Python (FastAPI/Django) to expose RESTful and GraphQL APIs for LLM inferences and agent actions.
- Implement AI observability using the ELK stack (Elasticsearch, Logstash, Kibana) focused on LLM metrics (latency, token usage, hallucinations, model drift).
- Apply SRE best practices to AI workloads: HA, fault tolerance, incident playbooks, SLO/SLA management, CI/CD for models, shadow deployments, and rollback strategies.
- Automate infrastructure and deployments with Bash and Python scripts; establish prompt versioning, model drift detection, and automated evaluation pipelines.
- Mentor and lead engineers, influence cross-functional technical decisions, and drive AI-native development practices including AI-assisted developer workflows (Cursor, GitHub Copilot).
Requirements
- Expert-level proficiency in Node.js (TypeScript/JavaScript) and Python; Bash scripting mandatory.
- Deep understanding of LLM architectures, prompt engineering, fine-tuning techniques (LoRA/qLoRA), embeddings, and production LLM application operation.
- Hands-on experience with agentic frameworks and implementing MCP servers.
- Proven experience building RAG/GraphRAG pipelines and working with vector databases (Pinecone, Milvus, Weaviate).
- Extensive experience with ELK Stack for logging/observability and AI-specific metric tracking.
- Cloud-native architecture experience; Azure and AKS strongly preferred.
- Experience with enterprise AI tooling: Microsoft Copilot (Copilot Studio), Meta AI (Llama ecosystem), Google AI (Gemini, Vertex AI).
- Minimum 8+ years progressive software engineering experience and at least 3 years in a technical leadership or architectural role.
Banking & Compliance
- Knowledge of SOC 2 Type II principles, financial data classification, PII protection, and audit trails for AI outputs.
- Experience with secure credential management (Azure Key Vault, HashiCorp Vault), model governance (versioning, explainability), and zero-trust/least-privilege patterns.
Preferred Qualifications
- Experience integrating AI observability with OpenTelemetry, Elastic certifications, familiarity with Elastic Agent/Fleet, prior financial services experience, or contributions to open-source AI/ML projects.
Location & Compensation
- Work Location: Hybrid remote in Toronto, ON (York District).
- Pay: $59,408.22 - $139,005.25 per year.