Platform

3-day production workload + AI enablement for a global bank

An intensive three-day on-site engagement: pair-build a production AWS workload with the customer's engineering team, then run AI enablement sessions for engineers and business leaders. Architecture, code, knowledge — all in one week.

Global financial services (anonymised)Software Development Engineer (AWS)Jun 2024 – Jun 2024
3
On-site days
Production-ready
Workload status
Engineering + business
Stakeholders engaged

Problem

A global financial services customer had a working AWS proof-of-concept they couldn't move into production. The blockers were familiar: hardened networking, missing IaC, no CI/CD, security review pending, ambiguous data flows. The clock mattered — they had a tight delivery deadline in weeks.

In parallel, their engineering organisation and several business stakeholders wanted a serious grounding in modern AI engineering: what's real, what's hype, what they could actually build on Bedrock without burning cycles.

We had three days on-site to do both.

Process

I shaped the week as a pair-build alongside enablement, not a presentation followed by a workshop:

Day 1 — Build morning, AI orientation afternoon. Morning: I sat with the customer's two senior engineers and we walked the existing PoC against a production-readiness checklist. Networking, secrets, observability, data classification. We decided what stayed, what was rebuilt, what got deferred to a second phase. Afternoon: a 2-hour session for the broader engineering org on Bedrock — what models exist, what they're good at, what RAG actually means in practice, where guardrails matter. Hands-on with the AWS console, real prompts.

Day 2 — Pair-coding. We rebuilt the workload with proper Terraform modules, Step Functions for the long-running orchestration the PoC handled badly, and a CI pipeline through CodeBuild. Two of the customer's engineers paired with me throughout — by lunchtime they were leading sections, with me reviewing.

Day 3 — Production hardening + leadership session. Morning: security review prep, IAM least-privilege, observability through CloudWatch + X-Ray, runbooks for the on-call team. We mock-ran an incident scenario. Afternoon: a closed session with engineering directors and a small group of business leaders on how AI investment lands inside an organisation — what to fund, how to staff, what governance to put in place. Concrete, not abstract.

Outcome

The workload left the engagement production-ready: hardened, in code, in CI, with the customer's engineers fluent in every part. They went into security review the following week and passed. Two engineers have since become the internal champions for further AI work.

The business-leader session reportedly fed directly into next-year's AI investment plan — but I don't have the receipts on that one.

For engineersTechnical Deep Dive
Expand

What the workload actually was

A document-intelligence pipeline: customer documents land in S3, Step Functions orchestrates extraction (Textract), classification (Bedrock), validation (rules engine in Lambda), and routing to downstream systems. The PoC version was a single sprawling Lambda with retries hand-rolled. The rebuilt version was structured around the failure modes.

S3 (incoming) → EventBridge
   │
   ▼
Step Functions:
   ├─ Map state per document
   │   ├─ Textract async (with token-bucket throttle)
   │   ├─ Bedrock classification (with retry on throttling)
   │   ├─ Rules validation (Lambda)
   │   └─ Route: pass / hold / reject
   └─ Aggregate state machine: batch summary → SNS

DynamoDB (audit trail) ← every state transition

Why Step Functions over a single Lambda

A single Lambda hit three problems on the PoC:

  1. Long-tail timeouts. Textract async could take minutes; Lambda's 15-minute ceiling meant a percentage of documents always failed.
  2. No granular retries. Failures in classification got the whole document re-processed, wasting Textract spend.
  3. No audit visibility. Compliance asked "where is document X right now?" and the answer was a CloudWatch log search.

Step Functions solved all three. Each state was idempotent, retries were per-step with exponential backoff, and the execution history was the audit trail.

Security posture

Production-readiness for a regulated customer meant:

  • VPC endpoints for every AWS service the workload touched (no internet egress)
  • KMS-encrypted DynamoDB and S3, with separate keys per data classification
  • IAM roles built from aws-iam-policy-validator output, not hand-written
  • Per-environment Terraform workspaces (dev / staging / prod) with prod requiring a separate approval gate

What I deferred

  • Multi-region failover. The customer's RTO didn't require it yet, and adding it during a 3-day rebuild would have been reckless.
  • Custom model fine-tuning. The PoC team had been considering it; we benchmarked stock Bedrock against their data and found accuracy already met the bar. Saved them ~6 weeks.

Trade-offs

The pace of the engagement worked because the customer's engineers were strong and senior leadership cleared everyone's calendars. It would not have worked with a less capable team or a stakeholder still in discovery mode. Honest selling was important: this wasn't a magic three-day fix, it was an intense pair-build against a clear scope.