Skip to content

Building Real-World Agentic AI Systems

A full-day masterclass at WeAreDevelopers Berlin. You will go from a single AI agent to a small system of agents that use tools, remember context, ask a human before risky actions, and hand work to each other, all grounded in one running story.

Abstract

Most "AI agent" demos fall apart the moment they meet the real world: missing context, fuzzy memory, unsafe tool calls, and no human oversight. This workshop is about the engineering that makes agents dependable. You will build, run, and inspect agents against a (semi-)realistic scenario, and you will see exactly where things break and how to make them safe. That's what she said

New to the show?

That is not a problem. You have eight hours of workshop time to catch up. No one will judge you for doing the research: Netflix, or just check the the Lore page to get you up to speed.

The day promise

By the end of the day you will be able to reason about and build with these seven ideas. Each is defined in plain language here and explored hands-on across the day. Some ideas pair up in a single segment, for example, tool calling is designed and used alongside MCP.

IdeaPlain-language definition
AgentA program that uses a language model to decide what to do next, often calling tools, to reach a goal.
Context engineeringDeliberately choosing what information the model sees (and what it doesn't) so it makes good decisions.
MemoryWhat an agent remembers across steps or sessions, and, just as importantly, what it should forget.
MCP (Model Context Protocol)An open standard for connecting agents to tools and data sources through a common interface.
Tool callingLetting an agent run a defined function (with typed inputs and outputs) to act on the world.
HITL (Human-in-the-loop)A point where a person reviews and approves what the agent proposes before it takes effect.
A2A (Agent-to-Agent)One agent handing a task to another, with a clear contract for what comes back.

New to these terms?

That's expected. Every acronym is defined again the first time it appears in each segment, and the References page links to the official source for each topic.

What you will build

Working in the Dunder Mifflin scenario, you will incrementally build:

  1. A first agent that answers questions about packages and orders.
  2. A context map that decides what the package manager agent should and shouldn't see.
  3. An MCP design that maps package capabilities to resources, prompts, and tools.
  4. A running MCP server exposing the package tools, validated standalone.
  5. A first package manager agent connected to the MCP server, with a defined role and boundaries.
  6. A native function tool plus an explicit agent-to-agent handoff to a RegionalManagerAgent.
  7. A human-in-the-loop review step that pauses for confirmation before a customer-impacting action (like changing a damaged package's status).

The scenario

The workshop uses one running story: the Dunder Mifflin package manager, a fictional system for a paper company's warehouse and delivery operation (inspired by The Office).

Across the day you work with these recurring elements:

  • Packages, orders, and routes moving through the warehouse.
  • Labels and package conditions that must be read and checked.
  • Managers, drivers, and approvals that gate what happens next.

Recurring roles you will meet:

RoleWhat they do
DarrylCoordinates warehouse work and hands tasks to agents.
RegionalManagerAgentMakes higher-level decisions and approves escalations.
PackageLabelParserA tool that turns a messy label into structured fields.
PackageConditionCheckerA tool that flags damaged or risky packages.

The app is supplied, not built here

The package manager application itself is supplied or referenced for you. These materials are about the agentic AI concepts around it — you will not build the package manager from scratch.

How to use this site

Facilitators: start at the Facilitator notes.

What we will not cover

  • A deep theoretical survey of every agent framework; we go deep on a few practical patterns instead.
  • Production-scale infrastructure, and load testing