Transforming Monolithic Foundation Models into Embodied Multi-Agent Architectures for Human-Robot Collaboration