This is something of an orthogonal answer, but I think Brooks didn't go about his idea the right way. That is, subsumption architecture is one in which the 'autopilot' is replaced by a more sophisticated system when necessary. (All pieces receive the raw sensory inputs, and output actions, some of which turn off or on other systems.)
But a better approach is the normal hierarchical control approach, in which the target of a lower level system is the output of a higher level system. That is, the targeted joint angle of a robot leg is determined by the system that is trying to optimize the velocity, which is determined by a system that is trying to optimize the trajectory, which is determined by a system that is trying to optimize the target position, and so on.
This allows for increasing level of complexity while maintaining detail and system reusability.
That said, I don't think you actually need what one would naively call 'embodied cognition' in order to get the bottom-up hierarchy of competencies that Brooks is right to point towards. The core feature is the wide array of inputs and outputs, which are understood in a hierarchical fashion that allows systems to be chained together vertically. I think you could get a functional general intelligence whose only inputs and outputs involve going through an Ethernet cable, and doesn't have anything like a traditional body that it actuates or senses through. (This is a claim that the hierarchical structure is what matters, not the content of what we use that structure for.)
(The main place to look for more, I think, is actually a book about human cognition, called The Control of Perception by William T. Powers.)