An agent is a concept, which can have slightly different meanings, abilities or instantiations depending on the context. However, given the purpose of this website, I will use and refer to the definition of agent commonly used in artificial intelligence.
An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.
For more details regarding the definition of an agent in AI, see my answer to the question What is an agent in Artificial Intelligence?.
A multi-agent system is a system composed of multiple agents that interact with an environment. See Multi-Agent Systems: A survey (2018) fore a more exhaustive overview of the field.
Multimodal interaction (MI) refers to the interaction with a system (e.g. a computer) using multiple modalities (e.g. speech or gestures). For example, we usually can interact with a laptop using a keyboard and a touchpad (or mouse), so the keyboard and the touchpad are the two different modalities that are used to interact with the computer. MI could thus be considered a sub-field of human-computer interaction.
Conceptually, an agent could be associated with each modality provided by a multimodal system, so a system that provides multimodal interaction could indeed be a multi-agent system. See, for example, A Multi-Agent based Multimodal System Adaptive to the User’s Interaction Context (2011).