Where does linguistic structure come from? Recent gesture elicitation studies have indicated that constituent order (corresponding to for instance subject–verb–object, or SVO in English) may be heavily influenced by human cognitive biases constraining gesture production and transmission. Here we explore the alternative hypothesis that syntactic patterns are motivated by multiple environmental and social–interactional constraints that are external to the cognitive domain. In three experiments, we systematically investigate different motivations for structure in the gestural communication of simple transitive events. The first experiment indicates that, if participants communicate about different types of events, manipulation events (e.g. someone throwing a cake) and construction events (e.g. someone baking a cake), they spontaneously and systematically produce different constituent orders, SOV and SVO respectively, thus following the principle of structural iconicity. The second experiment shows that participants’ choice of constituent order is also reliably influenced by social–interactional forces of interactive alignment, that is, the tendency to re-use an interlocutor’s previous choice of constituent order, thus potentially overriding affordances for iconicity. Lastly, the third experiment finds that the relative frequency distribution of referent event types motivates the stabilization and conventionalization of a single constituent order for the communication of different types of events. Together, our results demonstrate that constituent order in emerging gestural communication systems is shaped and stabilized in response to multiple external environmental and social factors: structural iconicity, interactive alignment and distributional frequency.