Disclaimer: This essay was planned and written in collaboration with Claude Sonnet 4 and Gemini 2.5 Pro.
The human voice presents a fundamental puzzle. It is at once a biological phenomenon, governed by the physical laws of acoustics and the specific anatomy of a speaker, and a profoundly social artifact, shaped by culture, learning, and the conscious expression of identity. The way we perceive a voice reflects this duality; we instantly and holistically apprehend not just its linguistic content but a rich tapestry of information about the speaker: things like their gender, social class and age. Standard acoustic analysis describes the physical properties of a soundwave but it fails to provide a satisfactory explanation for this immediate and integrated perceptual experience. It describes the ingredients of the signal without revealing the cognitive recipe the brain uses to interpret it. In this essay I’ll argue that the Motor Theory of speech perception offers a compelling mechanism to bridge this explanatory gap. Using gender as a case study, I contend that we perceive the social character of a voice by using our own motor system to actively simulate the articulatory gestures required to produce it. This is a type of naturalised empathy, an automatic relating-to and understanding of another through their voice. This process is actively shaped and primed by the listener’s own social knowledge and expectations based on the immediate context, grounding the complex phenomenon of vocal identity in an embodied, interactive, and deeply relational framework.
To understand the perception of voice, we need to first understand its production. The human vocal apparatus functions as a sophisticated instrument, comprising a source and a resonance structure. The source is the larynx, where the vocal folds vibrate as air passes from the lungs, creating a simple tone. The rate of this vibration determines the sound’s fundamental frequency, which we perceive as pitch. The physical dimensions of the vocal folds, which are typically longer and thicker in the male sex following puberty, directly influence the possible range of fundamental frequencies. This anatomical difference is a primary contributor to the generally lower-pitched voices of men compared to women. Yet pitch is only one dimension of the sound. The basic tone from the larynx then travels through the vocal tract—the pharynx, oral cavity, and nasal cavity—which acts as a modifier. This tract, by its specific shape and size, resonates at certain frequencies, amplifying them while dampening others. These resonant peaks are known as formants, and their specific pattern gives vowels their distinct character and contributes most significantly to a voice’s unique timbre. Like the vocal folds, the vocal tract also shows typical sex-related differences; male vocal tracts are, on average, longer than female tracts. This greater length systematically lowers all formant frequencies, a crucial acoustic marker that allows a listener to distinguish between voices even when they are producing the same pitch. This explains why a male singing in falsetto still sounds distinctly male; while he matches the fundamental frequency of a female singer, the formant structure, dictated by his longer vocal tract, remains different. These physical facts provide the biological baseline, the acoustic potentialities that anatomy makes available. They do not, however, tell the whole story. Just as we must learn to play an instrument, we must also learn to use our own voice.
The full expression of a gendered voice is enabled by a set of learned skills that transcends anatomical determinism. While the fundamental and formant frequencies provide a physical canvas, the most salient markers of gender in speech are often behaviours learned through social immersion. Intonation patterns—the melodic contours of speech—differ significantly across gendered styles of speaking. The use of pitch range, articulatory precision, speech rate, and qualities like breathiness are not direct consequences of anatomy but are acquired social codes. There is no better illustration of this principle than the practice of voice training for gender affirmation. A transgender woman, for instance, cannot change the fundamental length of her vocal tract, but through dedicated practice, she can learn a new set of motor habits to fundamentally alter the sound she produces. This training focuses on mastering new articulatory gestures: raising the larynx within the throat, shifting the tongue’s position forward in the mouth, and controlling airflow to achieve a “brighter,” more feminine resonance. She learns to speak using a different part of her pitch range and to adopt the intonational patterns culturally associated with femininity. This is a conscious process of overwriting old motor programs and instantiating new ones. The result is a change in the acoustic output, particularly the formant frequencies, which is achieved through action, not anatomical change.
So far we have described the physical nature of the voice: the role of anatomy and action in its production, and its composition as a fundamental pitch and a series of formant frequencies. But what accounts for our perception of a particular formant distribution as “feminine” or “masculine”? Where does this social meaning come from?
Answering this question requires moving beyond a simple analytical model of perception. It is tempting to assume that the brain functions like a scientist, taking in the acoustic signal and decomposing it into a list of its constituent parts—measuring the fundamental frequency, calculating the formant spacing, and checking these values against a mental database. Such a model, however, confuses the description of the stimulus with the mechanism of perception. Simply because it is fruitful for scientific practice to understand the voice in this way by no means implies that this is how the brain processes it. A more powerful explanation is found in the Motor Theory of speech perception. In its modern form, this theory proposes that we perceive speech not by passively receiving sound but by actively simulating the neural pathways associated with the articulatory gestures that produce it. The brain understands a sound by covertly modeling the actions required to make that sound. This is not a conscious or metaphorical process, but a tangible neurological one. Neuroscientific evidence supports this through the Dual-Stream Model of speech perception. After initial processing in the auditory cortex, vocal information is handled by two interconnected pathways. The ventral stream, running along the temporal lobe, is concerned with comprehension, the “what” of the signal, including identifying it as a voice in the first place. The dorsal stream connects these auditory areas to motor regions in the frontal lobe, including the premotor cortex. This pathway provides the neural architecture for a sound-to-action mapping, allowing the brain to determine “how” the sound was made.
This theoretical framework provides the crucial bridge from acoustics to social meaning. When we apply the Motor Theory to the perception of gender, a new picture emerges. A listener perceives a “feminine” voice not by abstractly calculating formant frequencies, but by simulating the physical gesture of raising the larynx and positioning the tongue forward in the oral cavity. They perceive “breathiness” not as a measure of spectral noise, but by simulating the motor act of speaking with incomplete vocal fold closure. The perception is of a unified, holistic gesture, a single complex action whose various acoustic consequences are apprehended as one. This process can be understood as a form of naturalised empathy. We understand the speaker’s expression because, on a sub-personal, neural level, we feel what it is like to perform that action. Our own body and its motor potential become the reference point for understanding another’s expression. This also reframes the concept of vocal prototypes. An individual’s internal “prototype” of a voice is not an abstract list of features but their own baseline set of habitual motor programs for speech. We perceive difference—an accent, a sociolect, or a gendered style—by registering the deviation between the motor program we are currently simulating and our own default way of speaking. The meaning of that difference is derived from a lifetime of social learning, where we have unconsciously associated specific articulatory gestures with specific social identities.
This model, however, requires a final, critical layer of nuance. The listener is not a neutral simulator, faithfully re-enacting the speaker’s gestures in a vacuum. The brain is actively involved in shaping attention, a phenomenon known as top-down processing. A listener uses every available cue—visual, situational, and social—to generate a prediction about the person they are encountering, preparing them to receive the voice in a particular way. This has profound implications for the multimodality of gender perception. The visual appearance of a speaker, for example, “primes” the listener’s motor system. If a listener visually categorises a speaker as male, their brain prepares to simulate the articulatory gestures it associates with male speakers. This top-down expectation actively shapes how the bottom-up acoustic signal is interpreted. This framework provides an acknowledgement and explanation for the social complexities of “passing” for many transgender people. If a trans person’s voice—the acoustic data—does not neatly align with the listener’s visually-primed motor simulation, the brain registers a conflict, a perceptual incongruity. The simulation fails to resolve smoothly. The resulting perception may not be of a clear gender category, but of ambiguity, “unnaturalness,” or a voice that seems to possess contradictory features. This means that the exact same acoustic signal can be simulated and perceived differently depending on the listener’s expectations. A voice heard on the telephone, devoid of visual cues, might be simulated and perceived as female, while the very same voice, when paired with a body the listener reads as male, could trigger a different, conflicting simulation. The “success” of a vocal performance is therefore not an objective property of the speaker’s voice alone. It is a dynamic, co-created phenomenon that arises in the context-dependent mind of the active listener.
In conclusion, the puzzle of the voice—part biological instrument, part social performance—finds its resolution not in the simple measurements of acoustics, but in the dynamic reality of embodied interaction. The Motor Theory of speech perception provides a powerful framework, revealing that to hear a voice is to, in a sense, enact it. We understand the gendered quality of a sound by covertly simulating the physical gestures that made it, a process of naturalised empathy that uses our own body as the primary tool for understanding another’s. But this is no simple mimicry; our simulation is primed by expectation and filtered through a lifetime of social learning, making the perception of vocal identity a profoundly relational act. The voice, then, is not a fixed signal transmitted from speaker to listener, but a fluid, co-created event.