Machine Listening

Improvisation and Control¶

Interactive (music) systems¶

A young, white man is sitting at a white desk. In front of him is a light grey DECstation 5000/200 computer and a black microphone. A simple melody plays. He begins tapping out a rhythm on the desk with his hands. As he modifies his tempo, the system responds. He is improvising. The machine is listening.

Is the man playing the computer - like an instrument? Or is he playing with the computer - as in a duet? Is he even playing at all?

The man appears slightly bored, pretending not to be aware of his own performance, exploring the limited freedom offered to him by the machine, which tirelessly repeats the melody again and again, infinitely. We are watching a breakthrough moment in human-computer interaction: the computer is doing what the man wants. But still, the man can only want what the machine can do.

The fantasy of easy, natural interface between man and a computer is captured in a diagram by Andrey Yershov in 1964 titled the ‘director agent model of interaction’. The man is meant to be in charge. But www⁄look at the diagram. We can start anywhere we like. Information cycles around and around, in a constant state of transformation from sound to voltage to colored light to wet synapses. All of these possibilities are contained in the schematic figure of the arrow. Which is the director here? And which the agent? The diagram itself cycled between the pages of different publications, including The Architecture Machine, a 1970 book by Nicholas Negroponte.¹ Negroponte had set up the Architecture Machine Group at MIT in 1967, which eventually led to his creation of the www⁄MIT Media Lab.

A state of the art, light grey machine is sitting on a white desk. A camera is pointed at it, focused on it. The camera zooms out to reveal a young, white man. Why are we seeing this moment? Why is the camera there to witness it? Judging by the DECstation, the year is probably 1992 or 1993, the location is definitely the MIT Media Lab, and we are looking through a small window into the “demo or die” culture that Negroponte famously instigated there. Demos could excite the general public and impress important visitors. They could attract corporate and government money. The colossal “machine listening” apparatus that we know today has its roots in thousands of demos like this one.

In the 80s and 90s, the demo was a prefigurative device also familiar to the music world. This man, like many of those he worked with, moved between these worlds. He was an engineer, but also a musician. He had come to MIT, in fact, for a PhD in the “Music and Cognition” group at the www⁄Experimental Music Studio, which had been founded by the composer www⁄Barry Vercoe in 1973 and was absorbed into the Media Lab from the very start in 1985.

One of the first students to join this group was another musician-engineer named www⁄Robert Rowe. Rowe’s doctoral thesis ‘Machine Listening and Composing: Making Sense of Music with Cooperating Real-Time Agents’ seems to be one of the earliest uses of the phrase ‘machine listening’ in print. ‘A primary goal of this thesis,’ Rowe writes, ‘has been to fashion a computer program able to listen to music’. ² The term ‘machine listening’ would go on to be taken up widely in computer music circles following the publication of a book based on Rowe’s thesis, in 1993.³

The following year the “Music and Cognition” group rebranded.

“I have been unhappy with ‘music (and) cognition’ for some time. It’s not even supposed to describe our group; it was the name of a larger entity including Barry, Tod, Marvin, Ken and Pattie that was dissolved almost two years ago. But I’ve shied away from the issue for fear of something worse. I like Machine Listening a lot. I’ve also thought about Auditory Processing, and I try to get the second floor to describe my demos as Machine Audition. I’m not sure of the precise shades of connotation of the different words, except I’m pretty confident that having ‘music’ in the title has a big impact on people’s preconceptions, one I’d rather overcome."⁴

So what began, for Rowe, as a term to describe the so-called ‘analytic layer’ of an ‘interactive music system’³ became the name of a www⁄new research group at MIT and something of a catchall to describe diverse forms of emerging computational auditory analysis, increasingly involving big data and machine learning techniques. As the term wound its way through the computer music literature, it also followed researchers at MIT as they left, finding its way into funding applications and the vocabularies of new centers at new institutions.

www⁄Here is one such application, by a Professor at Columbia named www⁄Dan Ellis. This is the man sitting at the desk and the author of the email we just read. www⁄Today he works at Google, for their ‘Sound Understanding Team’. As Stewart Brand once put it, ‘The world of the Media Lab and the media lab of the world are busily shaping each other.'⁵

Google’s ‘Sound Understanding Team’ is responsible, among other things, for www⁄AudioSet, a collection of over 2 million ten-second YouTube excerpts totaling some 6 thousand hours of audio, all labelled with a ‘vocabulary of 527 sound event categories’. AudioSet’s purpose is to train Google’s ‘Deep Learning systems’ in the vast and expanding YouTube archive, so that, eventually, it will be able to ‘label hundreds or thousands of different sound events in real-world recordings with a time resolution better than one second – just as human listeners can recognize and relate the sounds they hear’.⁶

AudioSet includes 7,000 examples tagged as ‘Classical music’, nearly 5,000 of ‘jazz’, some 3,000 examples of ‘accordion music’ and another 3,000 files tagged ‘music of Africa’. There are 6,000 videos of ‘exciting music’, and 1,737 that are labelled ‘scary’.

In www⁄AudioSet’s ‘ontology’, ‘human sounds’, for instance, is broken down into categories like ‘respiratory sounds’, ‘human group action’, ‘heartbeats’, and, of course, ‘speech’, which can be ‘shouted’, ‘screamed’ or ‘whispered’. AudioSet includes 1500 examples of ‘crumpling and crinkling’, 127 examples of toothbrushing, 4000 examples of ‘gunshots’ and 8,500 ‘sirens’.

This is the world of machine listening we inhabit today; distributed across proliferating smart speakers, voice assistants, and other interactive listening systems; that attempts to understand and analyse not just what we say, but how and where we say it, along with the sonic and musical environments we move through and are moved by. Machine listening is not only becoming ubiquitous, but increasingly omnivorous too.

Jessica Feldman’s essay, “The Problem of the Adjective,” describes a further frontier.⁷ Affective listening software tunes in to barely perceptable vocal inflections, which are “uncontrollable, unintended, and habitual” – but for the machine signify the “emotions, intentions, desires, fears… of the speaker—in short, the soul.” Can the machine listen to our soul? Of course not, but what does it hear, what does it do when it tries? And how will we act when confronted with an instrument intent on listening so deeply?

Rainbow Family¶

A young, black man is sitting at a white desk. In front of him is a bank of Apple II computers, connected to a trio of Yamaha DX-7 synthesisers, and four performers on stage. A woman playing an upright bass starts strumming out a rhythm and the system responds; then the soprano sax, followed by the rest of the ensemble. They are improvising. The machine is listening. The machine is improvising. They are listening.

This is the 1984 premiere of George Lewis’ www⁄Rainbow Family at www⁄IRCAM, Paris, then - as now - a global center for avant-garde composition, computer music and electro-acoustic research. Robert Rowe spent time there in the 1980s, and Lewis would perform at the debut of Rowe’s interactive music system, Cypher, at MIT in 1988. The concert was called Hyperinstruments.³

But this is the first of Lewis' ‘interactive virtual orchestra’ pieces, in which software designed by Lewis both responds to the sounds of the human performers, and operates independently, according to its own internal processes. There is no ‘score’. And Lewis is not, in the language of European ‘art music’, the piece’s ‘composer’. Instead, Rainbow Family comprises ‘multiple parallel streams of music generation, emanating from both the computers and the humans—a non-hierarchical, improvisational, subject-subject model of discourse.'⁸

This is not an accident. It is a deliberate aesthetic, technical and political strategy by Lewis, to produce ‘a kind of computer music-making embodying African-American aesthetics and musical practices’;⁸ a form of human computer collaboration embodying similar ideals to those of the African American musicians’ collective AACM - the Association for the Advancement of Creative Musicians. The group was founded in Chicago 1965 and, for Paul Steinbeck, it remains ‘the most significant collective organization in the history of jazz and experimental music.'⁹

Lewis would later call this AACM-inspired aesthetic ‘mulitdominance’. The idea is developed from an essay by the artist and critic Robert L. Douglas.¹⁰ Lewis writes:

By way of introduction to his theory, Douglas recalls from his art-student days that interviews with “most African-American artists with Eurocentric art training will reveal that they received similar instructions, such as ‘tone down your colors, too many colors’”. Apparently, these “helpful” pedagogical interventions were presented as somehow universal and transcendent, rather than as emanating from a particular culturally or historically situated worldview, or as based in networks of political or social power. Douglas, in observing that “such culturally narrow aesthetic views would have separated us altogether from our rich African heritage if we had accepted them without question,” goes on to compare this aspect of Eurocentric art training to Eurocentric music training, which in his view does not equip its students to hear music with multidominant rhythmic and melodic elements as anything but “noise,” “frenzy” or perhaps “chaos”.⁸

When we listen to Rainbow Family then, Lewis doesn’t want us to hear synchronicity, harmony, or even polyphony. He wants us to hear multidominance: as both an aesthetic and a political value, expressed and encapsulated now partly as code. This is a model of human-computer interaction premised on formal equality, difference, independence, and commonality of purpose; a system that is, Lewis explains,

‘happy to listen to you and dialog with you, or sometimes ignore you, but the conceptual aspect of it is that it’s pretty autonomous. You can’t tell it what to do …. So improvisation becomes a negotiation where you have to work with [the system] rather than just be in control.'¹¹

‘In African American music there is always an instrumentality connected with sounds; you make sounds for pedagogical purposes, to embody history or to tell stories, and so on.¹²

If Rainbow Family is pedagogy then its lesson is surely that computers, and machine listeners in particular, are part of these stories too; that they already were as early as the 1980s. What is being contested in fact is machine listening’s soul, the aesthetic and political ideals it both expresses and reproduces, years before the term first began to circulate at and around MIT.

What would an alternate machine listening system modeled along Lewis’ lines be like? The becoming-Rainbow-Family of the world? The world as rainbow family, in which the interdependence of man and machine is acknowledged and embraced, but also rooted in a concern for racial justice and traditions of afro-futurism? How might this idea of multidominance play out at scale, laced through our homes and cities? How would such a world be designed? How compatible would it be with the current regime of pervasive surveillance, data extraction and colonialism, capital accumulation, automation and control? What social relations would a system like this work to facilitate and produce? What would it sound like? And what risks would it too run?

DARPA improv¶

In www⁄this YouTube video, a white woman plays guitar with an Artificial Intelligence. The computer listens and responds. Once again, we find ourselves in a version of Yershov’s diagram. It is almost as if she is having a conversation with the AI. She plays, and it listens and responds. She listens to its response and plays some more. This improvised conversation is like the exchange between scientists and aliens in Close Encounters of the Third Kind. It might be playful, or maybe antagonistic - we don’t and can’t know the meaning even if we could participate in the dialog.

www⁄The woman playing the guitar also created the alien AI that she improvises with. Still, she can’t know the meaning of what it says or what she says in response. The fact that she can’t know the meaning and she can’t know what response her playing will provoke is precisely what excites her. She wants to create things that she can’t quite control. She doesn’t play her AI, the way she might play a piano or her guitar, she plays with it.

This woman and this AI and this amateurish video are part of a research project called www⁄MUSICA, short for Musical Interactive Collaborative Agent. The project is www⁄funded by DARPA, the US Defense Advanced Research Projects Agency, which also funded some of the early breakthroughs in automatic speech recognition beginning in the 1970s. MUSICA is part of DARPA’s ‘Communicating with Computers’ program. It has two parts: one is called “Composition by Conversation”; the other “Jazz and Musical Improvisation”.

So the question is: why? Why is DARPA into Jazz? Why is it so interested in machines that improvise? What does it think it will learn? What does DARPA imagine improvisation will help it do?

The contemporary battle field is nonlinear. It is often urban. Commanders no longer concentrate their forces along one line or at one point, but disperse them into a 360-degree battlefield. The individual soldier might experience anxiety, even a sense of isolation. They don’t simply follow orders, but communicate and flexibly coordinate with other isolated, anxious soldiers. They improvise.

Soldiers improvise with each other, but also with intelligent machines. www⁄Another project by the same research group teaches AIs how to infer the internal states of their human teammates, solving problems collaboratively with them, and communicating with them in a socially-aware manner. The practice stage here is Minecraft, where soldiers-in-training can enter a besieged village, fight a zombie, and are invited to intermittently reflect on their emotional state.

Many actors of seemingly different politics have an ideological and tactical investment in improvisation. Improvisation as freedom. Improvisation as multidominance. Improvisation as counter-hegemony; the opposite of control. Mattin has written about “this supposedly self-inherent critical potential of improvisation,” and points instead to the way “improvisers embody the precarious qualities of contemporary labor.” This is improvisation as corporate agility. Improvisation as zero hours contracts. Improvisation as moving fast and breaking things.

The ability to go off script, off score, to innovate in the moment with what is at hand, becomes a way both of accumulating capital and surviving in the labor market. Evidently for DARPA, improvisation is also the future of combat, so that it is keen to mine the communicative virtuosity of jazz improvisation for whatever secrets it may hold.

In addition to jazz robots, DARPA has two other programs, www⁄Improv and www⁄Improv2, in which hobbyists (or research labs pretending to be hobbyists) try to create military grade weapons using readily available software and off the shelf technology. Improvisation is both a technique and a generative source of knowledge to extract.

But one thing DARPA’s Improv program manager says reminds us that their improvisational imaginary has real constraints: www⁄“DARPA’s in the surprise business and part of our goal is to prevent surprise." In time, there is no more need for human input in the ensemble. The machine improvises with itself.

Resources¶

⦚bib:7cd09072-5282-441f-b30a-6d869488ecd8not found ↩︎
Robert Rowe, www⁄Machine Listening and Composing: Making Sense of Music with Cooperating Real-Time Agents, doctoral thesis (MIT Press, 1991) ↩︎
Robert Rowe, www⁄Interactive Music Systems: Machine Listening and Composing (MIT Press, 1993) ↩︎
Archived email exchange between Dan Ellis and www⁄Michael Casey, 28 March 1994. According to Casey, “Dan suggested “Machine Audition”, to which I responded that term “audition” was not widely used outside of hearing sciences and medicine, and that it could be a confusing name for a group that was known for working on music–think “music audition”. I believe we discussed the word hearing, but I–we?–thought it implied passivity as in “hearing aid”, and instead I suggested the name “machine listening” because it had connotations of attention and intelligence, concepts that were of interest to us all at that time. That is what I remember.” ↩︎
⦚bib:f840b2fa-8e2a-48b3-8ad7-1f138313d2b3not found ↩︎
Gemmeke et al. www⁄Audioset: An Ontology and Human-Labelled Dataset for Audio Events ↩︎
⦚bib:284b6cc8-1fe8-4d3f-b0b0-d53d4117370bnot found ↩︎
George Lewis, ‘Too Many Notes’, LEONARDO MUSIC JOURNAL (2000), pp. 33–39, 36-37. ↩︎
Paul Steinbeck, ‘George Lewis’ Voyager’ The Routledge Companion to Jazz Studies (2018), 261-270, 261. ↩︎
Robert L. Douglas, “Formalizing an African-American Aesthetic,” New Art Examiner (June/Summer 1991) pp. 18–24. ↩︎
George Lewis, quoted in Jeff Parker, “George Lewis.” BOMB, no. 93 (Fall 2005): 82-88, 85. ↩︎
George Lewis, quoted in Lawrence Casserley, “Person to … Person?” Resonance Magazine (1997) ↩︎