Text and Speech


On the Differences between Text and Speech

0.0 MARY: One of the arguments we have been making through this grammar has been that text and speech are very different from each other, so different in fact that we can no longer use the word “language.” In this video we are going to bring these arguments together into one place, because they have been scattered across the other videos and the two volumes of our transpositional grammar.

0.35 MARY: In our rough visual map of forms of meaning, we have put text beside image, and image beside space. Like image, text is laid out in a two-dimensional spatial array. Like image, it excels in carrying meaning across different times and spaces. Its purpose is asynchronous participation in representation, communication, and interpretation. Text also has a certain kind of spatial three dimensionality. This happens in the old apparatuses of tables of contents, indexes, and library catalogues, and now in the digital apparatus of hypertext.

1.18 MARY: I want to define speech as spontaneous oral and aural participation in meaning. With text, you can look around for resources when lost for words, spontaneous speech must rely on memory. Where text covers distance, speech is immediate. Where text is asynchronous meaning, speech is synchronous participation in representation, communication, and interpretation. Where text is ordered in space, speech, like all sounds, is ordered relentlessly in time. It is not possible to recover from having mis-spoken without declaring that one has; something just spoken cannot be unspoken. And unlike text, speech is orchestrated or conducted with gesture, and so is closely aligned with embodied presence.

Take what would you would think was the easy task of finding the beginning and end of a unit of meaning. You find the beginnings and ends of text in space (for instance, sentences, paragraphs, the covers books, the front doors or home pages of libraries), but the beginnings and ends of speech you find in time (for instance, an utterance or a conversation). This makes speech more like music, or episodes of embodied action, than text which is more like framed pictures, or buildings with their spatially defined facades and entry points.

  • Reference: Kalantzis, Mary and Bill Cope, 2020, Adding Sense: Context and Interest in a Grammar of Multimodal Meaning, Cambridge UK, Cambridge University Press, pp. 95-96.

2.51 MARY: These radical differences can be tracked back to the human sensorium. In their canonical forms, text can be purely a matter of sight, and speech purely a matter of sound, though of course our meanings are all the more powerful when speech is accompanied by gesture, and text by image.

  • Reference: Kalantzis, Mary and Bill Cope, 2020, Adding Sense: Context and Interest in a Grammar of Multimodal Meaning, Cambridge UK, Cambridge University Press, pp. 12-15.

3.12 MARY: In other words, in their most elementary forms, speech and text could hardly be more different. Not that we don’t perform a lot of transposition between text and image. These transpositions are in the nature of multimodality itself. Here are just a few examples of this multimodal practice, of the transpositions we make between text and speech. My first example…

  • Reference: Cope, Bill and Mary Kalantzis, 2020, Making Sense: Reference, Agency and Structure in a Grammar of Multimodal Meaning, Cambridge UK, Cambridge University Press, pp. 11-13.

3.35 MARY: Reading aloud sounds out text, but as Halliday has pointed out, the grammar of text is radically different from the grammar of speech. Spontaneous speech is a succession of clauses, irretrievably trapped in the unforgiving arrow of time, rambling, reiterative, redundant, full of self-corrections, and sometimes even with embarrassing Freudian slips. Reading a text aloud is to speak something that has been written, where the sentence is the most important structural unit, not the clause. What you are hearing from me now, we have written, because for a short video that says a lot, we needed this careful, essentially spatial arrangement of our meaning in order to lay out our thoughts clearly. What you see here on the screen is simply not possible in spontaneous speech.

4.32 MARY: And another example: direct, quoted speech attempts to capture the differences between prose and reported speech – in the sound of dialect, in italicized phonic emphases, in the transliteration of faltering. But even for a master of this transposition, Mark Twain, such multimodal craft is difficult and only ever partial. Novels, plays and film scripts are carefully written, ordered and then re-ordered spatially. They are feigned speech.

  • Reference: Cope, Bill and Mary Kalantzis, 2020, Making Sense: Reference, Agency and Structure in a Grammar of Multimodal Meaning, Cambridge UK, Cambridge University Press, pp. 30-33.

5.06 MARY: And another example, speaking at a distance over the telephone. Just like the husband in the picture, the friend with whom this housewife is speaking is at a distance. Like text, she has to be explicit about things that are not in the listener’s spatial frame. And the deictics of place are different – if the other person were in the same room, the speaker’s saying “here” would be a shared presence, but on the phone the speaker needs to transpose “here” for there. In other words, the telephone has in some respects made speech become like text. It is a multimodal transposition.

So too, giving a video lecture, or writing a novel that includes direct speech are both practices of multimodal transposition. But the journey travelled is large, perhaps the largest of all form-to-form transpositions. This is why “literacy” becomes such an important focus of schooling, as well as an enormous challenge for learners.

However, in alphabetic languages, literacy programs often begin with the premise that text is a transliteration of the sounds of speech. Not only are phonics programs which focus on the transliteration of the sounds of speech simplistic. In some profound respects, they simply miss the mark.

  • Reference: Kalantzis, Mary and Bill Cope, 2020, Adding Sense: Context and Interest in a Grammar of Multimodal Meaning, Cambridge UK, Cambridge University Press, p. 80.

6.33 MARY: Indeed, in the process of reading itself cannot fluidly and meaningfully sound out every letter. So, we learn the look of semantic units, and our eyes jump not letter to letter, but from one meaning unit to another. Each successive glance is called a saccade. Becoming fluent sounding letters all-but disappears, and reading becomes principally a visual process that focuses on clusters of letters as ideographs, not letters as phonemes.

7.06 BILL: Now let me illustrate the laboriously visual process of writing text, with the example of on particularly famous speech. Here, Abraham Lincoln is speaking at Gettysburg, Pennsylvania, several months after a devastatingly brutal battle that proved to be a turning point in the US Civil War. The final, officially published version of his speech was just 271 words long, and took not much more than about two minutes to deliver.

For so few words, Lincoln wrote numerous drafts, the first in the White House, then more on the train on his way to Gettysburg, then several after the speech for a published version. Five versions of Lincoln’s speech in his own handwriting remain today.

7.49 BILL: Here’s what is considered to be the second version of the remaining five drafts. You can see that there are many crossings out. Writing a speech is a difficult, multimodal task, constrained by the affordances of scribed text as a visual medium. The more you change it, the messier it gets, which means that, at a certain point, you need to write it out again. Writing is a multilinear process – as you rework it, you see connections across the text that require careful revision.

8.24 BILL: Here is what is thought to be the third copy, or at least the third of the copies extant today. It is neatly written with no crossings out. Of the changes by this version, one is very controversial – the words “under God” have been written in his description of the nation that Lincoln hoped would be reborn after this terrible war.

Lincoln did not speak of his religious views – if he was not an atheist, at the very least his views were unconventional. Did someone suggest he add these words? Did he add them himself? If so what did they mean? It’s hard to know, but this is not he point we are making here.

I just want to use this example to contrast writing with canonical speech, not a carefully scripted speech that has been written. Canonical speech is in-person, and unforgivingly linear. If the speech had been purely extemporaneous, he might have forgotten God, or he could add at the end, “Oh, by the way, I forgot to mention God, let’s not forget God.”

9.25 BILL: Because we are interested in the materiality of media, we’ve put a cursor in the text where Lincoln would have added “under God” if he had been writing on a word processor. There’s a big difference with digital text work – the text is less resistant to change. Not five versions, or however many versions of the Gettysburg address there actually were, now there can be as many versions now as there are keystrokes.

The linear processes of speech have become more multilinear than ever, more than ever a matter of spatial design. Text work has become even more like image work, and even less like extemporaneous speech. The distance travelled in the transposition between speech and text has become even larger.

10.05 BILL: And now, updating the speech for the era of digital text, here is PowerPoint version created by the artificial intelligence researcher Peter Norvig, updated using the Microsoft Powerpoint Autocontent Wizard for a business presentation. This becomes a strangely hybrid text-speech transposition. Because purely extemporaneous speech is not to be trusted, we support the transposition from text to speech visually with slides. Norvig scripts what might have been Lincoln’s introduction.

“Good morning. Just a second while I get this connection to work. Do I press this button here? Function-F7? No, that's not right. Hmmm. Maybe I'll have to reboot. Hold on a minute. Um, my name is Abe Lincoln and I’m your president. While we’re waiting, I want to thank Judge David Wills, chairman of the committee supervising the dedication of the Gettysburg cemetery. It’s great to be here, Dave, and you and the committee are doing a great job. … Oh - is it ready? OK, here we go.”

11.12 BILL: I have jumped now to the bullet points at the end of Norvig’s version of the speech. It looks like God doesn’t fit into the genre of a PowerPoint business presentation, either. But now we can see the hierarchical structure of the meaning that may be lost just in the hearing. Presentations with PowerPoint are peculiar, multimodal transpositions that cross the ever-greater distance between the forms of speech and text.

  • Reference: Kalantzis, Mary and Bill Cope, 2020, Adding Sense: Context and Interest in a Grammar of Multimodal Meaning, Cambridge UK, Cambridge University Press, pp.160-61.

11.36 BILL: I want to conclude this video with a number of ways in which speech and text are becoming more different from each other in the era of the digital manufacture of meaning. One we have just seen, text entry on the keyboard of a digital device. Because digital text is far less resistant to change than hand scribing, the text formation process can be more multilinear, it can be conceived in a more systematically hierarchical rather than linear way. In this respect, text is becoming even more a matter of spatial design, and so even more different from speech.

12.10 BILL: Another aspect of the digital era is the practical unspeakability of so many texts—computer code, mathematical formulae, tables, long urls, and serial or product numbers, for instance. By unspeakability, we mean speech in its canonical form, spontaneous oral/aural speech from memory. We may be able to read out some of these texts, even with some difficulty, but this is a practice of multimodal transposition.

12.39 BILL: Then there is the universal scripting system of Unicode, where there are very few phonemes but a growing repertoire of standardized ideographs. Here again, text is becoming more like image than speech.

  • Reference: Cope, Bill and Mary Kalantzis, 2020, Making Sense: Reference, Agency and Structure in a Grammar of Multimodal Meaning, Cambridge UK, Cambridge University Press, pp. 23-25.

12.52 BILL: And with hypertext, textual architectures are becoming more spatialized than ever. Here is a page of notes written by Ted Nelson, the person who first coined this word, where the connections need to be conceived spatially and hierarchically.

These are just some of the ways in which text-speech transpositions are becoming more and more extenuated in the digital era. The distance that must be travelled is becoming greater.

  • Reference: Kalantzis, Mary and Bill Cope, 2020, Adding Sense: Context and Interest in a Grammar of Multimodal Meaning, Cambridge UK, Cambridge University Press, pp. 173-74.

13.17 BILL: But here is a paradox, in an unprecedented way digital media facilitate the creation and rendering of text, image, sound, and speech using the same devices. So even though the forms of text and speech are pulling away from each other, for the most practical of reasons related to these new media, they are coming together in new and hybrid forms.

Typed message threads are conversation-like; they are text with an unrecoverable linearity. On the other hand, recorded speech can be edited, affording a multilinearity in its creation that is like speech. Video lectures can feign extemporaneous speech. There is a myriad of these ever-changing, multimodal forms of participation in meaning. The task of a multimodal grammar is to trace the processes of transposition that are in each case involved.

  • Reference: Kalantzis, Mary and Bill Cope, 2020, Adding Sense: Context and Interest in a Grammar of Multimodal Meaning, Cambridge UK, Cambridge University Press, pp. 169-70.