Buxton, W. (1995). Speech, Language & Audition. Chapter 8 in R.M. Baecker, J. Grudin, W. Buxton and S. Greenberg, S. (Eds.)(1995). Readings in Human Computer Interaction: Toward the Year 2000 San Francsco: Morgan Kaufmann Publishers.

Chapter 8: Speech, Language & Audition

Introduction

While natural language and the audio channel are the primary means of human-to-human communication, they are little used between human and machine. This is an area in transition, however. In this chapter, we investigate user interface issues associated with both speech and nonspeech audio, and natural language technologies.

When one considers communicating with sound, speech generally comes first to mind. Envisionments of future systems often involve people conversing with their computers in the same way as with a friend. The interactions with the computer Hal in the film 2001: A Space Odessy is one example. Another is the Knowledge Navigator system envisionment video, by Apple (1992).

As we will see in this chapter, conversing so freely with our computer in the near future is not too likely. Nor, as we shall see, is it necessarily desirable. However, what we will see is that there are some real benefits to be gained in using speech, especially when coupled with other modalities of interaction, such as pointing and gesturing.

The audio channel is not restricted to speech, however. It is suited to a wide variety of audio based interaction. We see examples of this in every-day activities, such as responding to a ringing telephone or whistling for a taxi. Similarly, in many computer-mediated interactions, nonspeech audio can play an important role. Example applications are process control in factories and flight management systems in aircraft.

If audio is such a powerful and well known mode of communication, then why is it not more commonly used in human-computer interaction? Sometimes it is because audio is simply inappropriate in the given context. For example, we can only be attentive to one stream of spoken instructions at a time. Consequently, in cases where we need to control more than one process simultaneously, speech (alone) is generally not an effective mode of communication. Like all other modes of interaction, audio has strengths and weaknesses that need to be understood. Until this understanding is forthcoming, audio will continue to be neglected. Finally audio input and output have presented technical and financial difficulties which discouraged their use in most applications. Recently, however, these problems have been greatly reduced, and audio is now a feasible design alternative.

Besides providing a basic review and some pointers to the literature, we have four main objectives in this chapter:

to relate work in natural language to that in speech
to try and put speech I/O in some realistic perspective;
to show how there is more to the audio channel than speech;
to emphasize that audio-based communication with computers (including speech) is now technologically and financially feasible.

To conclude this introduction, the reader is directed to the list of video examples found following the References/Bibliography. All of these examples are issued on the SIGGRAPH Video Review, and are available from the Association of Computing Machinery. They provide a valuable and important supplement to the chapter contents, for both student and teacher.

Speech and Natural Language Interfaces

Introduction

That people interact differently with each other than they do with computers is clear. Many believe that by changing how we interact with machines to resemble more how we interaction with people, we will result in some kind of HCI panacea. Such a belief underlies much of the hope for speech and natural language technologies.

In the next part of this chapter, we will investigate speech and natural language technologies and their use in interacting with machines. What we will see is that for some applications, the technologies are both practical and useful. However, we will also see that the ability to converse with your computer, as with a friend or colleague, is neither technologically feasible, nor - in many cases - desirable.

One important theme in what follows is that one of the main benefits that comes out of speaking to your computer is to communicate with another person, not the computer. Computer-mediated human-human communication is one of the more valuable speech technologies available. We see this, for example, in applications such as voice mail, and voice annotation of documents. In neither of these applications is speech recognition or synthesis required.

A second theme that emerges from this chapter is that speech is often far more effective when coupled with other modalities of interaction, especially gestures such as pointing and marking. Just as face-to-face conversation is much richer due to the ability to see the accompanying body language, so is it the case when interacting with, or via, computers.

Speech has a number of properties that need to be taken into account by the designer. For example, while it is faster to speak that to write, it is faster to read than to listen to speech. This is good for input, bad for retrieval. Speech is effective in communicating many concepts, but spatial and temporal relationships are often far better articulated with gestures or markings. Speech, and audio in general, is ubiquitous, whereas the visual channel is localized. That is have to be looking at something to see it, whereas my telephone can be heard anywhere. If the call is for me, this is an advantage. If not, it is an intrusion.

In the orchestra, there is no single "right" instrument. Composers choose the instruments that will give voice to their ideas depending on their suitability to express that particular idea. Every instrument is best for something and worst for something else. So is it with speech and natural language technologies. They can be immensely powerful when effectively used, and a disaster when not. Sometimes they are best used "solo," and other times in counterpoint or harmony with other interaction techniques. Our hope in the rest of this chapter is to provide the reader with some of the basic background that will enable better decisions in orchestrating the human-computer dialogue.

Stored Speech

One of the most powerful uses of speech is when it is applied to computer-mediated human-human communication, rather than HCI. In this case, there is typically no need for computer recognition, since it is a human, not a machine, for whom the message is intended. Speech is stored in some kind of computer memory, and later retrieved for playback or further processing.

Loosely interpreted, this may be the most common computer application of all, exceeding even word processors and spreadsheets. I say this since virtually all telephone answering machines (including voice mail) fall into this category.

Few people think of the audio cassette of their answering machine as a "file system," or their answering machine as an email server, but that is what they are. It is just that these systems are physically, logically and conceptually remote from our conventional workstations. This need not be so, however.

Likewise, dictating machines can form the model for a new generation of small mobile speech capture and filing devices. This is precisely the premise behind the innovative VoiceNotes system of Stifelman, Arons, Schmandt and Hulteen, (1993).

Integrating voice store and forward capability into our systems can take many other forms. In the rest of this section, we will look at a few, and see how they can enhance the value that the technology brings.

Our next example is a voice messaging system developed by IBM (Gould & Boies, 1983,1984 ). This system was the subject of the first Case Study in Baecker and Buxton (1987). The design challenge discussed in these papers is the provision of a system for editing, filing, retrieval and distribution of voice messages, using the telephone keypad for control, and the handset for voice I/O. The human factors of supporting this functionality using such limited I/O resources is the key to this study. The user centred iterative methodology employed is a good example for designs using other technologies as well.

The video of the Olympic Messaging System (Gould, 1985), shows another manifestation of the technology of the other papers. However, in this case, the user interface had to work for users who had no training, came from many different cultures, and spoke a number of different languages. It is a highly recommended case study in design.

While going beyond simple answering machines, the system described by Gould and Boies was still telephone-centric. It was a forerunner to the voice messaging systems which are becoming commonplace today.

Xerox PARC's Etherphone system (Zellweger, Terry & Swinehart, 1988; Vin, Zellweger, Swinehart & Rangan, 1991) is an example of how this basic functionality can be integrated into a workstation environment. This system is demonstrated in video form (Zellweger, P. et al., 1990). This system lays much of the foundation for the workstation as voice server.

One of the main benefits of voice store and forward comes in integrating voice messaging with other document types, such as written text. In so doing, one moves from voice mail to voice annotation, for example. This is a capability pioneered by systems like the Etherphone which is starting to find its way into popular PC word processors, for example.

Chalfonte, Fish & Kraut (1991) is an important study, in this regard. They compared the relative effectiveness of written vs. voice annotations of written documents. In their study, written annotations appeared on the paper document itself and voice annotations were spoken onto an audio cassette. The main finding was that there was a significant difference in the kinds of annotations for which voice was used compared to writing. Voice was used mainly for global, or high-level comments, whereas written annotations were lent themselves more to local low-level comments.

One conclusion that could be drawn for this is that both kinds of annotations should be supported. There is one caveat to this, however. The lack of any way of easily anchoring voice annotations to specific locations in the document may have prejudiced against using voice for more localized comments. This weakness is an artifact of the technology used in the study. It would be interesting to repeat this study using a system that would let you "attach" voice annotations to specific locations on the document, as well as write them on the manuscript. This, and more, is enabled by our next example, Wang's Freestyle.

Freestyle (Levine & Ehrlich, 1991) enabled users to annotate electronic documents by writing using an electronic stylus, or by voice. Most significantly, one could point, mark and speak simultaneously, and have the result captured for later review. The resulting document, with the dynamics of voice and gesture intact, could then be returned or distributed using electronic mail. The elegance of this system is its minimalist design: there is no use of voice or character recognition. Its effectiveness comes through matching the media to the task. In addition to static marks and conventional voice messaging, it recognized the importance of synchronized voice and gesture in annotation.

Freestyle is the subject of Case Study C, in this volume. The reference given above, Levine and Ehrlich (1991) appears as a reading in that case study. In addition, a comprehensive video demonstration of the system is provided in Francik (1989).

One of the problems encountered with speech files, however, is searching and retrieving information. One can visually skim written documents, and search them by content. How one would skim spoken messages is less clear. However, Arons (1992, 1993) has laid the groundwork for developing systems that provide this capability to some degree. Likewise, there is some hope for being able to perform content-based search and retrieval from within speech files, without full speech recognition.

Current research in word spotting (Wilcox, Smith & Bush, 1992) may well permit users to speak a "key" and have the file searched for an occurrence. Certainly there is some kind of recognition going on with such systems, but it is not speech recognition as discussed later in this chapter. Current systems still have limitations. They are not speaker independent, for example (that is, the key must be uttered by the same speaker who uttered the file being searched.). A demonstration of the recent state-of-the-art of such systems is presented in the video by Wilcox, Smith and Bush, M. (1992).

Speech Synthesis

Speech synthesis is perhaps the speech technology most available to the designer and to the user. It is standard equipment on a number of personal computers.

As most commonly found, this technology takes ASCII text (such as a document from a word processor) as input, and outputs speech. It typically does so by following a set of production rules, without dictionary look-up context checking. (Therefore, it cannot determine if "wind" should be pronounced as "I will wind my watch," or "The wind is blowing.")

So called text-to-speech technology is efficient and rather effective. Most often, the speech is synthesized. This can be done in software or hardware, and is relatively economical. In a few cases, however, such as the directory information systems used by most North American telephone companies, recordings of a real human voice are use. In the directory assistance example, the digits are spoken, stored, and then recalled for playback. The difficulty is in achieving smooth speech by splicing together numbers recorded in isolation. This is achieved by recording several versions of each digit, and then choosing the one that will result in the most natural transition with the digits preceding and following it. The justification of such specialized systems must generally be based on the need for high quality natural sounding speech, not cost, since they are generally expensive and difficult to implement.

The reading by O'Malley (1990) gives a good overview of the current state of the art of text-to-speech technologies.

Text-to-speech has been used in a number of applications. Schmandt (1984), developed a system called mailtalk, which enabled users to retrieve their email by having it "read" to them over the phone, rather than logging in to a system via computer.

Text-to-speech can also be used to deliver stored system messages and on-line help. Of course, this could also often be implemented using stored voice messages. However, since a sentence written in ASCII is much more compact than a recorded sentence (even with modern compression), the text-to-speech solution is far more economical (assuming reasonable quality). Perhaps even more to the point, if the messages are synthesized, they can be composed (as distinct from synthesized) on the fly. Hence, the system can be far more flexible and adaptable, than one made up of fixed prerecorded messages.

The quality of synthetic speech is variable. All too typically, at the low end, the speech output sounds like your prototypical robot, speaking in a near monotone. Paying attention to speech quality is important. Moody, Joost & Rodman (1987) discuss the effects of various types of speech output on speaker recognition. Pisoni, Nusbaum and Greene, (1985) show that the demands on the voice quality vary with application. Some of the human factors of voice quality in text-to-speech systems are discussed in Thomas and Rosson (1984). That the actual voice quality plays an important factor in how systems are accepted by users is shown in Rosson and Cecala (1986).

In addition to the reading by O'Malley, Kaplan and Lerner (1985) give a good overview of speech synthesis systems, the applicable technologies, and their use in applications. Lee and Lochovsky (1983) provide another, yet expanded, overview. Klatt (1987) provides a good review of test-to-speech for English.

Speaker Recognition

While not in the mainstream of speech recognition, the identification of individuals from their speech is an active area of research that has relevance in the field of human-computer interaction. There are two aspects of interest:

speaker verification: test against single template, verify it is X

speaker recognition: test against a set of n possibilities (including "no match") to identify speaker

The former has relevance in applications such as security. An example would be using a spoken utterance as one's password (more accurately "passvoice") to logon to a system, or enter a secured area.

Speaker recognition also has several potential applications. One example would be to keep track of who is speaking at any given time during a videotaped meeting. Supplementing the video with this information would support queries such as, "Show me the segment when Jane started to speak after John, about 5 minutes into the meeting." Such techniques have been used as tools in studying collaborative work (Kurtenbach & Buxton, 1994). An important observation with this application is that it has utility even if the content of the speech is not recognized.

The reading by Peacocke and Graf (1990) contains some discussion of speaker recognition and verification techniques. For further information, see also Doddington (1985), which gives a survey of the field, and has a good bibliography.

Speech Recognition

Speech recognition systems do just what their name suggests: they recognize spoken words. Speech recognition is often confused with natural language understanding. Speech recognition detects words from speech. However, the recognition system has no idea what those words mean. It only knows that they are words and what words they are. To be of any use, these words must be passed on to higher level software for syntactic and semantic analysis. If the spoken words happen to be in the form of natural language (for example English), then they may be passed on to a natural language understanding system (although, as we will see in our discussion of Natural Language, this is generally not the case).

It is important to understand that the ``words'' that make up the vocabulary that can be recognized by such systems need not be words in the normal sense. Rather, such systems typically work by matching the acoustic pattern of an acoustic signal with the features of a stored template. Within the confines of the system's resolution, these signals could be words, short phrases, whistles, or any other discernible acoustic signal or utterance. (Of course, since they are intended for speech, the pattern matching heuristics are optimized to work on the characteristic features of speech signals.)

Speech recognition systems are not yet in widespread use, although workstations and personal computers are now being delivered with built-in recognizers. Such systems are very relevant to users who are physically disabled (such as quadriplegic), or where the eyes or hands are unavailable (due to being occupied in performing some other task - such as driving a vehicle - or due to wearing special clothes, for example. (Examples would be a surgeon who cannot touch a keyboard for fear of contamination, or a mechanic whose hands are too greasy to effectively use a keyboard.)

Speech recognition is significantly more difficult than synthesis. Systems vary along a number of dimensions:

Speaker dependent vs independent: Does the system have to be trained separately for each different user? At this point, nearly all available systems are speaker dependent and require training.
Size of vocabulary: How many words can the system recognize? Low end systems recognize tens of words while state-of-the-art high-end systems can recognize on the order of 50,000. Increasing the vocabulary can expand the utility of a system. However, it also raises the capital cost and the cost that need be invested in training the system. With some systems, the cost of this training increases with vocabulary size. With others, however, training time is a constant, independent of vocabulary size. The system developed for IBM by Jelinik (1985), is an example.
Isolated word vs continuous speech recognition: One of the hardest problems in speech recognition is determining when one word ends and the next one begins. In order to side-step the problem, most systems force the user to issue single word-at-a-time commands. Typically, words must be separated by a gap of on the order of 300 milliseconds. Since this is unnatural, speech recognition systems that require multi-word commands may requires special training on the part of the users. One perspective on this is presented in Biermann et al. (1985).

The reading by Peacocke and Graf (1990) provides a good introduction to the field. See also Das and Nadas (1992) and Levinson and Liberman (1981). The overview of the panel session on voice-based communication which took place at CHI+GI '87 (Aucella et al., 1987) present a fairly broad perspective on the topic in a relatively small amount of space. The comments of Robin Kinkead, for example, speak to the issues that must be taken into account when designing such systems:

Continuous versus discrete speech is a non-issue, and yet it is the most hotly debated, claimed and counter-claimed and misunderstood variable in speech recognition. Before the advent of functional large vocabulary systems, it was frequently cited as the single largest barrier to the acceptance of speech recognition by the public. The real issues of accuracy , repeatability of performance, vocabulary size, cost, flexibility, amount of training needed, ease of modification location of microphone, complete voice control over the system and performance in variable conditions were frequently dismissed in favor of debate over what has been shown (in studies, in use and in demonstrations) to be a minor point whether one must speak with pauses between words or not. (Kinkead, in Aucella et al., 1987, p42)

One of the best single source for an overview of speech recognition is the book of readings by Waibel and Lee (1990). There are a number of interesting articles on speech recognition in a special issue of IEEE Computer, (Kamel, 1990), and the Proceedings of the IEEE (November, 1985). In that issue, for example, Jelinik (1985) describes a recognition system developed by IBM that has a 5000 word vocabulary. In the same issue, Levinson (1985) addresses structural issues affecting speech recognition, and Zue (1985) discusses some knowledge-based techniques.

While we are waiting for speech recognition to come of age, there is some debate as to how useful it will actually be. There are a number of empirical studies of speech recognition systems. Karl, Pettey and Shneiderman (1993) did a comparison of speech versus mouse specification of commands to a word processor. Subjects reacted favourably to the speech interface. Overall, it was 18% faster at task performance than using the mouse. However, users had real concerns with problems of accuracy, response time, and the inadequacy of the feedback to their actions. (This is a theme repeated in a number of other studies.)

Perhaps the most interesting result of the Karl et al. study was one which was unexpected by the authors. What they found was that the use of speech to issue commands interfered with short-term memory tasks which constituted part of the experimental task.

During the experiment, subjects had to build a table of symbols. For each, subjects would select and copy the symbol from a list, memorize its description, page down to the table, past in the symbol, then type its description from memory. Using voice for the cut, and paste commands, interfered with the task of remembering the description.

In retrospect, this result should not be a surprise, since it is consistent with earlier findings in the human factors literature. Wickens, Mountford and Schreiner (1981) showed how different modalities of interaction were appropriate for different classes of messages. The likely reason that speech interfered with the memory task, while using the mouse did not, is that speech and linguistic memory compete for the same cognitive resources. The mouse-issued commands use a different, non-competitive channel.

The lesson for user interface designers is that one has to pay careful attention to the full range of tasks to be performed, and what resources (motor sensory and cognitive) each consumes. A key rationale for multimodal interfaces is to enable tasks to be distributed over noncompetitive channels. Clearly, in the experiment by Karl, et. al, this was not the case.

Other studies of potential interest are those by Rudnicky, Sakamoto and Polifroni (1990), which studied voice interaction in a spreadsheet task, Franzke, Marx, Roberts and Engelbeck (1993), Martin, (1989) and Allen (1983).

Finally, there are a number of important earlier studies by Gould and his associates: Gould (1982), Gould and Alfaro (1984) and Gould, Conti and Hovanyecz (1983). This last study investigates the potential of composing letters with a listening typewriter. Its results suggest that large-vocabulary isolated-word recognition may provide the basis for a useful listening typewriter.

Perhaps as interesting to the student of human-computer interaction is the "Wizard of Oz" methodology that this study used. Since a practical listening typewriter was not available, the authors simulated one by having the subjects' speech transcribed by a human, and echoed back to the subject on a CRT. The study stands as a good example of using limited resources to test the validity of an idea before making a heavy investment in its development. Newell, Arnott, Dye and Cairns (1991) present an improved environment for performing this class of study. The reader is also directed to Hauptman (1989), which uses a similar technique to investigate voice in combination with gesture in working with 3D graphical objects.

Natural Language Recognition

Natural language understanding and generation would appear to be primarily an HCI issue. When we examine the literature, however, this turns out to be only partially true. The applications being studied are not mainly interactive human-machine dialogues. Rather, they are mostly concerned with three classes of application:

automatic translation
database query
information retrieval

Obviously all have a human interface component. However, natural language human-human dialogues do not generally consist of grammatically correct complete sentences. They involve interruption, misunderstanding detection and conversational repair mechanisms. These are issues that are peripheral to the applications mentioned, yet should be a major part of any truly natural language interface to interactive applications. Natural language understanding and recognition, as represented in the literature, fall much more within the domain of Artificial Intelligence than HCI. See Peria & Grosz, 1993 for a summary of recent work, published as a special volume of the journal Artificial Intelligence, on Natural Language Processing.)

Natural language systems are distinct from speech-based systems in that the input and output is ASCII text rather than speech. Readers might naturally assume that speech recognition and synthesis systems have a natural language system underneath them. (This is especially true for those familiar with computer language compiler design.) However, as the reading by White (1990) points out, this is not normally the case. Speech understanding systems more typically operate by phrase or pattern matching, rather than by any deep underlying model of natural language. White discusses how speech and natural language technologies might be combined in order to improve the performance of future systems. Central to his approach is a technique known as Hidden Markov Models (HMM). This is a topic which is expanded upon in the reading by Peacocke and Graf (1990), and should be studied by anyone interested in recognition technologies.

One of the key points emphasized by White is the importance of knowledge from various sources in contributing to the meaning of an utterance. These include:

Syntactic: having to do with grammar, or structure;
Prosodic: having to do with inflection, stress, and other aspects of articulation;
Pragmatic: having to do with the situated context within which utterance takes place - issues such as location, time, cultural practices, speakers, hearers, surroundings, etc.
Semantic: having to do with meaning of words

Each has an important role to play, and consequently, a true natural language understanding system must have a uniform method for representing and communication knowledge from each of these sources.

Of these knowledge sources, it is pragmatics which is currently the hardest for systems to collect from the environment automatically. This is an area where the technologies of ubiquitous computing (discussed in the reading by Weiser in Chapter 14), may help in the future.

How "Natural" are current natural language systems? Ogden and Sorknes (1987), for example, investigated the performance of a commercially available natural language interface for database queries, used by users with no formal query training. In their study, only 28% of their first questions resulted in a correct response. Perhaps even more significant, 16% of the problems were thought to have been answered correctly, but were not. The authors make the point that to operate any system, users must learn two things:

what the system is capable of
the language that enables them to invoke those capabilities.

What they draw from this is that with conventional interfaces, one learns the capabilities as one learns the language. However, with a natural language understanding system, this is less so, since the language is already "known." Despite this, one still has to learn the systems capabilities and limitations. So, even though the language may be known, there is still a user interface issue for which training must be provided.

As mentioned earlier in this section, information retrieval is one application that has received a lot of attention in the natural language literature. While this is not an area that we have the time to focus on, the reader is directed to the proceedings of a series of DARA sponsored conferences on "message understanding," (MUC, 1991, 1993). These report on work directed at a very specific task: to read a large body of wire service-like electronic messages in English, and identify those that deal with a specific topic (Latin American terrorism), and provide a standard form summarizing key information for each. While not dealing with foreground interaction, per se, this aspect of natural language technology will have increasing relevance to users as distributed on-line information systems become more widely available.

Natural Language Generation

Frequently, messages from computers have come in the form of "natural language" phrases or sentences describing errors, options, or status, for example. What is important to recognize is that such messages are mostly precomposed and stored, either as speech or text.

While the language used is "natural," these do not count as "natural language" systems in the sense that the term is usually intended. There is no natural language system underlying such messages. Hence, there is no way that new messages can be automatically composed to accommodate new circumstances. If the message is not in the original repertoire, it cannot be uttered.

This is clearly a problem from at least two perspectives. First, there is the problem of completeness. What if all circumstances are not anticipated? Second, there is the potential that all possible messages may be much larger than some language processor that could generate messages on the fly.

The generation of messages on the fly is the heart of natural language generation in interactive systems. This is especially true when the output goes beyond the provision of system messages, for example, and begins to include more "conversational" interfaces. Imagine, for example, a calendar management program that used natural language input and output.

Consider generating responses to queries such as, "What am I doing next Thursday?", or "When can I meet James for an hour next week?". Notice that these are not questions for which there are stock answers that can be prerecorded, or which follow simple templates. As with natural language understanding, generating appropriate responses requires access to deeper knowledge sources about the domain and context.

The literature on natural language generation and understanding largely overlap. As with recognition, it is mainly rooted in artificial intelligence. See the references already cited. In addition, Fedder(1990), presents a brief coverage of the importance of the mappings between domain and language concepts in natural language generation.

Speech, Gesture and Multimodal Interaction

So-called "Natural Language" systems are not always the most "natural way to interact (Buxton, 1990; Lee & Zeevat, 1990). Often, gestures can convey the intent of an utterance much more naturally than speech or written language. Consider expressing concepts of time or place. Pointing and selecting one object from among many similar ones is often more "natural" than trying to identify it by describing it verbally. On the other hand, it is often much easier to say "All red triangles" than point and select every red triangle from among a large set of objects. Both demonstrative and descriptive techniques are valuable, and the "natural" language for each is different. Perhaps the greatest benefit comes when both speech and gesture can be use together.

Wang Freestyle, discussed earlier (and n Case Study C) is a wonderful example of the power of combining speech and gesture, even when there is no recognition of either by the machine.

The pioneering example of the use of speech and gesture, and still one of the most compelling examples, is Put That There, (Bolt, 1984). The video demonstration of this system (Schmandt, et al., 1987), is highly recommended. Put That There was a system that enabled users to interact with objects on a wall sized map using voice input combined with pointing. In addition, the system provided feedback both graphically (on the map) and speech, using text-to-speech technology. Some of the benefits of the system can be understood from the name. For example, the pronoun "that" and adverb "there" can only be understood (in the absence of additional context) when accompanied by gesture. Without the gesture, the vocabulary of the speech recognizer would have to include the proper nouns for all selectable objects, and the simple adverb "there" would have to be replaced by a much more complicated clause. Hence, by taking the multimodal approach that they did, the dialogue is more succinct for the user, and much less complicated for the system.

Others have used voice and pointing, following the path laid by Put That There. Salisbury, et. al (1990), for example, describe a UI for the AWACS defense system that uses speech recognition and speech output in combination with mouse to point at targets.

Hauptmann (1989) ran a Wizard of Oz study of people performing seven different operations on a wireframe cube seen on a display. Three different modalities were tested: voice alone, gesture alone and voice and gesture together. More than twice as many subjects preferred to use voice and gesture together, than preferred voice alone, or gesture alone. As well, it was seen that subjects used a rather limited vocabulary, indicating that within restricted domains, small vocabulary recognition systems may be sufficient.

Weimer and Ganapathy (1992) describe a prototype system that starts to integrate ideas discussed by Hauptmann into a working system. They describe a system in which users interact with a 3D world seen through stereo glasses by using speech and hand gesture as input. In this Virtual Reality type system, hand gesturing implements a virtual control panel.

Koons, D., Sparrell, C. and Thorisson, K. (1993) present a prototype system for interacting with a map data base using speech, hand gesture and eye tracking. Thorisson, Koons and Bolt, (1992) provides a video demonstration of parts of this system. Their paper presents an excellent discussion of methods and issues that arise in interpreting simultaneous multi-modal inputs. A key problem discussed is how to build up single meaning from input from various sources.

An valuable part of the discussion concerns going beyond the use of simple pointing gestures, such as seen in Bolt (1983). Using taxonomy of gestures presented by Rimé and Schiaratura (1991), they discuss how systems could interpret the following:

symbolic gestures: These are gestures that, within the culture, have come to have a single meaning. The "OK" gesture is one such example. However, recognizing American Sign Language would also fall into this category.
Dietic gestures: These are the types of gestures most generally seen in HCI. These are the gestures of pointing, or otherwise directing the listener's attention to specific events or objects in the environment.
Iconic Gestures: As the name suggests, these gestures are used to convey information about the size, shape or orientation of the object of discourse.
Pantomimic gestures: These are gestures typically used in showing the use of movement of some invisible tool or object in the speaker's hand.

A key observation made by the authors is that only the first class, symbolic gestures, can be interpreted alone, without further context. Either this context has to be provided sequentially by another gesture or action (as is the case in DM systems, or GUI's), or, by another channel such as voice, working in concert with the gesture. The former is typical of current methods of HCI. The latter is far more reflective of communication in the everyday world.

Applications

There are a number of papers addressing the issue of integrating speech-based interaction into more conventional GUI's and window managers. Some of this work, such as Mynatt and Edwards (1992), uses both speech and nonspeech audio. Their main concern is providing access to users who have various disabilities. Schmandt, Hindus, Ackerman and Manandhar (1990) and Schmandt, Ackerman and Hindus (1990) both discuss issues extending the X windowing environment to support voice interactions. Kamel, Emami and Eckert (1990) discuss an architecture that would enable a workstation to support a range of applications such as a telephone manager, screen telephone, answering machine, calendar program and voice editor. Their concern is with the tool kits that bind the applications with the various hardware components required.

For fairly obvious reasons, a lot of work on voice-based interfaces and applications has centred around telephony. While the telephone has traditionally been perceived as distinct from the computer, today the distinction is becoming increasingly blurred. Voice is now being used to access services over the phone (Lennig, 1990). Nakatsu (1990) presents a system for presenting banking services delivered via the telephone. This system uses both speech synthesis and recognition. Wattenbarger, Garbeg, Halpern and Lively (1993) provide a good discussion of the human-factors issues involved in telephone-based speech systems.

Lawrence and Stewart (1990) and Sola and Shepard (1990) discuss telephone attendant support and dialing services, respectively. Brennan, et al. (1991) is a summary of a panel of a number of people involved in telephony applications discussing their experience with voice-based applications.

It may well be, however, that the applications that will have the most impact will be ones that are not centre around "familiar" technologies such as the workstation or the telephone. With emerging miniaturization and wireless technologies, a new range of information/communication devices are emerging. In the coming world of "ubiquitous computing" (see the reading by Weiser in Chapter 14), we will likely increasingly see the emergence of small "intimate" computing devices that have a strong speech component. The VoiceNotes hand-held voice note taker (Stifelman, Arons, Schmandt & Hulteen, 1993) is a good example of this class of technology. Schmandt (1993) provides an excellent discussion of the transition from the desktop to mobile based use of voice in computing.

Other Sources

The August 1990 issue of IEEE Computer was a special issue devoted to the voice in computing (Kamel, 1990). It is a good reasonably current collection. Simpson et al. (1985), is an older (but still relevant) human factors perspective of speech-based interaction, and Schmandt (1993) is an excellent overview.

Cowley (1990) and Cowley and Jones (1993) are videos which provide a human-factors guide to the use of speech.

The Encyclopedia of Artificial Intelligence (Shapiro, 1992) has entries for a number of topics covered above, including Natural Language Generation, Natural Language Understanding, Discourse Understanding and Speech Understanding. These are written mainly from an AI rather than HCI perspective, but are useful to anyone trying to obtain a deeper understanding of the issues involved.

In terms of collections, the proceedings of the annual DARPA Workshop on Speech and Natural Language (DARPA, 1989-92) presents a rich package of papers dealing with theory, technology and applications. It is a very efficient way to obtain an overview of research activities. The proceedings of the annual IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) is another rich concentrated source.

Finally, those interested in a source of suppliers for speech-related technologies are referred to Speech Technology magazine and their 1992 buyer's guide (Media Dimensions, 1992).

Nonspeech Audio

Introduction

Unless we are hearing impaired, we have all used nonspeech audio cues on a daily basis all of our lives. Crossing the street, answering the phone, diagnosing problems with our car engine, and whistling for our dog are all common examples. Despite the wide-spread use of this rich mode of interaction in other aspects of our lives, it has had little impact on how we interact with computers. This need not be so. There are significant potential benefits to be reaped by developing our capabilities in this regard.

The audio messages in such systems generally fall into one of three categories:

alarms and warning systems
status and monitoring indicators
encoded messages and data

Alarms and warning systems certainly dominate this class of communication. However, video games illustrate the potential of nonspeech audio to effectively communicate higher-level messages. Just compare an expert player's score when the audio is turned on with the score when the audio is turned off. The likely significant drop in score is indication that the audio conveys strategically critical information, and is more than just an acoustic ``lollipop.''

Perhaps the most common application of nonspeech audio has been in alarms and warning systems. To be effective, however, the meaning of each signal must be known to the intended listener. Just like any other language, this is a learned vocabulary. We are not born knowing the meaning of a fog horn, fire alarm, or police siren. If audio cues are to be used in interactive systems, then their design is of utmost importance. As graphic design is to effective icons, so is acoustic design to effective auditory signs, or earcons.

Learning & Remembering

One way to assist users in remembering the meaning of nonspeech audio signals is by metaphor - have the meaning associated with a sound correspond to the meaning of similar sounds in the everyday world. However, the extent to which existing acoustic signs can be exploited remains to be seen. On the one hand, the application of fog horn and fire alarm sounds to computer applications is rather limited. On the other hand, Gaver (1986) has presented some compelling examples which make effective use of existing ``world knowledge'' of the acoustic environment. One such example makes use of our association of reverberation with empty space. What he proposes is that if there was a reverberant ``clunk'' when we saved a file, then the amount of reverberation would provide a good cue as to how much free space was left on the disk. Similarly, on the Apple Macintosh, for example, placing a file into the ``trash can'' could be accompanied by an appropriate ``tinny crash''.

On first impression, such uses of the audio channel to provide feedback may seem frivolous or unnecessary. However, as soon as we consider the special needs of the visually impaired or those working in critical applications where such encoding can reduce the risk of error, the value of the approach is clear.

Musical & Everyday Listening

There are at least two approaches to nonspeech audio cues, as expressed by Gaver:

musical listening
everyday listening

In musical listening, the "message" is derived from the relationships among the acoustical components of the sound, such as its pitch, timing, timbre, etc. In everyday listening, what one hears is not perceived in this kind of analytical way. What one hears is the source of the sound, not its acoustic attributes. For example, if one hears a door slam, in everyday listening, one pays attention to the fact it was a door, perhaps how big it was, what material it was made from (wood or metal?), how hard it was slammed. The musical listening analysis of the same thing would deal with issues like, was it a long or short sound?, low or high pitched?, or loud or soft? The auditory icons described in the reading by Gaver and Smith (1990), are an excellent example auditory design targeting everyday listening.

The raw materials of auditory design based on musical listening are as follows. These are the musical equivalent to colour and line-type in graphic design:

pitch: the primary basis for traditional melody;
rhythm: relative changes in the timing of the attacks of successive events;
tempo: the speed of events;
dynamics: the relative loudness of events (static or varying);
timbre: the difference of spectral content and energy over time that which differentiates a saxophone from a flute;
location: where the sound is coming from.

Consider the use of Leitmotiv in Wagner's operas, or Prokofiev's use of musical themes to identify the characters of Peter and the Wolf. First, they show that humans can learn and make effective use of cues based on these musical parameters. Without trying to trivialize Wagner, we have seen the same type of audio cue effectively applied in the video game Pacman. On the other hand, the example of Wagner's operas also serves to show that if you are not intimately familiar with the themes, all of the references will be lost on you. Some researchers, such as Blattner, Sumikawa and Greenberg (1989) have proposed using just such motivic encoding for messages in user interfaces. While the ideas warrant exploring, it must also be pointed out that there is little experimental evidence validating this approach (a notable exception being the study by Brewster, Wright & Edwards, 1993, which showed some benefit of messages employing melodic encoding)..

Testing and Validation

Testing and validation is important. While missing a few references in an opera may not be life threatening, the same is true in piloting an aircraft or monitoring a nuclear power plant. In the Three Mile Island power plant crisis, for example, over 60 different auditory warning systems were activated (Sanders and McCormick, 1987, p. 155).

Momtahan, Hétu and Tansley (1993), for example, performed a study on the audibility and identification of auditory alarms in the operating room (OR) and intensive care unit (ICU) of a hospital. Staff were able to identify only a mean of 10 to 15 of the 26 alarms found in the OR. Only a mean of 9 to 14 of the 23 alarms found in the ICU were identified correctly by nurses who worked there. Only in the OR was the ability to identify alarms positively correlated with their importance.

The biggest problem leading to the results of this study is that there were simply too many alarms in these critical areas, and those that are there are poorly designed. Not only are they hard to distinguish on their own, but some are "masked" (that is, inaudible) due to the noise of machinery (such as a surgical drill) used in the space in normal practice. Due to the design, some alarms actually masked other, simultaneously occurring alarms. Whereas sound can be a valuable channel of communication, this study shows, only too clearly, the risk of bad design and implementation.

Nielsen and Schaefer (1993) studied the effect of nonspeech audio in a computer paint program on users who were between 70 and 75 years old. In the program studied, they found that the sound contributed nothing positive, and served to confuse some of the users. Clearly, the use of sound is no panacea.

If audio cues are to be employed, they must be clear and easily differentiable. They can be effective, but like all parts of the system, they require careful design and testing.

Human Factors

There is ample human- factors literature from which the designer can obtain useful guidelines about the use of nonspeech audio cues. Three excellent sources are Deatherage (1972), Kantowitz and Sorkin (1983) and Sanders and McCormick (1987). Each uses the term audio display to describe this use of the audio channel. Figure 1 summarizes Deatherage's view as to when to use audio displays rather than visual ones.

Use auditory presentation if:	Use visual presentation if:
1. The message is simple.	1. The message is complex.
2. The message is short.	2. The message is long.
3. The message will not be referred to later.	3. The message will be referred to later.
4. The message deals with events in time.	4. The message deals with location in space.
5. The message calls for immediate action.	5. The message does not call for immediate action.
6. The visual system of the person is overburdened.	6. The auditory system of the person is overburdened.
7. The receiving location is too bright or dark — adapta-tion integrity is necessary.	7. The receiving location is too noisy.
8. The person's job requires him to move about continually.	8. The person's job allows him to remain in one position.

Figure 1: When to Use Audio or Visual Displays.

Guidelines for determining which of the audio or visual channel to use in displaying information (from Deatherage, 1972, p. 124).

Perhaps the prime attribute of computer-generated audio output is that messages can be conveyed without making use of the visual channel. Visual messages must be seen to be understood. Audio messages are received regardless of where one is looking. This is of particular importance in cases where the visual channel is focused elsewhere, or where the task does not require constant visual monitoring. When the amount of information to be conveyed is high and pushing the visual channel to the limits, the audio channel can also be used to carry some of the information, thereby reducing the overall load.

Applications

Some of the earliest work using nonspeech audio was in data display, or "auralization." This can be considered the acoustic equivalent of scientific visualization. Typically, multidimensional data was presented acoustically by having each dimension of a datum map onto one of the dimensions of a sound. This type of technique was used to perform analysis of variance and to present "sound graphs," for example.

Some of the earlier work is summarized by Bly et al. (1985). Frysinger (1990) is an expanded description of early work summarized in Bly et al. See also Lunney and Morrison (1990), who describe their system for presenting the infra red spectra of various chemical compounds using such techniques. (Like many other researchers in the field, their work was motivated by making scientific data accessible to those with visual impairments.)

Smith, Bergeron and Grinstein (1990) describe an interactive system that uses graphical and acoustical "textures" to represent data sets. With the Lunney and Morrison work, where the computer presented sounds and the user then performed recognition. With Smith et al, one moves the cursor over a surface and hears a texture, not isolated sound events. The effect is very different, and does not employ musical listening.

Audio interfaces clearly have strong relevance for those with impaired vision. But note that we all are visually impaired to some degree when working with computers. This is especially true in collaborative work at a distance, where one cannot always visually monitor the activities of one's collaborators. This is one of the areas where sound has the most potential. The reading by Gaver and Smith (1990) explores some of this design space. Gaver, Smith and O'Shea (1991) presents another excellent exploration of this domain.

Pragmatics of Sound

In the past, however, one of the biggest problems in exploring the use of audio signals was a logistical one. In her research, for example, Bly had to build special hardware and interface it to her computer. Some important recent developments have changed this situation dramatically. The change is a technological one which has resulted from the music industry adopting a standard protocol for interfacing electronic sound synthesis and processing equipment to computers. This standard is known as MIDI, the Musical Instrument Digital Interface. As a result of this standard, there is a wide range of equipment and interfaces readily available to the researcher who wants to study this area. In addition to the specification (IMA, 1983), an excellent general introduction to MIDI can be found in Loy (1985). Cummings and Milano (1985) give a brief introduction to MIDI and provide valuable pointers to suppliers of MIDI interfaces and related hardware.

As personal computers become ever more powerful, and memory cheaper, one needs to rely less on peripherals to produce sounds. Digital recordings of sounds can be stored in memory and played back on demand in the same way as stored speech messages. This is especially appropriate for "everyday" auditory icons, such as used by Gaver.

Complex sounds can also be synthesized "on the fly" in real time from some compact representation. Gaver (1993) discusses some new and potent techniques for doing so. This approach is analogous to the use of text-to-speech techniques.

Finally, Wenzel et al. (1993) summarizes a panel discussion that tries to relate hardware requirements to perceptual performance. The discussion brings together a number of perspectives o benefit to the designer of nonspeech audio systems.

Other Sources

There are collections to which readers interested in pursuing the topic of nonspeech audio can turn. One is a special issue of the journal Human Computer Interaction devoted to the topic (Buxton, 1989). Another is the proceedings of a special SPIE conference (Farrell, 1990). Finally, there are the proceedings of the first International Conference on Auditory Design, ICAD (Kramer, 1994).

Perception and Psychoacoustics

In the preceding, we have discussed the importance of design in the use of acoustic stimuli to communicate information. One of the main resources to be aware of in pursuing such design is the available literature on psychoacoustics and the psychology of music.

Psychoacoustics tells us a great deal about the relationship between perception and the physical properties of acoustic signals. Music and the psychology of music tell us a lot about the human's ability to compose and understand higher level sonic structures. In particular, the literature is quite extensive in addressing issues such as the perception of pitch, duration, and loudness of acoustic signals. It is also fairly good at providing an understanding of masking, the phenomenon of one sound (for example, noise) obscuring another (such as an alarm or a voice). Information at this level is available in any first year psychology text book. Those looking for more detailed information are referred to Scharf and Buus (1986), Scharf and Houtsma (1986), Hawkins and Presson (1986), Evans and Wilson (1977), Tobias (1970), and Carterette and Friedman (1978).

One aspect of perception which is not covered in the above references is the topic of signal detection theory. This self-descriptive topic has been an important part of the classical human factors approach to audio signals. References which address the topic at an introductory level are, and Sanders and McCormick (1987), Deatherage (1972), and Kantowitz and Sorkin (1983).

Under a different name, acoustic design has had a thriving life as music. While music perception is not a part of main-stream human factors, it does have something to contribute. In particular, classic psychoacoustics has dealt primarily with simple stimuli. Music, on the other hand, is concerned with larger structures. Hence, melodic recognition and the perception and understanding of simultaneously sounding auditory streams (as in counterpoint) is of great relevance to audio's use in the human-computer interface. As a reference to this aspect of auditory perception, therefore, we recommend Deutsch (1982, 1986) and Roederer (1975).

The audio channel is an under utilized resource in human-computer interaction. Many users may feel that it is just not worth developing when compared with other areas of research. However, one need only consider the position of visually impaired users to realize that the need is compelling. It becomes even more so when we finally come to the realization that in some (often critical) situations, all of us are visually impaired, perhaps due to the visual system being saturated, loss of light or age. This is an area that deserves and must receive more attention. With MIDI and the widespread availability of personal computers, there is now no excuse for not doing so.

Readings

O'Malley, M. (1990). Text-to-Speech Convversion Technology, IEEE Computer, 23(8), 17-23.

Peacocke, R. & Graf, D. (1990). An Introduction to Speech and Speaker Recognition, IEEE Computer, 23(8), 26-33.

White, G. (1990). Natural Language Understanding and Speech Recognition, Communications of the ACM, 33(8), 72-82.

Gaver, W. & Smith, R. (1990). Auditory icons in large-scale collaborative environments. In D. Diaper et al. (Eds), Human-Computer Interaction - INTERACT '90, Elsevier Science Publishers B.V. (North-Holland), 735-740.

References/Bibliography

Allen, R.B. (1983). Composition and Editing of Spoken Letters, International Journal of Man-Machine Studies, 19(2), 181-193.

Allen, J. (1985). A Perspective on Man-Machine Communication by Speech, Proceedings of the IEEE, 73(11), 1541-1550.

Arons, B. (1992). Techniques, Perception, and Applications of Time-Compressed Speech. Proceedings of the American Voice I/O Society Conference, AVIOS '92, 169-177.

Arons, B. (1993). SpeechSkimmer: Interactively Skimming Recorded Speech, Proceedings of UIST'93, 187-196.

Aucella, A., Kinkead, R., Schmandt, C. & Wichansky, A. (1987). Voice: Technology Searching for Communication Needs, Panel Summary, Proceedings of CHI + GI '87, 41-44.

Baecker, R. & Buxton, W. (Eds.)(1987). Readings in Human Computer Interaction: A Multidisciplinary Approach, Los Altos, CA.: Morgan Kaufmann Inc.

Bailey, P. (1984). Speech Communication: The Problem and Some Solutions, in A. Monk (Ed.). Fundamentals of Human-Computer Interaction, London: Academic Press, 193-220.

Biermann, A., Rodman, R., Rubin, D. & Heidlage, J. (1985). Natural Language with Discrete Speech as a Mode for Human-to-Machine Communication, Communications of the ACM, 28(6), 628-636.

Blattner, M., Sumikawa, D. & Greenberg, R. (1989). Earcons and icons: Their structure and common design principles. Human-Computer Interaction 4(1), Spring 1989.

Bly, S. et al. (1984), Communicating with Sound, Proceedings of CHI'85, 115-119.

Bolt, R. A. (1984). The Human interface: Where People and Computers Meet. London: Lifetime Learning Publications.

Brennan, P. et al. (1991). Should we or Shouldn't we use Spoken Commands in Voice Interfaces? Panel Session in Proceedings of CHI'91, 369-372.

Brewster, S., Wright, P. & Edwards, A. (1993). An Evaluation of Earcons for Use in Auditory Human-Computer Interfaces. Proceedings of INTERCHI'93, 222-227.

Buxton, W. (1989). Introduction to this Special Issue on Non-Speech Audio. Human-Computer Interaction, 4(1), 1-9.

Buxton, W. (1990). The Natural Language of Interaction: A Perspective on Non-Verbal Dialogues. In Laurel, B. (Ed.). The Art of Human-Computer Interface Design, Reading, MA: Addison-Wesley. 405-416.

Buxton, W. (1993). HCI and the inadequacies of Direct Manipulation systems. SIGCHI Bulletin, 25(1), 21-22.

Carterette, E. & Friedman, M. (Eds.) (1978). Hearing, Handbook of Perception, Volume IV, New York: Academic Press.

Chalfonte, B., Fish, R. & Kraut, R. (1991). Expressive Richness: A Comparison of Speech and Text as Media for Revision, Proceedings of CHI'91, 21-26.

Cohen, J. (1993). XX, Proceedings of INTERCHI'93, XX-XX.

Cohen, P. (1992). The Role of Natural Language in a Multimodal Interface, Proceedings of UIST'92, 143-149.

Cummings, S. & Milano, D. (1986). Computer to MIDI Interfaces. Keyboard Magazine, January 1986, 41-44.

DARPA (1989) Proceedings of the 1989 DARPA Workshop on Speech and Natural Language. San Francisco: Morgan Kaufmann.

DARPA (1990) Proceedings of the 1990 DARPA Workshop on Speech and Natural Language. San Francisco: Morgan Kaufmann.

DARPA (1991) Proceedings of the 1991 DARPA Workshop on Speech and Natural Language. San Francisco: Morgan Kaufmann.

DARPA (1992) Proceedings of the 1992 DARPA Workshop on Speech and Natural Language. San Francisco: Morgan Kaufmann.

Das, S. & Nadas, A. (1992). The Power of Speech, Byte, 17(4), 151-160.

Deatherage, B. H. (1972). Auditory and Other Sensory Forms of Information Presentation. In H. P. Van Cott & R. G. Kinkade (Eds), Human Engineering Guide to Equipment Design (Revised Edition). Washington: U.S. Government Printing Office.

Deutsch, D. (Ed.) (1982). The Psychology of Music, New York: Academic Press.

Deutsch, D. (1986). Auditory Pattern Recognition, in K.R. Boff, L. Kaufman & J.P. Thomas (Eds.), Handbook of Perception and Human Performance, Volume II, New York: John Wiley and Sons, 32.1-32.49.

DiGiano, C. & Baecker, R. (1992). Program auralization: sound enhancements to the programming environment. Proceedings of Graphics Interface '92, 44-52.

Doddington, G.R. (1985). Speaker Recognition Identifying People by their Voices, Proceedings of the IEEE, 73(11), 1651-1664.

Edwards, A. (1989). Soundtrack: an auditory interface for blind users. Human-Computer Interaction 4(1), 45-66.

Evans, E. & Wilson, J. (Eds.) (1977). Psychophysics and Physiology of Hearing. New York: Academic Press.

Fallside, F. & Woods, W. (1985). Computer Speech Processing, Engelwood Cliffs: Prentice-Hall.

Farrell, E. (Ed.)(1990). Extracting meaning from complex data: processing, display, interaction. Proceedings of the SPIE, Vol 1259.

Fedder, L. (1990), Recent Approaches to Natural Language Generation. In D. Diaper et al. (Eds), Human-Computer Interaction - INTERACT '90. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), 801-805.

Flanagan, J.L. (1976). Computers that Talk and Listen: Man-Machine Communication by Voice, Proceedings of the IEEE, 64(4), 405-415.

Franzke, M., Marx, A., Roberts, T. & Engelbeck, G. (1993). Is Speech Recognition Usable? An Explorationof the Usability of a Speech-BAsed Voice Mail Interface, SIGCHI Bulletin, 25(3), 49-51.

Frysinger, S.P. (1990). Applied research in auditory data representation. In E. Farrell (Ed.). Extracting meaning from complex data: processing, display, interaction. Proceedings of the SPIE, Vol 1259, 130-139.

Gaver, W. (1986). Auditory Icons: Using Sound in Computer Interfaces, Human Computer Interaction, 2(2), 167-177.

Gaver, W. (1989). The SonicFinder: An interface that uses auditory icons. Human-Computer Interaction 4(1), 67-94.

Gaver, W. (1993). Synthesizing Auditory Icons, Proceedings of INTERCHI'93, 228-235.

Gaver, W., Smith, R. & O'Shea, T. (1991). Effective Sounds in Complex Systems: The ARKola Simulation, Proceedings of CHI'91, 85-90.

Gould, J.D. (1982). Writing and Speaking Letters and Messages, International Journal of Man-Machine Studies, 16(2), 147-171.

Gould, J.D. & Alfaro, L. (1984). Revising Documents with Text Editors, Handwriting Recognition Systems, and Speech-Recognition Systems, Human Factors, 26(4), 391-406.

Gould, J. & Boies, S.J. (1983). Human Factors Challenges in Creating a Principal Support Office System - The Speech Filing Approach. ACM Transactions on Office Information Systems 1(4), 273-298.

Gould, J.D. & Boies, S.J. (1984). Speech Filing An Office System for Principals, IBM Systems Journal, 23(1), 65-81.

Gould, J.D., Conti, J. & Hovanyecz, T. (1983). Composing Letters with a Simulated Listening Typewriter, Communications of the ACM, 26(4), 295-308.

Grinstein, G. & Smith, S. (1990). The perceptualization of scientific data. In E. Farrell (Ed.). Extracting meaning from complex data: processing, display, interaction. Proceedings of the SPIE, Vol 1259, 190-199.

Hawkins, H.L. & Presson, J.C. (1986). Auditory Information Processing, in K.R. Boff, L. Kaufman & J.P. Thomas (Eds.), Handbook of Perception and Human Performance, Volume II, New York: John Wiley and Sons, 26.1-26.64.

Hauptmann, A.G. (1989). Speech and Gestures for Graphic Image Manipulation. Proceedings of CHI'89, 241-245.

IMA (1983). MIDI Musical Instrument Digital Interface Specification 1.0, IMA, 11857 Hartsook St., North Hollywood, CA, 91607, USA.

Jelinek, F. (1985). The Development of an Experimental Discrete Dictation Recognizer, Proceedings of the IEEE, 73(11), 1616-1624.

Jusczyk, P.W. (1986). Speech Perception, in K.R. Boff, L. Kaufman & J.P. Thomas (Eds.), Handbook of Perception and Human Performance, Volume II, New York: John Wiley and Sons, 27.1-27.57.

Kamel, R. (Ed.) (1990). Voice in Computing, Special issue of IEEE Computer, 23(8), August 1990.

Kamel, R., Emami, K. & Eckert, R. (1990). PX: Supporting Voice in Workstations, IEEE Computer, 23(8), 73-80.

Kantowitz, B. & Sorkin, R. (1983). Human Factors: Understanding People-System Relationships, New York: John Wiley & Sons.

Kaplan & Lerner (1985). Realism in Synthetic Speech, IEEE Spectrum, 22(4), 32-37.

Karl, L., Pettey, M. & Shneiderman, B. (1993). Speech versus Mouse Commands for Word Processing: An Empirical Evaluation, International Journal of Man-Machine Studies, 39(4), 667-687.

Klatt, D.H. (1987). Review of Text-to-Speech Conversion for English. Journal of the Acoustical Society of America, 82, 737-783.

Koons, D., Sparrell, C. & Thorosson, K. (1993). Integrating Simultaneous Input from Speech, Gaze and Hand Gestures. In M. Maybury (Ed.), Intelligent Multimedia Interfaces, Menlo Park, CA.: AAAI Press / MIT Press, 257-276.

Kramer, G. (Ed.)(1994). Auditory Display: The Proceedings of ICAD'92, The International Conference on Auditory Display, Sante Fe Institute Studies in the Scieces of Complexity, Proceedings Volume XVIII. Reading, MA.: Addison-Wesley.

Kurtenbach, G. & Buxton, W. (1994). User learning and performance with marking menus. Proceedings of CHI '94, Boston, April 24-28.

Lawrence, D. & Stuart, R. (1990). Case Study of Development of a User Inerface for a Voice Activated Dialling Service. In D. Diaper et al. (Eds), Human-Computer Interaction - INTERACT '90. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), 773-777.

Lazzaro, J. (1992). Even as we Speak, Byte, 17(4), 165-172.

Lee, D. L. & Lochovsky, F. H. (1983). Voice Response Systems, ACM Computing Surveys, 15(4), 351-374.

Lee, J. & Zeevat, H. (1990). Integrating Natural Language and Graphics in Dialogue. In D. Diaper et al. (Eds), Human-Computer Interaction - INTERACT '90. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), 479-484.

Lennig, M. (1990). Putting Speech Recognition to Work in the Telephone Network, IEEE Computer, 23(8), 35-41.

Levine, S.R. & Ehrlich, S.F. (1991). The Freestyle system: a design perspective. In A. Klinger (Ed.). Human-Machine Interactive Systems. New York: Plenum Press, 3-21.

Levinson, S. E. & Liberman, M. (1981). Speech Recognition by Computer, Scientific American, 244(4), 64-76.

Loy, G. (1985). Musicians Make a Standard: The MIDI Phenomenon, Computer Music Journal, 9(4), 8-26.

Ludwig, L., Pincever, N. & Cohen, M. (1990). Extending the notion of a window system to audio. IEEE Computer, 23(8), 66-72.

Lunney, D. & Morrison, R.C. (1990). Auditory presentation of experimental data. In E. Farrell (Ed.). Extracting meaning from complex data: processing, display, interaction. Proceedings of the SPIE, Vol 1259, 140-146.

Martin, G.L. (1989). The Utility of Speech in Human-Computer Interfaces. International Journal of Man-Machine Studies, 30(4), 355-375.

Media Dimensions (1992). 1992 Voice Systems Applications Buyer's Guide, Speech Technology, February/March 1992, 59-86.

Michaelis, P.R. & Wiggins, R.H. (1982). A Human Factors Engineer's Introduction to Speech Synthesizers, in A. Badre & B. Shneiderman (Eds.). Directions in Human/Computer Interaction, Norwood, N.J., Ablex Publishing Corp., 149-178.

Momtahan, K., Hétu, R. & Tansley, B. (1993). Audibility and Identification of Auditory Alarms in the Operating Room and Intensive Care Unit, Ergonomics, 36(10), 1159-1176.

Moody, T., Joost, M. & Rodman, R. (1987). The Effects of Varous Types of Speech Output on Listener Comprehension Rates, in H.J. Bullinger & B. Shackel (Eds.) Human-Computer Interaction - INTERACT '87, Amsterdam: Elsevier (North Holland), 573-578.

MUC (1991). Proceedings of the Third Message Understanding Conference (MUC-3). San Francisco: Morgan Kaufmann.

MUC (1992). Proceedings of the Fourth Message Understanding Conference (MUC-4). San Francisco: Morgan Kaufmann.

Mynatt, E. & Edwards, K. (1992). Mapping GUIs to Auditory Interfaces, Proceedings of UIST'92, 61-70.

Nakatsu, R. (1990). Anser: An Application of Speech Technology to the Japanese Banking Industry, IEEE Computer, 23(8), 43-48.

Newell, A., Arnott, J., Dye, R. & Cairns, A. (1991). A Full-Speed Listening Typewriter Simulation, International Journal of Man-Machine Studies, 35(2), 119-131.

Nielsen, J. & Schaefer, L. (1993). Sound Effects as an Interface Element for Older Users, Behaviour & Information Technology, 12(4), 208-215.

Ogden, W. & Sorknes, A. (1987). What do Users Say to their Natural Language Interface?, in H.J. Bullinger & B. Shackel (Eds.) Human-Computer Interaction - INTERACT '87, Amsterdam: Elsevier (North Holland), 561-566.

O'Malley, M. (1990). Text-to-Speech Convversion Technology, IEEE Computer, 23(8), 17-23.

Peacocke, R. & Graf, D. (1990). An Introduction to Speech and Speaker Recognition, IEEE Computer, 23(8), 26-33.

Peria, F. & Grosz, B. (Eds.)(1993). Special Volume on Natural Language Processing, Artificial Intelligence 63(1-2).

Pisoni, D.B., Nusbaum, H.C. & Greene, B.G. (1985). Perception of Synthetic Speech Generated by Rule, Proceedings of the IEEE, 73(11), 1665-1671.

Rich, E. (1984), Natural-Language Interfaces, IEEE Computer, 39-47.

Rimé, B. & Schiaratura, L. (1991). Gesture and Speech. In R.S. Feldman & B. Rim (Eds.). Fundamentals of Nonverbal Behaviour. New York: Press Syndicate of the University of Cambridge. 239-281.

Rosson, M.B. & Cecala, A.J. (1986), Designing a Quality Voice: An Analysis of Listeners' Reactions to Synthetic Voices, Proceedings of CHI'86, 192-197.

Rudnicky, A. & Hauptmann, A. (1992). Multimodal Interaction in Speech Systems, in M. Blattner & R. Dannenberg, (Eds.), Multimedia Interface Design, Reading, MA.: Addison-Wesley, 147-171.

Rudnicky, A., Sakamoto, M. & Polifroni, J. (1990). Spoken Language Interaction in a Spreadsheet Task. In D. Diaper et al. (Eds), Human-Computer Interaction - INTERACT '90. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), 767772.

Salisbury, M., Hendrickson, J., Lammers, T., Fu, C. & Moody, S. (1990) Talk and Draw: Bundling Speech and Graphics, IEEE Computer, 23(8), 59-65.

Sanders, M. S. & McCormick, E. J. (1987). Human Factors in Engineering and Design (6th Edition), New York: McGraw-Hill.

Scharf, B. & Buus, S. (1986). Audition I: Stimulus, Physiology, Thresholds, in K.R. Boff, L. Kaufman & J.P. Thomas (Eds.), Handbook of Perception and Human Performance, Volume I, New York: John Wiley and Sons, 14.1-14.71.

Schraf, B. & Houtsma, A.J.M. (1986). Audition II: Loudness, Pitch, Localization, Aural Distortion, Pathology, in K.R. Boff, L. Kaufman & J.P. Thomas (Eds.), Handbook of Perception and Human Performance, Volume I, New York: John Wiley and Sons, 15.1-15.60.

Schmandt, C. (1984). Speech Synthesis Gives Voiced Access to an Eectronic Mail System. Speech Technology, August-September, 66-88.

Schmandt, C. (1985). Voice Communication with Computers, in H.R. Hartson (Ed.), Advances in Human-Computer Interaction Volume I, Norwood, N.J.: Ablex Publishing, 133-160.

Schmandt, C. (1993). From Desktop Audio to Mobile Access: Opportunities for Voice in Computing. in H.R. Hartson & D. Hix (Eds.), Advances in Human-Computer Interaction Volume 4, Norwood, N.J.: Ablex Publishing, 251-283.

Schmandt, C., Ackerman, M., Hindus, D. (1990). Augmenting a Window System with Speech Input, IEEE Computer, 23(8), 50-56.

Schmandt, C., Hindus, D., Ackerman, M. & Manandhar, S. (1990). Observations on Using Speech Input for Window Navigation. In D. Diaper et al. (Eds), Human-Computer Interaction - INTERACT '90. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), 787-793.

Schmandt, C. & McKenna, M. (1988). An Audio and Telephone Server for Multi-Media Workstations. Proceedings of the 2nd IEEE Conference on Computer Workstations, 150-159.

Shapiro, S.C. (Ed.)(1992). Encyclopedia of Artificial Intelligence (2nd Edition), Vol. 1 & 2. New York: John Wiley & Sons.

Simpson, C.A., McCauley, M.E., Roland, E.F., Ruth, J.C. & Willeges, B.H. (1985). System Design for Speech Recognition and Generation, Human Factors, 27(2), 115-141.

Sola, I. & Shepard, D. (1990). A Voice Recognition Interface for a Telecommunications Basic Business Group Attendant Console. In D. Diaper et al. (Eds), Human-Computer Interaction - INTERACT '90. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), 779-75.

Smith, S., Bergeron, R.D. & Grinstein, G.G. (1990). Stereoscopic and surface sound generation for exploratory data analysis. Proceedings of CHI'90, ACM Conference on Human Factors of Computing Systems, 125-132.

Stifelman, L., Arons, B.,Schmandt, C. & Hulteen, E. (1993). VoiceNotes: A Speech Interface for a Hand-Held Notetaker, Proceedings of INTERCHI'93, 179-186.

Strathmeyer, C. (1990). Voice in Computing: An Overview of Available Technologies, IEEE Computer, 23(8), 10-15.

Thomas, J. & Rosson, M. B. (1984). Human Factors and Synthetic Speech, Proceedings of the Human Factors Society, Volume 2, 763-767.

Thorisson, K., Koons, D. & Bolt, R. (1992). Multi-Modal Natural Dialogue. Proceedings of CHI'92, 653-654.

Tobias, J. (Ed.) (1970). Foundations of Modern Auditory Theory, Vol I & II, New York: Academic Press.

Vin, H., Zellweger, P., Swinehart, D. & Rangan, V. (1991). Mltimedia Conferencing in the Etherphone Environment. IEEE Computer, 24(10), 69-79.

Waibel, A. & Lee, K. (Eds.)(1990). Readings in Speech Recognition. San Francisco: Morgan Kaufmann.

Wallich, Paul (1987). Putting Speech Recognizers to Work. IEEE Spectrum, 24(4), 55-57.

Waterworth, J. (1984). Speech Communication: How to Use It, in A. Monk (Ed.), Fundamentals of Human-Computer Interaction, London: Academic Press, 221-236.

Wattenbarger, B., Garberg, R., Halpern, E. & Lively, B. (1993). Serving Customers with Autmatic Speech Recognition - Human-Factors Issues, AT&T Technical Journal, 72(3), 28-41.

Weimer, D. & Ganapathy (1989). A Synthetic Visual Environment with Hand Gesturing and Voice input, Proceedings of CHI'89, 235-240.

Wenzel, E. et al. (1993). Perceptual vs. Hardware Performance in Advanced Acoustic Interface Design, Panel session in Proceedings of INTERCHI'93, 363-366.

White, G. (1990). Natural Language Understanding and Speech Recognition, Communications of the ACM, 33(8), 72-82.

Wickens, C.D., Mountford, S.J. & Schreiner, W. (1981). Multiple Resources, Task-Hemispheric Integrity, and Individual Differences in Time Sharing. Human Factors, 23, 211-230.

Wilcox, L., Smith, I. & Bush, M. (1992). Wordspotting for Voice Editing and Audio Indexing. Proceedings of CHI'92, 655-656.

Zajicek, M. & Hewitt, J. (1990). An Investigation into the use of Error Recovery Dialogues in a User Interface Management System for Speech Recognition. In D. Diaper et al. (Eds), Human-Computer Interaction - INTERACT '90. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), 755-760.

Zellweger, P., Terry, D. & Swinehart, D. (1988). An Overview of the Etherphone System and its Applications. Proceedings of the 2nd IEEE Conference on Computer Workstations, 160-168.

Zue, V.W. (1985). The Use of Speech Knowledge in Automatic Speech Recognition, Proceedings of the IEEE, 73(11), 1602-1615.

Video Examples

Apple Computer (1992). The Knowledge Navigator. SIGGRAPH Video Review 79, New York: ACM.

Arons, B. (1993). Hyperspeech, SIGGRAPH Video Review 88, New York: ACM.

Cowley, C. (1990). A Human-Factors Guide to Computer Speech. SIGGRAPH Video Review 65, New York: ACM.

Cowley, C. & Jones, D. (1993). Talking to Machines, SIGGRAPH Video Review 88, New York: ACM.

Curtis, G. (1985). Preparing a Meal. SIGGRAPH Video Review 19, New York: ACM.

Davis, J. (1989). Direction Assistance. SIGGRAPH Video Review 48, New York: ACM.

Ensor, J.R. (1989). Rapport. SIGGRAPH Video Review 45, New York: ACM.

Francik, E. (1989). Wang Freestyle. SIGGRAPH Video Review 45, New York: ACM.

Gould, J. (1985). Olympic Messaging System. SIGGRAPH Video Review 19, New York: ACM.

Rudnicky, A. (1990). Spoken Language Interfaces: The OM System. SIGGRAPH Video Review 64, New York: ACM.

Schmandt, C. et al. (1984). Put That There. SIGGRAPH Video Review 13, New York: ACM.

Schmandt, C. et al. (1987). Conversational Desktop. SIGGRAPH Video Review 27, New York: ACM.

Thorisson, K., Koons, D. & Bolt, R. (1992). Multi-Modal Natural Dialogue. SIGGRAPH Video Review 76, New York: ACM.

Weimer, D. (1990). Three Dimension Interfaces in Shared Environments. SIGGRAPH Video Review 55, New York: ACM.

Wilcox, L., Smith, I. & Bush, M. (1992). Wordspotting for Voice Editing and Audio Indexing. SIGGRAPH Video Review 76, New York: ACM.

Wong, P. (1984). Rapid Prototyping Using FLAIR. SIGGRAPH Video Review 12, New York: ACM.

Zellweger, P. et al. (1990). More Voice Applications in Cedar. SIGGRAPH Video Review 59, New York: ACM.