Buxton, W. (1988). The Natural Language of Interaction: A Perspective on Non-Verbal Dialogues, INFOR: Canadian Journal of Operations Research and Information Processing, 26(4), 428-438.

also published in:

Buxton, W. (1990). The Natural Language of Interaction: A Perspective on Non-Verbal  Dialogues. In Laurel, B. (Ed.). The Art of Human-Computer Interface Design, Reading, MA: Addison-Wesley. 405-416.


The "Natural" Language of Interaction:
A Perspective on Non-Verbal Dialogues

Bill Buxton




ABSTRACT

The argument is made that the concept of "natural language understanding systems" should be extended to include non-verbal dialogues. The claim is made that such dialogues are, in many ways, more natural than those based on words. Furthermore, it is argued that the hopes for verbal natural language systems are out of proportion, especially when compared with the potential of systems that could understand natural non-verbal dialogue. The benefits of non-verbal natural language systems can be delivered by technology available today. In general, the benefits will most likely exceed those of verbal interfaces (if and when they ever become generally available).

This is a revision of a paper that previously appeared under the same title in, Proceedings of CIPS '87, Intelligent Integration, Edmonton, Canadian Information Processing Society, 311-316, and INFOR Canadian Journal of Operations Research and Information Processing, 26(4), 428-438.


INTRODUCTION

There is little dispute that the user interface is a bottle-neck restricting the potential of today's computational and communications technologies. When we begin to look for solutions to this problem, however, consensus evaporates. Researchers and users all have their own view of how we should interact with computers, and each of these views is different. If there is a thread of consistency, however, it is generally in the view that user interfaces should be more "natural." Within the AI community, especially, this translates into "natural language understanding systems" which are put forward as the great panacea that will free us from all of our current problems.

The question is, is this hope realistic? The answer, we believe, lies very much in what is meant by "natural language".

What is normally meant by the term is the ability to converse using a language like English or German. When conversing with a machine, such conversations may be coupled with speech understanding and synthesis, or may involve typing using a more conventional keyboard and CRT. Regardless, our personal view is that the benefits and applicability of such systems will be limited, due largely to the imprecise and verbose nature of such language.

But we do not want to argue that point, since it has too much in common with arguments against motherhood or about politics and religion. More importantly, it is secondary to our principal thesis: that this class of conversation represents only a small part of the full range of natural language.

We argue that there is a rich and potent gestural language which is at least as "natural" as verbal language, and which - in the short and long term - may have a more important impact on facilitating human-computer interaction. And, despite its neglect, we argue that this type of language can be supported by existing technology, and so we can reap the potential benefits immediately.

ANOTHER VIEW OF NATURAL LANGUAGE

There is probably little argument that verbal language coexists with a rich variety of manual gestures (that seems to increase in range as one approaches the Mediterranean Sea). The real question is, what does this have to do with computers, much less with natural language? What we are going to argue is that such gestures are part of a non-verbal vocabulary which is natural, and is a language capable of efficiently communicating powerful concepts to a computer.

The burden of proof, therefore, is to establish the communicative potential of such a language, and to show that it is, in fact, natural. Our approach is to argue by demonstration and by example. We will provide some concrete demonstrations of how such language can be used, and argue that it is natural in the sense that users come to the system with the basic requisite communication skills already in place. Our main hope is that we may be able to cause some researchers to rethink their priorities, and direct more attention to this aspect of interaction than has previously been the case.

THE MACINTOSH AS VICTIM

Before going too much further, it is probably worth making a few comments about my approach and my examples. As will become pretty evident, the Apple Macintosh takes a bit of a beating in what follows. Some readers may view this as evidence of contempt for its design. From my perspective, it is evidence of respect. Let me explain.

User interface design today is plagued by an unhealthy degree of complacency. We live in a world of copy-cat unimaginative interface products, where major manufacturers put out double-paged colour spreads in Business Week announcing "... our system's user interface is as easy to use as the most easy to use microcomputer [i.e., the Mac]" (or words to that effect). In what other industry can you get away with stating that you're as good as the competition, rather than better? Especially when the competition's product is over 4 years old!

The best ideas are the most dangerous, since they take hold and are the hardest to change. Hence, the Macintosh is dangerous to the progress of user interfaces precisely because it was so well done! Designers seem to be viewing it as a measure of success rather than as a point of departure. Consequently, it runs the risk of becoming the Cobol of the 90's.

I criticize the Macintosh precisely because it is held in such high regard. It is one of the few worthy targets. If I can make the point that there are other design options, and that these options appear to have a lot of potential, then I may help wake people up to the fact that it is an unworthy design objective to aim for anything less than trying to do to the Macintosh what the Macintosh did to the previous state-of-the-art.

Enough of the sermon. Let's leave the pulpit for some concrete examples.

OF PROOF-READERS AND FOOTBALL COACHES

Let us start off with a "real" computer-relevant example. Of all applications, perhaps the human factors of text editing have been the most studied. Within text editors (and command-line interfaces in general), perhaps no linguistic construct has received more attention and been more problematic than that of a verb that has both a direct and an indirect object.

The classic examples of this construct, largely because they are so ubiquitous, are the move and copy operations. In fact, the problems posed by verbs having two different types of operand are such that in programs like MacWrite, move and copy have each been replaced by two operations (cut-and-paste and copy-and-paste, respectively). Each step of these compound operations has only one operand (the second being implicit, the "magic" operand, the clipboard).


(a) Proof-Reader's "Move" Symbol

(b) The 49er Double Scrape
Figure 1: Similar notations for two types of information.

Figure 1(a) illustrates the use of proof-reader's notation to specify a spatial relationship within a document. Figure 1(b) uses a similar notation to specify a spatial relationship over time in the context of a football play.


What is clear, however, to anyone who has ever annotated a document, or seen a football playbook, is that there exists an alternative notation (read "language") for expressing these concepts. This is characterized by the gestural proof-reader's move symbol. Whether intended in the spatial or temporal sense, the notation is clear, succinct, and is known independently of (but is understandable by) computers.

Fig. 1(a) shows how the proof-reader's move symbol can be used to specify a verb with two objects (direct and indirect) without any ambiguity. Figure 1(b) illustrates the use of essentially the same notational language in very different context. This time it is being used to illustrate plays in a football game. Despite the context, the notation is clear and unambiguous, will virtually never result in an error of syntax, and is known to the user before a computer is ever used. And, it can be used to articulate concepts that users traditionally have a great deal of trouble expressing.

Could this type of notation be the "natural" language for this type of concept? More to the point, do any of today's "state-of-the-art" user interfaces even begin to let us find out?

NATURAL LANGUAGES ARE LEARNED

The capacity for language is one of the things that distinguishes humans from the other animal species. Having said that, nobody would argue that humans are born with language. Even "natural" languages are learned. Anyone who has tried to learn a foreign language (a language that is "natural" to others) knows this. We are considered a native speaker if and when we have developed fluency in the language by the time we are required to draw upon those language skills.

If we orient our discussion around computers, then the same rules apply. A language could be considered "natural" if, upon approaching the computer, the typical user already has language skills adequate for expressing desired concepts in a rich, succinct, fluent, and articulate manner.

By this definition, most methods of interacting with computers are anything but natural. But is there an untapped resource there, a language resource that users bring to the system on first encounter that could provide the basis for such "natural" dialogues?

Yes!

WHAT'S NATURAL TO YOU IS FOREIGN TO ME

One of the problems with arguments in favour of natural language interfaces is the unspoken implication that such systems will be universally accessible. But even if we restrict ourselves to the consideration of verbal language, we must accept the reality of foreign languages. German, for example, is different than English in both vocabulary and syntax. There is no universally agreed placement of verbs in sentences, for example.

The point to this train of thought, which is a continuation of the previous comments about languages being learned, is that so-called natural languages are only natural to those who have learned them. All others are foreign.

If we start to consider non-verbal forms of communication, the same thing holds true. The graphic artist's language of using an airbrush, for example, is foreign to the house painter. Similarly, the architectural drafts person has a language which includes the use of a drafting machine in combination with a pencil. Each of these "languages" is natural (albeit learned) for the profession.

But the argument will now be raised that I am playing with words, and that what I am talking about are specialized skills developed for the practice of particular professions, or domains of endeavor. But how is that different than verbal language? What is conventional verbal language if not a highly learned skill developed to enable one to communicate about various domains of knowledge?

Where all of this is heading is the observation that the notion of a universally understood natural language is naive and not very useful. Each language has special strengths and weaknesses in its ability to communicate particular concepts. Languages can be natural or foreign to concepts as well as speakers. A true "natural language" system is only achieved when the language employed is natural to the task, and the person engaged in that task is a native speaker. But perhaps the most important concept underlying this is acknowledging that naturalness is domain specific, and we must, therefore, support a variety of natural languages.

WHERE THERE'S LANGUAGE THERE MUST BE PHRASES

Let us accept, therefore, that there is a world of natural non-verbal languages out there waiting to be tapped for the purpose of improved human-computer interaction. Then where are the constructs and conventions that we find in verbal language? Are there, for example, concepts such as sentences and phrases?

Let us look at one of our favorite examples (Buxton, 1986b). Consider making a selection using a pop-up menu. Conceptually, you are doing one thing: making a choice. But if we look more closely, there is a lot more going on (as the underlying parser would tell you if it knew how). You are "uttering" a complex sentence which includes:

While you generate each of these tokens, you are not aware of the mechanics of doing so. The reason is that from the moment that you depress the mouse button, you are in a non-neutral state of tension (your finger). While you may make a semantic error (make the wrong selection), everything in the system is biased towards the fact that the only reasonable action to conclude the transaction is to release your finger. There is virtually no cognitive overhead in determining the mechanics of the interaction. Tension is used to bind together the tokens of the transaction just as musical tension binds together the notes in a phrase. This is a well designed interaction, and if anything deserves to be called natural, this does.

But let us go one step further. If appropriate phrasing of gestural languages can be used to reduce or eliminate errors of syntax (as in the pop-up menu example), can we find instances where lack of phrasing permits us to predict errors that will be made? The question is rhetorical, and we can find an instance even within our pop-up menu example.

Consider the case where the item being selected from the menu is a verb, like cut, which requires a direct object. Within a text editor, for example, specifying the region of text to be cut has no gestural continuity (i.e., is not articulated in the same "phrase") with the selection of the cut operator itself. Consequently, we can predict (and easily verify by observation) a common error in systems that use this technique: that users will invoke the verb before they have selected the direct object. As a result, they must restart, select the text, then reselect cut.

To find the inverse situation, where the use of phrasing enables us to predict where this type of error will not occur, we need look no further than the proof-reader's symbol example discussed earlier. The argument is, if the language is fluid and natural, it will permit complete concepts to be articulated in fluid connected phrases. That is natural, and if such a criterion is followed in language design, learning time and error rates will drop while efficiency will improve.

ON HEAD-TAPPING AND STOMACH-RUBBING

If there is anything that makes a language natural, as the previous discussion emphasized, it is the notion of fluidity, continuity, and phrasing. Let us push this a little farther, using an example that moves us even further from verbal language.

One of the problems of verbal language, especially written language, is that it is single threaded. We can only parse one stream of words at a time. Many people would argue that anything other than that is unnatural, and would be akin to the awkward party trick of rubbing your stomach while tapping your head. The logic (so called) seems to be based on the belief that since we can only speak and read one stream of words at a time, languages based on multiple streams are unnatural (after all, if God had wanted us to communicate in multiple streams, she would have given us two mouths).

But this argument is so easy to refute that it is almost embarrassing to have to do so. Imagine, if you will, a voice activated automobile. (If this is too hard, imagine yourself as an instructor with a student driver.) Your task is to talk the car down London's Edgware Road, around Marble Arch, and along Park Lane. If anything will convince you that verbal language is unnatural in some contexts - even spoken and coupled with the ultimate speech understanding system (a human being, in the case of the student driver) - this will. The single stream of verbal instructions does not have the bandwidth to simultaneously give the requisite instructions for steering, gear shifting, braking, and accelerating.

The AI pundits will, of course say, that the solution here is to couple the natural language system with an expert system that knows how to drive. Fine. Then replace the student driver with an expert, but one who has never driven in London before and go out at rush hour. The odds are still less than 50:50 of making it around Marble Arch. If you aren't in an accident, you will be bound to cause one. QED

There are some things which verbal language is singularly unsuited for. For our purposes, we will characterize these as tasks where the single threaded nature of such language causes us to violate the principles of continuity and fluidity in phrasing, as outlined above. That is what was happening in the car driving example, what would happen if one tried to talk a pianist through a concerto, and what does happen in many common computer applications, which are - likewise - inappropriately based on single-threaded dialogues.

FROM MARBLE ARCH TO MACWRITE

We can use the Apple Macintosh, in particular the well known word processor Macwrite, to illustrate our statements about continuity and multi-threaded dialogues. The example is based on an experiment undertaken by myself and Brad Myers (Buxton and Myers, 1986).

The study was motivated by the observation that a lot of time using window-based WYSIWYG text editors was spent switching between editing text and navigating through the document being edited. In the direct manipulation type systems that we were interested in, this switching took the form of the mouse being used to alternatively select text in the document and to navigate by manipulating the scroll bar and scroll arrows at the side of the screen.

This type of task switching occurs when the text that one wants to select is off screen. Hence, what is conceptually a "select text" task becomes a compound "navigate / select text" task. Since each of the component tasks (navigate and select) were independent, we decided to design an alternative interface in which each task was assigned to a separate hand.

Assuming right handed users, the right hand manipulated the mouse and performed the text selection task. The left hand performed the navigation task (using two touch sensitive strips: one that permitted smooth scrolling, the other that jumped to the same relative position in the document as the point touched on the strip).

We implemented the new interface in an environment that copied Macwrite. What we saw was a dramatic improvement in the performance of experts and novices. In particular, we saw that by simply changing the interface, not only did performance improve, but the performance gap between experts and novices was narrowed. Perhaps most important, it was clear that using both hands in this manner caused no problem experts or for novices. Clearly they had the requisite motor skills before ever approaching the computer and the mapping of the skills employed to the task was an appropriate one.

Why did this multi-handed, multi-threaded approach work? One reason is that each hand was always in "home position" for its respective task (assuming that they were not on the keyboard, in which case, each hand could be positioned on its own device about as fast as one hand could be placed on the mouse - which is the normal case). Hence, the flow of each task was uninterrupted, preserving the continuity of each.

The improvements in efficiency can be predicted by simple time-motion analysis, such as is obtainable using the Keystroke level model of Card, Moran and Newell (1980). Since each hand is in home position, no time is spent acquiring the scroll gadgets or moving back to the text portion of the screen. Having both hands simultaneously available, the user can (and did) "mouse ahead" to where the text will appear while still scrolling it into view. That users spontaneously used such optimal strategies is a good argument for the naturalness of the mode of interaction.

ANOTHER HANDWAVING EXAMPLE

Is this just one of those interesting idiosyncratic examples, or are there some generally applicable principles here? We clearly believe the latter. Most "modern" direct manipulation systems appear to have been designed for Napoleon, or people of his ilk, who want to keep one hand tucked away for no apparent purpose.

But everyday observation gives us numerous examples where people assign a secondary task to their non-dominant hand (or feet) in order to avoid interrupting the flow of some other task being undertaken by the dominant hand. While this is common in day-to-day tasks, the single-threaded nature of systems biases greatly against using these same techniques when conversing with a computer. Consequently, in situations where this approach is appropriate, single-threaded dialogues (and this includes verbal natural language understanding systems) are anything but natural.

But it is not just a case of time-motion efficiency at play here. It is the very nature of system complexity. If a single device must be time multiplexed across multiple functions, there is a penalty in complexity as well as time. This can be illustrated by another example taken from the Macintosh program, Macpaint.

Like the previous example, this one involves navigation and switching between functions using the mouse. In this case, the tasks are inking (drawing with the paint brush), and navigating (moving the document under the window).

Figure 2: Grabbing the page in Macpaint.

In order to expose different parts of the "page," in Macpaint, one selects the hand icon from the menu on the left, then uses the hand to drag the page under the window (using the mouse).


In this example, imagine that you are painting the duck shown in Fig. 2. Having finished the head, you now want to paint the body. However, there is not enough of the page exposed in the window. The solution is, then, to move the appropriate part of the page under the window. We then go on painting.

Let us work through this in more detail, contrasting two different styles of interaction.

  1. Assume that our initial state is that we have been painting (having selected the "brush" icon in the menu).
  2. Assume that our goal (target end state) is to paint on a part of the page not visible in the window.
  3. Our strategy is to move the page under the window until the desired part is visible, and then resume painting.
  4. The official method, as dictated by Macpaint:
  1. Our multi-handed, multi-threaded Method: Assume that the position of the page under the window is connected to a trackball which is manipulated by the left hand, while painting is carried out (as always) with the mouse in the right hand. The revised method is:


It is clear that the second method involves far fewer steps, and is far more respectful of the continuity of the primary task, painting. It is likely that it is easier to learn, less prone to error, and much faster to perform (we don't make this claim outright because there can be other influences when the larger context is considered, and we have not performed the study). It also means that the non-intuitive hand icon can be eliminated from the menu, reducing complexity in yet another way. It is, within the context of the example, clearly a more natural way to perform the transaction.

An important part of this example is the fact that the initial and final (goal) states are identical except for the position of the page in the window. In general, when this situation comes up during task analysis, bells should go off in the analyst's head prompting the question: if the start and goal state involve the performance of the same task, is the intermediate task a secondary task that can be assigned to the other hand? Can we preserve the fluidity and continuity of the primary task by so doing?

In many cases, the answer will be yes, and we are doing the user a disservice if we don't act on the observation.

IS IT THAT SIMPLE? LEARNING FROM HISTORY

To be fair, at this point it is worth addressing two specific questions:
At first glance, many of these ideas seem pretty good. Is it really that easy?

and

Are these ideas really new?

The answer to both questions is a qualified "no." Furthermore, the questions are very much related. Many of the ideas have shown potential and have been demonstrated as long as twenty years ago. However, developing a good idea takes careful design and the right combination of ingredients.

While many of these ideas have been explored previously, the work was often ahead of the technology's ability to deliver to any broad population in a cost-effective form. A human nature issue also arises from the fact that once tried without follow-through, people have frequently seemed to adopt a "that's been tried, didn't catch on, so must not have worked" type of attitude.

In addition, there are some really deep issues that remain to be solved to support a lot of this work, and certainly, the leap from a lab demo to robust production system is non-trivial, regardless of the power or "correctness" of the concept. For example, gesture-based systems generally require a stylus and tablet who's responsiveness to subtle changes of nuance is beyond what can be delivered by most commercial tablets (and certainly any mouse). IBM, among others, have spent a lot of time trying to get a tablet/stylus combination with the right feel, amount of tip-switch travel, and linearity suitable for this type of interaction.

The last point (not in importance) concerning the degree of acceptance of gestural and two-handed input - despite isolated compelling examples being around for a long time - has to do with user testing. Demos, there have been. Carefully run and documented user testing and evaluation, however, can be noted only by their scarcity. The only published studies that I'm aware of that tried to carry out formal user testing of gestural interfaces are Rhyne & Wolf (1986), Wolf (1986), and Wolf & Morrel-Samuels (1987). The only published formal study that I am aware of of user tests of 2-handed navigation-selection and scaling-positioning tasks was by Buxton & Myers (1986).

One of the most illuminating, glaring (and depressing) points stemming from the above is the huge discrepancy (in computer terms) between the dates of the first prototypes and demonstrations (mid 60's - early 70's), and the dates of any of the published user testing (mid - late 80's). I think that this says a lot about why these techniques have not received their due attention. Change is always hard to bring about. Without testing, we neither get anywhere near an optimal understanding or implementation of the concept, and, we never get the data that would otherwise permit us to quantify the benefits. In short, we simply are handicapped in our ability to fight the inertia which is inevitable in arguing for any type of significant change.

SUMMARY

We have argued strongly for designer to adopt a mentality that considers non-verbal gestural modes of interaction as falling within the domain of natural languages. While verbal language likely has an important role to play in human-computer interaction, it is not going to be any type of general panacea.

What is clear is that different forms of interaction support the expression of different types of concepts. "Natural" language includes gestures. Gestures can be used to form clear fluid phrases, and multi-threaded gestures can capitalize on the capabilities of human performance to enable important concepts to be expressed in a clear, appropriate, and "natural" manner. These include concepts in which the threads are expressed simultaneously (as in driving a car, playing an instrument, or mousing ahead), or sequentially, where using a second thread enables us to avoid disrupting the continuity of some primary task by having the secondary task articulated using a different channel (as in the Macpaint example).

We believe that this notion of a natural language understanding system can bring about a significant improvement in the quality of human-computer interaction. To achieve this, however, requires a change in attitude on the part of researchers and designers. Hopefully the arguments made above will help bring such a change about.

ACKNOWLEDGEMENTS

The research reported in this paper has been undertaken at the University of Toronto, with the support of the Natural Sciences and Engineering Research Council of Canada, and at Rank Xerox's Cambridge EuroPARC facility in England. This support is gratefully acknowledged. We would like to acknowledge the helpful contributions made by Thomas Green and the comments of Rick Beach, Elizabeth Churchill, Michael Brook and Larry Tessler.

REFERENCES AND BIBLIOGRAPHY

Baecker, R. & Buxton, W. (1987). Readings in Human-Computer Interaction: A Multidisciplinary Approach, Los Altos, CA: Morgan Kaufmann.

Buxton, W. (1986a). Chunking and Phrasing and the Design of Human-Computer Dialogues, Proceedings of the IFIP World Computer Congress, Dublin, Ireland, September 1 - 5, 1986, 475 - 480.

Buxton, W. (1986b). There's More to Interaction than Meets the Eye: Some Issues in Manual Input, in D. Norman & S. Draper (Eds.), User Centred Systems Design: New Perspectives on Human-Computer Interaction, Hillsdale, NJ: Lawrence Erlbaum Associates, 319 - 337.

Buxton, W. (1983). Lexical and Pragmatic Considerations of Input Structures, Computer Graphics, 17(1), 31 - 37.

Buxton, W. (1982). An Informal Study of Selection-Positioning Tasks, Proceedings of Graphics Interface '82, 323 - 328.

Buxton, W. & Myers, B. (1986). A Study in Two-Handed Input, Proceedings of CHI'86, 321 - 326.

Card, S., Moran, T. & Newell, A. (1980). The Keystroke Level Model for User Performance Time with Interactive Systems, Communications of the ACM, 23(7), 396 - 410.

Rhyne, J.R. & Wolf, C.G. (1986). Gestural interfaces for information processing applications, Computer Science Technical Report RC 12179, IBM T.J. Watson Research Center, Distribution Services 73-F11, P.O. Box 218, Yorktown Heights, N.Y.

Wolf, C.G. (1986). Can People Use Gesture Commands? ACM SIGCHI Bulletin, 18(2), 73 - 74.
Wolf, C.G. & Morrel-Samuels, P. (1987). The use of hand-drawn gestures for text-editing, International Journal of Man-Machine Studies, 27, 91 - 102.


Return to Bill Buxton's Home Page