Modalities
Inputs
Have you ever tried typing on a TV? The intended way is to take the TV controller, and slowly, painfully press the arrow keys to move around an on-screen keyboard and select each letter one at a time. Also for whatever reason the keyboard is usually alphabetical and not QWERTY.
The time difference to doing this exact same search on a keyboard is at least an order of magnitude. What is this insanity! Fundamentally the computers running on the laptop and the TV are not that different, the cardinal sin here is in the interface and particular in the input interface.
You can judge how good an interface is using only two axes:
- Bitrate
- Accuracy
Maybe I'm being a little uncharitable towards BCIs wrt. bitrate, but this is my current assessment of them. Their aspiration, if Neuralink-style, is to be located all the way off the screen to the right. I respect the ambition.
You can compensate for a lot of the innate deficiencies of an input interface using software. Good auto-correct and probabilistic keyboards are what make smartphone keyboards practical to use.
Technology is only really useful if bidirectional. So you need some way to receive output from the device you are using.
Outputs
Of the five human senses, only three are used for computer output, and the distribution is extremely unequal. The same axes apply.
Touch
Mainly implemented through haptics. Because the bitrate is so low, it's really just used a gimmick for the novelty factor.
Hearing
Computers have always been able to give low bandwidth, high accuracy alerts (think notification sounds), but recently, speech synthesis has blown the door open by significantly increasing the bitrate.
Sight
No contest that vision is the most dominant human sense. And thus we see the vast majority of people interact with technology through a screen.
Voice In, Text Out
This is the best way to bitrate-max your computer use, because you are using the fastest input modality and output modality available to humans today (until we get brain chips).
I think people are still slow to realize how amazing of an input voice has become for computers because of machine learning. Modern ASR models are so good and so fast they will beat out even touch typing in raw speed. The tradeoff has traditionally been in accuracy, but ASR accuracy has improved incredibly since Whisper. At the same time, we now have extremely capable LLMs that can parse these massive, semi-accurate streams of unstructured text and do something useful with it.
When I dictate to LLMs, I don't even think about correcting the dictation result, I'm just trying to voice dump as much context as I can from my brain to it. ICL allows them to understand when there's a transcription error and they still understand the overall gist of what I meant to say.
Right now, the biggest bottleneck to ubiquitous voice input is usage in public spaces. There is a social constraint to voice. The solutions are whispering and silent speech. Voice is also somewhat imprecise. Paired with voice there will always be a few low bitrate, ultra high precision input methods. Keyboard shortcuts are an example.
While voice is a great input modality, I think you should be very careful about using it as a primary output modality because vision is just so good. And audio forces an unskimmable time axis. I quote Karpathy: thank you to the giant GPU in your brain built for processing images very fast.
Most people can read books much faster than they can listen to someone read it to them. There are also types of outputs that are naturally suited to GUIs or text because information can be taken at a glance (parallelism). Think dashboards, maps, code. Imagine even a very skilled human trying to explain those same interfaces using only voice.
In fact, not only is audio losing in bitrate but it's also less accurate! People often have to ask to hear something again or they might mishear words. No such problems with vision, just read it again. This is the price for temporality. So, I am quite skeptical about voice taking over everything, and I think GUIs will always have their place because humans are optimized for vision.
In Defence of Voice Out
If audio always has both lower bitrate and accuracy than vision, is there ever a reason to use it as an output modality? Yes. Because while vision completely sweeps voice, it critically demands an undivided attention.
This is a feature, not a bug of foveated vision. There are simply too many photons bouncing around out there to make sense of it all at the same time. This means that if you are using a screen, you can't really be doing much else given how important vision is. For instance, you cannot drive, it's socially frowned upon to do this while talking with others, can't enjoy scenery (which you have to look at). But if instead you are using hearing, you have suddenly freed up sight to do whatever important thing you need it to do.
The other advantage of bidirectional voice is that it is the most natural form of human interaction. It is the modality of choice for human-human communication. We've all learned it since we were kids. From a pure productivity standpoint you might want voice in, text out but for other use cases you might actually prefer voice in, voice out.
Computer interaction is now human interaction
Anyway, I think it is an exciting time for HCI because the computers are becoming more and more humanlike, and so designing human-computer interaction becomes closer to modelling human-human interaction. And what a world it shall be when interacting with technology feels as effortless and natural as interacting with another human.