We are entering the next transition of human to device interaction. Products from Amazon, Google, Apple, and their licensees already enable you to interact with networks by speaking (or from my experience of people using these devices, shouting) questions and commands. An issue to the user though is the device is reactionary it needs to always be ‘listening’ and filtering what you have said, back at the companies’ servers, not at your home. This is great news for targeted marketing and the device companies’ bottom line, maybe not so much for your privacy. A more natural interaction, similar to with another person, is that you typically look at them and then start speaking. This presents a non-trivial issue for the interaction device, as vision-based data set can be several thousand times denser that audio. This density makes processing at the sensor or device level more difficult and expensive, because of the needed higher performance processing chips to deal with a conventional imagers output. Alternatively, you could look to an architecture that pushed the stream of video to the cloud, however, if you had several assistants to detect you are looking at it the device to enhance the interaction in your home, you would quickly overwhelm your home wireless network.
By looking at the evolution of other technologies, we can not only understand how these devices will morph but can leapfrog an evolutionary step to gain position on the competition. Numerous tech-based consumer products have evolved in a similar manner shifting across five evolutionary stages, and this can provide a baseline for projecting technical growth within the image sensing space.
The first stage of evolution starts with ‘Centralized Processing’. Think early mainframe computers or early long distance calling. Early long-distance calling required calling a centralize operator (a person) they would set up the call, and call you back when the connection was made. Expensive, not scalable, and the device in your hand had no processing. Early mainframes were large, expensive, multi-room sized devices, they were deployed at the corporate level, were not very scalable, did not support cross communications, and your interaction was a dumb terminal. This is similar to many personal assistant devices. This is also the current stage for most autonomous driving and augmented and virtual reality devices.
The next stage is ‘Spread Centralization’. For phones we are at the analog landline in every home stage, the long distance call was direct, expensive, and the unit in your hand still had no real processing. From a computer perspective, we are shifting from IBM’s big iron to DEC’s VAX machines. You shifted from central corporate to department level deployments with application-specific platforms and a dumb terminal. Some autonomous driving implementations and devices like Microsoft’s Hololens with their separate sensor processing Holo-Chip fall into this category.
The third growth stage is ‘Isolated Edge’ Units. The PC and analog cell phones are in this stage. Here we make a complete shift from doing all the processing centrally to doing all the processing locally. The terminal goes from dumb to functional specific capable. It is also the initial stage where you transition from sharing a single device to each person having their own. All decisions and processing happen right at the device and there is little to no sharing of the results unless you had a good supply of floppy disks. From a sensor/interaction perspective, traffic cameras are a good example of devices in this stage. A car is detected and the light changes state, it doesn’t know, or care, what the traffic flow is for the intersecting street, it just feeds the car present alert into a preset algorithm to change the lights.
The fourth stage is ‘Strong Edge with Limited Central Processing’. Early networked PCs and BlackBerrys are in this stage. Here you have provided the previous stage with the ability to share information. The communication platform does little beyond moving data between the devices, like sending emails and texts, and most of its processing is related to traffic flow. From a sensing perspective battery operated smart home security cameras and computer mice fit within this stage. The device senses predetermined occurrences, processes the information for key details locally and then sends information or events across a network for viewing, or to the PC to move the icon on the screen.
The fifth stage, ‘Smart Edge with Shared Processing,’ combines the first or second stage and the third stage. The smartphone is a perfect example of this stage, here the pc and the phone have been combined, you can use many features of the device quite effectively even when you do not have a network connection, (pictures, gaming, desktop apps . . .) when connected to a network, you can use the device like a dumb terminal (phone calls), or for heavy cloud processing (personal assistant), or for both local and remote cloud processing (Pokémon Go like apps). Relative to broadly adopted sensing applications though, we are just starting to arrive here.
For head-mounted devices an imager provides raw data to a processor for scene understanding, gesture capture and gaze orientation. For the latter two apps, the output should be more like the computer mouse, where a result is sent off the unit for the central processor to react to. Instead as mentioned earlier Microsoft had to make a separate chip just to process the raw data. Your personal assistant embedded in the refrigerator door should be able to understand you want to interact automatically as you stop in front of the door, you should be identified, the viewing screen in the door should turn on and the unit should be ready to take gesture commands. Your smart home security camera should be able to identify the home residents’ presence and not record when they are in the room, but it should also be able to identify potential at-risk behaviors that send data to a central processor for more complex processing when you are in the room and in a dangerous situation.
There are a few issues as to why these chips don’t exist. The imaging industry has been focused on capturing pretty pictures, images, and video that a person is meant to see. This means imager and camera designers have had to take human perception, and limitations of the viewing media (screen or print) into account, and that is not always the best approach for a sensing application. For pretty pictures, you have the capture device (the imager) and the human perception processing device, a logic chip, or logic core in a larger processor. The imaging industry has evolved so these two chips are most often two separate companies, where the image sensor company provides raw data for the back-end human-perception processing company to adapt to make the best representation of the scene. If using a megapixel 100 frame per second imager for gesture detection, the per second bandwidth difference between the raw data and an identified gesture is more than 1.5 million to 1. Because of the way the industry has evolved, most gesture solutions are focused on pushing the best image quality from the sensor company, so the back end processing company can make sense of the scene, until these types of result-out sensors are developed don’t expect to be able to wear your augmented and virtual reality headset for a full day.
Processing raw image data takes a lot of power, just finding the objects of interest for the application is non-trivial, never mind trying to make sense of what the object is doing. Power in a chip typically translates to heat and heat creates significant noise in an image sensor, so merging a conventional sensor and processor is problematic. To reduce the power used you need to reduce the complexity or amount of data to be processed, the first step is to take a long look at whether the pretty picture sensor architecture is the right one for processing. An image with a non-linear response to light might be ideal for a sensing application, but for a pretty picture application it tends not to look good, so effort on this has lagged improving perceived image quality. The readout from a sensor is frame based, because the display is frame based, but for some applications, all I care about are things that are moving relative the camera and when they are moving, filtering out the moving from non-moving information on a video in the imager can reduce the data bandwidth by over 1,000 times, and drive even bigger reductions in system bandwidth and power. Different monitoring modes, where the data and frame rate is reduced, as well as windowing the output to reduce the frame size can all be implemented, but this all needs to be done intelligently, it needs to approach the application like it is in technology evolution stage 5. The chip needs some level of intelligence to minimize the data set, and the downstream processing needs to support and update unknown conditions, as well as adapt the data being collected based on the functions that need to be a performance for that particular application at that particular time.
Many sensing applications have enough volume to warrant the creation of a dedicated chip, and we are seeing the front edge of the evolutionary path of the sensing technology; however, they are tending to follow each of the stages laid out previously. The opportunity is to recognize and leap ahead of the evolution far enough to provide very significant benefits, without leaping so far you outstrip the ecosystems ability to absorb the new approach.