10 Conclusions

This dissertation has demonstrated that improved computer awareness of people through audiovisual sensors may be accomplished by effective exploitation of available sensing and processing resources. Rather than using liberal amounts of computing resources to extract detailed features from a single type of sensor, the approach presented capitalizes on the relationships between two different sensor types to accomplish its goals. The detection of human activity was demonstrated by leveraging the spatial aspects of color and sound, with limited amounts of processing required. The sensor processing techniques presented in this dissertation scale linearly with the number of data samples captured by the system (e.g., only one or two passes through the image is required) compared to other pattern-search-based processing methods that involve higher order increases in computation.

While the simplicity and limited levels of representation used are beneficial for speed, they also increase the likelihood of occasional misinterpretation of sensor data. This was compensated for by using reactive control methods that depend on frequent measurements rather than high-level information and recover quickly from mistakes. The fuzzy behavior-based control system rapidly maps sensor information to control actions in what emerges as competent real-time camera activity despite the lack of high-level planning.

 

10.1 Contributions

This research has extended and combined the work of others in computer vision and sound localization, and has applied behavioral control and active perception concepts to potentially mainstream computer applications. The most significant contributions of the work are broken down as follows:

 

 

 

 

 

10.2 Limitations

The color and sound processing approaches described in this dissertation are not free of caveats. Care must be taken in how these methods are applied; it is important for these limitations to be understood when building similar systems. The most serious limitations discovered so far for these techniques are listed below, along with recommendations on how to address them.

 

 

 

 

 

 

10.3 Obstacles to Performance Testing

Evaluation of the effectiveness of intelligent sensor-based systems is a major area of research in artificial intelligence, as the quantification of system performance is not a simple task. First, real-world sensor information (especially audio and video) is so rich, that it is often difficult to represent it effectively with a finite set of performance benchmarks. For face detection, as an example, it may be possible to construct a reasonably representative sample of the set of all possible faces that may need to be detected (including all races, expressions, hair styles, etc.). But how would one model the ensemble of all possible backgrounds and lighting conditions? Assumptions made with a given model may not apply in all situations. Second, different systems are often designed to accomplish different goals and assume different things about the conditions they will operate under, making direct comparison impossible. Finally, systems that actively control what they sense (like mobile robots and active cameras) make it impossible to repeat exactly the same sensory stimuli during testing since the stimuli are affected directly by the output behavior. This makes it nearly impossible to separate the performance of the sensory perception system from the control system. In mobile robotics, one method that has been used to directly compare systems is to hold competitions designed around specific tasks [81]. Here the overall performance of the system may be judged according to specific metrics. Hopefully, future work in automatic camera control behaviors applied to human-computer interaction will allow comparisons with the methods presented in this dissertation.

 

10.4 Recommendations for Future Work

At the time of this writing, the personal computer is rapidly evolving from a data manipulation device to a mainstream communications appliance. With the world-wide increase in networked users and applications, expectations are high that a convergence of telephone, television, multimedia, and data communications technologies in the near future will make applications like videoconferencing as common as word processing. Many or most machines in homes and business will have the capability of performing the sensing tasks described in this work. At the same time, users will expect friendlier and easier-to-use machines. They will not want to type in their name to log in; they will expect to be recognized on sight, and to even dictate commands and documents orally. The demand for intelligent sensing of people is increasing while the costs and availability of the technology steadily improve. 

Personal computers sold today are powerful enough to either take speech dictation, videoconference over a network, recognize faces, or track peoples' positions and activities. Tomorrow, they will have the computational speed to do all of these things at the same time. When this happens, the challenge for sensor-based machine intelligence will be to exploit these various capabilities together. Today, multimedia data is piped from one peripheral to another, often with one application having exclusive rights to access the data and locking out all other processes that might benefit from it. The management of shared real-time multimedia data for various processing and decision-making agents is an important area of research. Multimedia computer hardware and operating systems must evolve to allow the proper scheduling of real-time tasks, and distribute them appropriately among processing resources. Another obstacle to synergistic use of multiple sensor types is the diversity of disciplines required to accomplish it (e.g., computer vision, speech recognition, videoconferencing, etc.) A single company may not have the resources to develop all the parts of an intelligent sensing system, so the question arises how multiple hardware and software components created by multiple companies can share data in an effective way. The model of data integration and interdependence between agents will have significant impact on the capabilities of the system as a whole.

Although component technologies for intelligent sensing have been demonstrated, the realization of a cognizant computer like "Hal" is nowhere near. Before attempting leaps like "Hal," practical solutions to the integration of two or more sensing modalities at a time should be investigated, to discover where bottlenecks, incompatibilities, and trade-offs exist. For example, speech recognition and sound localization can be combined in two ways: first, by allowing acoustic beamforming to reinforce the speech signal for improved recognition of words, and second, to associate the recognized words with a direction of origin so that the speaker location and identity may be ascertained. To implement the first scenario, the digital stream of sound samples produced by the beamforming process must be fed into the input of the speech recognition process. While this is simple to engineer, hardware and software abstractions currently in practice do not provide the hooks required to accomplish this unless all of the components are produced by the same organization, or two organizations agree on the same interface. In the second scenario, direction of sound may change dynamically with time, so the recognition of a word must be accompanied by an accurate time stamp of when the word was actually uttered. If the recognizer can provide results almost immediately, time stamps may not be required, but for slower or differed recognition, time stamps become important. The speech recognition component must cooperate with the real-time needs of the rest of the system to achieve these goals. None of these problems are difficult to solve from a technical standpoint; but from a practical engineering standpoint they are important to consider early on in the process of system integration, especially when attempting to re-use component technologies provided by other parties.

The combination of face detection, face recognition, sound localization, speech recognition, speech synthesis, and natural language understanding promises to revolutionize the way humans and computers interact. The only way to achieve this panacea is to experiment with new ideas on how these activities inter-relate, and how drastically different disciplines may merge to serve practical applications. Today, the platform required for such research, a multimedia computer, is within reach of almost any student, which makes the prospects for innovation very good indeed.