The technology concerned with computational understanding and use of the information present in visual images. In part, computer vision is analogous to the transformation of visual sensation into visual perception in biological vision. For this reason the motivation, objectives, formulation, and methodology of computer vision frequently intersect with knowledge about their counterparts in biological vision. However, the goal of computer vision is primarily to enable engineering systems to model and manipulate the environment by using visual sensing. See also: Vision
Sensing and image formation
Computer vision begins with the acquisition of images. A camera produces a grid of samples of the light received from different directions in the scene. The position within the grid where a scene point is imaged is determined by the perspective transformation. The amount of light recorded by the sensor from a certain scene point depends upon the type of lighting, the reflection characteristics and orientation of the surface being imaged, and the location and spectral sensitivity of the sensor.
The objective of this early stage is to compress the huge amount of image detail by identifying and representing those aspects of image structure which are salient for later stages of interpretation. Typically, this is accomplished by detecting homogeneous regions in the image or, equivalently, their edges. It must be done for all degrees of regional homogeneity and sizes, yielding a multiscale segmentation. Further, small regions may group together to form a larger structure seen as a homogeneous texture, which may define another basis for characterizing the homogeneity of a segment. The result of segmentation is a partitioning of the image such that each part is homogeneous in some salient property relative to its surroundings.
One central objective of image interpretation is to infer the three-dimensional (3D) structure of the scene from images that are only two-dimensional (2D). The missing third dimension necessitates that assumptions be made about the scene so that the image information can be extrapolated into a three-dimensional description. The presence in the image of a variety of three-dimensional cues is exploited. These cues may occur in a single image taken by a camera at a single time instant, a sequence of images acquired by one camera over a time interval, a set of images taken by multiple cameras from different positions or angles at a single time instant, or a time sequence of images taken by multiple cameras. In each case, the main task is to devise algorithms that can estimate three-dimensional structural parameters from image-based measurements. Some examples of such cues will now be discussed.
The boundaries of image regions, or the curves composing a line-drawing image, reveal scene characteristics such as the extent and shape of an object and the occlusion between objects.
The nature of gradual spatial variation of image values within a region is related to surface shape characteristics such as convex or concave, and planar or curved.
Variation in the coarseness of image texture is indicative of how the corresponding texture surface is oriented in the scene. For example, the distant flowers in an image of a large field of flowers are packed more densely together than the closer ones.
If two cameras are placed at different locations and orientations (like the human eyes), then the coordinates at which a given scene projects within the two images are different, and their disparity is trigonometrically related to the three-dimensional position of the scene point. The disparity variation within an image can be used to estimate the three-dimensional configuration of surfaces in the scene. Three or more cameras can also be used.
If there is relative motion between the scene and the camera, the image data consists of a dynamic image sequence. Each of the above-mentioned cues then provides added information since observations about the temporal behavior of the cue are available, in addition to its spatial properties. For example, the temporal variation in image values, called optical flow, can be used to estimate the relative motion, surface shape, and layout of the objects. A moving object gives rise to a decreasing amount of flow as its distance from the camera increases. Similarly, moving image curves, such as those corresponding to orientation discontinuities on an object surface or the silhouette of a rotating object, yield a wealth of information about the shape of an object as well as about its motion. Sequences may be taken by multiple cameras simultaneously, making surface estimation easier because a larger amount of data is available, that is, the spatial disparity as well its variation over time.
Active image acquisition
If the cameras maintain a single, fixed geometry or optical configuration, the amount of additional information extracted from successive images depends on the changes occurring in the scene, due, for example, to relative motion. It is possible to maximize the increment in scene information obtained from successive images by dynamically reconfiguring the cameras so that the cues in the new images are most informative. Thus, a partial interpretation of the scene is used to dynamically control the sensing parameters so that each stage of image acquisition adds the most to the interpretation. For example, if a scene is too bright, the aperture size may be reduced; if the object is not in sharp focus, the focus setting may be changed; and if the object is not well placed with respect to both cameras in stereo analysis, the cameras may be verged to fixate on the object, all of these actions happening simultaneously (see illustration). As a certain part of the scene is satisfactorily interpreted, the results of the interpretation may be used to determine where to point the cameras next, and even to suggest how to analyze these new parts so as to obtain the best final interpretation in the minimum time.
The two-dimensional structure of an image or the three-dimensional structure of a scene must be represented so that the structural properties required for various tasks are easily accessible. For example, the hierarchical two-dimensional structure of an image may be represented through a pyramid data structure which records the recursive embedding of the image regions at different scales. Each region's shape and homogeneity characteristics may themselves be suitably coded. Alternatively, the image may be recursively split into parts in some fixed way (for example, into quadrants) until each part is homogeneous. This approach leads to a tree data structure. Analogous to two dimensions, the three-dimensional structures estimated from the imaged-based cues may be used to define three-dimensional representations. The shape of a three-dimensional volume or object may be represented by its three-dimensional axis and the manner in which the cross section about the axis changes along the axis. Analogous to the two-dimensional case, the three-dimensional space may also be recursively divided into octants to obtain a tree description of the occupancy of space by objects.
A second central objective of image interpretation is to recognize the scene contents. Recognition involves identifying an object based on a variety of criteria. It may involve identifying a certain object in the image as one seen before. A simple example is where the object appearance, such as its color and shape, is compared with that of the known, previously seen objects. A more complex example is where the identity of the object depends on whether it can serve a certain function, for example, drinking (to be recognized as a cup) or sitting (to be recognized as a chair). This requires reasoning from the various image attributes and the derivative three-dimensional characteristics to assess if a given object meets the criteria of being a cup or a chair. Recognition, therefore, may require extensive amounts of knowledge representation, reasoning, and information retrieval.
Visual learning is aimed at identifying relationships between the image characteristics and a result based thereupon, such as recognition or a motor action. For example, the essentials of the visual appearance of a person may be learned so that whenever the person reappears in the scene his or her identity will be recognized despite changes such as in lighting and observer's viewpoint. Alternatively, continuous use of visual feedback to perform a task may be learned. For example, a person trying to pick up a suitcase moves his or her hands toward it while continuously using vision to control and correct the hand configuration and motion to reach the desired location and grip.
Vision is the predominant sense in humans, and it can play a correspondingly central role in computer-aided systems. In manufacturing, vision-based sensing and interpretation systems help in automatic inspection, such as identification of cracks, holes, and surface roughness; counting of objects; and alignment of parts. Computer vision helps in proper manipulation of an object, for example, in automatic assembly, automatic painting of a car, and automatic welding. Autonomous navigation, used, for example, in delivering material on a cluttered factory floor, has much to gain from vision to improve on the fixed, rigid paths taken by vehicles which follow magnetic tracks prelaid on the floor. Recognition of symptoms, for example, in a chest x-ray, is important for medical diagnosis. Classification of satellite pictures of the Earth's surface to identify vegetation, water, and crop types, is an important function. Automatic visual detection of storm formations and movements of weather patterns is crucial for analyzing the huge amounts of global weather data that constantly pour in from sensors. See also: Artificial intelligence; Character recognition; Computer graphics; Intelligent machine; Robotics