© Academic Press (1980). This is the author's version of the work. It is posted here by permission of Academic Press for your personal use. Not for redistribution. The definitive version was published in Computer Graphics and Image Processing 13, 1980, pp298-333.
This paper describes an image understanding system that can recognize photographs of the city of Pittsburgh. It uses an effective and efficient artificial intelligence search technique called Locus. The system builds a three-dimensional model of the city and uses information from hypothesized two-dimensional views to label images. It currently achieves less than 20% error when labeling photographs, given a knowledge base of over fifty objects. In addition, the system can determine the angle of view around the city with approximately 40 degrees of error.
The paper formulates image understanding as a problem of search; shows how Locus search can be used to label images; describes the many sources of knowledge used in the interpretation; explores extensions to the use of knowledge; and presents some experimental results. New issues presented here include the multi-dimensionality of Locus search and techniques for automatically acquiring scene knowledge.
This paper is about Argos, a computer program that can understand images. The term "image understanding" means that the program can identify the major components of a scene by using knowledge about the structure of the scene. For example, Argos is able to use a map of downtown Pittsburgh to aid in the identification of the major buildings, rivers, and other objects in photographs of the city.
The technique that Argos uses to identify, or label, these images is called Locus search. Locus is a powerful search technique that uses a recursively defined evaluation function to scan an image while using a network of knowledge about the scene to constrain the search. After the image has been scanned once, Locus is able to extract picture labels from the results of the search.
One of the significant features of Locus search is its ability to use many diverse sources of scene knowledge when labeling images. Not only is it able to use information about the visual appearance of the buildings, but it also knows their size, shape, and physical environment. In general, it is felt that Locus search is a useful technique in applications where knowledge is used to attain goals.
Argos constitutes the first use of Locus search in the multi-dimensional environment of images. Some of the new issues that arise from this implementation are the modification of the Markov assumption used by Locus and the hierarchical application of knowledge networks. These changes are necessary due to the topological complexity of the signal (i.e. images are two-dimensional) and the immense amount of knowledge needed by a complex vision system.
Image understanding is defined as the application of knowledge to the task of interpreting an image. This application consists of finding a match between the knowledge and the image. Once the match has been made, the image is both interpreted and understood. It is interpreted because it has been linked to the knowledge and therefore is labeled with knowledge-based descriptions. The image is also understood because it is possible to extrapolate the interpretation and derive information not specifically in the image.
In general, the image interpretation process has three components: the image, the knowledge, and the match. The image is often refered to as the signal and the knowledge is the symbol. The match is therefore a signal-to-symbol match. The signal has been treated differently by the various image interpretation systems in existence. The simplest signal representation is the pixel, where each cell in the image is matched to the knowledge base. This representation is typically used alone [11, 17, 35] and is sometimes used in conjunction with other forms [4]. A higher representational form is the segment which is an irregularly shaped collection of pixels [4, 5, 12, 32, 38]. Another high-level representation of images is the line and/or curve [4, 21, 27, 37]. Locus can deal with any representational form. Initial experiments were with pixels, but the Argos system currently labels segmented images.
Representation of symbolic knowledge is a topic that is widely debated. A common way to represent pictorial knowledge is by listing those 2-D relationships that can exist in the image [5, 11, 12, 21, 35, 37]. More complicated forms are also used which combine the relational information into grammars [17] and semantic networks [4, 32, 38]. Locus works with the simple form of knowledge: a set of image relations that are organized in such a way that arbitrarily complex knowledge can be employed.
The important aspect of an image interpretation system, that which ties together the signal and the symbol and which typically dictates the nature of these, is the matching algorithm. The simplest match is a brute force comparison of the signal to all symbolic forms [27]. Systems with complex signal and symbol often use data-driven matching techniques [4, 32]. Standard forms of tree search are also common, for example, heuristic search [5], best-first search [11, 38], bottom-up parsing [17], and dynamic programming search [12]. Iterative relaxation, where the symbolic constraints are repeatedly re-evaluated until the labels converge over the image, is becomming increasingly popular [21, 35, 37]. Locus search is a refinement on the standard tree search techniques that combines near optimality with high efficiency.
Locus was first used in the Harpy speech understanding system [20]. Harpy uses this technique on the recognition of spoken utterances and is currently the best speech recognition system in existence. Prior to Harpy, Dragon [3] used a breadth first graph searching technique to attain the same goal.
Dragon is the theoretical grandfather of Argos. It uses a probabilistic function of a Markov model to find an optimal path through a knowledge network. The technique is very similar to the Viterbi Algorithm [13]. Harpy relaxes the formality of the probabilistic function and demonstrates that heuristics can be used to improve the search. Argos goes one step further by modifying the Markov assumption so that the multi-dimensionality of images can be handled. Instead of requiring that a first-order Markov system rely on only one previous node in the search tree, Argos implements an "adjacency first-order" Markov system that relies on all surrounding nodes in the image.
The next section discusses Locus search in detail. It starts by explaining how image understanding can be formulated as a problem of search. From there, it describes the organization of the knowledge networks that are used in Argos. Following that is a discussion of the low-level system in Argos: the techniques used to give signal characteristics to PPEs. Finally comes a detailed explanation of the search and an example that ties it all together.
Section 3 discusses how many different knowledge sources can be used with Locus search. In addition to expounding upon object adjacency and image pre-segmentation, the section discusses the use of object size, shape, and location. This section is also interesting because all of the knowledge covered is automatically extracted, generalized, and used.
Section 4 discusses how knowledge can be organized hierarchically to infinitely expand the power of Locus. Most of this section is speculation, but it does conclude with a review of where Argos stands in its hierarchical use of knowledge.
Section 5 describes the initial experimental results that were obtained with Argos. The system is currently able to label fifteen scenes of the city of Pittsburgh with 20% error at the pixel level. It can also determine the angle of view around the city with an average error of 40°. Considering the complexity of the knowledge and the images (see Figure 1), this is quite good.
![]() ![]() These pictures are two of the fifteen photographs of the city of Pittsburgh taken during the winter of 1977-78 (Notice that the river in the foreground of Training Scene 2 is frozen). |
For a system to be able to match its symbolic knowledge to a visual signal, it must be able to compare and relate these two items. Locus relates signal and symbol by formulating both with the same basic unit. This unit is called the Primitive Picture Element.
The Primitive Picture Element, or PPE, is the basic building block of both the signal and the symbol. Every sensed image can be described with PPEs since each pixel can be given a unique PPE label. Similarly, the general knowledge of a scene (sky above river under bridge, etc.) can be described with a network of objects, all of which also have PPE labels. PPEs can be thought of as the smallest units of representation that exist for the micro-world of the image recognition task. Alternately, they can be thought of as the largest object that is both homogeneous to the signal and homogeneous to the symbol. Simply, the PPE is the label that Locus uses when interpreting images.
The choice of PPEs varies with the micro-world being explored. It is dependent on the level of detail in the knowledge base and on the ability of the system to optically distinguish different parts of an object. If, for example, one wishes to determine whether an image is a city scene or an office scene, then neither the sensed data nor the symbolic representation need be very complex. The PPEs sky, building, road, wall, desk, and floor would suffice. Note that each of these PPEs is comprised of optically similar pixels in the sensed image and represents adequate symbolic information to determine the scene type.
Given that a set of PPEs is to be used to interpret an image, a number of questions immediately come to mind. Two obvious ones are how the sensed image is to be compared with PPEs and how the PPEs are to be organized as a symbolic knowledge base. More important is the issue of how these two forms are matched. The next two sections discuss how PPEs are organized symbolically and optically. The section after that describes the match process, which of course is Locus search.
Imagine that a network is constructed whose nodes are all PPEs. The network contains an initial node and a terminal node between which all other nodes are somehow connected. Each node in the network is an item describing an object and each node connection is a fact about the two items it connects. Therefore, a complete network is a set of facts about PPEs and is called a knowledge network. In addition, a complete path from the initial node to the terminal node is a collection of knowledge that is consistent (i.e. all facts agree with each other). Imagine further that for every image which is M by N pixels in size, there exists a path of MxN nodes stretching from the initial node to the terminal node which forms a consistent "statement" about that image. The existence of a PPE node in a given position along this path is equivalent to a label assignment of that PPE's symbolic name to the pixel in that path position. The path, therefore, forms a symbolic description of the image, taken from the knowledge network.
Since a PPE path corresponds to a set of labels for an image, the image interpretation problem reduces to a search task. Knowledge for this search task comes from the network connections which guide the path selection. For example, assume that there are 50 PPEs in a network and that each one is connected to all of the others, yielding 2500 node connections. This represents no constraint on the network paths and therefore no knowledge. The network would then consist of an initial node, a terminal node, and one node for each of the 50 PPEs. If a sensed image which is 75 by 100 pixels in size is matched to this network, then the path through the 50 PPEs can take any of 50 choices at each step through the image. The number of different paths through the network would be 507500. If any of the 2500 node connections is removed, then the number of possible network paths is reduced. Knowledge, therefore, appears in the form of constraints on the interrelations of PPEs.
How are these constraints selected? One way is to build up the set of interrelations from a set of exemplar images. Since each network path corresponds to a legal image, the converse must also hold: each legal image has a unique network path. It is this converse form which defines how knowledge is encoded as a set of network constraints. The paths from the exemplar images are combined into an overall knowledge network. If these images are all of downtown Pittsburgh, a city which is surrounded by mountains, then they are going to have a backdrop of mountains below the backdrop of sky (except for pathological viewpoints which can be ignored in this example). This simple piece of knowledge manifests itself directly as a network constraint: the sky PPE may not adjoin any of the building PPEs because the mountain PPE intervenes.
In addition to specifying legal adjacencies, the network arcs can contain information about the nature of the adjacency. An arc can specify that the PPEs which it connects are vertically or horizontally adjacent. It can also specify more complicated relationships such as "containership" and some other knowledge such as object size. However, it will be shown that most complex relationships can be built from a series of simpler constraints.
The use of networks to store knowledge is nothing new. Semantic networks have been used for years to just such end. In fact, one might think that these PPE networks are nothing more than semantic networks. This last, however, is not true. Semantic networks have multiple levels of hierarchy which define knowledge. In a semantic network, there are depth relationships which describe sub-sets of knowledge. PPE networks however, have no depth. All knowledge must be specified fully at each network node. It might seem that this causes each node to use excessive amounts of storage, but in practice, it only requires a few extra pointers at each node.
Why are PPE networks structured so as to make them larger than necessary? The answer is found in the classical tradeoff of space for time. Single-level networks are much easier to search and PPE networks are no exception. What Locus loses in space, it gains in time while searching the PPE network. Search of a Locus network proceeds in the same way as images are laid out. Semantic networks do not have this explicit search order built into them, so they take longer to search because the control structure must be more complex.
Before the search process can be discussed, there is one more background issue that must be covered: the direct comparison of the sensed image to PPEs. This comparison is at the low-level (signal) end of vision research and is critical to any high-level (symbol) system such as Argos. However, since this research is concerned primarily with the high-level end, the discussion presented here contains no new ideas. In fact, it does not even pretend to represent the optimal techniques.
The images that Argos works with come digitized as 525x700 points of light. Each point is described by three 8-bit values from red, green, and blue sensors. To simplify labeling, Argos does not work with 525x700 images. Not only is it prohibitively expensive to search for a network path that is over 300,000 nodes long, but the interpretation being sought does not require such detail (see the list of PPEs in Tables 1 and 2 for an understanding of the desired level of detail). Argos reduces each image before labeling. When the system is working with segmented images, it typically divides the image into as many as 100 arbitrarily shaped segments (see Figure 2). When the system is labeling a regular grid, each image is reduced by a factor of 7 in each dimension before being processed. The 49 original pixels are combined by taking the median and the contrast (a function of the fourth moment of the distribution curve [34]). Median provides a good low texture measure and contrast is good for areas of high texture. Thus, each reduced pixel is represented by a six element feature vector: (Median of Red, Median of Green, Median of Blue, Contrast of Red, Contrast of Green, Contrast of Blue). The reduction to 75x100 images makes the search more manageable. It is the largest reduction that can be performed without damaging a human's ability to recognize the images at the desired level of detail. In addition, the 7x7 window allows the contrast operator to distinguish many of the buildings from each other.
This table shows all of the primitive picture elements (PPEs) that were selected for the object identification labeling of the Pittsburgh images. The symbol column is the identifier used in the labelings in Fig. 12. The training image column reports which of the seven training images were used to extract signal-to-symbol matches for that PPE. Note that four of the PPEs (Pittsburgh National Bank Operations Building, Jenkins Arcade Bldg., Miscellaneous Bridges, and Federal Bldg.) did not appear in any of the photographs, and were therefore not trained. Note also that the Blue Cross Bldg. and the Grant Bldg. were trained from test image 5 because they only appeared in that image. |
This table shows all of the primitive picture elements (PPEs) that were selected for the view angle identification task of the Pittsburgh images.
One additional consideration must be given to the feature vector templates: they should be able to accurately describe objects that have multiple appearances in the training images. Argos solves this problem by creating multiple feature vector templates for objects with multiple appearances. The decision to create an additional template is made automatically when the standard deviation of more than two feature vector elements exceeds 20% of their possible range. If more than two elements are this diverse, the feature vector is split in the middle of the dynamic range of the most diverse element. When extracting feature vectors from seven training images, Argos generates as many as three templates for each object, although one template typically suffices. Given six elements in the feature vector, a distance metric must be found for the pixel-to-PPE likelihood calculation. Argos uses a metric that is commonly found in vision systems: weighted-Euclidean distance (Mahalanobis distance). The selection of weighting factors is done by computing the mean and standard deviation for all pixels in the region. Then, those pixels that are more than 1.5 standard deviations from the mean are rejected. This rejection of atypical pixels allows for human error in the training and prunes data points which may be unfairly contributing to the metric. The mean and standard deviation are then re-calculated and the adjusted standard deviation is used as a weighting factor for the adjusted mean (which becomes the template value). The use of standard deviation as a weighting factor on the mean is common practice because it is an indicator of the consistency of the data and therefore the usefulness of the mean value in classifying that data. The conversion of standard deviation to weighting factors is done by inverting the six standard deviation values for a given PPE and then linearly scaling them so that they sum to 1. Since feature vector components with high standard deviations indicate erratic quality on the part of that component, a high standard deviation will yield a low weight which will reduce the system's dependence on that component.
2.3 Locus SearchThere is only one step remaining in this labeling process: search. The knowledge environment has been formulated as a network and techniques have been described for matching the sensed image to that network. The search process must now find the "correct" set of labels for the sensed image. As earlier discussion mentioned, the combinatorics of labeling an image are tremendous. Argos deals with images that are 75 by 100 pixels and contain a minimum of 50 PPEs. If no knowledge about region relationships is available, then the branching factor for a 50 PPE network is 50 (i.e. PPEs can branch 50 ways to all other PPEs) and the number of possible labelings is 507500, or about 1012750. When knowledge is applied, the branching factor reduces significantly. For example, a typical 50 PPE network with knowledge has a branching factor of about four which reduces the number of possible labelings to 47500, or about 104500. Thus, the application of knowledge is the most powerful tool available in reducing the search space. Nothing else can reduce the number of alternatives by over 8000 orders of magnitude! However, there is still much reduction to be done before a single, correct path is selected, because even 104500 paths are too many for a machine to examine. Locus search is able to perform that reduction and extract a single labeling from the knowledge-reduced search space. Locus search is different from standard breadth-first search. It rejects the notion that any final path decisions must be made before the entire image has been scanned. Recall that Locus is a beam search: it steps through the knowledge network and the image selecting many "attractive" PPE options at each pixel. These attractive options comprise the beam: a pruned search tree which contains a list of near-miss alternatives around the best path. Beam selection is based on a combination of signal and symbol factors in a recursively defined environment that links these factors in all parts of the image. There is enough latitude in the beam to allow many possible labelings to be stored. It is only after the entire image has been examined that a single path is selected from the beam. The final labeling is therefore "the best of the best". It is found without time-consuming backtracking and it is highly accurate because it delays decision making until the entire image has been examined. Thus, Locus proceeds in two steps. The examination of the image and construction of a beam of alternatives is called the Forward Pass. The selection of a single near-optimal path from that beam is called the Backtrace.
2.3.1 The Forward PassThe forward pass of Locus search proceeds in breadth order through the knowledge network creating a beam-like search tree as it goes. Each level of depth in the tree corresponds to a pixel or segment in the sensed image and each knowledge network PPE which is placed in the beam at a depth level is a possible label assignment for that pixel. The following sub-sections discuss the various aspects of the forward pass.
2.3.1.1 Order of SearchThe order in which the image pixels are scanned is raster (left-to-right, top-to-bottom). For the 75x100 pixel images, this means that depth level 1 of the beam corresponds to the pixel in the upper-left-hand corner, depth level 100 is the upper-right-hand corner, depth level 7401 is the lower left-hand corner, etc. For pre-segmented images, the segments are ordered by raster according to the position of the segment centroid. This order of search may appear at first to be detrimental to the quality of labeling. It will be shown, however, that when using Locus, the order of search is of minor importance as long as the backtrace follows the reverse order of the forward pass. Recall that each knowledge network has an initial PPE and a terminal PPE. For the two-dimensional image task, there are actually four knowledge network PPEs which cover the initial position. These PPEs are the four image edges: top, bottom, left, and right. Locus knowledge networks can use these PPEs to help constrain region placement within the image (for example, sky is at the top of the image). Note, however, that these constraining PPEs can be used to not constrain simply because Locus imposes no fixed rules on region placement within the scene: it leaves that as an option. The current system allows the edge PPEs to adjoin any other PPE, thus removing all dependence on the field of view.
2.3.1.2. Path LikelihoodsDuring the forward pass, determination of likely PPEs for the beam is based on the computation of a path likelihood for the path that arrives at that PPE. The path likelihood is defined recursively in terms of the path likelihoods of all parent PPE nodes in the beam. It is this path likelihood value which is used to build the beam and guide the forward pass. It uses three pieces of information: the optical match of the pixel to the PPE, the path likelihoods of previous PPEs, and the transition likelihoods of arriving from those previous PPEs. Formally:
where Pi,j is the path likelihood of PPE i at depth level j (position j of the signal); Oi,j is the optical match of PPE i to the pixel at position j; Δ(d) is the depth adjacency function which offsets the current beam depth j to a "previous" adjacent depth j + Δ(d) in the two-dimensional direction d; and Tk,i,d is the transition likelihood of traveling from PPE k to PPE i in direction d. The need for the Δ function is explained by the fact that the task is image interpretation which is two-dimensional. Since depth within the beam is one-dimensional, some concept of physical adjacency must be incorporated. This is done with the Δ function which determines, for a given two-dimensional direction, what two depth levels are physically adjacent in the image. For example, in a 75x100 pixel image, Δ(above) = -100 and Δ(left) = -1. This is because of the raster order of scan: a pixel to the left is 1 back in the beam whereas a pixel to the top is 100 back on the previous raster line. Pre-segmented images have an arbitrary number of adjacencies so the Δ function can extend as far back as the start of the image. In unsegmented images, Argos uses four primary directions of adjacency: left, above, upper-left, and upper-right. These four directions, which are the range of d, have four opposite directions which together allow all horizontal, vertical, and diagonal relationships to be defined within the 3x3 matrix of pixels surrounding the current point. Note that the opposite directions are not explicitly used because they are the simple converse. In pre-segmented images, adjacency is much different. The directions of adjacency cannot be limited to an upper-left semi-circle since two segments can border on all sides of each other. Thus, at least eight directions of adjacency are required to retain the same amount of knowledge. However, the pre-segmentation version of Argos does not use diagonal directions of adjacency. This is because segments which are diagonally adjacent can also be considered to adjoin horizontally and/or vertically. The result is that pre-segmentation Argos has four directions of adjacency: left, above, right, and below. These directions are used to match the segment adjacency to the network adjacency when determining the transition likelihood (T). Unlike unsegmented Argos, these directions can be combined when describing segment adjacency. Thus two segments can adjoin in any of 16 directions. Even complex relationships like "containment" [19] can be incorporated in this scheme. A more sophisticated adjacency system for pre-segmented images would be able to account for the continuum of border classes. It might be reasonable to work the border type into the above path likelihood equation. In addition to considering the angle of a border, its texture (smooth or jagged) could be considered.
2.3.1.3. RecombinationThe path likelihood equation introduces an interesting heuristic used in Locus search: recombination of nodes in the beam that have the same PPE label. If the beam were actually represented as a true search tree with totally independent paths then any entry would uniquely identify its lineage back to the root of the tree. In the image understanding world, where search trees are phenomenally large, these independent paths are too expensive to retain. Therefore all paths at a given depth level which transit to the same PPE in the knowledge network are merged by Locus into one node in the beam. Thus, the beam resembles a Viterbi trellis [13]. It is for this reason that the "maximum" factor exists in the path likelihood equation: a pointer is saved to the best previous beam entry with this PPE label and all other connections are rejected. In a true search tree, the "maximum" factor is unnecessary because there is only one parent entry for each child. Beam entries have many parents, only one of which can be saved. There is one more detail that should be mentioned about the path likelihood equation: the "average" clause. This clause indicates that path likelihoods from all directions of adjacency are being considered. The likelihoods are averaged to show that they contribute equally to the overall path likelihood. As a side constraint, the unsegmented version of Argos insists that an adjacency connection exist in all four directions. The pre-segmentation version of Argos relaxes this constraint and simply penalizes the path likelihood of any PPE that does not have transitions from all neighbors. This allowance is due to the potentially large number of neighbors that a segment can have.
2.3.1.4. Transition LikelihoodsMore reduction in the complexity of the path likelihood equation can be obtained by selecting the transition likelihoods from one of two values: 0 or 1. A value of 0 indicates that it is not feasible to transit from PPE k to PPE i in direction d. A value of 1 means that the transition is allowed (i.e. PPE k and PPE i may adjoin each other in the d direction). The transition likelihoods are therefore the knowledge network constraints and comprise the major knowledge element of the path likelihood equation. Argos actually implements the transition likelihoods as one of three values: 0 for a disallowed transition; 0.1 for an allowable transition from one PPE to another; and 0.9 for an allowable transition from a PPE to itself. Self-transitioning is a necessary special case in the unsegmented system since most PPEs cover a large area of pixels. The increased likelihood of transitioning to the same PPE is a form of knowledge which Argos uses to label city scenes. The value of 0.9 is taken, without experimental verification, from the Harpy speech implementation of Locus. Pre-segmentation Argos also uses differing intra-state and inter-state transition likelihoods, but the reverse values are used. This is a knowledge source which implies that the segments form complete objects in the image and should not self-transit.
2.3.1.5. The BeamThe beam is the search tree. It is the collection of pixel-to-PPE pairings that is accumulated on the forward pass. It recursively generates itself using the path likelihood equation to select new members. However, not all of the path likelihoods produced by the equation are needed to produce the beam. This is because Locus resembles a first-order Markov system, so the only likelihood values that are needed are in the immediate neighborhood of the pixel being computed. In the unsegmented system, anything more than 1 scan-line back (100 depth levels back) can be discarded because path likelihoods exist only to compute other path likelihoods. The pre-segmentation system, however, must keep all of the likelihoods since strange segment adjacencies may require arbitrary likelihoods from earlier in the forward pass. The important information in the beam is the inter-level connections. Whenever a path likelihood is computed, there is an "optimal parent" in each direction of adjacency. This optimal parent is considered to be a major contributor to the path likelihood at the current pixel or depth level. It can be seen in the path likelihood equation as the maximum k for each direction d. The collection of optimal parents is what makes up the beam.
2.3.1.6. PruningYet another distinguishing feature of Locus is its pruning. It has been mentioned that Locus selects the best few PPEs and saves them in the beam. There is a fixed maximum size at each level of the beam which is considerably smaller than the number of PPEs. Therefore, only those PPEs with the highest likelihoods are allowed in the beam. Those that cannot fit are usually not important in the overall labeling scheme because there are many other beam entries with higher likelihoods. In addition to fixing the beam size, Locus maintains a likelihood threshold below which a PPE will be pruned regardless of beam space availability. This threshold, which is used in Harpy Locus, is dynamic in that it is relative to the best likelihood at the particular depth level. If the best PPE has likelihood value P, then all other PPEs with likelihoods below P - threshold will be pruned. It has been found that a threshold of 100 (out of a possible range of 256, see next subsection) will prune on the order of 60% of the PPEs at each depth level. This is a hefty cut yet it does not damage the labeling quality.
2.3.1.7. NormalizationOne problem that arises when computing the path likelihoods is precision. In a true Markov system, all of the probabilities at a given depth level must sum to 1. In Locus, however, there is much pruning and re-combination of likelihoods. The result is that the values degenerate as the image is scanned. To prevent this, the likelihood values at a pixel are normalized once they have all been computed. The most promising PPE at the level is given the likelihood value 1 and all others are linearly scaled to fall below that. This normalization causes no damage to the algorithm because Locus is only concerned with one single depth level as it relates to another. If all PPEs at a level are normalized together, then the relative results are the same. As an added bonus, the normalization allows the path likelihoods to be stored with very few bits of precision. Argos uses negative log values of all likelihoods so that it can add instead of multiplying. Also, it is able to store these as integral values in only 8 bits of precision. Thus, the likelihood value 1.0 is stored as a 0 and the likelihood value 0.0 is stored as 255. All fractions between 0.0 and 1.0 are represented as an integer between 0 and 255.
2.3.2. The BacktraceAfter the entire image has been scanned, all that remains is the beam: stretching out for as many depth levels as there are pixels or segments in the image, and containing PPE connections from the forward pass. The backtrace scans the beam in reverse order (i.e. from the bottom of the beam to the top) and produces a labeling of the image. The labeling, of course, consists of a series of PPE assignments to every pixel in the image. The inquisitive reader will ask why the beam is scanned in reverse order. The answer is: to obtain a unique labeling. Notice that the beam contains, for each entry, a pointer to its optimal parent at each previous depth level. This optimal parent pointer is a direct by-product of the path likelihood equation. Therefore, when scanning the beam backwards, each child which is selected will identify exactly one parent at a next higher level. If the beam were scanned forwards, there would be ambiguity when a parent had more than one child: which child should be selected next? The backtrace is necessary in this form of search tree evaluation because of the re-combination that occurs when multiple entries at a depth level join into a single one at the next depth level. If there were no re-combination, then the optimal PPE entry at the bottom depth level would uniquely define a path back through the search tree without the need for a beam or any parent pointers. However, the combinatorics of search without re-combination are prohibitive in the image understanding task, so Locus re-combines and does a backtrace. Notice that the backtrace is extremely fast. It performs no search or other involved computation. In fact, the backtrace is linearly bounded in time to the size of the image. Even with the complications that are about to be discussed, it retains these qualities.
2.3.2.1. Conflicts: The ProblemConflicts arise in the backtrace when multiple beam pointers from the forward pass disagree. This is a subtle problem that is once again caused by topology: the task is two-dimensional but the beam is one-dimensional. It is the solution to the dimensionality problem that causes problems in the backtrace. To handle two-dimensional data on the forward pass, there is an optimal parent in each of the directions of adjacency. If the forward pass saves multiple parents for every child then, by symmetry, there must be multiple children for every parent on the backtrace. Thus, two-dimensionality presents a problem of conflicts: when the children select different parent labels. Although it would be nice to have unanimous PPE selection, this is not always the case. Conflict resolution is an interesting problem that has not been fully explored. Therefore, the following discussion contains some techniques and observations, but no definitive solution.
2.3.2.2. Conflicts: Some AnswersThere are two ways to resolve conflicts. The obvious one is to select one of the candidates from the child PPEs through some heuristic. The surreptitious solution is to throw out a pixel or segment when it has a conflict. This latter technique is not as bad as it sounds for a few reasons. First, there is no rule which says that every pixel must be labeled. The second reason for leaving conflict points unlabeled is in direct support of the first: experience has shown that conflicts arise only at region borders or in areas where there is inadequate training. When Argos changes PPEs, it gets confused for a pixel or two but soon picks up the scent and continues faithfully. Given that conflicts can be resolved by arbitration or rejection, Argos chooses to do both. It employs a few simple heuristics to resolve conflict types that it understands, and then rejects the pixel if the conflict persists. Leaving a pixel unlabeled sounds simple enough, but it causes other problems in the backtrace. One problem is that there are no parent pointers emanating from an unlabeled pixel. In the unsegmented Argos system, when an unlabeled pixel is skipped, all parent pointers are extended through it. Although no pointers emanate from unlabeled pixels, every subsequent pixel still gets four pointers into it because there is a labeled pixel at some distance in all four directions. In pre-segmentation Argos, unlabeled segments do not affect the backtrace: they are simply ignored. Since the number of child segments varies anyway, it does not matter if there are one or two less children available when it comes time to label a segment. It should not be presumed, however, that unlabeled segments are harmless. Conflicts indicate inconsistency in the interpretation and should be resolved whenever intelligently possible. The most powerful, and least obvious conflict resolution technique is simple knowledge constraint checking. The child pixels have been labeled already so their parent PPE selections should be consistent with all legal adjacency rules in the relational knowledge network. It is simple enough to check each candidate for the parent position against its children and reject those candidates that are inconsistent. If all but one candidate are rejected, then the conflict has been resolved. If there are still multiple candidates, or, if all of the candidates are rejected, then some other resolution technique must be employed. Before moving on to other resolution strategies, it is worthwhile to stop and see why the above technique is able to function. After all, aren't all beam entries generated from the path likelihood equation? And doesn't that equation use only legal PPE relationships? So how can invalid relationships arise? The answer has to do with the reverse order of scanning and the topology. X may adjoin y, and y may adjoin z, but x doesn't have to adjoin z. Even though they may end up touching in two-dimensions, they do not have to be legally adjacent. Recall that the beam contains many likely label candidates which may not all be consistent with each other. The next technique used in conflict resolution is voting. It has been found that if one candidate outnumbers all the others, then that candidate should be chosen for the pixel label. Voting is easily implemented and aids in labeling quality. The only other resolution technique that was explored is directional preference. This last-resort heuristic is used only when the above two techniques fail. What it does is to select candidates solely on the basis of their adjacency direction. The assumption is that certain directions aid the labeling more strongly than others. When this assumption was tested out in the unsegmented system, it was found that the diagonal directions seem to be slightly stronger than the horizontal and vertical directions. However, the advantage is not strong enough to be significant. In addition, use of this technique implies that all conflicts can be resolved since this technique cannot fail. Inaccuracies in labeling rise faster than accuracies when conflict points are forced to be labeled. Therefore, directional preference is not used. There are a number of conflict resolution techniques that were not explored. An example of one of these is the use of optical matches. This factor can be used in conjunction with other techniques (i.e. as a weighting factor for voting) or by itself. In fact, any knowledge that is used in the forward pass can also be applied in the backtrace to help resolve conflicts.
2.4. Advantages of Locus SearchLocus performs very well in labeling images. The technique of delaying decision until the backtrace pass allows a globally near-optimal labeling to be selected. "Globally near-optimal" means that every knowledge constraint which has not been pruned is able to be applied to every PPE relation in the scene. Of course, pruning does allow for the possibility of missing the globally optimal path. However, this happens too seldom to make the added cost of exploring all paths worthwhile. The reason that Locus search yields globally near-optimal results is simple: it resembles a Markov system, and Markov systems yield globally optimal results. In fact, if no pruning, normalization, or node recombination were done, then Locus would be a Markov system. It has been found, through extensive experience with Locus systems, that these heuristics do little damage to the Markov nature of the search. Thus the quality is preserved. Yet another advantage of Locus is search order independence. It was mentioned earlier that the raster order of search is not detrimental to labeling quality. In a Markov system, all that matters is that every point be examined: they are all related to each other regardless of order. In Locus, all that matters is that every point be examined and that the backtrace reverse the order of the forward pass. Given that Locus implements a Markov system (which, within heuristic bounds, it does) then it becomes marginally important how the image is scanned. In Argos, the raster scanning order is done for two reasons: to provide a consistently high number of neighbors in the context of the search (i.e. every pixel in the forward pass or backtrace is guaranteed to have neighbors that have already been examined) and to demonstrate that the order of search is not significant. It would be feasible to add search order as a knowledge source. This would require the addition of a depth level factor in the path likelihood equation (in the normalization phase). Then the search could start from one or many "points of interest" in the sensed image and proceed outward (perhaps in a spiral). These points of interest could be selected on the the basis of their signal features (i.e. bright areas) or on the basis of some external knowledge about the image. Another advantage of Locus search is speed. Locus uses no backtracking or other unbounded computation. The beam constrains all search possibilities to a reasonable size as it follows many parallel paths through the search tree. In fact, the varying sized beam can be functionally compared with the varying sized stack in standard backtracking searches: both list the current best paths through the search tree. The difference is that the beam is horizontal and the stack is vertical (when search trees are viewed as top-to-bottom, not left-to-right). It is this difference which allows Locus to limit the beam size without damaging the search.
2.5. ExampleA good way to finish up this section is with an example. This example is, of necessity, a toy use of Locus but it will make use of much that has been presented in this section to show how it all fits together. Those readers who are satisfied with the details may skip to section 3. This example concerns the labeling of a very small satellite picture of a fictitious agrarian nation. The nation's chief products are Alfalfa, Barley, and Corn (sometimes refered to in this example as A, B, and C). Due to soil conditions, winds, and antiquated laws, farmers always plant their crops in fixed positions on their land. The law states that Barley and Corn must be planted, but Alfalfa is optional. The soil conditions mandate that Corn always be planted to the east of Barley, and winds force farmers who choose to plant Alfalfa to do so in the north-west corner of their field. The Locus knowledge network in Figure 3 is compiled from the above information. It uses only north and west in its relationships. Needless to say, there are three PPEs: alfalfa, barley, and corn. Together with the four image-edge PPEs, they define the knowledge network. The goal is to label the satellite photographs and identify the crop types.
The satellite photographs are not very large: four pixels across and two pixels high (unsegmented). However, with a total of 8 pixels and 3 choices of a label for each pixel, there are still over 6000 possible labelings of each satellite photograph. It was mentioned earlier that knowledge does most of the reduction of the search space and this example is no exception. In fact, the knowledge-reduced search space actually contains the ten labelings shown in Figure 4.
2.5.1. SignalThe signal side of this example is as follows: there is one sensor in the satellite which generates brightness values in the range of 1 to 10. Therefore, the feature vector templates are simply one number which describes each PPE. Let us assume that the typical alfalfa field registers 4 on this sensor, barley registers 9, and corn registers 3. For a distance metric, it is adequate to choose an unweighted subtraction measure. In particular, the formula
will yield a fraction between 0 and 1 which indicates the likelihood that PPE i is matched to image pixel j. Figure 5 shows the sample photograph to be labeled and the optical matches for each PPE.
2.5.2. The SearchThe search through the image now begins. The first step is the forward pass which starts with the pixel at position (1, 1) in the upper-left corner. The path likelihood for alfalfa at pixel (1, 1) is derived as follows:
Notice that the initial PPEs are defined to have the likelihood value 1.0 and that all transition likelihoods are simply 1 or 0 depending on the PPE connection. By similar computation, Pbarley,(1,1) = 0.4 and Pcorn,(1,1) is undefined. This last path likelihood is undefined because the transitions are not allowed in the WEST direction. Once both valid PPEs have been calculated for pixel (1, 1), the beam is updated. In this case, barley is pruned because its path likelihood is too low. (In this example, anything .5 or more from the best likelihood is pruned. Barley has the value 0.4.) The remaining value is normalized to 1.0. At pixel (1, 2), the three PPEs are evaluated again. Observe the path likelihood of barley at pixel (1, 2):
At pixel (1, 2), corn is again pruned because there is no barley to the west of it. Therefore, only two PPEs are valid, neither of which is pruned. At pixel (1, 3), all three PPEs are retained in the beam. This is the computation of barley at (1, 3):
Figure 6 shows the status of the search tree after the top row of the image has been scanned. The two numbers above each PPE node in the search tree indicate the path likelihood values before and after normalization. Observe that certain paths are rejected because there are more optimal parents. The forward pass now advances to the second row and scans it from left to right (west to east). Pixel (2, 1) has only two choices: alfalfa (normalized likelihood 0.667) and barley (likelihood 1.0). Pixel (2, 2) also has these two possibilities because corn was not allowed at pixel (1, 2). Lets examine the likelihood calculation for barley at pixel (2, 2). It is quite complex:
The completed forward pass, with row two, is shown in Figure 7. Notice that row one has been simplified to its pruned normalized paths. Figure 8 shows the beam as it appears at the end of the forward pass, constructed from unpruned optimal paths. The backtrace starts at pixel (2, 4). The terminal PPE (arrow entering from the bottom) indicates the corn label for this pixel because it had the highest path likelihood on the forward pass. From there, the corn at (2, 3) is selected (just follow the arrows) and barley is selected at (2, 2) and (2, 1). Next, pixel (1, 4) is labeled and corn is easily picked. At pixel (1, 3), however, there is a conflict. The child at (2, 3) recommends corn but the child at (1, 4) recommends barley. The conflict is resolved by observing that the barley candidate would be inconsistent with the corn to the south of it, whereas the corn candidate is consistent with both children. There are no other conflicts in the backtrace and the final two pixels are quickly and unanimously selected to be barley and alfalfa. The final labeling is shown in Figure 9.
2.5.3. CommentsA dangerous situation arose in the above labeling which could have had disastrous results. The pixel at (2, 4) was correctly labeled corn but could have been labeled alfalfa. If alfalfa had been stronger, the entire scene would have been labeled alfalfa and that is an illegal labeling. How could this happen? The toy nature of the example helped to encourage the possibility of illegal labelings. With only two directions of constraint, there was very little knowledge guidance during the search. By the addition of two more directions, NORTH-WEST and NORTH-EAST, this problem is completely avoided because alfalfa is then denied adjacency to the right side in the NORTH-EAST direction. In addition, the use of minimum path knowledge, which limits PPE proximity to edges based on the number of legal transitions required to adjoin the edge, would have pruned many PPE nodes in this example. However this knowledge is seldom used in larger images. The bottom line is: the more knowledge, the better. In fact, even the four directions that Argos uses are inadequate for complex labeling tasks. It's not that Locus is weak, only its knowledge is weak. The next section discusses other forms of knowledge that are available to enhance Locus search.
3. KNOWLEDGEThus far, Locus search uses only one kind of knowledge: region adjacencies. This adjacency knowledge forms the basic structure of the network that Locus searches, but it is not the only knowledge that can be brought to bear. In fact, as this section will show, there appears to be no major limitation to the knowledge that can be used in Locus search. The knowledge sources described in this section are all acquired automatically from training images. Although the techniques for this acquisition are not generalized learning procedures, they are interesting in terms of visual knowledge maniuplation.
3.1. Adjacency KnowledgeBefore examining additional forms of knowledge, it is appropriate to discuss the exact method used to acquire region adjacency knowledge. Locus networks contain the extracted adjacency knowledge from many hypothesized views of an internal model of the scene. The internal model, which in the case of Pittsburgh city scenes is constructed from street maps, is a three-dimensional model of the city which can be used to generate all possible views of the city. To build a knowledge network, each selected view (or hypothesis) is generated and added as a separate path. The non-obvious aspect of building a network from multiple views is that a region which appears on different views is sometimes represented with separate PPEs and at other times is merged into one PPE. The decision to merge is based on a similarity measure between the different views of the region. Three measures were proposed for the merging of multiple instances of the same region from different hypothesized views. Each algorithm allows regions to combine if a certain percentage of their adjacencies are identical. The algorithms differ only in the amount of consideration they give to the various descriptive aspects of the regions such as shape and size. The first algorithm uses no descriptive information. It merely compares the local adjacencies and merges the regions into a single PPE if the percentage of their identical adjacencies is above the threshold for merging (experimentally determined to be 10%). The second algorithm depends highly on location and shape information. Not only must the two regions being considered for merging have a fixed percentage of identical adjacencies (as in the first algorithm), but the centers of their minimum-bounding-rectangles must be within the same percentage of the image size. In addition to using this location information, the second algorithm requires that the minimum and maximum dimensional shape limits (see Section 3.4) be within tolerance to each other. By tolerance is meant that the smaller value must be at least the fixed percentage of the larger value. The third algorithm uses the same technique as the second algorithm except that it relaxes the restrictions on shape. Only maximum dimensional shape is compared, and only in the horizontal and vertical directions (not in the diagonal directions). The effective result is that the sizes of the minimum-bounding-rectangles are compared. This last reduction algorithm was consistently found to be superior to the others in all of the Argos tasks. Although some knowledge is necessarily lost when multiple instances of the same region are merged into one PPE, it is mandatory that the networks be reduced or else each hypothesized view will be stored as a separate path through the network. This not only makes the networks too large to search, but forces the beam to contain at least one path for every hypothesis, if that hypothesis is to be considered in the labeling. It has been found that frequent region merging produces good labeling results because little information is lost and the search is much easier. The remainder of this section discusses how the location, size, and shape of a region can be used to aid the Locus labeling process. Even pre-segmentation, where arbitrarily shaped areas are presented for labeling, can be considered to be a source of knowledge.
3.2. PresegmentationArgos does not explicitly segment: it labels. Segmentation is the division of an image into distinct areas, each of which is a separate object in the image. Labeling is the identification of each object. Many image understanding systems segment their image first and then label the segments [32, 38]. Argos is able to take pre-segmented images and label the segments, but it is also able to take unsegmented images and "post-segment" them. What this means is that it is possible to view the labeled output as a segmentation since the labels are grouped into distinct clusters within the image. Pre-segmentation, therefore, is the use of images that have been segmented (by some unknown but reliable algorithm) before Argos labels. Some reasonable choices of pre-segmentation algorithms are clustering [26], and other multispectral classifications [18, 30]. Argos is currently running with a clustering algorithm that is derived from Ohlander's work [33]. The use of pre-segmentation has the advantage of time and space saving because there are many fewer nodes in the search tree, typically two orders of magnitude fewer. In addition, there is increased accuracy when using pre-segmentation since smaller numbers of search tree nodes can be constrained better. The only disadvantage of pre-segmentation knowledge is that care must be taken to ensure that enough segments are found. It is only marginally harmful if there are too many segments because the labeler can simply give multiple segments the same label, but if there are too few segments, then there will be continuity gaps in the adjacency and the search will fail.
3.3. Location KnowledgeEveryone knows that mountains are usually found in the top part of an image. This sort of location knowledge is separate from adjacency: it is absolute position as opposed to relative position. The use of location knowledge is quite easy and requires only two steps: learning the knowledge and using the knowledge. Learning location knowledge is done at the same time as the learning of region adjacencies. Each hypothesized image that is combined into the knowledge network has a minimum-bounding-rectangle (MBR) drawn around each region. When regions from two hypothesized images are merged, their MBRs are combined into a new MBR that includes both regions. The resulting knowledge that is extracted is an MBR for each PPE in the network. In some cases, the resulting MBR does not appear to contain useful information. If, for example, a building appears in many different places across the hypothesized images, then the knowledge contained in the combined MBR will not constrain that building's location. But that is a form of knowledge which says that the building may appear anywhere. It would be possible to extract an exact template of the pixels occupied by each region. When regions merged, the new template would be formed from the union of the old templates. However, this much detail is not necessary for Locus since the hypothesized views are imprecise at the outset. Using location knowledge is easier than obtaining it. In the unsegmented system, each pixel that is being evaluated during the forward pass is either in the MBR of a PPE or it isn't. If it is in the MBR, then no action is taken. If it is outside, then a penalty is added to the path likelihood equation for that PPE-pixel entry in the beam. The pre-segmentation version of Argos uses the segment centroid to determine location violation in the same manner. The penalty gets stronger as the pixel gets farther from the MBR of the PPE. Thus, Locus doesn't reject a label assignment if it violates location knowledge, but the likelihood begins to decrease as a pixel strays from its expected location. Unless there are other factors in the path likelihood equation that override the location penalty, the path will soon be pruned from the beam. This is how Locus search allows imprecise knowledge to be used without completely destroying the labeling.
3.4. Shape KnowledgeShape is hard to define because it has many aspects. This section will discuss two methods of describing shape. The first is better suited to pre-segmented images, and the second works best with unsegmented images. When Argos uses pre-segmented images, it must match the shape of an area of the image to the expected shape in the knowledge network. It is not sufficient to pre-compute the shape of each segment in the image since segments may combine during the search due to self-transitions in the network. Argos must be able to dynamically evaluate the shape of a group of segments and compare it to the shape specified by the PPE which is labeling these segments. Although there are many choices of shape descriptors including moment invariants [16] and bit masks, four simple shape measures were selected which make use of the segment perimeter and area. The simplest shape measure used by pre-segmentation Argos is fractional fill. This is the ratio of the area of the segment to the area of its minimum bounding rectangle. Another measure of shape is called compactness. This is the ratio of the perimeter squared to the segment area. Both of these measures distinguish compact segments from "loose" segments. Two other shape descriptors that are used are orientation and elongation [28]. These are derived from the first moment of the Fourier transform and indicate the angle of an elongated segment and the amount of elongation. The shape measures used by pre-segmentation Argos are matched with the same weighted Euclidean distance metric that was described previously. The results of the distance metric are used as a penalty in the path likelihood equation, thus allowing shape to collaborate with the other knowledge sources. The other shape descriptor, which is used in unsegmentation Argos, is dimensional shape: the specification of region shape by the use of pixel dimensions along various axes ("region a is from 5 to 10 pixels wide, from 15 to 20 pixels tall, etc."). The axes used are the four directions of adjacency and the information used is the minimum and maximum number of pixels that may appear along these axes. Dimensional shape is well suited to descriptions of regular objects like buildings. However it fails in any attempt to describe complex shape like a skyline. It will be shown, however, that even a skyline can be described to Locus search. Dimensional shape is easy to extract from the hypothesized views but its accumulation can easily lead to useless information if the hypothesized views are taken too literally. Therefore, Argos automatically rejects abnormal range values during the accumulation of dimensional shape. For example, suppose a building has one pointed turret above a wide main structure. Rejection of the abnormally narrow turret point would prevent the dimensional shape description from having a horizontal lower dimension of 1 when there is really only one place in the region that is so narrow. Dimensional shape is easy to implement in the forward pass of Locus search. All that is needed is four counters with every beam entry that specify the dimension (up to that point in the raster scan) of that PPE. When a PPE transits to a different PPE and the former one has insufficient dimension, the transition is penalized. Similarly, a PPE that transits to itself too often and exceeds the maximum dimension is penalized. Like location knowledge, the penalties become more severe as the distance from the dimension limit increases. Shape measures occasionally miss an important feature of a segment because they describe shape in such abstract terms. There is a very simple solution to this problem which requires that all regions with complex shape be broken down into multiple regions that are less complex and are adjacency constrained to form the overall shape. Of course, this technique requires extra network space and search time, so it should only be used for pathological shapes. However, it does allow arbitrarily shaped objects to be described.
3.5. Size KnowledgeSize is usually separate from shape. However, the dimensional shape measure used by Argos also specifies information about size. It is still possible to use size knowledge even in conjunction with dimensional shape knowledge. This is done in unsegmented Argos by defining size as an upper limit on the number of pixels in a region. PPEs that transit to themselves too often will violate the size limit and penalize their path likelihood. Unfortunately, size is not effective for Argos when labeling city scenes. The following explanation is offered. When the location and dimensional shape of a region have already been specified, size knowledge serves only to prevent the region from "filling out" its prescribed area. Since Locus works in raster scan from top to bottom, only the excess pixels near the bottom of the region will be penalized because they are the first pixels that exceed the maximum size. Therefore, size knowledge is useful in defining regions that are large near the top and get smaller near the bottom because it penalizes the bottom area pixels and forces recognition to "slim down". A quick look at the pictures in Figure 1, however, will convince anyone that most of the regions in this domain are shaped exactly the opposite: large on the bottom and small on the top. Therefore, size knowledge only serves to reject regions that would otherwise be correctly labeled. There are other ways to implement size knowledge that have not yet been explored. Size can be used in the backtrace to resolve conflicts in much the same manner as shape. In addition, it is possible to implement relative size knowledge that is able to make statements like a is twice as large as b. The implementation of this would require that the knowledge network have access to registers during the search in a similar manner to Augmented Transition Networks [39].
3.6. Knowledge ConstraintThe knowledge sources that are used in Argos are not used independently. They are all descriptive aspects of regions, so they work together. Argos has location, dimensional shape, and size implemented. The search is more effective if these three are constrained to a hierarchical scheme where lower knowledge sources in the hierarchy are only accumulated if higher knowledge sources are not being violated. The highest knowledge source is location because it constrains region identification most; the lowest knowledge source is size. Implementation of this hierarchy implies that dimensional shape information be accumulated only at pixels that fall within the minimum bounding rectangle of the PPE. Similarly, size information is only accumulated at pixels that fall within the dimensional shape bounds. The need for this hierarchy is apparent because there is no point in including incorrectly positioned pixels in the computation of shape: it can only damage the shape measure. Similarly, inclusion of pixels that violate shape bounds will adversely affect the use of size knowledge.
3.7. ConclusionSome forms of knowledge can be used with Locus more effectively than others. Since Argos is designed to be efficient, it does not attempt to incorporate knowledge schemes that are unnatural for Locus search. However, it is still felt that any knowledge can be used in some form. This section discussed region knowledge that can be put into a network. The next section discusses how to make a series of networks that bring more complex knowledge to bear on the labeling of images.
4. KNOWLEDGE HIERARCHIESUp to this point, all discussion about Argos has centered around the labeling of Pittsburgh city scenes. What about scenes of other cities such as New York? What about non-city scenes? There is a continuum of knowledge in the world and Pittsburgh is just one piece of it. The assumption that has been made is that each micro-world requires a separately designed knowledge network. This chapter is concerned with the techniques that can be used to label arbitrary scenes without the need for manual selection of networks. Although few of these techniques have been implemented, they provide some direction for future research. The first thing that should be pointed out is that there is a reasonable upper limit to the amount of knowledge that a network can hold. Although current computer technology imposes its own limits, there are theoretical upper limits that should also be observed. Therefore, it is not reasonable to build one large network with all available knowledge. In particular, the continuum of knowledge can be divided into hierarchical levels and a network should always bridge two of these levels. This bridging effectively applies knowledge that has been acquired at a higher hierarchical level to the acquisition of knowledge at a lower level. For the purposes of this discussion, all image knowledge will be divided into three hierarchical levels. The top level is the scene level and contains all of the scene types that can be distinguished by the image understanding system. Another way to think of this level is as the schema or frame level. Below this is the viewpoint level which is concerned with the angle and distance of view within the given scene. At the bottom is the object level which, given the environment and viewing position, concerns itself with correct identification of the objects in the scene. These levels of knowledge are not the typical levels that other image understanding systems recognize [4, 32, 38], that is, pixel within object within region within scene, etc. Locus networks combine those "minor" levels into a uniform network structure. Argos hierarchies are organized along general-to-specific lines (scene to view to object). The hierarchy spans the full depth of knowledge and is organized to extract a varying level of understanding from the entire image. The three levels of hierarchy that Argos recognizes span quite a lot of knowledge, especially the top and bottom levels which are infinitely extendible in their respective directions. For example, the object level can attempt to identify the buildings in a city scene or, at a lower level of detail, all of the windows in all of the buildings or, at a higher level, just the major features of the scene such as sky and mountains. Similarly, the scene level can identify a scene as "city of Pittsburgh," "city," or just "outdoor scene." There are even many distinctions to the middle category (viewpoint) such as view angle and view distance. The rest of this section discusses how Argos could use hierarchies of knowledge to completely identify scenes. At the end is a discussion of where the system currently stands in its hierarchical use of knowledge.
4.1. Today Pittsburgh, Tomorrow the WorldThe ultimate source of knowledge is the internal model. This model is an actual three-dimensional description of the scene to be viewed. Since the assumption behind the use of knowledge hierarchies is that there is too much knowledge to be completely specified by one network, it will be assumed that the model is very large, perhaps a representation of the entire United States. Before discussing the use of very large models, it is appropriate to mention their creation. This paper will not address the issue of automatic model generation or learning, but it would be possible to attach a feedback loop onto Argos which would use labeled knowledge to modify the existing model. More sophisticated learning systems could build additions to the model solely from examples found in the existing structure. Since learning is outside of the scope of this paper, it will be assumed that the scene models are built by hand. Hierarchical use of model knowledge always starts at the most general end of the hierarchy with identification of the scene type. Even humans start at this point by first determining the "gist" of the scene [2]. This scene type identification process involves running the image through a pass of Locus search with a knowledge network that contains only the major components of each scene type in the model. For example, if the model contains twenty cities, then the knowledge network used to identify the correct city might contain PPEs only for those major regions of each city that distinguish it from another. Pittsburgh would have river and mountain PPEs and perhaps one for the U.S. Steel Bldg which is a landmark. Other cities might employ ocean, countryside, or even smog PPEs. Even the shape of the skyline can be used to distinguish cities. Once the search is run on this network, it is easy for the machine to examine the labeled output and determine the selected city. With city knowledge in hand, the machine can generate a more detailed network of the identified city which explores the scene at a lower level in the knowledge hierarchy. Note that there is no restriction on the number of Locus search iterations that can be done at a given level of hierarchy. If the world model contains 2000 cities, it might be reasonable to break the scene-type hierarchical level into two passes of Locus search; one using a network to classify them into major types (cities with tall buildings, cities near water, etc.) and then another pass with a more detailed sub-network to select from the identified types of cities. Locus networks, therefore, form a tree structure that spans the knowledge hierarchy. In less complex world models it might be possible to pre-compute all of these networks and store them for possible use in the hierarchy traversal. In more complex models, these networks must be generated by the machine at each decision point in the identification process. Extremely complex models have the interesting property that the lower (more specific) levels of knowledge networks all look the same. For example, once the proper city and building has been identified, the details of the building are very similar to the details of any other building. These models no longer have a tree structure to their knowledge hierarchy: it is now a graph structure that re-combines knowledge paths at the lower levels. Add to this observation the fact that there is bound to be noise and error in the graph searching process, and a fascinating observation is made: Locus search can be used to guide the selection of knowledge networks. All of the efficient, non-backtracking search can be used at a much higher level: the Primitive Knowledge Network level. So Locus search can be used both for labeling pixels within images and for labeling images within world models. Is there a higher dimension that Locus can run on? It is possible to envision hierarchies of world models which vary according to the intent of the model creator. A city planner would build a world model that reflects the populated vs. unpopulated areas of cities because the planner is concerned with growth trends. A military tactician would build a model that contains only strategic points, be they telephone switching centers or harbor docks. So there is a higher level of knowledge that sits on top of the hierarchies discussed in this section. This level deals with the actual semantics of an image and can indisputably be called "image understanding."
4.2. Knowledge Used by ArgosNeedless to say, Argos is not able to analyze any scene in terms of arbitrary levels of understanding. In fact, Argos is barely able to transmit results from one level of the knowledge hierarchy to another. It does demonstrate the use of hierarchical knowledge with a simple task. The task that was selected was highly dependent on the images that were available. Since the fifteen images were all of Pittsburgh from five different views, the obvious hierarchical tasks were view angle identification followed by object identification. View angle identification consists of building a network with many views of the city. This task used 24 views at 15° intervals around the city. Prior to network construction, the number of significant objects in the city was reduced from 58 to 16 (see Table 2). This is because the view angle task is more general than the knowledge base which was selected for the object identification task. The selection of 16 PPEs was done automatically by examining the 24 machine-generated views of the city and counting the number of times each building contributed to the skyline. Any building which formed part of the skyline in over 50% of the views was kept as a separate PPE. The other buildings were eliminated from the list and merged into the "Miscellaneous Buildings" PPE. To determine the angle of view for each photograph of the city, Argos labeled each image and extracted the most popular angle from the labels (note that each label includes information about the original view or views from which it came). Argos typically selected one or two of the 24 angles and these values were in error by an average of 41°. The next level of the knowledge hierarchy is the object identification level. Presumably, Argos should be able to use the predicted angles from the view angle runs and use that knowledge to improve the object identification. A number of schemes were tried, none of which yielded significant improvement in the object recognition. The first experiment in knowledge hierarchy traversal involved building new networks using the those hypothesized views that were predicted by the view angle task. This sounds like a good thing to do, but it constrains the search too tightly. Also, if there is any error in the predicted view angle, then the search is totally destroyed because the global path is guaranteed to be missing from the knowledge network. The next step involved using the error of the view angle task as a factor in building the object identification networks. Instead of networks built only from the predicted view, Argos built networks from the predicted view plus a range of 45° on either side (three 15° intervals). Thus the average view recognition error was used to determine a range of hypothesized views in the object identification task. This still refused to yield worthwhile results because the network used in object identification needs to be rich in transitions and this one has a small branching factor. A completely different approach was then tried. The predicted view angles were implemented as a knowledge source that penalizes transitions to PPEs that are not from the proper view. For example, if a photograph was estimated by the view angle identification task to have been taken from 120°, then the object identification task would penalize transitions to PPEs that are not tagged with angles between 75° and 165° (remember the 45° error). This technique not only yields the best results, but it has the added advantage that the object identification network does not have to be re-built for each image. The only disadvantage is that the object identification network is fairly large since it must contain knowledge about the entire view angle identification task. The surface has only been scratched with respect to hierarchies of knowledge. Future results can be expected to be much better. The next section discusses the precise environment of the testing that was done with Argos. In spite of many noisy and inaccurate sources of knowledge, Argos performs well in the labeling of real-world scenes.
5. EXPERIMENTSA series of experiments were made with the Argos image understanding system. Written in the SAIL language and running on a PDP-10 computer, the system has three main parts: the image subsystem, the knowledge subsystem, and the match subsystem. The image subsystem of Argos consists of a series of programs that reduce large images to a workable size and accumulate useful feature vectors for each pixel. In addition, this subsystem allows interactive segmentation and labeling of the sensed scenes for use in result evaluation and for the extraction of feature vector templates. The knowledge subsystem creates networks. It contains programs for building the city model from maps, sketching views of the city model, extracting knowledge from the views, and building networks from the extracted knowledge. The match subsystem is the program that uses Locus search to label images. It takes signal and signal-to-symbol data from the image subsystem and matches that with networks from the knowledge subsystem to generate labelings of images. In the beginning, there were fifteen pictures taken from five different vantage points about the city. These images were segmented and labeled by untrained humans (i.e. people who were not familiar with Argos and who, in many cases, were not initially able to identify the buildings in each picture). The purpose of this human labeling was quite simple: a metric with which to measure of the quality of Argos labeling. Once the images were labeled, a set of 58 PPEs was selected for the object identification task. At this time, seven of the fifteen pictures were chosen as training data and were used to extract signal-to-symbol information. Table 1 lists the 58 PPEs and which of the images they were trained on. It was decided that the training images would be the only ones used in all subsequent experiments. It would not be until Argos was completely tuned that the other eight images, the test images, would run. Next came the knowledge engineering step. A map of the city was digitized to produce the computer map in Figure 10. Elevation information was obtained very crudely by telephoning building superintendents and various architects. Note that this inaccuracy, as well as other inaccuracies in Argos, was not refined for optimal results both because of the non-production environment of Argos and the ability of Locus to handle noisy data. This three-dimensional model of the city is constructed entirely of planar surfaces. Although there are more sophisticated representation techniques [24, 29] this technique serves Argos adequately.
With a machine model of the city, "hypothesized" views could be made (see Fig. 11 for an example). The amount and type of views varied with the task. Region adjacency, location, shape, and size knowledge was extracted from the hypothesized views and a knowledge network was built. Argos was now ready to run.
The system has many parameters which must be tuned for best performance. Here are some experimental results.
Once tuned, Argos labeled the seven training images and eight test images. These runs used unsegmented images and performed an object identification task. Prior to labeling evaluation, all of the output of Argos is smoothed with three separate operations. The first step is simple smoothing [9] which takes any single pixel that is surrounded by many pixels of a different label and changes the label on the center pixel to that of its environment. The second step throws out any region with less than eight pixels. These pixels become unlabeled because they do not form a large enough area to have meaning. The third step fills in any unlabeled holes in a region. The unlabeled area must be completely surrounded by pixels of the same PPE to be filled in with that label. Since there are no "containership" relationships in the Pittsburgh images, this operation serves only to fill gaps in the labeling. The overall average labeling quality is 73% correct by area. If the two close-up scenes (pathological views that were not properly trained) are removed from these statistics, then the training scenes achieve 81% accuracy and the test scenes achieve 79% accuracy. Overall accuracy on thirteen of the Pittsburgh city scenes is 80%. Figure 12 shows a sample labeling output.
The version of Argos which uses pre-segmented images ran with many of the same parameter settings as the unsegmented system. The task, however, was different. Instead of training on the existing images to obtain high labeling accuracy, the pre-segmented system implemented the two-level knowledge hierarchy discussed in the previous section. For each image, there is a view angle identification run and an object identification run. The view angle identification task had an average recognition error of 30° on training images and 51° on test images with an overall average error of 41°. The object identification was comprable to the unsegmented system.
6. DISCUSSIONThe graph structure representation of Locus is a natural outgrowth of work in languages [1], graph representations [15], and syntax directed pattern recognition [8, 14, 23]. The Locus approach principally differs in the use of the network representation. It rejects the notion that image recognition is best viewed as a problem in parsing. Given the error and uncertainty associated with the decisions, the problem tends to be not one of deciding whether a given pattern is parsable but rather one of search, i.e., deciding which of the many acceptable alternative paths represents the near-optimal interpretation. The view that the problem of image recognition is one of constraint satisfying search has been gaining increasing acceptance [31, 35, 37]. Locus also subscribes to this viewpoint and differs mainly from the other efforts in the representation of constraints and the method of search. The realization that it is important to introduce some measure of the degree of uncertainty into the interpretation process is reflected in the papers by Fischler and Elschlager [12], Feldman and Yakimovsky [11], and the probabilistic relaxation methods [31, 35]. Locus is able to handle search in the presence of error and uncertainty in a natural and straightforward manner provided all knowledge and constraints are represented in terms of a graph structure. The best-first search given by the A* algorithm [25] and the breadth-first graph search of the dynamic programming algorithms [6, 19] provide alternative approaches to optimal graph search problems. The beam search technique of the Locus model provides a minimal effort near-optimal solution. It appears to be effective in cases where the evaluation function is dependent on an external signal source and where a large number of decisions are related to each other as they attempt to provide alternative interpretations of the same signal segment.
7. CONCLUSIONArgos is important for a few reasons: the modification of Locus search for image understanding, some techniques for automatically acquiring scene knowledge, and the development of a working image understanding system. Each of these contributions encompasses a number of smaller results and observations. Modification of Locus involves a heuristic to handle the two-dimensionality of images. Implementation for the speech task used a linear search tree that could be built and backtraced in a well-defined manner. The first-order Markov nature of Locus allowed each element of this tree to be dependent on the one previous element in a strict recursive manner. In Argos, this first-order Markov assumption must be changed by the complexity of the signal. Locus now becomes an "adjacency first-order" Markov system that uses all surrounding elements both on the forward pass and the backtrace. Modification of the forward pass involves the inclusion of all adjacency directions in the path likelihood equation. This is done by averaging each direction's contribution. In images that are broken down into a grid of pixels, there are four neighbors to be considered (assuming a raster scan and physical adjacency of all neighbors). Images that are segmented into varying sized and shaped areas have a varying number of neighbors located at arbitrary positions in the search tree. The backtrace also runs into issues of image topology. Because there are a varying number of neighbors in the forward pass, the backtrace follows a varying number of paths as it climbs the search tree. These paths frequently join together with differing results, generating conflicts in the global path. A number of local conflict resolution techniques were explored and some global methods were proposed. In addition to the basic Locus search, this paper has explored the addition of various knowledge sources. Using a three-dimensional model to generate hypothesized two-dimensional views, Argos is able to extract object location, size, and shape. In addition, it is possible to hypothesize the two-dimensional views in a hierarchical order starting at scene selection and proceeding to viewpoint selection and finally object identification. It is believed that this hierarchy can be used to apply massive amounts of knowledge to the interpretation of images. The final result is, of course, the output of Argos. The system was trained on seven images of downtown Pittsburgh. Another eight were saved for test purposes after the system was tuned. The training images were labeled with only 18% error on a pixel basis. The test images were 22% incorrect. In addition, Argos was able to correctly identify the major components of almost every image (only three of the fifteen showed a lack of "understanding" by the system). In a view angle selection task, the system identified the angle of view with an average error of 41°. Argos has therefore demonstrated the applicability of Locus search to the image understanding task. Although much work lies ahead for image understanding by computer, experience with Argos adds another step to our understanding of the application of knowledge, and ultimately to our understanding of human vision and the human brain.
REFERENCES
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||