This report is an updated version of the 1978 doctoral
thesis with the same name.
It includes additional research on the use of
pre-segmented images and hierarchies of knowledge.
This work was supported by the Advanced Research
Projects Agency of the Office of the Secretary
of Defense under contract number F44620-73-C-0074
and is monitored by the Air Force Office of
Scientific Research.
ARGOS is an image understanding system. It builds a three-dimensional model of the task domain and uses hypothesized two-dimensional views of the model to label images. It currently achieves less than 20% error by area when labeling real-world (city of Pittsburgh) photographs with a knowledge base of over fifty objects. In addition, the system can determine the angle of view around the city with approximately 40 degrees of error.
The labeling technique used by ARGOS is called Locus search. Locus is a non-backtracking graph search technique in which a beam of near-miss alternatives around the best path are extended in parallel through the graph. After the graph has been searched in breadth-first order, the beam of possibilities is examined in reverse order to extract a near-optimal path. This path defines a labeling of the image and is only sub-optimal because of the pruning heuristics used in the beam creation.
Locus search has been used in the interpretation of speech (Lowerre, 1976). Its implementation in the image understanding task requires major modifications due to the non-linearity of the signal. Instead of implementing a form of a first-order Markov search which relies on only one previous node in the beam (as the speech system does), ARGOS implements an "adjacency first-order" Markov system that relies on all surrounding nodes in the physical image.
This thesis formulates image understanding as a problem of search; shows how Locus search can be used to label images; describes the many sources of knowledge used in the interpretation; shows how knowledge represented as a network can be used to constrain the search; explores extensions to the use of knowledge; and presents the experimental results of ARGOS. Its main contributions are the demonstration that Locus search can be used for image understanding and the exploration of issues involved in this use.
Many researchers use this space to thank the countless and otherwise nameless people who have helped in the completion of their thesis. Parents, wives, and children are high on the list along with fellow researchers. But this very important space is often used to thank first-grade teachers, janitors, and summer camp counselors. Occasionally you get a guy who thanks his rabbit.
In my case, there is only one person to thank because that person's efforts were of major significance in producing this thesis. To thank anyone else would reduce the importance of this person's efforts. He was responsible for every stage of my graduate education and will command my respect for years to come. And so, before moving on to the thesis, I have only one thing to say. Thank you, Raj.
his thesis is about ARGOS, a computer program that can
understand images.
The term "image understanding" means that the
program can identify the major components of a scene
by using knowledge about the structure of
the scene.
For example, ARGOS is able to use a map of downtown
Pittsburgh to aid in the identification of the
major buildings, rivers, etc. in photographs of
the city.
The technique that ARGOS uses to identify, or label, these images is called Locus search. Locus is a powerful search technique that uses a recursively defined evaluation function to scan an image while using networks of knowledge about the scene to constrain the search. After the image has been scanned once, Locus is able to extract picture labels from the results of the search.
One of the significant features of Locus search is its ability to use many diverse sources of scene knowledge when labeling images. Not only is it able to use information about the positions of the buildings, but it also knows their size, shape, and environment. In general, it is expected that Locus search can be a useful technique in applications where knowledge is used to attain goals.
This thesis, therefore, has two contributions. First, it presents a detailed discussion of Locus search as it applies to the image interpretation problem. Some of the new issues that arise from this implementation are the modification of the Markov assumption used by Locus and the hierarchical application of knowledge networks. These changes are necessary due to the topological complexity of the signal (i.e. images are two-dimensional) and the immense amount of knowledge needed by a complex vision system. The second and more significant contribution is ARGOS: a working image understanding system that can deal with real-world scenes. ARGOS is also important because it works from an internal three-dimensional model of the task environment.
Image understanding is defined as the application of knowledge to the task of interpreting an image. This application consists of finding a match between the knowledge and the image. Once the match has been made, the image is both interpreted and understood. It is interpreted because it has been linked to the knowledge and therefore is labeled with knowledge-based descriptions. The image is also understood because it is possible to extrapolate the interpretation knowledge.
For example, ARGOS has knowledge about Pittsburgh. To understand a new image, it must find a match between that image and the Pittsburgh knowledge. This match will consist of pairings that link areas of the image with objects in the knowledge (i.e. "the area at the top is sky, the area at the bottom is Monongahela River, the area to the left of center is Hilton Hotel, etc."). With the results of this match, it is possible to delve farther into the knowledge and make statements that are not explicitly in the image ("this view is from Mount Washington"). It is this ability that makes ARGOS an image understanding system.
In general, the image interpretation process has three components: the image, the knowledge, and the match. The image is often refered to as the signal and the knowledge is the symbol. The match is therefore a signal-to-symbol match. The next section examines a number of image interpretation systems in these terms.
The table below lists a few image interpretation and image understanding systems in terms of their signal, symbol, and match. Each system is described more fully in the next sections.
| Who | Signal | Symbol | Match |
| Barrow and Popplestone | Segment | Relational Network | Heuristic Search |
| Fischler and Elschlager | Segment | Relational Network | Linear Embedding (Dynamic Programming) |
| Feldman and Yakimovsky | Pixel | Relational Probabilities | Best-First |
| Tenenbaum and Barrow(IGS) | Pixel | Relations | Relaxation |
| Sakai et. al. | Segment | Semantic Network (Knowledge Blocks) | Data-driven Search |
| Keng and Fu | Bit Pattern | Grammar | Bottom-up Parse |
| Perkins | Lines and Curves | Lines and Curves | Brute Force |
| Waltz | Lines and Shadows | Junction Dictionary | Relaxation |
| Mackworth (MAPSEE) | Lines | Relational Network | Relaxation |
| Ballard et. al. | Lines, Segments, Pixels, etc. | Semantic Network | Arbitrary |
| Williams et. al. (VISIONS) | Segments | Semantic Network | Best-First |
| Rubin (ARGOS) | Pixel or Segment | Relational Network | Beam Search |
Barrow and Popplestone (1971) divide their images into segments. A coffee cup, for example, is broken down into three segments: the outside of the cup, the inside of the cup, and the hole in the handle. These three segments compose the signal description of the cup.
The symbolic knowledge consists of a relational network. Nodes in the network are region names and arcs that connect nodes are "facts" about the regions. For example, a knowledge network that describes the previously mentioned coffee cup will have three nodes for the inside, outside, and hole regions. The arc which connects the outside region and the hole region will contain the fact that the outside region completely surrounds the hole region. In addition, each node contains descriptive knowledge which enables the system to help identify the proper segment (i.e. the hole region is a small and highly compact segment).
The most obvious match process requires the evaluation of all of the segments against each node in the knowledge network. Even with only three segments and three nodes there is a possibility of 33=27 different ways to interpret the image. Therefore the match technique is important not only for accuracy but also for space and time savings. Barrow and Popplestone use the heuristic search technique which views the space of segment-node alternatives as a search tree. It then searches this tree using details from the knowledge network to guide the search. Assume that the "trunk" of the search tree branches three ways to indicate the three pairings: (segment 1 is inside), (segment 1 is outside), and (segment 1 is hole). Each of these branches divides three ways to list the possibilities for segment 2 and each of the nine segment 2 branches splits three more ways for the segment 3 options. If, however, segment 2 is surrounded by segment 1 in the image, then the search process need not examine all nine possibilities for segment 2, just those possibilities which satisfy the knowledge constraint of containment. Thus, heuristic search is able to prune large portions of the search tree by examining only the match possibilities suggested by the symbolic knowledge.
The disadvantage of heuristic search is that it still must examine all feasible possibilities. Locus search goes one step further in its pruning by evaluating each path and examining only the most likely of the feasible alternatives.
Fischler and Elschlager (1973) also use image segments and relational knowledge networks. The networks consist of numerical "springs" which connect region nodes. These springs must be compressed and expanded by the match process to find the overall set of labels that has the least spring "tension". The match process uses a form of dynamic programming (Bellman and Dreyfus, 1962) which does not suffer from the combinatorial explosion of tree searching. Their match, called the Linear Embedding Algorithm, prunes all but a constant number of search tree nodes at each level. This match may fail to find the globally optimal interpretation, but it is efficient. In addition, it allows for noisy data by accepting highly "stretched" springs in the final labeling.
The reduction of the search space to a linear growth is the same philosophy used by Locus. The difference is that Locus uses a dynamic threshold to admit the most likely nodes within the fixed growth limit. Therefore, growth of the search space is variable but linearly bounded.
Feldman and Yakimovsky (1974) use relational probability knowledge to guide the isolation and identification of segments. The image starts as a collection of picture elements (pixels) which are organized as a rectangular grid of points. The system combines pixels into segments by breaking down the boundaries between areas of the image that are similar. It uses a probabilistic utility function of the signal and the symbol to compute the most probable boundary to break. This best-first control structure proceeds until a stopping criteria is reached at which time the image is segmented and labeled. The most interesting aspect of this system is its use of Bayesian decision theory in computing the utility function.
Locus does not explicitly segment images. It labels each point purely on the basis of context knowledge. The final labeling may be used to define a segmentation by observing groups of similar labels, but that is only a by-product.
Another difference between this work and Locus is formality. Locus does not follow the rules of decision theory very closely: it prunes and approximates frequently. The result is a pseudo-statistical system that is only formal within heuristic bounds.
Tenenbaum and Barrow (1976) apply relational knowledge to the segmentation and interpretation task. Their system starts by crudely segmenting on the basis of pixel similarity and then making a set of labels for each segment. It then applies iterative relaxation to get a consistent label a refined segmentation of the image. Relaxation works by repeatedly evaluating the junctions of segments and revising segment borders and labels on the basis of relational constraints. Each pass of the relaxation operation revises the list of possible segment labels on the basis of the surrounding segment label lists. For example, if a large segment with the label options wall and door surrounds a small segment with the label options window and doorknob, and one iteration of relaxation eliminates the door option, then the next iteration will eliminate the doorknob option. Thus, global knowledge can propagate through the image as the relaxation iterates. The final labeling is obtained when the relaxation converges and yields no further change to the interpretation.
The problem with relaxation techniques is they do not guarantee convergence and so do not have well defined termination criteria.
Sakai et. al. (1976) label image segments using a sophisticated knowledge network that can contain procedural code and arbitrary knowledge constraints. These "knowledge blocks" are applied first to key segments in the image. Since the knowledge blocks can have control sections, they do the bulk of the search. Thus, instantiation of a knowledge block can lead to the hypothesis, verification, and rejection of other knowledge blocks. The control procedure simply runs through the list of untested blocks until there are no more, at which time the interpretation is complete. Although this system is fairly sophisticated, it suffers from the complexity of its data which must be carefully constructed.
Keng and Fu (1976) have a system which interprets satellite images by matching the binary patterns in windows of the image to templates in a grammar-style knowledge base. Each window of 8x8 points in the image is compared with every known template in the grammar. Windows which evoke multiple templates are reduced with the grammar to a unique label. If, for example, a window contains two intersecting lines, each of which matches a highway template, then the grammar will generate a highway intersection interpretation for the window. The system uses no global constraining and relies heavily on its knowledge of the sensor characteristics of Landsat satellites.
At General Motors, Perkins (1977) identifies parts on a conveyor belt by comparing their outlines to those in the knowledge base. Outlines consist of curves and straight lines which he calls concurves. A scoring algorithm counts the number of concurves in the image which match those in known parts. Since the system looks for relative scores, it can accept noisy data and occluded objects. The system is even able to extrapolate on partial matches and predict the complete placement of occluded parts and parts that are only partially in the field of view. However, the task domain is limited and the system would suffer from combinatorial explosion if it handled complex scenes. This is because it uses no heuristics to limit the search space: identification requires the comparison of all concurves in the image with all concurves in the knowledge base.
Waltz (1975) matches knowledge in a list of rules to an image that consists of lines. In addition, each segment enclosed by lines can be shaded (a shadow) or unshaded. The knowledge base contains alternative interpretations for the various line junction types. Match is done with a form of relaxation that iteratively compares junction interpretations at the ends of each line segment. After the system has converged, the junction interpretations are used to label the surfaces and explain the shadows.
The system works in the blocks-world domain where perfect lighting makes shadows trivial to distinguish. ARGOS uses real-world scenes which are full of shadows that are hard to detect. However, the goal of Waltz's work is to explain shadows whereas the goal of ARGOS is to explore a search technique, so there is no fair basis for comparison.
Mackworth (1977) uses relaxation on a relational knowledge network. His input signal is a "sketchmap" which consists of chains of lines. Some chains are closed and form lakes and islands. Others are open and form rivers and roads. The system initially makes many weak pairings of labels to chains and then refines the pairings using a network consistency algorithm. This algorithm simply re-evaluates the pairing options based on each chain's environment. The system eventually converges on an interpretation of the image. It is interesting to note that an initially noisy image can be interpreted many times by using the final results of the first interpretation to refine the sketchmap. The entire process is then repeated for a better overall interpretation.
Ballard et. al. (1977) use a semantic network that describes the scene at several levels (region relations, pixel adjacencies, line lengths, etc.). Each image is also described at many levels as a collection of lines, segments, volumes, etc. The user of the system codes a query procedure using a control structure of his own choosing which explores the semantic network and builds an instantiation of the network that corresponds to the image. This instantiation, called a "sketchmap" (not to be confused with Mackworth's sketchmaps) is an interpretation that links parts of the image to parts of the semantic network. An example of a query procedure that has been built is a rib matcher for chest X-rays. This procedure uses cost functions to numerically determine the best match of signal-to-symbol. Although it would be possible to code Locus in this framework, very few alternative search strategies have actually been explored.
Williams et. al. (1977) describe an ambitious effort to apply multiple sources of knowledge to a segmented image. Their knowledge base is a semantic network of knowledge sources that are linked hierarchically in the same manner as the Hearsay II speech system (Erman and Lesser, 1975). It deals with the image on many levels of representation including lines, segments, and frames (Minsky, 1975). Control is divided among the many knowledge sources which repeatedly hypothesize and verify signal-to-symbol matches in a best-first search order. There are even knowledge sources that guide other knowledge sources by focusing the system's attention.
Locus attempts to unify many knowledge sources into one so that a single control structure can be applied. It rejects the notion that multiple sources of knowledge must be used independently.
ARGOS uses Locus search to interpret images. Locus was first used in the Harpy speech understanding system (Lowerre, 1976). Harpy uses this technique on the recognition of spoken utterances and is currently the best speech recognition system in existence. Prior to Harpy, Dragon (Baker, 1975) used a breadth first graph searching technique to attain the same goal.
Dragon is the theoretical grandfather of ARGOS. It uses a probabilistic function of a Markov model to find an optimal path through a knowledge network. The technique is very similar to the Viterbi Algorithm (Forney, 1973). Harpy relaxes the formality of the probabilistic function and demonstrates that heuristics can be used to improve the search. ARGOS goes one step further by modifying the Markov assumption so that the multi-dimensionality of images can be handled. Instead of requiring that a first-order Markov system rely on only one previous node in the search tree, ARGOS implements an "adjacency first-order" Markov system that relies on all surrounding nodes in the image.
Before proceeding, it is useful to stop for a moment and examine ARGOS in the light of the signal-to-symbol match paradigm of the previous section. In Locus, the symbolic knowledge is composed of units called Primitive Picture Elements (PPEs). These PPEs are organized into relational networks which specify knowledge about a scene. In addition, PPEs have signal characteristics associated with them so that they may be directly compared with parts of the image. ARGOS does not place any constraint on the nature of the signal, which is why it is able to interpret both pre-segmented images and unsegmented images.

Although the signal is frequently refered to as a pixel, it should be understood that ARGOS can deal with either pixels or arbitrarily shaped segments.
A simple example of PPE selection is in order. Assume that Locus must match a satellite image of a field to a knowledge base which contains information about crop locations. Locus would define each crop to be a PPE so that a relational network could be built describing the crop locations. Similarly, each of the crop PPEs would have signal characteristics that allow it to be matched with the image (i.e. "the alfalfa PPE registers 4 on a brightness scale of 1 to 10"). Thus, PPEs are the common ground between signal and symbol.
The match aspect of Locus is a two-pass search process that explores the space of signal-to-symbol pairings. The first pass of the search, called the forward pass, constructs a highly pruned tree of alternative pairings. Each level of this search tree corresponds to a different pixel in the image, and the various nodes at each level are the alternative PPE labels for that pixel. The selection of tree entries (which are likely PPEs for a given pixel) is based on a recursively defined evaluation function. This function uses knowledge from the network, the image, and the immediate environment of the search tree to determine a uniform evaluation of a signal-to-symbol match. The completed search tree contains a multitude of pixel-to-PPE matches, each of which identifies a path of "optimal" pixel-to-PPE matches above it in the tree. The second pass of Locus search, called the backtrace, simply re-examines the search tree from the bottom up, gathering the optimal pixel-to-PPE matches into a unique labeling of the image. Because of the Markov nature of the forward pass, the backtrace is able to bring global constraint to bear in the final selection of image labels.
It is interesting to note that the search tree built by the forward pass is so highly pruned that it resembles a varying sized beam of alternatives and is sometimes called the "beam". The beam is functionally equivalent to the stack of alternatives generated by standard backtracking search algorithms: they both list the options that are under consideration. However, beam search is superior to backtracking methods because it avoids thrashing behavior.
ARGOS interprets pictures of downtown Pittsburgh. Fifteen of these pictures, reproduced in Appendix I, were taken from five vantage points around the city. To enhance variability, some were shot with a standard 50mm. lens and others were taken with a 28mm. wide-angle lens. Seven were used for training and the other eight were saved for test purposes. In the following table of images, the column labeled "Number of Objects" is the number of different regions that were identified during human labeling.
| Image | Vantage Point | Lens | Number of Objects |
| Training 1 | Northwest | 28mm. | 17 |
| Training 2 | Northwest | 28mm. | 19 |
| Training 3 | Northeast | 28mm. | 17 |
| Training 4 | West | 50mm. | 26 |
| Training 5 | West | 28mm. | 27 |
| Training 6 | Downtown | 28mm. | 9 |
| Training 7 | Southwest | 28mm. | 27 |
| Test 1 | Northwest | 50mm. | 16 |
| Test 2 | Northwest | 28mm. | 17 |
| Test 3 | Northwest | 50mm. | 15 |
| Test 4 | Northeast | 50mm. | 15 |
| Test 5 | Southwest | 50mm. | 13 |
| Test 6 | Southwest | 50mm. | 19 |
| Test 7 | Southwest | 50mm. | 21 |
| Test 8 | Downtown | 28mm. | 7 |
Each image was originally digitized into a rectangular grid of 700 pixels across and 525 pixels down. However, ARGOS interprets reduced versions of those images that are 75 by 100 pixels in size. It can also interpret images that have been divided into arbitrarily shaped segments.
The system runs on a PDP-KL10 computer and requires approximately 100,000 words of 36-bit memory. The largest part of this space is used to store the parallel search paths (typically 25 paths). It takes about five minutes of processor time to label one 75 by 100 image, but less than a minute to label a pre-segmented image.
The next chapter discusses Locus search in great detail. It starts by explaining how image understanding can be formulated as a problem of search. From there, it describes the organization of the knowledge networks that are used in ARGOS. Following that is a discussion of the low-level system in ARGOS: the techniques used to give signal characteristics to PPEs. Finally comes a detailed explanation of the search and an example of the entire process.
Chapter 3 discusses how many different knowledge sources can be used with Locus search. In addition to expounding upon object adjacency and image pre-segmentation, the chapter discusses the use of object size, shape, and location.
Chapter 4 discusses how knowledge can be organized hierarchically to infinitely expand the power of Locus. Most of this chapter is speculation, but it does conclude with a review of where ARGOS stands in its hierarchical use of knowledge.
Chapter 5 describes the initial experimental results that were obtained with ARGOS. The system is currently able to label the fifteen city scenes with 20% error at the pixel level. It can also determine the angle of view around the city with an average error of 40°. Considering the complexity of the knowledge and the images, this is quite good.
Chapter 6 concludes the thesis by evaluating the results of ARGOS.
CHAPTER 2: LOCUS SEARCH |
![]() |
his chapter describes the Locus model of search
as it is used in the Image Understanding task.
Locus is an efficient, non-backtracking search technique
that has proved itself successful in the Harpy speech
understanding system (Lowerre & Reddy, 1977).
It is able to provide near-optimal solutions to the network
search problem
in time that is linearly bounded by the complexity
of the input signal.
This chapter discusses the search technique in detail. The first section describes the representation used by Locus to describe images and knowledge networks.
The next section is a discussion of the knowledge networks that are employed. Coupled with this discussion is an explanation of why image interpretation is treated as a problem of search.
Following that is a discussion of the low-level processing that is done in ARGOS. This thesis is not specifically concerned with the low-level aspect of image interpretation, so the discussion is somewhat cursory.
An explanation of the search process follows next. Since application of the search technique to image interpretation is one of the main contributions of this thesis, the discussion is quite long and contains many examples.
The chapter closes with an example of Locus search.
Image understanding is the application of knowledge to the task of interpreting images. As Chapter 1 pointed out, this process requires the matching of two forms of data: the sensed data which is the signal and the knowledge structure which is the symbol. The signal is the raw form: points of light pieced together to comprise the image. For a computer, the points of light are grouped together into segments or picture elements (pixels) which are describable with a set of numbers called a feature vector. These numbers are taken directly from camera sensors and tell such values as the intensity of red light, blue light, and green light.
The symbolic form of an image is the knowledge representation of the scene. This form of knowledge is much more compact than the signal, so it is easier to store. Symbolic representations take the form of networks of information: the nodes of the networks represent "items" and the inter-connections of the nodes represent "facts" or qualifications about the items.
Before a machine can match a signal to a symbol, it must be able to compare the two forms. Some systems use a third representation to do this (Ballard et. al., 1977). ARGOS does just the opposite: instead of creating a new representation, it combines the existing ones so that the signal and the symbol are all describable with the same units. These units are called Primitive Picture Elements.
The Primitive Picture Element, or PPE, is the basic building block of both the signal and the symbol. Every sensed image can be described with PPEs since each pixel can be given a unique PPE label. Similarly, the general knowledge of a scene (sky above river under bridge, etc.) can be described with a network, all of whose nodes are PPEs. PPEs can be thought of as the smallest units of representation that exist for the micro-world of the image task. Alternately, they can be thought of as the largest object that is both homogeneous to the signal and homogeneous to the symbol. For example, look at training scene 6 in Appendix I. The building in the center is the Pennsylvania State Office Building and it has a lobby that looks much different than the rest of the structure. Proper identification of this building would therefore require two PPEs called State-office and State-office-lobby because, although they are symbolically homogeneous (both State Office), they are not homogeneous to the signal (they look different). Similarly, observe the three cross-shaped buildings, one of which is obscuring the other two, on the right side of in test scene 2. Although all three look the same, they are assigned different PPEs because they are symbolically different (from left to right, they are Gateway Two, Gateway One, and Gateway Three). To sum it up, the PPE is the label that Locus uses when interpreting images.
The choice of PPEs varies with the micro-world being explored. It is dependent on the level of detail in the knowledge base and on the ability of the system to optically distinguish different parts of an object. If, for example, one wishes to determine whether an image is a city scene or an office scene, then neither the sensed data nor the symbolic representation need be very complex. The PPEs sky, building, road, wall, desk, and floor would suffice. Note that each of these PPEs is comprised of optically similar pixels in the sensed image and represents adequate symbolic information to determine the scene type.
If, however, a finer level of scene detail were to be interpreted, then the selection of PPEs would have to be more complex. The following PPEs might be selected for the detailed analysis of a Pittsburgh city scene:
| Hilton awning | Commonwealth Place |
| Hilton conference area | Liberty Avenue East |
| Hilton elevator shaft | Liberty Avenue West |
| Hilton main structure | Liberty Avenue Island |
| Hilton windows | Blvd. of the Allies |
| Hilton doors | Mountains to the North |
| Allegheny River | Mountains to the south |
| Sky | Park |
| etc. | |
For a more complete discussion of PPE selection as it relates to knowledge hierarchies, see Chapter 4.
Given that a set of PPEs is to be used to interpret an image, a number of questions immediately come to mind. Two obvious ones are how the sensed data is to be compared with PPEs and how the PPEs are to be organized as a symbolic knowledge base. More important is the issue of how these two forms are matched. Sections 2.2 and 2.3 discuss how PPEs are organized symbolically and optically. Section 2.4 describes the match process, which of course is Locus search.
Imagine that a network is constructed whose nodes are all PPEs. The network contains an initial node and a terminal node between which all other nodes are somehow connected. Each node is an item in the network and each node connection is a fact about the two items it connects. Therefore, a complete network is a set of facts about PPEs and is called a knowledge network. In addition, a complete path from the initial node to the terminal node is a collection of knowledge that is consistent (i.e. all facts agree with each other). The network might look like this:

Imagine further that for every image which is M by N pixels in size, there exists a path of MxN nodes stretching from the initial node to the terminal node which forms a consistent "statement" about that image 1 The existence of a PPE node in a given position along the path is equivalent to a label assignment of that PPE's symbolic name to the pixel in that path position. The path, therefore, forms a symbolic description of the image, taken from the knowledge network.
Since a PPE path corresponds to a set of labels for an image, the image interpretation problem reduces to a search task. Knowledge for this search task comes from the network connections which guide the path selection. For example, assume that there are 50 PPEs in a network and that each one is connected to all of the others, yielding 2500 node connections. This represents no constraint on the network paths and therefore no knowledge. The network would then consist of an initial node, a terminal node, and one node for each of the 50 PPEs. If a sensed image which is 75 by 100 pixels in size is matched to this network, then the path through the 50 PPEs can take any of 50 choices at each step through the image. The number of different paths through the network would be 507500. If one of the 2500 node connections is removed, then the number of possible network paths is reduced. Knowledge, therefore, appears in the form of constraints on the interrelations of PPEs.
How are these constraints selected? One way is to build up the set of interrelations from a set of "expected" images. Since each network path corresponds to a legal image, the converse must also hold: each legal image has a unique network path. It is this converse form which defines how knowledge is encoded as a set of network constraints. Each image in the expected micro-world is combined into an overall knowledge network. If the expected images are all of downtown Pittsburgh, a city which is surrounded by mountains, then all images of the city are going to have a backdrop of mountains below the backdrop of sky (except for pathological viewpoints which can be ignored in this example). This simple piece of knowledge manifests itself directly as a network constraint: the sky PPE may not adjoin any of the building PPEs because the mountain PPE intervenes.
In addition to specifying legal adjacencies, the network arcs can contain information about the kind of adjacencies. An arc can specify that the PPEs which it connects are vertically or horizontally adjacent. It can also specify more complicated relationships such as "containership" and some other knowledge such as object size. However, as the discussion of shape in Chapter 3 will show more fully, most complex relationships can be built from a series of simpler constraints.
The use of networks to store knowledge is nothing new. Semantic networks have been used for years to just such end. In fact, one might think that these PPE networks are nothing more than semantic networks. This last, however, is not true. Semantic networks have multiple levels of hierarchy which define knowledge. A semantic network of a city scene might look like this:

In the above semantic network, all depth relationships (such as the one between backdrop and sky) are "within" by default and all nodes at a given breadth level (such as the one between backdrop and buildings) are unrelated unless specified. PPE networks however, have no hierarchy of "within". All relationships must be specified explicitly. A Locus version of the above network would be represented with the seven PPEs taken from the lowest level of the semantic network. Notice that the layout of the PPE constraints implies that the top of the image is on the left "side" of the network:

The PPE network differs in two ways from the semantic network. First, all information about a node is encoded at the node site, and not at some "higher level" of the network. This is because there is only one level of the network. It might seem that this causes each node to use excessive amounts of storage, but in practice, it only requires a few extra pointers at each node. The second difference between PPE networks and semantic networks is size. Since semantic networks can share low-level descriptions, they can often save space with common sub-networks. PPE networks cannot do this because of their uni-level structure. As a result, PPE networks tend to be quite large. This is one of the reasons that world knowledge is broken down into multiple networks instead of being placed into one knowledge structure (see Chapter 4).
Why are PPE networks structured so as to make them excessively large? The answer is found in the classical tradeoff of space for time. Single-level networks are much easier to search and PPE networks are no exception. What Locus loses in space, it gains in time while searching the PPE network. Search of a Locus network proceeds in the same way as images are laid out. Semantic networks do not have this explicit search order built into them, so they take longer to search because the control structure must be smarter.
Before the search process can be discussed, there is one more background issue that must be covered: the direct comparison of the sensed data to PPEs. This comparison is at the low-level end of vision research and is critical to any high-level system such as ARGOS. However, since this thesis is concerned only with the high-level end, the discussion presented here contains no new research. In fact, it does not even pretend to represent the optimal techniques.
Implicit in the definition of a PPE is the iconic homogeneity factor: every pixel 2 that belongs to a PPE class has similar sensor parameters. For example, the red, green, and blue sensors could be used to define the sky PPE as a feature vector with a high blue component and low red/green components. Then, every pixel which meets these specifications could be classified as sky. These two steps (defining feature-vector templates and classification of pixels from the templates) comprise the low-level system of ARGOS.
It is interesting to note that in experiments where this low-level system was replaced by a "perfect match" the overall labeling error was cut in half. This perfect match took its knowledge directly from the labels that are used in evaluating results. Therefore it performed even better than a human would: it knew about quirks and errors in the evaluation process. However, this experiment still served to show that a better low-level system would improve ARGOS results.
In vision systems today, there is a glut of information available to describe an image. The images that ARGOS works with come digitized as 525x700 points of light. Each point is described by three 8-bit values from red, green, and blue sensors. Using red, green, and blue, it is possible to obtain hue, intensity, and saturation, or, as the color television industry in America does, obtain Y, I, and Q (Price, 1976). Both of these derived forms cover the color spectrum. Every point of light, therefore, is available as 9 different sensor values.
To simplify labeling, ARGOS does not work with 525x700 images. Not only is it prohibitively expensive to search trees with over 300,000 depth levels, but the interpretation being sought does not require such detail (see the list of PPEs in Appendix II for an understanding of the desired level of detail). ARGOS reduces each image before labeling. When the system is working with segmented images, it typically divides the image into as many as 100 arbitrarily shaped segments (see Appendices VII and VIII). When the system is labeling a regular grid, each image is reduced by a factor of 7 in each dimension before being processed. This reduction to 75x100 images makes the search more manageable. It is the largest reduction that can be performed without damaging a human's ability to recognize the images at the desired level of detail. In addition, the 7x7 window allows the texture operators (described below) to distinguish many of the buildings from each other.
The selection of feature vector templates now has two aspects: the choice of sensors and the choice of reduction operations to perform on the points of light in each window. The reduction operators that were explored were: mean, mode, median, standard deviation, Z.C.C. (zero crossing count: a micro-edge detector from Price, 1976), and contrast (a function of the 4th moment of the distribution curve from Tamura et. al., 1977). The list could go on forever but this thesis cannot. With 9 sensors and 6 possible operators on each sensor, the feature vector has a potential size of 54. Space limitations in ARGOS reduced that number to 6.
There is no adequate justification for the limit of 6 elements in the feature vector. For that matter, there is no proof that 6 are needed. The selection of the feature vector components was done experimentally by evaluating the labeling quality of each component on a training image. Although more formal techniques exist (Tou and Gonzalez, 1974) this method was chosen for simplicity. It was also felt that more robust selection criteria were not needed in the non-production environment of this thesis.
Feature vector selection proceeded as follows: a training image was labeled by an unbiased person, feature-templates were generated for each sensor/reduction pair, then each feature-template was tested on its ability to reproduce the initial labeling. The table below shows the percentage of the image that was labeled correctly for each reduction operator and each sensor. Note, for example, that the mean value from the blue sensor was the most accurate because when each PPE was classified solely in terms of that template, 44% of the label assignments made were correct.
| Z.C.C. | Mean | S.D. | Mode | Median | Contrast | |
|---|---|---|---|---|---|---|
| Red | 29% | 22% | 30% | 22% | 29% | 9% |
| Green | 27% | 22% | 31% | 24% | 36% | 8% |
| Blue | 22% | 44% | 40% | 37% | 42% | 8% |
| Hue | 23% | 7% | 18% | 27% | 42% | 11% |
| Intensity | 26% | 23% | 31% | 26% | 36% | 8% |
| Saturation | 24% | 23% | 24% | 21% | 19% | 7% |
| Y | 31% | 21% | 30% | 22% | 35% | 8% |
| I | 30% | 20% | 26% | 17% | 28% | 11% |
| Q | 22% | 30% | 37% | 31% | 29% | 7% |
From the above table, plus additional information about how well each sensor did for various parts of the scene, six feature vector elements were chosen. It was decided first that since Hue/Intensity/Saturation and Y/I/Q are simple re-combinations of Red/Green/Blue, the former could safely be rejected in favor of the latter. This is also supported by the fact that humans perceive color as Red/Green/Blue so ARGOS should do the same if it wants to best mimic human perception. The next choice is among the operators. Median, a good low-texture operator, performs best according to the above table. Since city scenes have many high-texture objects, some other operator is also needed.
It is interesting to note that the statistics in the above table are of no help in selecting a high-texture operator since the test scene had much low-texture area, thus causing high-texture operators to score badly. In particular, it was noticed that contrast was best at labeling the high-texture areas, even though the above table indicates that it is the worst operator. Therefore, the final feature vector selection contains these sensors and operators:
| Median of Red | Contrast of Red |
| Median of Green | Contrast of Green |
| Median of Blue | Contrast of Blue |
One additional consideration must be given to the feature vector templates: they should be able to accurately describe objects that have multiple appearances in the training images. For example, assume that two images show a building alternately in the shade and in direct sunlight. A single feature vector for these two views would attempt to combine both appearances and, in so doing, destroy the ability of the feature vector to label either view. Although more sophisticated image operators could be used to distinguish shadows, the point here is that objects may vary in appearance and this variance must be handled properly.
ARGOS solves this problem by creating multiple feature vector templates for objects with multiple appearances. The decision to create an additional template is made automatically when the standard deviation of more than two feature vector elements exceeds 20% of their possible range. If, for example, the Median of Blue range is 100, and the standard deviation of five instances of a building is over 20 in the Median of Blue position, then that element of the feature vector is too diverse. If more than two elements are too diverse, the feature vector is split in the middle of the dynamic range of the most diverse element. When training on seven images, ARGOS generates as many as three feature vector templates for each object, although one template typically suffices.
Now that six elements have been selected for the feature vector, a distance metric must be found for the pixel-to-PPE likelihood calculation. A distance metric is simply a way of determining the likelihood that a given pixel in the sensed image belongs to a PPE class. The metric provides what can be called an "optical match" between signal and symbol: it tells how close they are to looking the same. The simplest distance metric is subtraction where the optical match is based on the difference between the PPE's feature vector template and the pixel's feature vector. For example, if the sky PPE has the template values: (red median=0, green median=0, blue median=10, red contrast=0, green contrast=0, blue contrast=0) for the feature vector selected in the previous section (which indicates strong, solid blue) and a sensed pixel has the feature vector (1, 2, 7, 4, 2, 0) then the subtraction distance metric yields the vector (1, 2, 3, 4, 2, 0). An optical match can be computed using a percentage of the worst-case distance vector (which would be (10, 10, 10, 10, 10, 10) in this case).
ARGOS uses a slightly more complex metric that is commonly found in vision systems: weighted-Euclidean distance. Euclidean distance is, of course, the square root of the sum of the squares of the distances. In the above example, the Euclidean distance would be [12 + 22 + 32 + 42 + 22 + 02].5 = 5.8. ARGOS goes one step further: each component of the feature vector template is weighted so that unimportant components will not drag the optical match value down. The weights are a six-tuple which might, for the sky example, be (.18, .18, .1, .18, .18, .18), indicating that the low-contrast (median) blue sensor is less constrained, but all others contribute equally to the likelihood calculations. In this case, the weighting allows the sky PPE to be deep blue or light blue, but no red, green, or high-texture. The weighted-Euclidean distance metric then becomes [(12)(.18) + (22)(.18) + (32)(.1) + (42)(.18) + (22)(.18) + (02)(.18)].5 = 2.2.
The selection of weighting factors is another badly justified algorithm in ARGOS. The problem is that at the root of all selection is a human who must train the machine on each PPE. The training is done by interactively outlining examples of each PPE region. From these regions, the machine is told to select feature vectors for the PPEs. It begins its selection by computing the mean and standard deviation for all pixels in the region. Then, those pixels that are more than 1.5 standard deviations from the mean are rejected. This rejection of atypical pixels allows for human error in the labeling and prunes data points which may be unfairly contributing to the metric. The mean and standard deviation are then re-calculated and the adjusted standard deviation is used as a weighting factor for the adjusted mean (which becomes the template value). The use of standard deviation as a weighting factor on the mean is common practice because it is an indicator of the consistency of the data and therefore the usefulness of the mean value in classifying that data. The conversion of standard deviation to weighting factors is done by inverting the six standard deviation values for a given region and then linearly scaling them so that they sum to 1. Since feature vector components with high standard deviations indicate erratic quality on the part of that component, a high standard deviation will yield a low weight which will reduce the system's dependence on that component.
There is only one step remaining in this labeling process: search. The knowledge environment has been formulated as a network and techniques have been devised for matching the sensed image to that network. The search process must now find the "correct" set of labels for the sensed image.
As the discussion of PPEs mentioned, the combinatorics of labeling an image are tremendous. ARGOS deals with images that are 75 by 100 pixels and contain a minimum of 50 PPEs. If no knowledge about region relationships is available, then the branching factor for a 50 PPE network is 50 (i.e. PPEs can branch 50 ways to all other PPEs) and the number of possible labelings is 507500, or about 1012750. When knowledge is applied, the branching factor reduces significantly. For example, a typical 50 PPE network with knowledge has a branching factor of about four which reduces the number of possible labelings to 47500, or about 104500. Thus, the application of knowledge is the most powerful tool available in reducing the search space. No other technique can reduce the number of labelings by over 8000 orders of magnitude! However, there is still much reduction to be done before a single, optimal path is selected, because even 104500 paths are too many for a computer to examine. Locus search is able to perform that reduction and extract a single labeling from the knowledge-reduced search space.
Finding the optimal path through a graph is a classical search problem with many alternative search strategies (Nilsson, 1971). What distinguishes ARGOS from other systems is its use of Locus search. Locus is a beam search heuristic in which all except a beam of near-miss alternatives around the best path are pruned from the search tree at each decision point. This reduces the exponential number of paths to explore without requiring backtracking or any iterative search.
The remainder of this section describes the basic issues of search which are necessary to the understanding of Locus. More advanced readers will want to skip ahead in order to avoid sleep.
Search involves the creation of a search tree from the signal and the symbol. This creation is a simple expansion of the possible paths through a knowledge network. Just as each knowledge network path has a node for every image pixel, so every level of depth through a search tree corresponds to a different image pixel. Thus, a knowledge network node which is visited ten times in the labeling of an image will appear on ten levels of the search tree.
To fully understand Locus search, a description of some simpler search strategies is needed. Recall the 75x100 pixel image that is matched to a knowledge network with a branching factor of four. The search tree contains 47500 paths which are laid out like this:

A single path must be selected which labels the image. A simple breadth ordered evaluation starts by examining the four possible PPEs at pixel 1 and selecting the best. The search tree then looks like this:

Next, the four choices at pixel 2 are evaluated and the search is reduced further:

After 7500 choices of four pixel labels, the entire image is labeled. This evaluation technique is fast and extremely inaccurate.
Inaccuracies in the above search scheme arise from the order in which PPE possibilities are pruned. When the four PPEs at the first pixel are reduced to one, 75% of the search tree is pruned. At this early point in the tree, it is definitely unwise to prune that many paths. A good path could be pruned simply because it "gets off to a bad start" (i.e. is not well matched to its first pixel). In addition, once a path is chosen, the search is "stuck" with whatever is below that node in the tree. If, for example, the four choices at pixel 200 are all bad, the search is still forced to choose one and proceed.
The typical solution is to add backtracking. Backtracking is a method of rejecting all choices at a pixel and backing up to the previous pixel for a re-selection of the correct path. This backing up can step as far up the tree as it likes. The problems with backtracking are numerous. First, how is it decided that all choices are bad? The machine doesn't know what's on the other paths so it can't be sure. It has already been established that the machine can't look at every point in the search tree, so some absolute measure must be available of the goodness of a path. Another problem with backtracking is time: the search might be near the end of the tree and find out that it has to back-up to the beginning to re-do a faulty decision. This could happen easily in test scene 3 (see Appendix I): while labeling from the top to the bottom, the machine might incorrectly identify the smoke-stack as a building. The mistake would not be discovered until the building at the base of the smoke-stack was scanned. The machine would have to backtrack to the top of the smoke-stack and re-do all of its work. Backtracking can thrash around for quite a long time and cause many pixels to be evaluated repeatedly.
Other techniques add more bells and whistles to the search. State information can be saved to make the backtracking less painful; multiple paths can be explored (to a limited extent); different search-orders can be used; and various evaluation functions can be applied. For the image understanding task, where search trees are enormous, a much better technique must be used.
Locus search is different from standard breadth-first search. It rejects the notion that any final path decisions must be made before the entire image has been scanned. Recall that Locus is a beam search: it steps through the knowledge network and the image selecting many "attractive" paths at each pixel. These attractive paths comprise the beam: a pruned search tree which contains a list of near-miss alternatives around the best path. Beam selection is based on a combination of low-level (signal) factors and high-level (symbol) factors in a recursively defined environment that links these factors in all parts of the image. There is enough latitude in the beam to allow many optimal labelings to be stored. It is only after the entire image has been examined that a single path is selected from the beam. The final labeling is therefore "the best of the best". It is found without time-consuming backtracking and it is highly accurate because it delays decision making until the entire beam has been examined.
Observe the search tree of the previous section as Locus searches it. The tree has a depth of 7500 and a branching factor of four. Locus prunes a varying number of branches at each depth level. The amount of pruning is determined both by beam eligibility and beam capacity. Let us assume that two paths are rejected at pixel 1. The remaining two are placed in the beam:

There are now 8 possibilities at pixel 2. Five are selected for the beam:

Note that it is relatively easy to include many paths in the beam at this early point in the tree. However, as the tree expands, it will be necessary to make severe reductions in the percentage of paths that are saved. When the entire image has been scanned and the beam has been selected, the tree might look like this:

This completes what is known as the forward pass of the search. No unique path selections have been made, yet the entire beam has been built. All that is left is to assemble a final labeling. This step, known as the backtrace, is simply a backwards scan of the paths that were placed in the beam by the forward pass. The backtrace bears no relation to backtracking: its function is to proceed directly through the beam in reverse order without recursion, iteration, search, or any other extended computation. The backtrace is not unlike the painter who, after meticulously laboring on a painting, steps back to observe her work. The forward pass is the painter's effort; the backtrace is her final evaluation.
The forward pass of Locus search proceeds in breadth order through the knowledge network creating a beam-like search tree as it goes. Each level of depth in the tree corresponds to a pixel in the sensed image and each knowledge network PPE at a depth level is a possible label assignment for that pixel. The following sub-sections discuss the various aspects of the forward pass.
The order in which the image pixels are scanned is raster (left-to-right, top-to-bottom). For the 75x100 pixel images, this means that depth level 1 of the search tree corresponds to the pixel in the upper-left-hand corner, depth level 100 is the upper-right-hand corner, depth level 7401 is the lower left-hand corner, etc. For pre-segmented images, the segments are ordered by raster according to the position of the segment centroid 3. This order of search may appear at first to be detrimental to the quality of labeling. It will be shown, however, that when using Locus, the order of search is of minor importance as long as the backtrace follows the reverse order of the forward pass.
Recall that each knowledge network has an initial PPE and a terminal PPE. For the two-dimensional image task, there are actually four knowledge network PPEs which cover the initial position. These PPEs are the four image edges: top, bottom, left, and right. Locus knowledge networks can use these PPEs to help constrain region placement within the image (for example, sky is at the top of the image). Note, however, that these constraining PPEs can be used to not constrain simply because Locus imposes no fixed rules on region placement within the scene: it leaves that as an option.
During the forward pass, determination of likely PPEs is based on the computation of a path likelihood for the search tree path that arrives at that PPE. The path likelihood is defined recursively in terms of the path likelihoods of all parent PPEs in the search tree 4. It is this path likelihood value which is used to build the search tree and guide the forward pass. It uses three pieces of information: the optical match of the pixel to the PPE; the path likelihoods of previous PPEs; and the transition likelihoods of arriving from those previous PPEs. Formally:
where Pi,j is the path likelihood of PPE i for depth level j (position j of the signal); Oi,j is the optical match of PPE i to the pixel at position j; ∂(d) is the depth adjacency function which offsets the current search tree depth (j) to a "previous" adjacent depth (j+∂(d)) in the two-dimensional direction d; and Tk,i,d is the transition likelihood of traveling from PPE k to PPE i in direction d (explained further in sub-section 2.4.3.4). The following diagram illustrates the physical placement of these values in an unsegmented image:

The need for the ∂ function is explained by the fact that the task is image interpretation which is two-dimensional. Since depth within the search tree is one-dimensional, some concept of pixel adjacency must be incorporated. This is done with the ∂ function which determines, for a given two-dimensional direction, what two depth levels are physically adjacent in the image. For example, in a 75x100 image, ∂(above) = -100 and ∂(left) = -1. This is because of the raster order of scan: a pixel to the left is 1 back in the search tree whereas a pixel to the top is 100 back on the previous raster line. Pre-segmented images have an arbitrary number of adjacencies so the ∂ function can extend as far back as the start of the image.
In unsegmented images, ARGOS uses four primary directions of adjacency: left, above, upper-left, and upper-right. These four directions, which are the range of d, have four opposite directions which together allow all horizontal, vertical, and diagonal relationships to be defined within the 3x3 matrix of pixels surrounding the current point. Note that the opposite directions are not explicitly used because, for example, the relationship "A below B" can easily be expressed as "B above A".
In pre-segmented images, adjacency is much different. The directions of adjacency cannot be limited to an upper-left semi-circle since the previous segment can border on all sides of the current segment. Thus, at least eight directions of adjacency are required to retain the same amount of knowledge. However, the pre-segmentation version of ARGOS does not use diagonal directions of adjacency. This is because segments which are diagonally adjacent can also be considered to adjoin horizontally and/or vertically. The result is that pre-segmentation ARGOS has the following four directions of adjacency: left, above, right, and below. These directions are used to match the segment adjacency to the network adjacency when determining the transition likelihood. Unlike unsegmented ARGOS, these directions of adjacency can be combined when describing segment adjacency. Thus two segments can adjoin in any of 16 directions (left, right, above, below, above and left, above and right, etc). Even complex relationships like "containment" (Levine, 1977) can be incorporated in this scheme.
A more sophisticated adjacency system for pre-segmented images would be able to account for the continuum of border classes. It might be reasonable to work the border type into the path likelihood equation. For example, if the Gulf Bldg comes to a point at a 60° angle, then the transition likelihood to a Gulf Bldg PPE would be penalized more as the angle of the segment border strays from -60° or +60° (the two angles that define the pinnacle of the building). In addition to considering the angle of a border, its texture (smooth or jagged) could be considered.
The path likelihood equation introduces an interesting heuristic used in Locus search: recombination of PPE nodes in the search tree. If search trees were actually represented as totally independent paths (which section 2.4.2 leads one to believe) then any PPE-pixel node would uniquely identify its lineage back to the root of the tree. In the image understanding world, where search trees are phenomenally large, these independent paths are too expensive to retain. Therefore all paths at a given depth level which transit to the same PPE in the knowledge network are merged by Locus into one PPE node in the search tree. For example, if both sky and mountain PPEs can transit to the river PPE, then the left-hand figure below shows the true search tree for this transition and the right-hand figure shows the Locus search tree which combines the river PPEs.

It is for this reason that the "maximum" factor exists in the path likelihood equation: the best previous PPE is saved, along with its path likelihood; all others are rejected. In a true search tree, the "maximum" factor is unnecessary because there is only one parent PPE for each child. Locus's rejection of less likely paths means that, in actuality, one of the transitions in the above Locus search tree will be ignored after both have been calculated. The rejection of a path is equivalent to the rejection of the entire path leading up to this point from the beginning of the search tree. However, the rejection is not damaging since there is now a better path to this point.
There is one more detail that should be mentioned about the path likelihood equation: the "average" clause. This clause indicates that path likelihoods from all directions of adjacency are being considered. The likelihoods are averaged to show that they contribute equally to the overall path likelihood. As a side constraint, the unsegmented version of ARGOS insists that a parent exist in all four directions. For example, if three out of four pixel neighbors have beam entries that can legally transit to the river PPE, but there are no entries in the fourth pixel neighbor that allow this, then the river PPE will not be placed in the beam at this pixel. The pre-segmentation version of ARGOS relaxes this constraint and simply penalizes any PPE that does not have transitions from all neighbors. This allowance is due to the potentially large number of neighbors that a segment can have.
More reduction in the complexity of the path likelihood equation can be obtained by selecting the transition likelihoods from one of two values: 0 or 1. A value of 0 indicates that it is not feasible to transit from PPE k to PPE i in direction d. A value of 1 means that the transition is allowed (i.e. PPE k and PPE i may adjoin each other in the d direction). The transition likelihoods are therefore the knowledge network constraints and comprise the major knowledge element of the path likelihood equation.
ARGOS actually implements the transition likelihoods as one of three values: 0 for a disallowed transition; 0.1 for an allowable transition from one PPE to another; and 0.9 for an allowable transition from a PPE to itself. Self-transitioning is a necessary special case in the unsegmented system since most PPEs cover a large area of pixels. The increased likelihood of transitioning to the same PPE is a form of knowledge which ARGOS uses to label city scenes. The value of 0.9 is taken, without experimental verification, from the Harpy speech implementation of Locus. Pre-segmentation ARGOS also uses differing intra-state and inter-state transition likelihoods, but the reverse values are used. This is a knowledge source which implies that the segments form complete objects in the image and should not self-transit.
The beam is the search tree. It is the collection of pixel-to-PPE pairings that is accumulated on the forward pass. It recursively generates itself using the path likelihood equation to select new members. However, not all of the path likelihoods produced by the equation are saved in the beam. This is because Locus resembles a first-order Markov system, so the only likelihood values that are needed are for the ∂ calculations in the immediate neighborhood of the pixel being computed. In the unsegmented system, anything more than 1 scan-line back (100 depth levels back) can be discarded because path likelihoods exist only to compute other path likelihoods. The pre-segmentation system, however, must keep all of the likelihoods since strange segment adjacencies may require arbitrary values from the beam.
The important information in the beam is the PPE connections. Whenever a path likelihood is computed, there is an "optimal parent" in each direction of adjacency. This optimal parent is considered to be a major contributor to the path likelihood at the current pixel or depth level. It can be seen in the path likelihood equation as the maximum k for each direction d. The collection of optimal parents is what makes up the beam.
Yet another distinguishing feature of Locus is its pruning. It has been mentioned that Locus selects the best few PPE connections and saves them in the beam. There is a fixed maximum size at each level of the beam which is considerably smaller than the number of PPEs. Therefore, only those PPEs with the highest likelihoods are allowed in the beam. Those that cannot fit are usually not important in the overall labeling scheme because there are many other beam entries with higher likelihoods.
In addition to fixing the beam size, Locus maintains a threshold below which a likelihood will be pruned regardless of beam space availability. This threshold, which is used in Harpy Locus, is dynamic in that it is relative to the best likelihood at the particular depth level. If the best PPE has likelihood value P, then all other PPEs with likelihoods below P-threshold will be pruned. It has been found that a threshold of 100 (out of a possible range of 256, see next subsection) will prune on the order of 60% of the PPEs at each depth level. This is a hefty cut yet it does not damage the labeling quality.
One problem that arises when computing the path likelihoods is precision. In a true Markov system, all of the probabilities 5. at a given depth level must sum to 1. In Locus, however, there is much pruning and re-combination of likelihoods. The result is that the values degenerate as the network is scanned. To prevent this, the likelihood values at a pixel are normalized once they have all been computed. The most promising PPE at the level is given the likelihood value 1 and all others are linearly scaled to fall below that. This normalization causes no damage to the algorithm because Locus is only concerned with one single depth level as it relates to another. If all PPEs at a level are normalized together, then the relative results are the same.
As an added bonus, the normalization allows the path likelihoods to be stored with very few bits of precision. ARGOS uses negative log values of all likelihoods so that it can add instead of multiplying. Also, it is able to store these as integral values in only 8 bits of precision. Thus, the likelihood value 1.0 is stored as a 0 and the likelihood value 0.0 is stored as 255. All fractions between 0.0 and 1.0 are represented as an integer between 0 and 255.
After the entire image has been scanned, all that remains is the beam: stretching out for as many depth levels as there are pixels in the image, and containing PPE connections from the forward pass. The backtrace scans the beam in reverse order (i.e. from the bottom of the search tree to the top) and produces a labeling of the image. The labeling, of course, consists of a series of PPE assignments to every pixel in the image.
The inquisitive reader will ask why the beam is scanned in reverse order. The answer is: to obtain a unique labeling. Notice that the beam contains, for each PPE entry, a pointer to its optimal parent PPE. This optimal parent pointer is a direct by-product of the path likelihood equation. Therefore, when scanning the beam backwards, each child PPE which is selected will identify exactly one parent PPE at the next higher level. If the beam were scanned forwards, there would be ambiguity when a parent had more than one child: which child should be selected next? Observe the following beam:

There are six depth levels, plus the terminal PPE 6. Notice that the beam has exactly one path defined by following the optimal parent pointers (arrows) from the terminal PPE. The same cannot be said for the other order. The backtrace is necessary in this form of search tree evaluation because of the re-combination that occurs when two PPEs at a depth level join into a third at the next depth level. If there were no re-combination, then the optimal PPE at the bottom depth level would uniquely define a path back through the search tree without the need for a beam or any parent pointers. However, the combinatorics of search without re-combination are prohibitive in the image understanding task, so Locus re-combines and does a backtrace.
Notice that the backtrace is extremely fast. It performs no search or other involved computation. In fact, the backtrace is linearly bounded in time to the size of the image. Even with the complications that are about to be discussed, it retains these qualities.
Conflicts arise in the backtrace when multiple beam pointers from the forward pass disagree. This is a subtle problem that is once again caused by topology: the task is two-dimensional but the beam is one-dimensional. The solution to the dimensionality problem is simple but it introduces problems in the backtrace.
To handle two-dimensional data on the forward pass, there is an optimal parent in each of the directions of adjacency. This was mentioned in section 2.4.3.2 when the path likelihood equation was presented. If the forward pass saves d parents for every child then, by symmetry, there must be d child PPEs for every parent on the backtrace. It looks like this in the unsegmented system where d=1:

In the above diagram, the backtrace is about to label pixel P and the four child PPEs have already been scanned (remember: the backtrace proceeds in reverse raster order, from right-to-left, bottom-to-top). Each child pixel has been assigned a PPE label, therefore each child pixel has something to say about its parents in four directions (observe the tangle of arrows). Notice that although there is only one optimal parent for every child, there are four directions of adjacency, so there is an optimal parent in each direction. Parent pointers connect two levels of the search tree and every PPE has parents at four other levels, therefore four parent pointers. If, for each child pixel, c (1 ≤ c ≤ 4), there are four parents OPc[d] (d taken from direction key above), then the four PPE label choices for pixel P are OP1[UL], OP2[T], OP3[UR], and OP4[L].
Clearly, two-dimensionality presents a problem of conflicts: when the children select different parent labels. Although it would be nice to have unanimous PPE selection, this is not always the case. Conflict resolution is an interesting problem that has not been fully explored. Therefore, the following discussion contains some techniques and observations, but no definitive solution.
There are two ways to resolve conflicts. The obvious one is to select one of the candidates from the child PPEs through some heuristic. The surreptitious solution is to throw out a pixel when it has a conflict. This latter technique is not as bad as it sounds for a few reasons. First, there is no rule which says that every pixel must be labeled. Humans don't assign interpretations to every point in an image, so why must ARGOS? In effect, the system can say, "this point is confusing: it could be any of these objects, I can't tell. I'll leave it undecided." The second reason for leaving conflict points unlabeled is in direct support of the first: experience has shown that conflicts arise only at region borders or in areas where there is inadequate training. When ARGOS changes PPEs, it gets confused for a pixel or two but soon picks up the scent and continues faithfully.
Given that conflicts can be resolved by arbitration or rejection, ARGOS chooses to do both. It employs a few simple heuristics to resolve conflict types that it understands, and then rejects the pixel if the conflict persists. Leaving a pixel unlabeled sounds simple enough, but it causes other problems in the backtrace. One problem is that there are no parent pointers emanating from an unlabeled pixel.
In the unsegmented ARGOS, when an unlabeled pixel is skipped, all parent pointers are extended around it. Although no pointers emanate from unlabeled pixels, this allows every subsequent pixel to get four pointers into it by having a labeled pixel at some distance in all four directions. The diagram below illustrates the arrangement of child pixels used when an unlabeled entry is left. Notice that the second child is passing through the unlabeled pixel.

In pre-segmentation ARGOS, unlabeled segments do not affect the backtrace: they are simply ignored. Since the number of child segments varies anyway, it does not matter if there are one or two less children available when it comes time to label a segment. It should not be presumed, however, that unlabeled segments are harmless. Conflicts indicate inconsistency in the interpretation and should be resolved whenever intelligently possible.
The most powerful, and least obvious conflict resolution technique is simple knowledge constraint checking. The child pixels have been labeled already so their parent PPE selections should be consistent with all legal adjacency rules in the relational knowledge network. It is simple enough to check each candidate for the parent position against its children and reject those candidates that are inconsistent. If all but one candidate are rejected, then the conflict has been resolved. If there are still multiple candidates, or, if all of the candidates are rejected, then some other resolution technique must be employed.
Before moving on to other resolution strategies, it is worthwhile to stop and see why the above technique is able to function. After all, aren't all beam entries generated from the path likelihood equation? And doesn't that equation use only legal PPE relationships? So how can invalid relationships arise? The answer has to do with the reverse order of scanning and the topology. X may adjoin y, and y may adjoin z, but x doesn't have to adjoin z. Even though they may end up touching in two-dimensions, they do not have to be legally adjacent. Don't forget that the beam contains many likely label candidates which may not all be consistent with each other. For a good demonstration of how this type of conflict can arise, see the example in section 2.5.
The next technique used in conflict resolution is voting. It has been found that if one candidate outnumbers all the others, then that candidate should be chosen for the pixel label. Voting is easily implemented and aids in labeling quality.
The only other resolution technique that was explored is directional preference. This last-resort heuristic is used only when the above two techniques fail. What it does is to select candidates solely on the basis of their adjacency direction. The assumption is that certain directions aid the labeling more strongly than others. When this assumption was tested out in the unsegmented system, it was found that the diagonal directions seem to be slightly stronger than the horizontal and vertical directions. However, the advantage is not strong enough to be significant. In addition, use of this technique implies that all conflicts can be resolved since this technique cannot fail. Inaccuracies in labeling rise faster than accuracies when conflict points are forced to be labeled. Therefore, directional preference is not used.
There are a number of conflict resolution techniques that were not explored. An example of one of these is the use of optical matches. This factor can be used in conjunction with other techniques (i.e. as a weighting factor for voting) or by itself. In fact, any knowledge that is used in the forward pass can also be applied in the backtrace to help resolve conflicts.
Locus performs very well in labeling images. The technique of delaying decision until the backtrace pass allows a globally near-optimal labeling to be selected. "Globally near-optimal" means that every knowledge constraint which has not been pruned is able to be applied to every PPE relation in the scene. For example, the discovery of a boat in the lower-left corner can affect the selection of a boat house in the upper-right corner or anywhere else. Of course, pruning does allow for the possibility of missing the globally optimal path. However, this happens too seldom to make the added cost of exploring all nodes worthwhile.
The reason that Locus search yields globally near-optimal results is simple: it resembles a Markov system, and Markov systems yield globally optimal results. In fact, if no pruning, normalization, or node recombination were done, then Locus would be a Markov system. It has been found, through extensive experience with Locus systems, that these heuristics do little damage to the Markov nature of the search. Thus the quality is preserved.
Yet another advantage of Locus search is reduced order dependence. It was mentioned earlier that the raster order of search is not detrimental to labeling quality. In a Markov system, all that matters is that every point be examined: they are all related to each other regardless of order. In Locus, all that matters is that every point be examined and that the backtrace reverse the order of the forward pass. Given that Locus implements a Markov system (which, within heuristic bounds, it does) then it becomes marginally important how the image is scanned. In ARGOS, the raster scanning order is done for two reasons: to provide a consistently high number of neighbors in the context of the search (i.e. every pixel in the forward pass or backtrace is guaranteed to have neighbors that have already been examined) and to demonstrate that the order of search is not significant.
It would be feasible to add search order as a knowledge source. This would require the addition of a depth level factor in the path likelihood equation (perhaps somewhere in the normalization phase). Then the search could start from one or many "points of interest" in the sensed image and proceed outward (perhaps in a spiral). These points of interest could be selected on the the basis of their signal features (i.e. bright areas) or on the basis of some knowledge about the image (perhaps from a higher knowledge hierarchy level).
Another advantage of Locus search is speed. Locus uses no backtracking or other unbounded computation. The beam constrains all search possibilities to a reasonable size as it follows many parallel paths through the search tree. In fact, the varying sized beam can be functionally compared with the varying sized stack in standard backtracking searches: both list the current best paths through the search tree. The difference is that the beam is horizontal and the stack is vertical (when search trees are viewed as top-to-bottom, not left-to-right). It is this directionality which allows Locus to limit the beam size without damaging the search.
A good way to finish up this chapter is with an example. This example is, of necessity, a toy use of Locus but it will make use of much that has been presented in this chapter to show how it all fits together. Those readers who are satisfied with the details of Locus may skip to the next chapter.
This example concerns the labeling of a very small satellite picture of a fictitious agrarian nation. The nation's chief products are Alfalfa, Barley, and Corn (sometimes refered to in this example as A, B, and C). Due to soil conditions, winds, and antiquated laws, farmers always plant their crops in fixed positions on their land. The law states that Barley and Corn must be planted, but Alfalfa is optional. The soil conditions mandate that Corn always be planted to the east of Barley, and winds force farmers who choose to plant Alfalfa to do so in the north-west corner of their field.
A Locus knowledge network is compiled from the above information. It uses only north and west in its relationships and looks like this:

Needless to say, there are three PPEs: alfalfa, barley, and corn. Together with the four image-edge PPEs, they define the knowledge network. The goal is to label the satellite photographs and identify the crop types.
The satellite photographs are not very large: four pixels across and two pixels high (unsegmented). However, with a total of 8 pixels and 3 choices of a label for each pixel, there are still over 6000 possible labelings of each satellite photograph. It was mentioned earlier that knowledge does most of the reduction of the search space and this example is no exception. In fact, the knowledge-reduced search space actually contains only the following ten possible labelings:

The low-level side of this example is as follows: there is one sensor in the satellite which generates brightness values in the range of 1 to 10. Therefore, the feature vector templates are simply one number which describes each PPE. Let us assume that the typical alfalfa field registers 4 on this sensor, barley registers 9, and corn registers 3.
For a distance metric, it is adequate to choose an unweighted subtraction measure. In particular, the formula Oi,j ≡ 1 - |i-j| / 10 will yield a fraction between 0 and 1 which indicates the likelihood that PPE i is matched to image pixel j. In our example, the sample photograph to be labeled looks like this:

so therefore the optical matches for each PPE at each pixel are:

The search through the image now begins. The first step is the forward pass which starts with the pixel at position (1, 1) in the upper-left corner. The path likelihood for alfalfa at pixel (1, 1) is derived as follows:
| Palfalfa,(1,1) | ≡ | Oalfalfa,(1,1) x AVERAGE[ | (Ptop,(initial) x Ttop,alfalfa,NORTH) and |
| (Pleft,(initial) x Tleft,alfalfa,WEST)] | |||
| = | 0.9 x AVERAGE[(1.0 x 1.0) and (1.0 x 1.0)] | ||
| = | 0.9 | ||
Notice that the initial PPEs are defined to have the likelihood value 1.0 and that all transition likelihoods are simply 1 or 0 depending on the PPE connection. By similar computation, Pbarley,(1,1) = 0.4 and Pcorn,(1,1) is undefined. This last path likelihood is undefined because the transitions are not allowed in the WEST direction. Once both valid PPEs have been calculated for pixel (1, 1), the beam is updated. In this case, barley is pruned because its path likelihood is too low. (In this example, anything .5 or more from the best likelihood is pruned. Barley has the value 0.4.) The remaining value is normalized to 1.0.
At pixel (1,2), the three PPEs are evaluated again. Observe the path likelihood of barley at pixel (1, 2):
| Pbarley,(1,2) | ≡ | Obarley,(1,2) x AVERAGE[ | (Ptop,(initial) x Ttop,barley,NORTH) and |
| (Palfalfa,(1,1) x Talfalfa,barley,WEST)] | |||
| = | 0.9 x AVERAGE[(1.0 x 1.0) and (1.0 x 1.0)] | ||
| = | 0.9 | ||
At pixel (1, 2), corn is again pruned because there is no barley to the west of it. Therefore, only two PPEs are valid, neither of which is pruned. At pixel (1, 3), all three PPEs are retained in the beam. This is the computation of barley at (1, 3):
| Pbarley,(1,3) | ≡ | Obarley,(1,3) x AVERAGE{ | (Ptop,(initial) x Ttop,barley,NORTH) and | |
| Max[ | (Palfalfa,(1,2) x Talfalfa,barley,WEST), | |||
| (Pbarley,(1,2) x Tbarley,barley,WEST)]} | ||||
| = | 0.9 x AVERAGE{(1.0 x 1.0) and Max[(.667 x 1.0), (1.0 x 1.0)]} | |||
| = | 0.9 x AVERAGE{1.0 and 1.0} | |||
| = | 0.9 | |||
The diagram below shows the status of the search tree after the top row of the image has been scanned:

The two numbers above each PPE node in the search tree indicate the path likelihood values before and after normalization. Observe that certain paths are rejected (R) because there are more optimal parents. The forward pass now advances to the second row and scans it from left to right (west to east). Pixel (2, 1) has only two choices: alfalfa (normalized likelihood 0.667) and barley (likelihood 1.0). Pixel (2, 2) also has these two possibilities because corn was not allowed at pixel (1, 2). Lets examine the likelihood calculation for barley at pixel (2, 2). It is quite complex:
| Pbarley,(2,2) | ≡ | Obarley,(2,2) x AVERAGE{ | Max[ | (Palfalfa,(1,2) x Talfalfa,barley,NORTH), |
| (Pbarley,(1,2) x Tbarley,barley,NORTH)] and | ||||
| Max[ | (Palfalfa,(2,1) x Talfalfa,barley,WEST), | |||
| (Pbarley,(2,1) x Tbarley,barley,WEST)]} | ||||
| = | 0.7 x AVERAGE{ | Max[ | (0.667 x 1.0), (1.0 x 1.0)] and | |
| Max[ | (0.667 x 1.0), (1.0 x 1.0)]} | |||
| = | 0.7 x AVERAGE{1.0 and 1.0} | |||
| = | 0.7 | |||
The completed forward pass, with row two, is shown below. Notice that row one has been simplified to its pruned normalized paths.

At the end of the forward pass, the beam, which was constructed from unpruned optimal paths, looks like this:

The backtrace starts at pixel (2, 4). The terminal PPE (arrow entering from the bottom) indicates the corn label for this pixel because it had the highest path likelihood on the forward pass. From there, the corn at (2, 3) is selected (just follow the arrows) and barley is selected at (2, 2) and (2, 1). Next, pixel (1, 4) is labeled and corn is easily picked. At pixel (1, 3), however, there is a conflict. The child at (2, 3) recommends corn but the child at (1, 4) recommends barley. The conflict is resolved by observing that the barley candidate would be inconsistent with the corn to the south of it, whereas the corn candidate is consistent with both children. There are no other conflicts in the backtrace and the final two pixels are quickly and unanimously selected to be barley and alfalfa. The final labeling is:

A dangerous situation arose in the above labeling which could have had disastrous results. The pixel at (2, 4) was correctly labeled corn but could have been labeled alfalfa. If alfalfa had been stronger, the entire scene would have been labeled alfalfa and that is an illegal labeling. How could that happen?
The toy nature of the example helped to encourage the possibility of illegal labelings. With only two directions of constraint, there was very little knowledge guidance during the search. By the addition of two more directions, NORTH-WEST and NORTH-EAST, this problem is completely avoided because alfalfa is then denied adjacency to the right side in the NORTH-EAST direction. In addition, the use of minimum path knowledge, which limits PPE proximity to edges based on the number of legal transitions required to adjoin the edge, would have pruned many PPE nodes in this example. However this knowledge is seldom used in larger images.
The bottom line is: the more knowledge, the better. In fact, even the four directions that ARGOS uses are inadequate for complex labeling tasks. It's not that Locus is weak, only its knowledge is weak. The next chapter discusses other forms of knowledge that are available to enhance Locus search.
CHAPTER 3: KNOWLEDGE |
![]() |
ocus search, as the last chapter presented it, uses
only one kind of knowledge: region adjacencies.
This adjacency knowledge forms the basic
structure of the network that Locus searches, but
it is not the only knowledge that can be brought
to bear.
In fact, as this chapter will show, there appears to be no major
limitation to the knowledge that can be used
in Locus search.
Before examining additional forms of knowledge, it is appropriate to discuss the exact method used to acquire region adjacency knowledge. Locus networks contain the extracted adjacency knowledge from many hypothesized views of an internal model of the scene. The internal model, which in the case of Pittsburgh city scenes is constructed from street maps, is a three-dimensional model of the city which can be used to generate all possible views of the city. To build a knowledge network, selected views (or hypotheses) are generated and added as separate paths 7.
The non-obvious aspect of building a network from multiple views is that a region which appears on different views is sometimes represented with separate PPEs and at other times is merged into one PPE. The decision to merge is based on a similarity measure between the PPEs. Observe the following two views of the Alcoa Building:
![]() |
Alcoa1 Alcoa1 Alcoa2 Alcoa2 Alcoa2 |
below leftof below leftof rightof |
U.S. Steel U.S. Steel Gimbles Gimbles Gimbles |
In the figure above, the Alcoa Bldg appears in front of the U.S. Steel Bldg in one view and in front of Gimbels Dept. Store in another view. It is represented with two PPEs in the knowledge network that reflect the five adjacencies listed above. This is because the two views happen to be from opposite directions and there are no common adjacencies between the two Alcoa Bldg regions. If, however, two views show the Alcoa Bldg alternately to the left of the U.S. Steel Bldg and to the right of the U.S. Steel Bldg as in the following diagram, then these two instances of the Alcoa Bldg are likely to be merged into one PPE which has the transitions shown.
![]() |
Alcoa Alcoa Alcoa |
below leftof rightof |
U.S. Steel U.S. Steel U.S. Steel |
The decision to merge PPEs from different views is based on many factors including the relative size and position of the two PPE regions. For a detailed discussion of the merging algorithm, see section 5.2.1 on network size reduction.
Although some knowledge is necessarily lost when regions are merged into one PPE, it is mandatory that the networks be reduced or else each hypothesized view will be stored as a separate path through the network. This not only makes the networks too large to search, but forces the beam to contain at least one path for every view, if that view is to be considered in the labeling. It has been found that frequent region merging produces optimal labeling results because little information is lost and the search is much easier (see section 5.2.2).
The remainder of this chapter discusses how the location, size, and shape of a region can be used to aid the Locus labeling process. Even pre-segmentation, where arbitrarily shaped areas are presented for labeling, can be considered to be a source of knowledge.
ARGOS does not explicitly segment: it labels. Segmentation is the division of an image into distinct areas, each of which is a separate object in the image. Labeling is the identification of each object. Many image understanding systems segment their image first and then label the segments (Sakai et. al., 1976; Williams et. al., 1977). ARGOS is able to take pre-segmented images and label the segments, but it is also able to take unsegmented images and "post-segment" them. What this means is that it is possible to view the labeled output as a segmentation since the labels are grouped into distinct clusters within the image.
Pre-segmentation, therefore, is the use of images that have been segmented (by some unknown but reliable algorithm) before ARGOS labels. Some reasonable choices of pre-segmentation algorithms are clustering (Ohlander, 1975), and other multispectral classifications (Kettig and Landgrebe, 1976; Rodd, 1972). ARGOS is currently running with a clustering algorithm that is derived from Ohlander's work (Shafer and Kanade, 1978). The use of pre-segmentation has the advantage of time and space saving because there are many fewer nodes in the search tree, typically two orders of magnitude fewer. In addition, there is increased accuracy when using pre-segmentation since smaller numbers of search tree nodes can be constrained better. The only disadvantage of pre-segmentation knowledge is that care must be taken to ensure that enough segments are found. It is only marginally harmful if there are too many segments because the labeler can simply give multiple segments the same label, but if there are too few segments, then there will be continuity gaps in the adjacency and the search will fail.
Everyone knows that mountains are usually found in the top part of an image. This sort of location knowledge is separate from adjacency: it is absolute position as opposed to relative position. The use of location knowledge is quite easy and requires only two steps: learning the knowledge and using the knowledge.
Learning location knowledge is done at the same time as the learning of region adjacencies. Each hypothesized image that is combined into the knowledge network has a minimum-bounding-rectangle (MBR) drawn around each region. A minimum-bounding-rectangle is simply the rectangle formed by lines parallel to the X and Y axes which pass through the smallest X, largest X, smallest Y, and largest Y co-ordinate in the object. Thus, it is the smallest box that can be drawn around the object which is parallel to the X and Y axes. When regions from two hypothesized images are merged, their MBRs are combined into a new MBR that includes both regions. The resulting knowledge that is extracted is an MBR for each PPE. In some cases, the MBR contains useful information and in other cases it doesn't. If, for example, a building appears in many different places across the hypothesized images, then the knowledge contained in the combined MBR will not constrain the building's location. But that is a form of knowledge which says: this building may appear anywhere.
It should be noted that the proximity of the MBRs is used as a factor when deciding whether or not to merge two regions. This prevents the creation of networks that have meaningless location information.
It would be possible to extract an exact template of the pixels occupied by each region. When regions merged, the new template would be formed from the union of the old templates. However, this much detail is not necessary for Locus since the hypothesized views are imprecise at the outset.
Using location knowledge is easier than obtaining it. In the unsegmented system, each pixel that is being evaluated during the forward pass is either in the MBR of a PPE or it isn't. If it is in the MBR, then no action is taken. If it is outside, then a penalty is added to the path likelihood equation for that PPE-pixel node in the search tree. The pre-segmentation version of ARGOS uses the segment centroid to determine location violation in the same manner. The penalty gets stronger as the pixel gets farther from the MBR of the PPE. Thus, Locus doesn't reject a label assignment if it violates location knowledge, but the likelihood begins to decrease as a pixel strays from its expected location. Unless there are other factors in the path likelihood equation that override the location penalty, the path will soon be pruned from the beam. This is how Locus search allows imprecise knowledge to be used without completely destroying the labeling.
Shape is hard to define because it has many aspects. This section will discuss a number of methods of describing shape. Some are better suited to pre-segmented images, others work best with unsegmented images, and a few are totally useless to ARGOS.
When ARGOS uses pre-segmented images, it must match the shape of an area of the image to the expected shape in the knowledge network. It is not sufficient to pre-compute the shape of each segment in the image since segments may combine during the search due to self-transitions in the network. ARGOS must be able to dynamically evaluate the shape of a group of segments and compare it to the shape specified by the PPE which is labeling these segments. Although there are many choices of shape descriptors including bit masks and moment invariants (Hu, 1961), four simple shape measures were selected which make use of the segment perimeter and area.
The simplest shape measure used by pre-segmentation ARGOS is fractional fill. This is the ratio of the area of the segment to the area of its minimum bounding rectangle. Another measure of shape is called compactness. This is the ratio of the perimeter squared to the segment area. Both of these measures distinguish compact segments from "loose" segments. Two other shape descriptors that are used are orientation and elongation. These are derived from the first moment of the Fourier transform and indicate the angle of an elongated segment and the amount of elongation. For more detail, see the thesis of Keith Price (Price, 1977).
The shape measures used by pre-segmentation ARGOS are matched with the same weighted Euclidean distance metric that was described in section 2.3.2. The only detail that must be observed during this process is the fact that orientation is cyclic, so the distance between 10° and 350° is 20°, not 340°. The results of the distance metric are used as a penalty in the path likelihood equation, thus allowing shape to collaborate with the other knowledge sources.
The above set of shape measures does not completely describe a segment's shape. The only perfect shape descriptor is a copy of the desired shape. It would not be terribly expensive to retain a template for each PPE in the knowledge network which would describe the exact shape of the expected region. Unfortunately, it is difficult to apply this kind of knowledge to Locus search because it is hard to determine where in the template a given search tree node is. For example, notice the following template for a cross-shaped PPE region:

This template describes a region with 20 points. If an isolated instance of that PPE is detected during the forward pass, which element of the template should that pixel be associated with: 1 or 5? It is not enough to remember all previous instances of that PPE in the beam because, after the first row of the template has been scanned, there will be uncertainty about where to place the pattern unless it appears in the beam exactly twice. Matching irregularly shaped segments is even harder. The multiple option philosophy of Locus indicates that some uncertainty is expected in the forward pass so it doesn't know what to do with precise knowledge. Besides, the use of extensive "look-back" in the search is contrary to the heuristics used in Locus because of the amount of time and space needed. And, of course, there is the problem of what template to store for PPEs that have been created from merged views of a region. The template method of storing shape knowledge is nice, but effective use of it in Locus search is not well understood.
It might be possible to incorporate shape templates into the backtrace. These templates would be used to help resolve conflicts by rejecting options that do not fit. It would be easier to work the precise knowledge of a template into the backtrace because there is no uncertainty: each PPE selected by the backtrace is unique to that pixel position.
A more natural shape descriptor for the forward pass (which is used in unsegmentation ARGOS) is dimensional shape: the specification of shape by the use of PPE dimensions at various angles. In the above cross example, the dimensions of the PPE region would be taken along the four directions of adjacency. The horizontal dimension would be 2-to-6 pixels, meaning that any horizontal slice through the region contains at least 2 and at most 6 pixels in a row. The vertical dimension is the same, and the two diagonals are both 1-to-4 pixels. Dimensional shape is well suited to descriptions of regular objects like buildings. However it fails in any attempt to describe complex shape like a skyline. It will be shown, however, that even a skyline can be described to Locus search.
Dimensional shape is easy to extract from the hypothesized views and, of course, is used as a factor in region mergings. Note, however, that the accumulation of dimensional shape can easily lead to useless information if the hypothesized views are taken too literally. Therefore, ARGOS rejects abnormal range values during the accumulation of dimensional shape. For example, suppose a building has one pointed turret above a wide main structure. Rejection of the abnormally narrow turret point would prevent the PPE description from having a horizontal lower dimension of 1 when there is really only one point in the PPE that is so narrow.
Dimensional shape is easy to implement in the forward pass of Locus search. All that is needed is four counters with every beam entry that specify the dimension (up to that point in the raster scan) of that PPE. Whenever a PPE is added to the beam, its dimensions are simply set to be one greater than the four surrounding PPE dimensions in the beam. When a PPE transits to a different PPE and the former one has insufficient dimension, the transition is penalized. Similarly, a PPE that transits to itself too often and exceeds the maximum dimension is penalized. Like location knowledge, the penalties become more severe as the distance from the dimension limit increases.
Shape measures occasionally miss an important feature of a segment because they describe shape in such abstract terms. There is a very simple solution to this problem which requires that all regions with complex shape be broken down into multiple regions that are less complex and are adjacency constrained to form the overall shape. Returning again to the cross example, the region can be broken down into five PPEs, each a 2x2 square which is easily describable with simple shape measures. These five PPEs can then be constrained so that they adjoin in a cross pattern with one in the center and four surrounding it. Of course, this technique requires extra network space and search time, so it should only be used for pathological shapes. However, it does allow arbitrarily shaped objects to be described.
Size is usually separate from shape. However, the cross example in the previous section shows that the shape measure used by ARGOS also specifies information about the size of the cross. This section discusses how independent size knowledge can still be used, even with dimensional shape knowledge.
Recall that dimensional shape describes the cross as a segment with horizontal and vertical dimensions of 2-to-6 and diagonal dimensions of 1-to-4. The following region fits that shape:

Size knowledge can prevent the above region from matching this description: it has twenty-four pixels and the cross has only twenty 8. Size is defined as an upper limit on the number of pixels in a region. PPEs that transit to themselves too often will violate the size limit and penalize their path likelihood. Like location and shape, size is extracted from the hypothesized images and used as a factor in network mergings.
Unfortunately, size is not effective for ARGOS when labeling city scenes. It will be shown in section 5.2.3 that size knowledge of this type works against correct labeling because of a quirk interaction between the search order and the nature of object shapes in natural scenes. However, this type of knowledge is still useful for Locus search in general.
There are other ways to implement size knowledge that have not yet been explored. Size can be used in the backtrace to resolve conflicts in much the same manner as shape. In addition, it is possible to implement relative size knowledge that is able to make statements like "a is twice as large as b". The implementation of this would require that the knowledge network have access to registers during the search. Various network nodes would store knowledge in these registers for use by other nodes. This scheme is similar to Augmented Transition Networks (Woods, 1970).
The knowledge sources that are used in ARGOS are not used independently. They are all descriptive aspects of regions, so they should work together. ARGOS currently has location, shape, and size implemented. It is reasonable to expect that the search is more effective if these three are constrained to a hierarchical scheme where lower knowledge sources in the hierarchy are only accumulated if higher knowledge sources are not being violated. The highest knowledge source is location because it constrains region identification most; the lowest knowledge source is size.
Implementation of this hierarchy implies that shape information be accumulated only for nodes in the search tree that fall within the minimum bounding rectangle of the PPE. At the other end of the hierarchy, size information is only accumulated for nodes in the search tree that fall within the shape bounds. The need for this hierarchy is apparent because there is no point in including incorrectly positioned pixels in the computation of shape: it can only damage the shape measure. Similarly, inclusion of pixels that violate shape bounds will adversely affect the use of size knowledge.
In experiments, the use of knowledge constraint was not found to give significant improvements. However, ARGOS continues to use it for the slim advantage that it does yield.
Some forms of knowledge can be used with Locus more effectively than others. Since ARGOS is designed to be efficient, it does not attempt to incorporate knowledge schemes that are unnatural for Locus search. However, it is still felt that any knowledge can be used in some form.
This chapter discussed region knowledge that can be put into a network. The next chapter discusses how to make a series of networks that bring more complex knowledge to bear on the labeling of images.
p to this point, all discussion about ARGOS has centered
around the labeling of Pittsburgh city scenes.
A knowledge network has been described which can identify
various views of the city and selected images have been
successfully labeled.
What about scenes of other cities such as New York?
What about non-city scenes?
There is a continuum of knowledge in the world
and Pittsburgh is just one piece of it.
The assumption that has been made is that each micro-world
requires a separately designed knowledge network.
This chapter is concerned with the techniques that can
be used to label arbitrary scenes without the need for
manual selection of networks.
Although few of these techniques have been implemented, they
provide some direction for future research.
The first thing that should be pointed out is that there is a reasonable upper limit to the amount of knowledge that a network can hold. Although current computer technology imposes its own limits, there are theoretical upper limits that should also be observed. Therefore, it is not reasonable to build one large network with all available knowledge. In particular, the continuum of knowledge can be divided into hierarchical levels and a network should always bridge two of these levels. This bridging effectively applies knowledge that has been acquired at a higher hierarchical level to the acquisition of knowledge at a lower level.
For the purposes of this discussion, all image knowledge will be divided into three hierarchical levels. The top level is the scene level and contains all of the scene types that can be distinguished by the image understanding system. Another way to think of this level is as the schema or frame level. Examples of scene types are: city scene, office scene, satellite scene, etc. Below this level is the viewpoint level which is concerned with the angle and distance of view. At the bottom is the object level which, given the environment and viewing position, concerns itself with correct identification of the objects in the scene.
These levels of knowledge are not the typical levels that other image understanding systems (such as Sakai et. al., 1976; Ballard et. al., 1977; Williams et. al., 1977) recognize: pixel within object within region within scene, etc. Locus networks combine those "minor" levels into a uniform structure. ARGOS hierarchies are organized along general-to-specific lines (scene to view to object). The hierarchy spans the full depth of knowledge and is organized to extract a varying level of understanding from the entire image.
The three levels of hierarchy that ARGOS recognizes span quite a lot of knowledge, especially the top and bottom levels which are infinitely extendible in their respective directions. For example, the object level can attempt to identify the buildings in a city scene or, at a lower level of detail, all of the windows in all of the buildings 9 or, at a higher level, just the major features of the scene such as sky and mountains. Similarly, the scene level can identify a scene as "city of Pittsburgh," "city," or just "outdoor scene." There are even many distinctions to the middle category (viewpoint) such as view angle and view distance.
The rest of this chapter discusses how ARGOS could use hierarchies of knowledge to completely identify scenes. At the end is a discussion of where ARGOS currently stands in its hierarchical use of knowledge.
As the last chapter discussed, the ultimate source of knowledge is the internal model. This model is an actual three-dimensional description of the scene to be viewed. Since the assumption behind the use of knowledge hierarchies is that there is too much knowledge to be completely specified by one network, it will be assumed that the model is very large, perhaps a representation of the entire United States.
Before discussing the use of very large models, it is appropriate to mention their creation. ARGOS does not address the issue of automatic model generation or learning, but it would be possible to attach a feedback loop onto ARGOS which would use labeled knowledge to modify the existing model. More sophisticated learning systems could build additions to the model solely from examples found in the existing structure, but that is completely outside of the scope of this thesis. Therefore it must be assumed that the scene models are built by hand.
Hierarchical use of model knowledge always starts at the most general end of the hierarchy with identification of the scene type. Even humans start with this point of the knowledge hierarchy by first determining the "gist" of the scene (Akin and Reddy, 1976). This scene type identification process involves running the scene through a pass of Locus search with a knowledge network that contains only the major components of each scene type in the model. For example, if the model contains twenty cities, then the knowledge network used to identify the correct city might contain PPEs only for those major regions of each city that distinguish it from another. Pittsburgh would have river and mountain PPEs and perhaps one for the U.S. Steel Bldg which is a landmark. Other cities might employ ocean, countryside, or even smog PPEs. Even the shape of the skyline can be used to distinguish cities.
Once the search is run on this network, it is easy for the machine to examine the labeled output and determine the selected city. This is because each PPE is tagged with the name of the city or view that generated it 10. With city knowledge in hand, the machine can generate a more detailed network of the identified city which explores the scene at a lower level in the knowledge hierarchy. For example, 50 different views of the identified city can be hypothesized to form the next level network which would extract view angle and/or view distance. Note that there is no restriction on the number of Locus search iterations that can be done at a given level of hierarchy. If the world model contains 2000 cities, it might be reasonable to break the scene-type hierarchical level into two passes of Locus search; one using a network to classify them into major types (cities with tall buildings, cities near water, etc.) and then another pass with a more detailed sub-network to select from the identified types of cities.
Locus networks, therefore, form a tree structure that spans the knowledge hierarchy. In less complex world models it might be possible to pre-compute all of the networks and store them for possible use in the hierarchy traversal. In more complex models, these networks must be generated by the machine at each decision point in the identification process.
Extremely complex models have the interesting property that the lower (more specific) levels of knowledge networks all look the same. For example, once the proper city and building has been identified, the details of the building are very similar to the details of any other building. These models no longer have a tree structure to their knowledge hierarchy: it is now a network structure that re-combines knowledge paths at the lower levels. Add to this observation the fact that there is bound to be noise and error in the hierarchy searching process, and a fascinating observation is made: Locus search can be used to guide the selection of knowledge networks. All of the efficient, non-backtracking search can be used at a much higher level: the Primitive Knowledge Network level. So Locus search can be used both for labeling pixels within images and for labeling images within world models.
Is there a higher dimension that Locus can run on? It is possible to envision hierarchies of world models which vary according to the intent of the model creator. A city planner would build a world model that reflects the populated vs. unpopulated areas of cities because the planner is concerned with growth trends. A military tactician would build a model that contains only strategic points, be they telephone switching centers or harbor docks. So there is a higher level of knowledge that sits on top of the hierarchies discussed in this chapter. This level deals with the actual semantics of an image and can indisputably be called "image understanding."
Needless to say, ARGOS is not able to analyze any scene in terms of arbitrary levels of understanding. In fact, ARGOS is barely able to transmit results from one level of the knowledge hierarchy to another. It does demonstrate the use of hierarchical knowledge with a simple task. The task that was selected was highly dependent on the images that were available. Since the fifteen images were all of Pittsburgh from five different views, the obvious hierarchical tasks were view angle identification followed by object identification.
View angle identification consists of building a network with many views of the city. This task used 24 views at 15° intervals around the city. Prior to network construction, the number of significant objects on the city was reduced from 58 to 16 (see Appendix II). This is because the view angle task is more general than the knowledge base which was selected for the object identification task. The selection of 16 PPEs was done automatically by examining the 24 machine-generated views of the city and counting the number of times each building contributed to the skyline. Any building which formed part of the skyline in over 50% of the views was kept as a separate PPE. The other buildings were eliminated from the list and merged into the Miscellaneous Buildings PPE.
To determine the angle of view for each photograph of the city, ARGOS labeled each image and extracted the most popular angle from the labels (recall that each label includes information about the original view or views from which it came). ARGOS typically selected one or two angles and the values were in error by an average of 30° for the training images and 50° for the test images.
The next level of the knowledge hierarchy is the object identification level. Presumably, ARGOS should be able to use the predicted angles from the view angle runs and use that knowledge to improve the object identification. A number of schemes were tried, none of which yielded significant improvement in the object recognition.
The first experiment in knowledge hierarchy traversal invol