Figure 1 shows a variety of graphics pipelines. Existing graphics architectures (a) support a single context through a single command processor. This architecture can be extended to multiple command processors by synchronizing immediately between the command processors to determine a valid command sequence and maintaining that sequence through the rest of the system as usual (b). Alternatively, synchronization may be pushed later in the pipeline. In (c) for example synchronization is performed between the geometry processors to allow the command processors to run more freely (assuming sufficient buffering). Argus currently implements this version as well as a version where synchronization is performed within the pixel processors.
A high-level view of a single Argus pipeline (one application processor) is shown in figure 2. A single processor issues immediate-mode graphics commands to the library. These fine-grained commands are streamed into contiguous memory with as little processing by the host CPU as possible. At fixed intervals a small amount of state information is written out to allow independent processing of a one block of the command data (here the blocks are A, B, C, etc...), and an entry for the block is placed in the geometry-processing queue.
Blocks of commands are popped from the queue on demand by the polygon processors. When a block is completed, the resulting block of screen-space triangle data is entered onto the pixel-processing queues of each overlapped tile but not in order of arrival. The order of arrival may be incorrect if one polygon processor finishes a block faster than another. Blocks are bucket sorted by entering them into slots in the queue corresponding to their original sequence. A 'reclaim' pointer is maintained which indicates the extent in the sequence for which all geometry processing is complete. The pixel processors can process filled slots and skip empty slots up to this indicator without fear of skipping over a late arrival.
The amount of parallelism within the system can be quite large and quite heterogeneous. Application processing, polygon processing and pixel processing are interleaved on all available CPUs to avoid any dead time as various constraints (ordering constraints, synchronization constraints, or just empty/full queues) cause blocking. To manage this complexity more easily, the Argus system is heavily multithreaded using an optimized internal thread system.
Argus currently implements several pixel-processor load balancing algorithms including static tile distribution, frame-to-frame coherence based algorithms, and hardest-tile-first dynamic tile 'stealing'. Load balancing of the pixel processors is an ongoing research topic -- more information is available here.