Communication forms the backbone of parallel graphics, allowing multiple functional units to cooperate to render images. The cost of this communication, both in system resources and money, is the primary limit to parallelism. We examine the use of object and image parallelism and describe architectures in terms of the sorting communication that connects these forms of parallelism. We introduce an extended taxonomy of parallel graphics architecture that more fully distinguishes architectures based on their sorting communication, paying particular attention to the difference between sorting fragments after rasterization, and sorting samples after fragments are merged with the framebuffer. We introduce three new forms of communication, distribution, routing and texturing, in addition to sorting. Distribution connects object parallel pipeline stages, routing connects image parallel pipeline stages, and texturing connects untextured fragments with texture memory. All of these types of communication allow the parallelism of successive pipeline stages to be decoupled, and thus load-balanced. We generalize communication to include not only interconnect, which provides communication across space, but also memory, which functions as communication across time. We examine a number of architectures from this communication-centric perspective, and discuss the limits to their scalability. We draw conclusions to the limits of both image parallelism and broadcast communication and suggest architectures that avoid these limitations.
We describe a new parallel graphics architecture called "Pomegranate," which is designed around efficient and scalable communication. Pomegranate provides scalable input bandwidth, triangle rate, pixel rate, texture memory and display bandwidth. The basic unit of scalability is a single graphics pipeline, and up to 64 such units may be combined. Pomegranate's scalability is achieved with a novel "sort-everywhere" architecture that distributes work in a balanced fashion at every stage of the pipeline, keeping the amount of work performed by each pipeline uniform as the system scales. The use of one-to-one communication, instead of broadcast, as well as a carefully balanced distribution of work allows a scalable network based on high-speed point-to-point links to be used for communicating between the pipelines. Pomegranate provides one interface per pipeline for issuing ordered, immediate-mode rendering commands and supports a parallel API that allows multiprocessor applications to exactly order drawing commands from each interface. A detailed hardware simulation demonstrates performance on next-generation workloads. Pomegranate operates at 87-99% parallel efficiency with 64 pipelines, for a simulated performance of up to 1.10 billion triangles per second and 21.9 billion pixels per second.