The goal of this project is to develop new high performance graphics architectures for emerging parallel, data intensive systems.
Scalable computer technology is available and many important applications have been parallelized and run successfully on such machines. However, comparable scalable technology for graphics systems is not available. Given the importance of visualization to many military and commercial applications, e.g. nuclear stockpile stewardship, weather prediction, image analysis and target identification, flight simulators and distributed simulation environments, we believe that it is imperative to create a scalable graphics technology base. Our research focuses on building such scalable graphics technology, providing the ability to produce and manipulate imagery with several orders of magnitude more performance than currently available.
The key technical challenge is to find the right mix of hardware and software to support graphics and imaging efficiently on such machines. We plan on tackling this question by implementing a prototype graphics system that achieves at least 64-way scalable performance at over 90% efficiency. There are two key elements to our approach: data parallel rendering algorithms and appropriate architectural support in software and hardware.
Traditionally graphics systems have been built in a pipelined fashion as was done in the original flight simulators created under DARPA funding. However, more recently high-end workstations, such as the Silicon Graphics InfiniteReality, use small numbers (tens) of processors in a hybrid data parallel/pipelined architecture. Data parallelism exists in two forms: object parallelism (multiple graphics primitives), and image parallelism (multiple pixels). In this project we take this trend of increased data parallelism to its logical conclusion. Since graphics systems typically manipulate many objects and pixels, they map well onto data parallel architectures and programming models. Since this is by far and away the greatest source of parallelism in graphics systems, we believe this is the key to building a software graphics system on a parallel computer, and, further, we believe it will be the most common architecture underlying high performance graphics systems of the future. The second key feature of our approach is to appropriate hardware to efficiently support image display and low-level imaging operations such as z-buffering and compositing.
This project has several components:
Pomegranate is a parallel hardware architecture for polygon rendering that provides scalable input bandwidth, triangle rate, pixel rate, texture memory and display bandwidth while maintaining an immediate-mode interface. The basic unit of scalability is a single graphics pipeline, and up to 64 such units may be combined. Pomegranate's scalability is achieved with a novel "sort-everywhere" architecture that distributes work in a balanced fashion at every stage of the pipeline, keeping the amount of work performed by each pipeline uniform as the system scales. Because of the balanced distribution, a scalable network based on high-speed point-to-point links can be used for communicating between the pipelines. Pomegranate uses the network to load balance triangle and fragment work independently, to provide a shared texture memory and to provide a scalable display system. The architecture provides one interface per pipeline for issuing ordered, immediate-mode rendering commands, and supports a parallel API that allows multiprocessor applications to exactly order drawing commands from each interface.
The Lightning-2 pixel interconnect will enable us to build a scalable rendering system from multiple graphics accelerators. The major features of Lightning-2 are the following: First, each Lightning-2 module will accept 4 input streams and generate 8 output streams. Each input and output stream will use a new commercial video standard, DVI (Digital Vidual Interface). Lightning-2 modules may be tiled in a 2D array. We have decided to build 8 Lightning-2 modules; this will allow us to connect a 32 node graphics system to a 8 projector video wall. We have signed an agreement with Intel Corporation to fabricate the boards.
Real-time graphics hardware is rapidly becoming capable of rendering images using advanced texturing, shading, and lighting models. However, it is time-consuming and difficult to implement complex rendering algorithms on this hardware, because the programmer must configure the hardware at a low level. We have built a real-time system with a programmable-shading language that provides an abstraction layer between the programmer and the hardware. The most important feature of our shading language is its capability to specify calculations at various stages in the graphics pipeline, in particular, at primitive groups, vertices and fragments. In our current system, per-vertex calculations are implemented on the CPU. When programmable transformation and lighting hardware becomes available, per-vertex calculations could be implemented on this hardware. In either case, per-vertex computations can use a wider variety of operations and data types than per-fragment computations, due to limitations of fragment hardware. Since per-vertex computations are performed less often than per-fragment computations, changing some computations from per-fragment to per-vertex can improve rendering performance.
The SHARP system consists of many different data managers, databases, and functional units, all organized in a feedback loop. Each ray is first scheduled to be traced, second the ray is tested for intersections using a gridded acceleration data structure, third, the ray is passed to the shading processor, and fourth, the results are summed to form the final image. The system is implemented in two architectural simulators. The first simulator presents an idealized architecutre with zero latency memory accesses and no bus contention. It is used primarily for the rapid prototyping of new ideas and statistics gathering for our attempt to analyze the ray tracing computation. The second simulator is the Smart Memories simulator. This simulator provides detailed information about memory, bus, and cache behavior for the SHARP system as mapped onto the Smart Memories architecture. These statistics will be used for determining the behavior and feasability of a hardware based ray tracer.
Previous successful components to this project include:
This project is sponsored by DARPA ITO. Additional support has been provided by Digital, Intel, NVIDIA, and SGI