Date: Fri, 27 Aug 2010 08:45:30 -0700 From: [ Stanford Registrar ] Subject: Registrar Approval Mail Congratulations! Your dissertation/thesis id 0000000545 has been approved by the Office of the University Registrar. Please find the comments as: Congratulations!. The following are the details: Name of the Student: Jeremy Sugerman Academic Career: GR Academic Program: CS Academic Plan: CS-PHD Milestone title: Programming Many-Core Systems with GRAMPSNaturally, Pat's final comments were changes to the figures. Equally naturally, coercing LaTeX into complying took far longer than the simple nature of the requests. Here's the official pdf (slated to appear here once the University / Library digests it). As Jeff warned, the post-processing to insert the copyright and signature pages made the bookmarks and metadata a little wonky, so here's the final presubmission pdf, too.
Just shy of seven years, all told (September 2003 - August 2010). Now I have to figure out what I'm going to be when I grow up. Or at least find a job or something.
Despite my paranoia, and contingency plans, the Student Services Center had no qualms about accepting standard acid-free printer paper for the two required hard-copy pages. Which is good, since with electronic submission for the thesis itself, I certainly hope they just toss the title page, and scan and toss the signature page. Now I just need to make the final edits and secure Pat's final sign-off.
University Oral Examination: Friday, May 14, 2:15pm, Gates 104 Programming Many-Core Systems with GRAMPS Jeremy Sugerman Advised by Pat Hanrahan Department of Computer Science Stanford University 2:15 PM, Friday, May 14, 2010 (Refreshments at 2:00 PM) Gates building, room 104 Abstract: The era of increased performance from faster single cores and optimized single core programs is ending. Instead, a major aspect of the improvements to new processors is more parallelism: multiple threads per core and multiple cores per processor. In addition, highly parallel GPU cores, initially developed for shading, are increasingly being adopted to offload and augment conventional CPUs. Processor vendors are already discussing chips that combine CPU and GPU cores. These trends are leading to heterogeneous, commodity, many-core platforms that have excellent *potential* performance, but also (not-so-excellent) significant *actual* complexity. In both research and industry run-time systems, domain-specific languages, and more generally, parallel programming models, are becoming the tools to contain this complexity and deliver this performance. In this talk I will present GRAMPS, a programming model for these heterogeneous, commodity, many-core systems that expresses programs as graphs of thread- and data-parallel stages communicating via queues. I will show how we built applications from domains including interactive graphics, MapReduce, physical simulation, and image processing in GRAMPS and describe GRAMPS implementations for three different architectures: two simulated future rendering platforms and one current multicore x86 machine. In these cases, GRAMPS efficiently finds and exploits the available parallelism while containing the footprint / buffering required for the queues. Finally, I will discuss how GRAMPS's scheduling compares to three representatives of popular programming models: task-stealing, breadth-first, and static scheduling. GRAMPS's dynamic scheduling provides good load-balance with low overhead and its knowledge from the application graph gives it multiple advantages in managing footprint.
Anyhow, despite this, I grudgingly put together a poster for the Computer Forum poster session last week (ppt) (pdf). I've ducked it for the past few years because it conveniently always falls on Wednesday, but now that I no longer go to VMware, it seemed like I should present rather than look for another excuse.
# Note: Sometimes a thing must be done, even if of dubious merit, # just to prove it is possible. Thus it is with this (awesome) list # comprehension. And people like to complain about perl programs... data = dict([(k, [ (r['testName'], r['cyclesPerCore'] / 100000) for r in allBySched[k] if r['numCores'] == maxCores ]) for k in allBySched ])That's actual production code for pulling out the appropriate subset from the complete test results (a dictionary who entries are lists of inner-dictionaries; each inner-dictionary contains all the parsed stats from one run) to plot total execution time for each test as a function of scheduler mode.
Earlier, I caught myself typing initial experiments (into the interpreter interactively) to add layers of reduce() and some choice lambdas to "streamline" other pieces even more. Luckily, I stopped before the fevered muttering got too strong. As entertaining as it is to think in functional language constructs, the resulting code rapidly attains write-only status. I mean, I'm uncertain whether I'll still understand the snippet above when I wake up in the morning.
Anyhow, I have to say that the Python compiler / interpreter is a lot grouchier than than perl. I think it stems from two (correlated) factors: a designer with an academic interest in programming languages and first-class exceptions as an intrinsic language feature from the start. Perl would attempt to do what I want in the name of getting the job done. Python wants to teach me a little something about programming and happens to have this nice exception mechanism sitting around with which to thwack me if I attempt expedience (or sloppiness, depending on the eye of the beholder). On the other hand, the syntax, especially for nested types (e.g., dictionaries-of-dictionaries, arrays-of-dicts) is nicer. I do miss the closing curly braces, though. It looks weird to leave occasionally deeply indented lines hanging there terminating a function or a class. I can make things look nicer to my eye with a terminal 'return' to close functions, but that's probably not idiomatic python.
And the language nearly lost me, no matter how hip it is, when I went to use the C ternary operator (which tends to happen within the first 20 lines of code I write in any language), failed, and read through the history of rationalizations that ultimately lead to Python's alternative ternary syntax. I would probably have walked away with no regrets if it were still the pre-2.5 days of ternary hand-wringing and quibbling.
Of course, the universe is not without some conservation of grief. It appears that the default experience remotely displaying GLX applications back to an X server running the nvidia binary drivers is for the client GLX code to fail to find any usable visuals and blow up ("couldn't find RGB GLX visual or fbconfig"). The entertaining part, though, is that all that's really happening is that at least one member of the nvidia / X.org / mesa trio is lying about itself in a way that prevents software (indirect) rendering from kicking in. When I explicitly set LIBGL_ALWAYS_INDIRECT=1 then everything works just like it used to. The Internet indicates many users have bounced off this problem with varying degrees of recognition and success. I haven't found any maintainers (distribution or among the trio) who acknowledge the problem even enough to point blame at someone else, let alone adjust their code so that the right thing happens. Ah, well.
After some testing, I glanced back at my code and discovered g++ had been happily compiling:
if (maxRecord != 0 and maxRecord < entries.size()) { entries.resize(maxRecord); }without any complaint. Not only that, my editor had been happily highlighting 'and' as a reserved word. After some puzzled exclamations and skepticism, I satisfied myself that g++ would accept 'and' in place of && and gcc would not. After still more digging (producing a search query involving the word 'and' that produces useful results is nontrivial), I learned the result above: this is all by design. Maybe the C++ language designers want closer source compatibility to scripting languages. Or more ways to torture the heroic non-native English speakers who endure already highly anglocentric programming languages (what about internationalization? Clearly my compiler should look for different && synonyms depending on my locale...).
Maybe a future language revision will accept "x equals one plus one." (I verified that 'plus' is not (yet) a keyword). And then we'll have to be careful about mocking COBOL programs. Or SQL queries. At least APL will still be thoroughly mockable.
I now know where Ubunutu chooses to package the man pages for all the core pthread_* functions, but NOT pthreads(7) or a number of support functions: manpages-posix-dev. Of course. To aid discoverability, the package has this (uselessly) pithy description:
Description: Manual pages about using a POSIX system for development These man pages describe the POSIX programming interface, including these two sections: 7 = POSIX header files (with 7posix extension) 3 = POSIX library calls (with 3posix extension)Grr. I suspect Ubuntu inherited this particular partitioning of content from Debian where it was employed for the sort of pedantically correct and unreasonably stupid reasons of which that project is so fond.
And yes, after dodging it for ages, I find myself working on a pthreads-derived GRAMPS runtime. I blame Christos.
I also put together an abbreviated (read: harder to follow) version that I presented at the PPL retreat the next day: (ppt) (pdf). That reception was much more reserved. It's partially setting-- a large room full of people more in listening (plus checking e-mail on their laptops) mode-- and partially cramming the whole thing into 20-ish minutes.
Mike asked how GRAMPS compares to TBB, Grand Central Dispatch, and the general emerging workqueue based toolkits. I had my suspicions and superficial answer at the time. Since then, I've read through a lot of the documentation for both and realized how thoroughly different they are (from GRAMPS-- they are very similar to each other). So much so that I need to think about how best to relate them at all (since that is a very valid-seeming question). Also, good grief do the task scheduler operations in TBB require excruciating micromanagement boiler-plate! It is very evident that their intent is to be the invisible underpinning of the TBB parallel data structures and a last-resort escape hatch rather than a primary API for users.
The first reflected a flaw in the simulator / a flaw in my assumption that execution units are the only measure of forward progress. For some applications, the last work can be consumed in the middle of the pipeline (even without loops). For example, any time the last triangle in a scene is backfacing or completely occluded, the rasterization pipeline does its last work at RAST: it eats the triangle and emits no fragments. At this point, RAST completes and GRAMPS sends NOMORE along all its outputs. If the downstream stages were threads, then they would explicitly consume the NOMORE and finish. When the downstream stages are shaders, however, GRAMPS itself consumes the NOMORE and propagates it down stream. Since the run-time is built into the simulator, no simulated instructions are issued and the simulator shows no forward progress for as many cycles as there are shader stages between the last stage to do work and the bottom of the pipeline. I could fix this by enhancing the scheduler's handling of NOMORE for shader stages, but that seems like a poor optimization / complexity tradeoff for a tiny penalty (single cycle) for a rare occurence (once per stage over the lifetime of the application), so I just added 10 cycles worth of tolerance into the deadlock detector.
The second flaw was less expected (I had some notion from things I have seen in logs before that termination sometimes took a couple of cycles to be recognized)-- one of the sanity tests had a run of more than 98 thousand cycles without making forward progress after which it noticed it was done exited cleanly. It turned out that the micro-core scheduler never made any calls to GrCheckForDoneOrDeadlock(). Thus, GPU-like GRAMPS did not notice termination for execution graphs with cycles until the tier-1 scheduling period elapsed for the fat core (every million or so cycles). Oops.
With these changes, I able to use in-place combine shaders to implement a combine stage in the map-reduce histogram. Actually, I was pleasantly surprised that my scheme for propagating the final results fit seamlessly into the same interface the reduce stage already expected, so inserting combine was completely transparent.
... One disappointing thing is that the run-time currently appears to cope very poorly with histogram-combine. That is, it's a marginal footprint savings over reduce and it's way slower. I'll look into it more when I'm back on campus, but one nuisance problem is that GrTier1CanPreReserve() always iterates through subqueues starting from zero. When there's a really deep subQueue with a high index (as happens to the histogram bucket for brightest pixels) then it gets starved until all the lower numbered buckets are completely empty. This makes it (a) really deep and (b) can lead to deadlock if it actually fills and enough map instances block trying to push more. Bottom line: a good reminder that push coalescing is really poor with queue sets and that always scanning a list from the front has fairness problems.
Bottom line: I can now slide a combine stage transparently into a map-reduce app (when the reduction operation supports it) and GRAMPS can do the appropriate parallel partial reduction. But, there's some dubious suboptimal paths that currently get tickled in the run-time along they way.
OpenCL right now is, not necessarily intentionally but nevertheless indisputably, extremely short-term / low horizon. It's at its best standardizing the status quo (or at least the status quo of what I suspect the GPU vendors have designed, even though neither Larrabee nor whatever AMD/ATI's OpenCL capable GPU will be called exist publicly). It's very weak-- somewhere between flawed, naive, and just weird-- when trying to be more future looking and future proof. I think that's mostly fine, but suspect some of the wackier things in it will become the OpenCL equivalent of OpenGL's built-in spline evaluators: cute legacy niches that every implementation is stuck supporting, but are never used in practice. I have a fairly extensive set of impressions and future guidance for OpenCL after digging into the conceptual / specification side of things at SIGGRAPH and hacking on some simple apps last week. Maybe I'll take some time to write them up and post them.
As before, a few minor surprises showed up in testing. There's a significant startup cost associated with AMD's OpenCL implementation. On my machine, the app takes at least 80 milliseconds even with a tiny work size. Interesting, even as I scale the work size up to the point where it takes 80 milliseconds just to run the kernel (as reported by clGetEventProfilingInfo()), the entire app itself only takes 100 milliseconds. Years ago, when we were writing gpubench and testing Brook, we saw that behavior launching kernels on a GPU and explained it as a fixed host side driver cost and variable asynchronous GPU cost. That makes somewhat less sense when everything's running on the host / CPUs and thus cycles are zero sum.
The other surprise was that valgrind gets mildly unhappy at the AMD OpenCL library when I run my simple test. It's intermittent and never shows up with extremely small work sizes (1 - 16 entries), but regularly appears at 256 instances or higher. Valgrind complains of a number in invalid 4 byte reads and writes deep in the guts of the library (e.g. a bunch of different offsets inside cpu::WorkGroupOperation::execute()). I wish I knew what it were doing enough to have an idea of if this is somehow my fault.
I'm also curious whether it's attempting to enforce my WRITE_ONLY buffer or if I could actually read from it without problems. And what's going on with alignment. The demos do, but I do not, explicitly align their allocations. I wonder if it's a performance win, total superstition, or a correctness requirement I'm violating without the library enforcing.
OpenCL implementations are allowed to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device.This makes sense. Without this escape, a GPU or accelerator would be stuck using PCI-e or other shared memory to access the buffer object and it would be too slow to be useful. Where I wish the spec would chime in is when the implementation is required to flush the cached copy back to the host. The entire point of CL_MEM_USE_HOST_PTR, from my perspective, is a syntactic simplification to save boiler-plate calls to copy my data in and out. I would like that to mean that as long as I make sure to wait for all kernels accessing the buffer to finish, my host code can directly consume its pointer to the data. I.e., any device side cache is filled on launch and flushed on completion so that host code is free to do arbitrary things outside of that time. If correctness requires explicit additional synchronization or calls to clEnqueueMapBuffer then CL_MEM_USE_HOST_PTR is useless as a syntactic aid and seems effectively useless overall (I can't construct any compelling performance optimizations it enables under those conditions). I just wish the specification would rule one way or the other. It's such a natural question and surprising omission.
(Note: one could construct an argument that it has an additional advantage of conserving host address space, but (a) even huge device memories are pretty small in the grand scheme of things and (b) in the age of 64-bit hosts, address space is even more plentiful).
For test only: Expires on Wed Sep 30 00:00:00 2009 Found 1 platform(s). platform[(nil)]: profile: FULL_PROFILE platform[(nil)]: version: OpenCL 1.0 ATI-Stream-v2.0-beta2 platform[(nil)]: name: ATI Stream platform[(nil)]: vendor: Advanced Micro Devices platform[(nil)]: extensions: platform[(nil)]: Found 1 device(s). device[0x9ac2d56]: NAME: Intel(R) Pentium(R) D CPU 3.00GHz device[0x9ac2d56]: VENDOR: GenuineIntel device[0x9ac2d56]: PROFILE: FULL_PROFILE device[0x9ac2d56]: VERSION: OpenCL 1.0 device[0x9ac2d56]: EXTENSIONS: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store device[0x9ac2d56]: DRIVER_VERSION: 1.0 device[0x9ac2d56]: Type: CPU device[0x9ac2d56]: EXECUTION_CAPABILITIES: Kernel device[0x9ac2d56]: GLOBAL_MEM_CACHE_TYPE: Read-Write (2) device[0x9ac2d56]: CL_DEVICE_LOCAL_MEM_TYPE: Global (2) device[0x9ac2d56]: SINGLE_FP_CONFIG: 0x7 device[0x9ac2d56]: QUEUE_PROPERTIES: 0x2 device[0x9ac2d56]: VENDOR_ID: 4098 device[0x9ac2d56]: MAX_COMPUTE_UNITS: 2 device[0x9ac2d56]: MAX_WORK_ITEM_DIMENSIONS: 3 device[0x9ac2d56]: MAX_WORK_GROUP_SIZE: 1024 device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_CHAR: 16 device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_SHORT: 8 device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_INT: 4 device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_LONG: 2 device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_FLOAT: 4 device[0x9ac2d56]: PREFERRED_VECTOR_WIDTH_DOUBLE: 0 device[0x9ac2d56]: MAX_CLOCK_FREQUENCY: 2992 device[0x9ac2d56]: ADDRESS_BITS: 32 device[0x9ac2d56]: MAX_MEM_ALLOC_SIZE: 1073741824 device[0x9ac2d56]: IMAGE_SUPPORT: 0 device[0x9ac2d56]: MAX_READ_IMAGE_ARGS: 0 device[0x9ac2d56]: MAX_WRITE_IMAGE_ARGS: 0 device[0x9ac2d56]: IMAGE2D_MAX_WIDTH: 0 device[0x9ac2d56]: IMAGE2D_MAX_HEIGHT: 0 device[0x9ac2d56]: IMAGE3D_MAX_WIDTH: 0 device[0x9ac2d56]: IMAGE3D_MAX_HEIGHT: 0 device[0x9ac2d56]: IMAGE3D_MAX_DEPTH: 0 device[0x9ac2d56]: MAX_SAMPLERS: 0 device[0x9ac2d56]: MAX_PARAMETER_SIZE: 4096 device[0x9ac2d56]: MEM_BASE_ADDR_ALIGN: 1024 device[0x9ac2d56]: MIN_DATA_TYPE_ALIGN_SIZE: 128 device[0x9ac2d56]: GLOBAL_MEM_CACHELINE_SIZE: 64 device[0x9ac2d56]: GLOBAL_MEM_CACHE_SIZE: 65536 device[0x9ac2d56]: GLOBAL_MEM_SIZE: 1073741824 device[0x9ac2d56]: MAX_CONSTANT_BUFFER_SIZE: 65536 device[0x9ac2d56]: MAX_CONSTANT_ARGS: 8 device[0x9ac2d56]: LOCAL_MEM_SIZE: 32768 device[0x9ac2d56]: ERROR_CORRECTION_SUPPORT: 0 device[0x9ac2d56]: PROFILING_TIMER_RESOLUTION: 1 device[0x9ac2d56]: ENDIAN_LITTLE: 1 device[0x9ac2d56]: AVAILABLE: 1 device[0x9ac2d56]: COMPILER_AVAILABLE: 1A few surprises popped up. As is immediately visible, AMD doesn't bother returning a valid platform ID. Instead they rely upon the "implementation-defined" semantics for NULL. Fair enough, it's their implementation. More surprisingly, a number of device parameters turned out to be 64 bit quantities. Rather than sort it out, I just always use a long long for numeric properties (it's not as if peak integer performance is critical, so simpler code wins). Oh, and valgrind informs me that AMD's OpenCL libraries leak memory (though thankfully my super simple program itself does not).
I wonder if I can find someone who has the NVIDIA OpenCL drivers to compare results. Or I could dig up an NVIDIA GPU, thereby completing the irony (running AMD's OpenCL run-time on a machine with Intel CPUs and an NVIDIA GPU) and also freeing me from my BrokenGL versus broken ATI driver installation grief.
/* * Create a program long enough to bind the specified kernel and then * decrement its ref count immediately. This should work fine unless * the kernel object both depends on the program object and * clCreateKernel() fails to call clRetainProgram() internally. */ src = FileToString(inputFile); srcLen = strlen(src); program = clCreateProgramWithSource(ctx, 1, &src, srcLen, &status); clBuildProgram(program, 1, device_list, NULL, NULL, NULL); kernel = clCreateKernel(program, kernelName, &status); clReleaseProgram(program); return kernel;The specification is silent on this point. It'd be pointlessly burdensome to require the program to keep around a pointer to a program it's never going to use just to free it later (or to use clGetKernelInfo() to fish it out at cleanup time). Of course, this is the same specification that returns success / failure as an out-param rather than a return value so burdening the programmer is clearly not a significant concern.
Hmm. Actually, it occurs to me to wonder about this question more generally. For example, can I clReleaseContext() after the last time I explicitly refer to my context and rely upon implicit reference counts? Once I have a command queue and my kernels, my program is done with explicit references to the context. I'd like to release it then and forget about it, but obviously I don't want the run-time to free it until all the dependent objects are dead.
The ironic thing, or perhaps just the disappointing thing, is that this is a specification predicated on the existence of nontrivial compiler technology, which is leveraged all over of the place for the kernels themselves. It would have been so hard to do something similar for setup? (I say this with full fear of the ludicrous templated C++ bindings that are already making their way from the abyss to this more innocent world).
In other news, there are a couple of clear places the OpenCL authors were feeling a bit punchy and let loose in the footnotes. Two fine examples, back to back (emphasis added in both cases):
28. The user is cautioned that for some usages, e.g. mad(a, b, -a*b), the definition of mad() is loose enough that almost any result is allowed from mad() for some values of a and b.
29. Frequently vector operations need n + 1 bits temporarily to calculate a result. The rhadd instruction gives you an extra bit without needing to upsample and downsample. This can be a profound performance win.
Important safety tips, thanks Egon (possibly even profoundly important, if you know what I mean).
Bonus hilarity: The top of the Developer Central Registration form says, "Membership in AMD Developer Central is free, but the benefits are priceless." The word 'benefits' is a hyperlink and, curious to learn about the awesomeness that awaited, I clicked on it. The result, a page saying "Error - Page Not Found" and search links to all the pages that included the word benefits. On the other hand, if it turns out this broken link is intentional commentary from some atypically jaded AMD employee on just how priceless the benefits of registration are, I want to give him or her a gold star.
Update: Sigh. It turns out I can predict the future. I already have an AMD Developer Central account. Whose username and password I've already forgotten. Hooray for the password reset / recovery form. Ridiculous. Well, time to download all the packages and cache them locally before I forget and have to repeat this rigamarole.
Anyhow, as I've wandered away from daily GPGPU programming, my requirements for graphics cards have dropped even lower. I just want multi-mon and basic OpenGL to work. I needed a graphics card for a new machine and we had a big stack of ATI boards lying around (in nice shiny boxes, even) and no NVIDIA cards. This should have been a hint, but I hoped for the best. Ubuntu installed whatever it wanted (the open source radeon drivers) and things appeared to work. Except that under closer scrutiny, GL was pretty flawed. Anything double buffered in GLUT never refreshed, basic line drawing was inordinately chunky and CPU heavy, anti-aliasing was woefully broken / not present, etc. I shrugged and chalked up another victory for ATI.
At SIGGRAPH, though, Mike was bragging about how the ATI driver people have ground-up rebuilt their OpenGL drivers and totally rounded the corner on quality and also had their own proprietary Linux drivers. I was highly dubious, but decided to take him at his word. So, first I asked Ubuntu to install the proprietary (fglrx) drivers. That was exciting-- it produced an X server that crashed on startup and left my display unusable. Logging in remotely, I discovered that I had to manually uninstall the open source radeon drivers to avoid some conflict that apparently was beyond the ability of the maintainers to express with the packaging system (the same packaging system that somehow handles all the rest of Ubuntu, including the proprietary NVIDIA drivers, without problems). Hooray. Except that it didn't work. The driver didn't automatically claim my card and when I explicitly told X to use the fglrx driver, the driver found no supported devices. /usr/bin/aticonfig told me the same thing. Now, this isn't some shady engineering sample board we had lying around the lab. It came out of a shiny, glossy, official ATI-branded box and lspci identifies it as "ATI Technologies Inc R520GL [FireGL V7200]".
But, whatever. I uninstalled the various Ubuntu packages and went to the ATI/AMD website. I told it I had a FireGL V7200 and it told me it had just what I needed. I'm somewhat dubious of installing core modules outside the management of the packaging system because if (who am I kidding, when) it fails, cleanup will be messy. Luckily, I don't have to worry about it. Because the ATI installer doesn't get anywhere near that far. Just after decompressing the driver, it tells me:
Error: ./default_policy.sh does not support version default:v2:i686:lib::none:2.6.28-14-server; make sure that the version is being correctly set by --iscurrentdistroRight, then. I actually have a reasonable idea of what that means, but what I don't have is any motivation to try and make it work. I mean, seriously, NVIDIA solved the binary driver problem years ago. It went through some teething and rough times, but now it just works. And when it doesn't just work, it just says that it failed or even offers a reason why. I mean, I'm at least as fond of broken error message spew from careless shell scripts as anyone else alive (or at least anyone within five years of my age), but that's just embarrassing.
So, in summary, it's back to BrokenGL courtesy of the open source driver, my long-standing prejudices have a bit more anecdotal grist for their mill, and I've got my eye out for spare NVIDIA hardware. I wonder if Larrabee will be this much fun.
// FIXME(kayvonf): with finite size push queues or loops, inspect // bits stay on and we could deadlock if (grGlobals.sched.stageMaskInspect.Any()) return;The second circumstance is that the code only produces visualization information on scheduling changes. A deadlocked application never changes so no further records are logged as soon as things deadlock. The only give away is that the cycle count reported by grampsviz stops increasing and doesn't match the text in the log. Finally, there appears to be some chance, or at least hard-to-predictability associated with which inputs deadlock with an inspect bit on and which don't. Hence the proper detection of deadlock on other images. Now I just have to figure out what, if anything, to do about it (the only possibility is to discover / report better errors, so there's limited upside in spending too long fussing. I do like to fuss, though. And I dislike pathological failure handling.)
Of course, it actually is sort of good news because the need for a huge log file was a symptom of a deeper flaw. I handed my histogram app a 512x512 image and then went and patiently did other things while it ran. Only it ran and ran. And ran. And ran. And eventually got killed for logging more than two gigs. Sometimes I don't think very deeplly when in scientific method mode, so I responded by turning off logging and assuming the logging was also slowing things down. After rather a while I gave up on that and tried a 256x256 image. After rather a while my brain actually engaged and I took a step back. I timed the 64x64 processing (barely more than a second) and decided that there was no way the slowdown I was seeing at 256x256 (only 16x the data) was remotely reasonable. I looked at the grampsviz output from a run I'd killed prematurely and realized that one of the subqueues was actually filling (presumably the one for zero or 256 since those seem to be the two big spikes in the images I have at hand).
Now, in and of itself, a subqueue filling is a problem, but a problem that should produce deadlock and catch the notice of the deadlock detector. I became suspicious, ratcheted the queue sizes way down (10 slots per subqueue/bucket), and tried the tiny images again. Happily and sadly, the deadlock detector promptly fired. In proper brownian motion debugging, I upped the queue sizes to 100. Still deadlock (160 of the 256 pixels in the 16x16 image are dark enough to quantize to a luma of zero). But, then I tried the 64x64 image and discovered that I have successfully coaxed it into reproducing the bad state the larger images were showing with larger bin sizes. Now I just have to figure out what's going on. I'm sure (or at least, strongly suspect) something fishy in accounting for shader instances that block on push, potentially in conjunction with queue sets or dynamically instanced queue sets. I probably also want a discting indication in grampsviz for shader instances that are blocked, but tying up a thread slot.
Alex asked a good question: What is Map-Reduce bad at? Map-Reduce is here to stay, so it's valuable to delineate both what it handles readily and for what it is ill-suited. Kunle offered matrix multiply as a weakness. I only commented about graphics, since those are the only apps I would say I know well enough. D3D/OpenGL style pipelines are going to look appealing from a high level, but be awful quagmires without some extensions to enforce ordering. Order pervades them at too many fundamental levels and tagging everything with a sequence number, buffering, and reordering at the end is anathema to how they get performance. REYES, on the other hand, seems nicely suited. In all cases, the current implicit split and partition phases in Map-Reduce align with the graphics fixed function / non-data parallel stages and I think it's possible a good implementation would want something more stateful than a simple function callback, but I'm not sure.
In other news, I stumped Amazon. A search for 'paperback bookcase' returns no furniture. Indeed, it returns nothing other than books. Bookcases optimized for the storage of paperbacks seemed like an obvious niche to me (and a wider search online produces some hits), but apparently none of the Amazon vendors agree.
Frustratingly, lots of little nuisances arose in what seemed like a minor task. First, I wanted combine to be a shader since it's logically as lightweight and stateless as any shader instance (the whole appeal of the GPGPU parallel reduction model). However, partial reduction inherently involves consuming two inputs and the current shader instance environment is entirely organized around processing single elements. I double checked and reminded myself that the combine shader sanity test I'd written had cheated and used a buffer for its accumulation of partial results. With a sparse and dynamic key space, that doesn't seem as kosher.
This didn't (doesn't) seem insurmountable, but prompted me to try to write a thread stage version of combine first. The code was straightforward: bind the same queue as input and output, reserve two packets on the input side and one on the output side, combine the two inputs into the output, and commit. Unfortunately, this promptly deadlocked. It's not exactly a real deadlock, but it's close enough to be trouble. The intermediate-pairs queue now has two producers: map and combine. Additionally, combine creates a cycle since it also consumes from intemediate-pairs. The semantic combine really wants is, "read two at a time from intermediate-pairs as long as map is running, then get a short read once there's only one packet left and map is done." That's a little too nuanced for the current deadlock detector, though. Instead, it eventually combine blocks on a two packet input reservation, map is done, and the system deadlocks. I suspect a reasonable general rule is that a stage that produces to a queue it also consumes should still receive NO_MORE if all its upstream producers complete, but I haven't thought it through in detail yet.
Anyhow, these cascading nuisances have prompted me to wonder if the right solution is a formal filter/partial-reduction shader stage type. In conjunction with queue sets, I'm pretty sure I can fake it with instanced thread stages that accumulate their partial results internally rather than constantly returning them to the intermediate-pairs queue, but that precludes creating multiple parallel instances that all handle the same subqueue for e.g., a logarithmic reduction.
Update: Indeed, faking it with a instanced thread stages that keep the partial result on their stack works correctly. No surprise.
My preliminary thinking is that rather than compacting in local store and copying into the queue, there are at least some situations where the push compaction should just be done in-place in each subqueue and the dispatch thread should maintain an array of windows per subqueue (or cache of windows per some large number of subqueues). It absolutely makes sense in the case of reduce, where the downstream isn't going to run until all data is available, so there's no pressure to get data committed early. I think it makes sense more generally whenever pushing to an unordered queue. It may cause fragmentation in the internal queue implementation, but since there's no head-of-line blocking in unordered queues, the increase in packet utilization / density should be the significant one.
There are many morals to this tale, but honestly, the schadenfreude factor is really swamping them all for me. I've just smiled as the various confused, frantic, and resigned emails have gone around. Well, and I tried to help Solomon a little, but quickly established that there wasn't much to be done. For my part, I didn't even notice any hiccups, though I suppose I would have if I'd tried to write this earlier or checkin outstanding changes.
------------------------------------------------------------------------ r1000 | yoel | 2009-04-03 16:40:43 -0700 (Fri, 03 Apr 2009) | 27 lines Make visualization of dynamic instancing actually work. All my prior changes to dump and visualize workloads with dynamic instancing were braindead. In their defense, they didn't segfault and they worked if the graph only had a single queue. That turns out not to be terribly helpful, though. Now I've added enough information to each VizRecord to determine the correct starting offset of each queue in the global list of subqueues and fixed up the surrounding code to honour it. The grampsviz images for histogram make sense and provide value now. ...Oops.
In other high quality facilities news, GCafe stalled for a good five minutes this afternoon. The reconstructed story: the building carpets are being cleaned in the evenings this week. Last night they cleaned the third floor. For highly dubious reasons, the same circuit the janitors use to power their equipment happens to be the one that drives the plasma television and projector in the graphics lab conference room. Shampooing the carpets blew the circuit breaker (which it apparently does routinely), the janitors moved to another plug (also apparently routine), and we were stuck figuring out what had happened and then tracking down the breaker panel in question (over next to the water fountains, nowhere near the conference room). Or rather, John was stuck doing that while we gave up and used a portable projector someone had.
Here's something that has happened several times: a programmer asks me to intervene in some debate he is having with a program manager. "Who is going to write the code?" I asked. "I am..." "OK, who checks things into source control?" "Me, I guess, ..." "So what's the problem, exactly?" I asked. "You have absolute control over the state of each and every bit in the final product. What else do you need? A tiara?" You see, it turns out that this system puts the burden on the program manager to persuade the programmer, because at some point, the program manager runs the risk that the programmer will give up and just do whatever the heck the programmer feels like.I encounter this dysfunction / dissatisfaction with shocking regularity. Programmers, senior well-respected programmers, express feelings of powerlessness in the face of product management, program management, regular vanilla management, etc. I always challenge them back that such powerlessness is only possible if they abdicate their inherent last word on product development. Fundamentally, engineers take their cues and strongest influence from their peers and leads. Those are the people surrounding them all day for advice, discussion, code reviews, mentoring, etc. They may have a weekly one on one with their managers and see him or her in a group meeting or two, but that's an order of magnitude less contact and social influence. Product management, program management, etc. are all even further away. Moreover, the whole reason "the organization" gives engineers fancy titles (tech lead, architect, distinguished engineer, staff engineer, principal engineer, etc.) is to convey both a recognition and trust of their judgement as well as an expectation that they will use it (I'll note that some, but definitely not all of the time, I approach the managers, PMs, etc. about the flip side of such situations, they express an eagerness for the engineers to step up and show more leadership. It's the absence of that leadership that has them knowingly doing an uncomfortable job covering rather than leaving a total vacuum). I think I'll start incorporating, "What else do you need? A tiara?" in my future exhortations to dismayed engineers that they can empower themselves as simply as ceasing to abdicate their influence.
I want to first apologize for my intrusion. My name is ---- and I am a Staffing Specialist with NVIDIA Incorporated (Nasdaq: NVDA). I'm working with Senior Directors of one of NVIDIA's most visible and prestigious teams. We are looking for a CUDA DEVTECH ENGINEER #1136737 ...I actually receive so few unsolicited recruiting emails that I still paid enough attention to judge this a pretty lame effort. There were clearly manual steps to this discovery / screening process that produced my name and the best he can do is devtech? Without even mentioning the other CUDA related openings NVIDIA's web page lists? I guess he only gets paid for hiring devtech, no matter how unlikely the fit. And that's seriously all the cover letter / introduction he can spare? It actually makes the email less appealing than if it were omitted entirely. The NVIDIA marketing youtube links in his signature are a nice touch, though.
*Technically, CPAN has a Parallel-MapReduce, but like many fledgling efforts it manages to be both massively over-engineered / complex and not actually able to express the usage I want. Alas.
Anyhow, after the PPL meeting today, we again tilted at the windmill of parallel taxonomies and required building blocks. I frequently complain that producer-consumer gets overlooked when the parallelism in an application is superficially reduced to what's "task" and what's "data" parallel and that building blocks for the producer-consumer case are a requirement for any lowest level parallelism substrate so just talking about tasks and vectors misses out. Pat justifiably complained that producer-consumer is orthogonal to the abstract academic definitions of task and data parallelism and has prompted me to reevaluate my phrasing once again. Here's what I ultimately wrote:
I suppose I'd rather take a step back from a single task/data dichotomy and say that the taxonomy of significant parallel primitives includes four things:
I've also drawn a few generalizations about how things seem to work out 'in practice':
On a vaguely related note, I think I have finally wrapped my head around what Pat calls a data parallel domain specific language. The longstanding stumbling block has been my under appreciation of the role of the lambda functions / element-wise filters applied with every map, reduce, over, group-by, etc. operation and Pat's insistence on calling it data parallel. I see it as a domain specific language for any parallel program that can be expressed as a static computation graph (or expression tree if you perceive each stage as an operator the way Pat does). It's really the ultimate generalization of all computation graph based systems. Where streaming, at least as it has been adopted, only includes 'map' shaders and GRAMPS supports 'map + push' shaders and threads (the degenerate case of applying a trivial vector operator to the single length vector and putting all the work in the lambda / filter function), this language supports a whole spectrum of distinctly typed stages. I'm curious whether it's significantly more expressible than GRAMPS (or map-reduce, or any of a number of other systems that have consciously reduced the space). If it isn't, then it's really a poster-child domain specific language. The whole thing can be implemented on a more minimal set of primitives while exposing richer higher level primitives upwards. If it is, it would be interesting to know if there's a viable fast implementation of the missing part or whether it inherently doesn't map well even to the bare metal.
The problem is minor. It's unsettling, however, to discover that extensively used and seemingly correct code is wrong. I would have taken a not-implemented in stride, but this manifestation has more of the "I found a compiler bug" feel where I become suspicious of everything and everyone. Luckily there are meds for that.
#define PRODUCEFN(fn) \ static void fn(void); \ void (*_MRProduceFn)(void) = fn; \ void fn(void) #define MAPFN(fn, input) \ static void fn(const void *input); \ void (*_MRMapFn)(const void *input) = fn; \ void fn(const void *input) #define COMBINEFN(fn, key, v1, v2) \ static void fn(const void *key, const void *v1, const void *v2); \ void (*_MRCombineFn)(const void *key, const void *v1, const void *v2) = fn; \ void fn(const void *key, const void *v1, const void *v2) #define REDUCEFN(fn, key, values, num) \ static void fn(const void *key, const void *values, int num); \ void (*_MRReduceFn)(const void *key, const void *values, int num) = fn; \ void fn(const void *key, const void *values, int num)Hopefully this means I'll think up a clever way to do it on the way home tonight. Otherwise I'll have to actually clean it up and live with it. And that doesn't count the weak symbol definitions I threw into the bottom of the header file to make it work out when an app doesn't define all the entry points.
Ian had some good comments when we talked more after dinner and I've even post facto modified my slides a little in response. Mendel asked me if GRAMPS supported task parallelism and I realized that it fits really well. Task parallelism is just an execution graph where the what's queued is activation records for sub-tasks rather than actual data. Similarly, data parallelism is just an execution graph where all that flows through each queue is a single NO_MORE/ALL_DONE as kernels complete. Pat liked my characterization of task-parallel as Divide and data-parallel as Conquer (which, by the way, I claim is truly accurate, not just a snappy rhetorical device).
I think we explicitly agreed that from the CUDA/source code perspective there was no way to tell the difference whether the compiler or hardware implemented the vector masks to deliver a scalar environment, but he also emphatically stated (1) that the two were fundamentally different and (2) they were disinclined to tell us why the two were different or offer any insights into choosing one design over the other. He stated that any further how or why details got into NVIDIA 'secret sauce' and it made no business sense to reveal anything further since it would help their competitors. I found that very surprising-- how do you persuade people the merits of an architecture without complementing 'what' with 'why'? Well, perhaps not pure end users, but surely a community of designers and researchers examining a variety of choices not only to use, but to which to contribute, for which to build, etc.
My experience and instinct argue there should be a qualitatively meaningful middle ground for providing useful high level information without giving away competitive advantage. Is there truly a trade secret that is so simple, so central, and yet so unpatentable that protecting it prevents offering any insights? I wouldn't expected that and it would make it very awkward for satisfying my honest and good faith intellectual curiosity about design choices and the corellary lessons to be learned.
(A note: SIMT does offer the potential for an implementation that can be discerned from threaded vector. There is no inherent limitation that only one program counter within a warp execute at once. At Kurt pointed out last year, one can easily conceive of a SIMT implementation that can dispatch multiple different instructions per cycle in the event of a divergent warp. It remains an interesting question whether such an implementation immediately trends towards as many independent instructions as threads in a warp or whether a warp still reflects an interesting execution coherence grouping.)
11 November 2008:
Now, obviously one could implement exactly the same thing with a parallel for() loop (use the loop index as the thread ID and paste in the same kernel body). I also think the resource footprint and practical communication constraints imposed by current GPU shader core architectures are a good match for the expectations embodied by the conversational shorthand, "basic parallel for() loop". However, I understand how an argument that CUDA is more than that can be defended on marginally technical grounds by a particularly dogmatic partisan.
Note that despite that, I think CUDA as a programming model has actually demonstrated itself a valuable contribution to commodity parallel programming. Despite my teasing and longstanding dismissiveness, I sincerely hope Intel implements a "scalarizing shader/kernel compiler" that runs CUDA-like kernels on Larrabee and NVIDIA, Intel, or someone makes a run-time available for out-of-order x86 cores, too. For the class of apps that it admits, it offers a relatively simple abstraction to developers that admits highly parallel and efficient implementations on widely available hardware. Deep conceptual work or not, that sounds significant to me.
Upon consideration, this seems to reflect two different divergences from my world view: operation on in-place/persistent data and the ephemerality of tasks. The former is easily shown with an example from cloth simulation. Such simulations (or at least, many such simulations) are based on a preset point-spring system. The first kernel of each time step visits all the points, performs local force computations, and computes each point's proposed new position and velocity. In that sense, it's like a vertex or fragment shader. However, unlike transformed vertices or shaded fragments, the grid of points persists statically across every time step. Transformed vertices are queued and discarded after rasterization, but "transformed" cloth points are the input in the next pass. Additionally, they're on a preset, pre-allocated grid that is not even repacked, let alone variable length. Thus, it makes vastly more sense to assume that a task, or set of tasks, will operate in-place on the grid rather than consume it and queue output downstream. In contrast, when a kernel (or set of tasks) generates a variable number of outputs or transient intermediate results dynamic queueing handles allocation and scheduling the producing and consuming tasks co-residently saves huge amount of bandwidth and storage from spilling all the intermediate data and recirculating it. Given the longstanding emphasis on iterative simulations and solvers and the relative inexpense spilling to memory in past networked clusters compared to current chip multi-processors, I can see how co-resident producer-consumer parallelism has been overlooked.
The other divergence I see is that other parallel systems-- task or data-parallel-- assume tasks are ephemeral. Specifically, they assume that preemption is never necessary: with only minor/short-lived disturbances, load balancing can be achieved by waiting for tasks to finish and appropriately placing new ones. This has a major consequence: no task-local statefulness. That's an abstract mouthful that deserves an example. Consider a kernel, task, or set of collaborating tasks that want to perform an operation that spans multiple input elements: e.g., computing the median of all inputs, sorting fragments in depth order and resolving transparency, or even just caching the last N inputs or intermediate results for some purpose. There are two choices straightforward choices for scheduling such functions: defer launching them until the system is certain all of their input is ready or launch as soon as the first input arrives, but deschedule/preempt when more input is temporarily unavailable. Only the former is compatible with ephemeral tasks, but requires potentially huge bandwidth and storage to spill and recirculate (much like the variable output problem above).
There's a third choice that advocates of task systems (or users who happen to be stuck with them) suggest: fake the latter by blocking the input into chunks and running a series of sub-tasks that compute a partial reduction and spill enough internal state for the next task to pick up and continue. There are no shortage of reasons to complain about this "solution", mostly the standard event/continuation warts: sub-task management and bookkeeping ends up dumped on the application writer, the internal/continuation state needs to be synchronized with some ad hoc locks, barriers, etc. I prefer a simpler objection, though: preemption of threads is not a difficult problem and is one long since solved (it is even routinely built into processor cores these days). Why inflict ugly program transformations on the programmer to fake a semantic that is so easily implemented?
I think both of these "weird" (to me) expectations reflect of a history emphasizing networked clusters or parallel supercomputers that were rich (at the time) in per-node resources. And, also likely a heritage running offline applications. In both cases, the relative costs of spilling even megabytes of intermediate buffering to seconds (or minutes or hours) or kernel computation time are acceptable. At real-time and interactive rates, there are just no resources: GPU designers tell me they have been off-chip bandwidth limited for multiple generatiors and they already go to extreme lengths never to spill intermediate data from on-chip FIFOs. I suspect that at real-time rates, the ephemerality assumption for load-balancing can also become questionable. It clearly holds for shader-like tiny thread instances, but I remain skeptical about launching oodles of short-lived tasks rather than few longer lived tasks that dynamically produce and consume on queues. I realize the two models are duals and it is possible the compiler and programming language folks can transform between them effectively, but it seems more likely to map the latter onto actual hardware with reasonable overheads. I am also unsure what the most comprehensible task-dual is to specifying queue/buffer lengths (obviously a limit on task spawn/fan-out, but since tasks cannot just block if they try to spawn while at the limit, it seems potentially ugly).
One concluding note: one can get a lot of mileage out of dependency chains of ephemeral tasks which only communicate at entry and exit. That is, these models evolved for good reasons that still hold. For that matter, CUDA is an even more constrained model-- there is not (yet) an explicit dependency mechanism and the tasks are confined to trivial data-parallel kernels-- and it is widely gaining adoption by people who find it fits pieces of their workload (or can be shoehorned into roughly fitting and the performance plus simplicity trade-offs appeal). I have even found that GRAMPS needs an analog of the dependency/barrier mechanism for the cases where it makes sense to operate in place rather or a reduction needs to know when it has seen all relevant input. However, I am surprised by the apparent blind-spot to recognizing the locality advantages of producer-consumer parallelism between co-resident threads. It seems like a straightforward extension to task parallelism whereas trying to infer it automatically and reconstruct it via task-fusion and program transformation seems very difficult.
Any programming model brings generality (if it didn't, it would be an application, not a programming model). And, generality is directly opposed to the programming mode/run-time/hardware being able automatically to maximize optimization. Instead, I would claim that any programming model that has been successful/popular has included explicit mechanisms for its power users (i.e., savvy developers) to tune, guide, and/or optimize. Even programming models that emphasize ease over performance-- e.g., many scripting language and probably all managed code environments-- develop extensive practices and features for enhancing performance. These show up consistently as the same thing: guides for the progamming model to exploit its own performance features better. Map-Reduce implementations tend to let power users control the number of map and reduce tasks and the partitioning in case the best effort black box strategy for load balancing can be improved. The many-core versions expose a knob for appropriate batch sizes so that users can tune based on the (non-input) working set sizes of their map functions. Similarly, GRAMPS lets users specific maximum queue sizes as explicit load balancing hints to the scheduler. CUDA makes explicit the size of thread group launches because it's a correctness requirement for some use cases of the shared cache, but it's also an optimization parameter for balancing number of groups versus group size in many others.
A completely separate observation: GRAMPS is not alone in leaving it a separate problem to write good kernels. All of the parallel programming models about which I'm reading are focused on structuring the skeleton the application for performance and abdicating a discussion or ownership of optimizing kernels (whether they're called kernels, shaders, mapper/reducers, tasks, etc.). This is also entirely natural: when the focus is on obtaining speedup through parallelism, obtaining speedup in the various sequential instances is an orthogonal problem. The sequential bits are also common too and richly mined by the existing work in the context of serial execution. Natural or not, it is nice to see it confirmed de facto (none state it outright) by other work in the area.
Update (4 Dec 2008): We're received confirmation that the Industrial Practice track dissolved and our submission is a regular part of the conference. Poof.
The situation's a little different for the GPU-like configuration. Its scheduler is already synchronous. That amounts to queuing less work and isn't very susceptible to the head of line blocking scenario above. That said, it develops on pipeline bubble I can't currently explain with confidence. Here is the queue and core occupancy for a 1024x1024 rendering of the courtyard scene. I don't fully understand the bubble that develops about 60% of the through in conjunction with the late spike in RO work. That's downstream of the critical weirdness though. There are occasional short bursts of vertex work surrounded by fragment and blend work such as the one where the cursor is in the image. Given the priorities, that vertex work should have been scheduled as soon as it appeared, but I can't explain why it just appeared then. Even if it were backlogged on a pending scoreboard, each thread has run both blend and fragment work and those should have been pre-empted. Order is always the whipping boy for weird things, but I haven't pieced together the full narrative yet.
It's worth noting that the lazy scheduler has exactly footprint trade-off against the eager scheduler that one would predict. Here is the same occupancy picture for the eager scheduler. It goes through far more vertex - fragment - blend cycles and has the pipeline bubbles from exhausting all fragment work, but not being able to fill the machine with vertex-work alone. The thing is, lazy does not have significantly better thread slot occupancy and (unsurprisingly) it has a much higher queue footprint. The lazy scheduler gets 95.3% occupancy with a 5.5 meg worst case queue footprint. The eager scheduler gets 95.0% -- almost the same -- but has a 1.3 meg footprint. Clearly I need to understand the bubble in the lazy scheduler. It may also be that eager will be unfairly successful so long as scheduling is 'free'.
Interesting. Apparently, at least some versions of elinks can use SpiderMonkey to do limited Javascript. It's too bad elinks is so much clunkier looking than lynx (in fairness, it tries to replicate a full fledged renderer when that's actually counter to my interests).
Apparently, it worked as Pat declared at the end that when I came in as a systems person he had wondered if I'd ever be a graphics person, but that I'd just demonstrated I was legitimately a graphics person. I wonder if this means I should stop telling people who ask about my Ph.D. program that I'm pretending to be a graphics person.
Oh, and printing from Vista sucks.
The G80 with its 128 scalar arithmetic units running at 1.3 GHz should deliver over 160 GFlops, meaning that _tracing one ray costs about 10,000 cycles_ [!! emphasis added]. We suspect the main bottleneck to be the large number of registers in the compiled code, which limits the occupancy of the GPU to less than 33%. Unforunately, although the program requires much less registers, the CUDA compiler is not yet mature enough and cannot aid in reducing their count. An option would be to rewrite the whole CUDA code in PTX intermediate assembly.The unfortunate CUDA compiler is an old familiar whipping boy of mine. I wonder if they've figured out yet that the intermediate assembly doesn't help with register allocation (it's entirely done by the backend compilation to the final form) or if NVIDIA has relented on that stance. I resorted to using the shared cache as excess register file and manually spilling my ray data there while computing intersections. That saved a huge number of registers without limiting parallelism (since I was still slightly register bound, not shared space bound). And of course, I have to wonder what happens when they discover that no amount of parallelism can hide the divergent execution penalty within a SIMD group, warp, or whatever they're called these days.
The step I call "rasterization" has a variety of other names in the REYES and RenderMan literature, but it is entirely rasterization! A collection of surface patches in object space are tranformed to image space and sampled according to pixel (and subpixel) locations. OpenGL and Direct3D call the results fragment while RenderMan calls the results visible points, but they are the same results of the same process. Credit to Kurt for hammering home that not only is this "just like rasterization" but it simply is rasterization. Note that the REYES rasterizer is differently complex than the GL/D3D rasterizer. The REYES rasterizer is simpler because all it ever gets are shading grids (effectively quad strips) and they're very regular. The REYES rasterizer is more complicated because it supports interpolated primitives. I.e. grids that have an associated parameterized location and whose parameter varies among different samples. This is REYES' "stochastic sampling", but in practice it seems to me specifically interpolation rather than a more general scheme. It's also absolutely fundamental to REYES as opposed to a rare corner case that can be marginalized or ignored.
Lpics strikes me as weird because they took this framework and inserted themselves logically in the middle. They split out just the lighting portion of shading and converted it to run in image space instead of object space. Then they turned off transparency so that visible points corresponded one to one with pixel locations. Then they turned off antialiasing so that REYES' surface shading corresponded one to one with a GPU fragment's image space shading. Within this framework, they took their modified subset of shading, tranformed all the inputs to be in pixel space instead of object space, and ran them at the end of the GPU. I agree that it's clever and effective, but it does seem a little contorted. I suppose GPUs of that era basically mandated contorted in order to accomplish anything besides GL/D3D. The unfortunate and unexplained thing is the decision to run the GPU shaders in pixel space instead of object space. That inherently denies them the visibiliy supersampling that's so central to REYES. It doesn't seem like it would have been that hard to literally interpose in the middle instead of just logically. On DirectX 10 era GPUs, I'd actually take things a step further and cut-over the whole bottom of the REYES pipeline over to the GPU...
It sounds like Lightspeed chops things up more like I would envision. However, when Jonathan presented it here last school year, I didn't really have the context to understand enough that my retention is any good so I only have offhand remarks from Kayvon and Jonathan to back that up.
It's an interesting perspective that matched a lot of my analysis while still having some divergence. I definitely believe and agreed that the post-dicing REYES model is roughly equivalent to GL vertex shading and then rasterization without (or with trivial) fragment shading. While I agree that technically the GL evaluators resemble the REYES dicing, I think the fact that they've essentially remaining vestigial pieces of GL ignored by the hardware and applications makes the congruence weak. The REYES rasterizer is also a different creature from the GL rasterizer. The REYES "rasterizer" knows that its primitives are only quads and whose area is only on the order of a single pixel. However, it also has to support "stochastic sampling", which is really just a fancy way of saying it has to additionally sample in time or lens position for motion blur or depth of field. Arguably the most major difference, though, is that the REYES rasterizer delivers order independent transparency. It cannot commit any samples to a pixel until all samples are available and sorted in depth order (or at least all non-opaque samples). The driving motivation for the bucketing scheme in REYES / PRMan was to limit the duration over which samples for a pixel needed to be kept around uncommitted.
Anyhow, after the Gates kick-off party and a ray tracing interlude with Solomon, Jonathan showed up. He's coyly extracting free plane tickets in exchange for enduring a job interview or two (secretly he thrives on the wheeling and dealing). He wanted to hear what we're up to and Kayvon's running off to Seattle for the weekend and Mike's involved in delicate political wrangling over trade press articles on GPU programming (sucker). Since it's been the topic of the week, I ran through assemble threads, shader threads, and the whole fan-in / fan-out / vertex caching / data sharing situation quickly and then sprung "GPU REYES" on him, too.
We agreed that the two interesting questions are: is there a picture REYES can make that GL cannot (and vice versa, but restricting to photo-like pictures) and in areas where the two are technically equivalent, what idioms / abilities are more naturally associated with REYES vs. GL? At the heart of both is an attempt to understand why, from a technical perspective, people want realtime REYES (because some definitely do), and what it is that they actually want. Clearly there's the fantasy of wanting PRMan renders at GPU rates, but just as clearly that's fantasy. Something we think is open and not fantasy is what implementation of REYES / subset of PRMan is practical for realtime / multicore / "GPU-like" implementation if we successfully produce an architecture that generalizes GPUs enough to accomodate other alternatives.
At least I have a reading committee now. If I'd realized I was going to do it without a specific extended thesis proposal, I'd have submitted the form back in February. Ah well, at least my thinking is more organized now.