GPU Notes

These are my notes from the 2005-2006 and 2006-2007 school years.

3 May 2007:

Extended silence. Sadly, it actually does mostly reflect extended lack of research. We've nibbled at the edges of our simulator and actually done a lot of talking and thinking, but talk is always oh so cheap. I had a gcafe slot today and since I've been unassigned or assigned a summer slot while I was absent in all prior years, I didn't try and weasel out of it. Even though I TA a class that overlaps the timeslot. And today was the (in-class) midterm for said class. To my own credit (or at least, preventing me from being too ashamed), I'm fairly pleased with the preparation I did and resulting talk I gave. It seemed to do a much better job of motivating the CPU + GPU fusion story without getting bogged down in graphics rat holes. It's interesting (to me) that there's an entirely orthogonal GPU derived story and argument for exactly the same queuing / scheduling concepts (indeed that's the story actually driving the first experiments and publications). I think it's important to make the two arguments disjointly, though, because our comingling them before was what kept distracting our graphics-savvy (but graphics-biased) audience.

30 March 2007:

Kayvon and I had a long chat with Pat and Fredo this morning (well, technically yesterday morning since it's now past midnight on Friday). As usual, Pat and I had a more fractious discussion than I wanted and as usual he came out with a completely different picture than the one he had going in. At least having Kayvon around, I know I'm not crazy and constantly changing my story and lying about it. But clearly Pat and I don't speak the same language, which is interesting because I don't feel like I have anywhere approaching this level of disconnect talking to systems people or other students or even e.g. the Neoptica folks. I think Pat reads very heavily into what he thinks I'm doing rather than what I say I'm doing and we just have divergent instincts / responses given the same stimuli so he ends up trying to apply what I say to a mental model I never meant to convey.
Anyhow, I've decided that key points to express (and there are so many, all mutual forward references, which is why picking the ones with which to open is such a tangle) include:
- For GPU style compute cores, the programming model currently straitjackets developers into formal multipass batch processing as the only way of getting large enough work packets for the hardware. This is really painful in both development/porting and performance costs and needless. Queues still require the application developer to create large bundles of work, but allow it to happen asynchronously and without imposing elaborate artificial structural constraints. Summary: There is an erroneous misconception that batch programming is required to get large enough work units to get efficient utilization of GPU cores. Exposed queues / bins / write buffers allow applications to dynamically collect work into sufficiently large units (and can trivially implement every batch based workload without any loss of efficiency).
- We are effectively espousing bins / queues as a design pattern, nothing deeper. At a high level, that's it. There are various technical contributions about specific ways to apply queues, scheduling algorithms, and the rest, but at the end of the day, we're selling religion. Please convert to our church of asynchronous parallelism, coordination, and salvation through the power of work queues.
- Our religion also takes as dogma that future processors should rely upon having at least one programmable "compute core" in addition to at least one conventional core. Server / HPC targeting SKUs will likely have many compute cores while consumer SKUs will have fewer. Just as even the most dedicated issuer of integer instructions no longer begrudges the FPU its transistors, a single compute core will be such a low relative cost for such a disproportionate performance gain on widely useful tasks that it will be accepted.
  Superficially this seems like an unlikely article of faith to add to work queues, but the two are actually very intertwined. Producer / consumer queueing offers a very convenient way for the two styles of cores to transfer work back and forth at a fine granularity while preserving enough context to group and dispatch it efficiently. They are the very mechanism which breaks the batch oriented feed-forward pipeline into a multi-directional state machine and one aspect of that change allows work to move between states serviced on compute cores and on conventional cores.
- One critical side note: we believe the compute cores themselves are nearly pluggable / composable pieces of the processor. That is, modulo the ISA extensions to expose queues to locally running programs, our religion covers GPU style compute cores, Intel-style tiled multi-threaded mini-x86 cores, Niagara style threaded cores, whatever. Regardless of the black box scheduling strategy used to mine efficiency from large bundles of coherent work particular to each compute architecture, we believe that work queues (a) offer a mechanism for exposing / creating those bundles that is a superset of rigid batch processing and (b) form a good basis for translating and conveying work in a form that is efficient for the very differently organized conventional cores also present on the chip. Effectively, for either the compute or conventional cores, the bins make it possible for a high level scheduler to identify batches of work sufficiently large to fill the core and then a low level scheduler (trivial / software managed for many types of cores, hardware managed for GPU cores) runs those batches. Obviously this still relies upon the programmer to keep a small enough number of deep enough queues, but that's just saying that the workload itself needs enough intrinsic coherence at runtime. It is still a big advance from the batch model's requirement of identifying coherence statically before execution begins without any flexible way of resorting or rebinning without breaking execution into distinct passes with expensive intermediate data processing.

16 March 2007:

I appreciate abstracts that challenge conventional wisdom without coming across as loony. This is well put:

   Commodity operating systems still retain the design principles
   developed when processor cycles were scarce and RAM was precious. These
   out-dated principles have led to performance/functionality trade-offs
   that are no longer needed or required; I have found that, far
   from impeding performance, features such as safety, consistency and
   energy-efficiency can often be added while improving performance over
   existing systems.

Too bad the talk itself is at 4:15PM on my birthday. Maybe I'll end up going anyways.

I've decided I find the name XPU or X Processing Unit an amusing term for our hybrids for now.

23 March 2007:

I think too much effort is put into distinguishing Distribution Ray Tracing from ray tracing in general. I am not saying distribution ray tracing was an insignificant advance. Rather, it was introduced 13 years ago and today is a normative part of what almost everyone thinks of as central to ray tracing. I understand the technical distinction and I understand how important and meaningful a change it was when initially introduced, but we're all on the wrong side of the hindsight test today. I think it's actually hard to explain to people how distribution ray tracing differs from path tracing differs from Whitted tracing as anything other than an interesting historical context. The terminology is often used in those moments of text or speech that require technical legalese (mmm, jargon) as a succinct way of identifying shortcomings in one's own (or others') implementations. I think the nuances are often lost on much of the audience, though.

15 March 2007:

A note about what I mean when I say below I don't care about graphics-- I mean that I am not trying to replace or compete with a fixed function rasterizer or texture filtering hardware. I am trying to compete / improve / influence the computational model and I certainly care about rendering systems and graphics applications. However, in terms of impact and approach, when I talk about incorporating "GPU cores", I really mean a generalization of the pixel shader units rather than anything GPU-ish in its entirety. I have absolutely no moral objections to retaining a rasterizer, z-buffer, and texturing hardware on a dedicated accelerator. I just happen to think the CPUs should filch some of the idioms for providing so many FLOPS and that, done well, that aspect of GPUs does not need to continue along such an arms race trajectory (while retaining such an exclusively graphics focus rather than becoming a normative aspect of CPUs).
Our FLASHG talk went fairly well. We only had to skip one of Kayvon's slides (the second to last) due to time pressure and there was a lot of discussion. It did take a few slides for people to work up sufficient outrage to challenge me, but luckily eventually I managed to be too controversial for a politely quiet audience to persist.
Personally, I think the biggest challenge remaining still is to articulate the framing. I really don't like the instinctive pigeon-holing that we're trying to talk about next generation GPU++ style designs and that everything has to be scrutinized and policed carefully for its negative impact on graphics. I don't care about graphics! That's intentionally a bit of an overstatement, but the fundamental point to me is actually about CPU evolution: the combination of a trend towards compute-dense cores (be they GPU, Niagara, Larrabee, Ageia, or whatever style) that intrinsically offer an order of magnitude compute win over CPUs in the same deisgn budget and the universality of multi-core CPUs means that the "right" future CPU will be heterogeneous-- a number of conventional fat cores (single threaded / ILP performance isn't giving up its mindshare any time soon, whether or not one accepts that it truly matters technically) and a number of compute-maximizing cores.
It's really exactly the same argument as for on-chip FPUs or SSE. There's certainly always other things, even if only a tiny bit of extra caching, that could be eked out with that die area / transistor budget, but they're useful enough for applications that are important enough and the opportunity cost is small enough that they're universal on modern CPUs. I claim the same is on the cusp of being true for FLOP-dense cores and that the ubiquitity advantage of making them part of the nominal notion of CPU and ability to co-opt CPU - compute core coordination, scheduling, and communication into the ISA is huge. That's really basically my claim, along with a bunch of more concrete opportunities to use that architecture to tweak / change the givens of today's CPU plus compute accelerator model to recover a lot of the flexibility, expressiveness, and porting pain researchers currently incur trying to do anything beyond toy programming of compute cores. Even without those improvements though, I think there's a huge insight (though calling it an insight when it seems so obvious to me feels unjustified) in just recognizing that pulling in a compute core as a first class aspect of the concept of CPU is a huge opportunity. And, at the same time, it's really not a radically different hardware change than continuing to bulk up the SSE units, make them more programmable, and devote a hardware thread to them.

9 March 2007:

Surreal conversations in my office:
- Ling: "Where are we today?"
- Bjorn: "Oh, it's already late"
- Ling: "Where are we meeting?"
- Bjorn: "Is 392 available?"
- Ling: "Oh! I forgot, it's Friday"
Then Ling wanders off and Bjorn follows a few minutes later.
Grr. All mail to my @cs account, including the mail forwarded from @graphics, is going into a black hole. Xenon's been tempermental all day and of course now it's the middle of the night, so there's no one around to fix it. Hopefully it's just getting queued and will all be there when I wake up. Failing that, hopefully it will have bounced by then. I really hope it doesn't just get silently dropped. In my book, that's inexcusably poor system administration.

8 March 2007:

Doug Carmean gave an interesting talk on Tuesday about a hypothetical design for a throughput oriented x86 based multicore chip with a competitive area and/or power budget with more normal "fat" x86s. He claimed he could deliver 20x more FLOPS (combined across all the cores) at a cost of only delivering 33% of single threaded performance. Kunle and I got into an argument about whether or not single threaded performance matters (it does, to everyone, though I agree with Kunle that some of its import is pure perception and not true technical need). The first part of the talk, the theoretical design and what they thought they could sort of hand-wavingly build, was excellent. The second half of the talk, where he attempted to compare with GPUs and classify workloads / algorithms based on which sorts of architectures would run them well, was not so good. It was sort of interesting, but oversimplified miles past the point of being outright confusing and deceptive. That really bothers me because there is a clear motivational story and distinction to be made for why a chip like his would be really cool that is still compelling without muddying the waters completely and claiming to be all things to all people.
He followed the talk with an NDA-concealed much more concrete talk that included a lot of lively debate and discussion. And nearly all the faculty who have offices on the third floor. Unfortunately, from my perspective, we got bogged down in the nitty gritty microarchitectural details of buses, cache organizations, etc. for a long time and the higher level architectural pieces got short shrift and the OS / applications level pieces were extremely rushed. It also makes it challenging to write more concrete my impressions of the public talk without worrying about compartmentalizing the private content.
And nobody paid any attention to hybrid, even though they immediately admitted it was the way to go for making a system like this practical. Grr.
Writing this white paper / draft research document for our hybrid system is inexplicably awfully hard. I just sit and stare at the screen and do anything I can to avoid writing. When I do write, I never get more than a few sentences out before deleting them and rephrasing them. I think the biggest problem is chasing all the forward references. The basic concepts are so clear in our heads, but there's so much groundwork to be laid and so many fuzzy words like 'GPU-like' or 'coherence' that an obnoxious or misunderstanding reader could pick nits in anything I could write while missing the big picture. Argh.

27 February 2007:

I think I've successfully persuaded Kayvon that "graphics" mode can be distinct from our enhanced mode and we can defer (ignore) all the hand wringing over determinism, ordering, and all that other stuff. I expect and intend pragmatic leakage of new capabilities to the graphics side if people really run with our ideas, but it is a total sinkhole and plausibility challenge for us to be making claims about graphics at this point.
Unfortunately, I've also successfully persuaded him (and more importantly myself) that we need a Shader Model 4.0 interpreter as the first stage in our testbed. It's nice to have something concrete to develop, but in many ways I would rather continue to focus on maturing our ideas and writing before jumping into coding.
Tim invited us up to Neoptica, so we took a field trip Friday. Caltrain is not very economical, but despite trying, neither Kayvon nor I missed our trains and it turned out to be a nice day to walk across the City. I got a lot out of our conversations, both about what they're doing and their comments on what we're doing. We tied down 2/3 of the Neoptica staff (including two executives!) for a total of 16 man-hours of time. I hope they don't bill us...

22 February 2007:

I understand the Ageia architecture better now, though there are still lots of secrets. Our conversation had lots of interesting ideas, but sadly I didn't feel like I took away any significant, concrete new angles.

20 February 2007:

The CS148 Wiki got sufficiently defaced that apparently it hit number three in the Google search results for "animal porn". Sadly it's now down to number six (and a number of the results above it look like similarly defaced research sites). Sadder still, the Yahoo search results seemed unfazed.
I have been successfully matte-extracted from our amateur green screen and Kayvon's amateur extraction procedure. Adding user selection of a representative background pixel and support for auto-normalizing one parameter in response to changes in the other really helped. My brown hair is just too green (don't blame me, I even washed it today) and our green screen not pure enough green for flawless extraction. Ah well.
I received an interesting paper to review. Stackless kd-tree ray tracing on GPUs. When I read the abstract, I was really concerned it would prove to be independent reinvention of our I3D paper, but happily it proved to be a completely different approach, though their performance results are in the same ballpark. Ours seems more practical, certainly space wise, but theirs is certainly more elegant and given what I know about G80 and CUDA there may be room for useful refinements.

8 February 2007:

Emergence. I believe I have discharged all my responsibilities to the admissions committee for the year. And slacked my way through the minimal possible updates for our I3D camera-ready. Time to take a short break, collect myself, and start doing some real work. Luckily Kayvon's got his nose to the grindstone, at least when it comes to churning out slideware.

22 January 2007:

Research, research, research. After last week's presentation to Kurt and Tim, Kayvon and I are up tomorrow for the Merrimac meeting. This time, with any luck, Kayvon will do the brunt of the talking. Since he didn't realize I planned to be there until an hour or so ago, it seems likely. Of course, Pat wants to radically restructure the argument in a way I don't fully approve, but that's hardly new. I understand the motivation and hopefully after a few more rounds we'll get to a reasonable synthesis.
Round one, complete. 47 folders. Despite the long string of post-5AM bedtimes, it's been really interesting. I'd do it again. I wish I were more familiar with all the prominent and secondary schools and researchers in a wider range of fields, though. And I'm either ashamed of my GRE verbal and analytical scores or embarrassed on behalf of nearly everyone else.

19 January 2007:

I think I'm funny. Or perhaps just a little slap-happy. This actually more likely means I'm not getting enough writing via other outlets. Here's the email I sent about next week's FLASHG:
Reminder-- our off again, on again relationship with Gates 392 is on again this coming Tuesday. Jeff will tell us something that most likely will make us all question whether our jobs and research are morally fulfilling and Kayvon will feed us something to appropriately celebrate the mutual Texan pride he and Jeff share.
Somehow writing never got off the ground last quarter. All in all, last quarter didn't progress at all the way I planned or expected. Still, we wrote yet another GPU ray tracing paper and not only was it accepted for publication, it actually gets competitive performance, uses rasterization, and is more of an approach to defend than ray tracing in a vacuum shoe horned onto a GPU.
Unfortunately, the paper totally derailed the thesis topic pursuit and I was sluggish about recapturing my mindset. Conveniently, perhaps, Kathi decided to encourage me along by putting my registration on hold without warning at the start of December. That was a bit of a surprise, and turned out to be punishment for TAing with Kayvon thus also getting snagged while Mike and Daniel squeaked through. It effectively got my attention though, and Pat's. We had a series of frustrating conversations around topics I'd articulated and what feels like just an impedence mismatch in the way we approach things and understand words like claims, assumptions, hypothesis, etc. Ultimately we not only managed to understand each other, but he managed to talk me into a much more gradiose vision that I'd laid out and dismissed as too big and, more importantly, too much an argument over vision instead of something that could be demonstrated. It even converged nicely with Kayvon's new interests, so we've gotten the band back together, so to speak. Now I just need time to think and time to work. Hopefully that will be easier once the admissions committee work is done in a month. And we revise the ray tracing paper for its camera-ready deadline in two weeks.

9 June 2006:

I've just fallen for a classic blunder. Only slightly less famous than "never get involved in a land war in Asia" is apparently "never active pass an hwnd to the Win32 message functions." Since its inception, the ray tracer has never exited smoothly on Tim's machine or on some of the lab machines, but has worked fine for me and on other lab machines. I've been messing with a different machine recently (since it has a pathetic video card, I'd ignored it historically, but these days it's all about the CPU) and the app wouldn't quit cleanly there either. So, I pulled out Spy++ and poked around. My message handler was getting my quit keystrokes and WM_CLOSE, but the WM_QUIT messages were never appearing. I stared at MSDN for a while and confirmed that I was inline with their high level guidelines and then I resorted to more open ended web searching. Pretty quickly out pops some dude who says "I think I'm doing everything correctly and calling PostQuitMessage() in the right places and my app never quits" with some follow-up comments from others saying "Don't call GetMessage(..., hwnd, ...), call GetMessage(..., NULL, ...)" and the explanation of the true ramifications of the MSDN statement that WM_QUIT is per-thread: Even if you only have one window, that window will never get WM_QUIT and the only way to get it is to GetMessage(NULL). You don't even need to have any special code to handle WM_QUIT in your message handlers or message loop. It all just works unless you fall into the trap of actually passing your window handle when trying to get its messages. Hooray.
Now that's a popular defense. Five of Marc's students are defending in the space of a week and Augusto's defense this afternoon overflowed Gates 104. I gave up my seat and came back upstairs.
It's been feeling like time to do some big picture thinking as we pile up more and more opinions and large scope projects about how to do ray tracing right but we haven't really dived into any of them. At the same time, the NVIDIA work is wrapped and the basics of the cell tracer are well in hand. Not to mention that I expect to start receiving threatening letters from the department any month now about the need to have a reading committee filed. The committee form's basic requirement (in addition to faculty signatures) is a thesis title. And, I agree with Pat that good work fits within an enclosing narrative structure and a good title helps (me, as the author) frame that structure.
Therefore, I've organized my thinking around potential thesis titles and accompanying premises. Pat and I spent a while yesterday discussing (arguing about) our perspectives and I have four to four and a half quite plausible topics (the half isn't quite so fully baked in that it's based on shading work into which I haven't yet dug deeply enough to be confident there's an answer, plus I go back and forth on how much I actually care personally about shading) and two titles that expressly are not worth pursuing (both overlap with my 'good' titles in the critical places and are ill-posed in other aspects).
The good news is four and a half valid thesis topics is plenty. The bad news is that they cover a wide scope and are sufficiently different than focusing on one means putting aside projects I really want to see completed. The other bad news is that, naturally, the most deep / visionary topics are also the ones least liable to actually be useful and involving the most hand-waving and simplifications. By and large, that can be what one expects from an academic thesis, but a piece of me rebels and argues that vision and hand-waving are critical to have framing the choice of projects and topics, but that the projects and topics themselves should stand on their own as a valuable incremental contributions even if the vision never comes to pass. This is tied to my conviction that it's always better to hire someone who can get his hands dirty and ship code than someone full of vision who always hand-waves the implementation details and thus produces projects that never actually deliver or work quite right. I wouldn't trust a bridge builder who pitched an amazing span but admitted he had no idea how to come up with materials that would actually tolerate the induced strain. You can argue that the vision will prompt lesser minds to solve the engineering details, but I don't buy it. I 'invent' an unlimited number of things that just aren't possible and if someone else later comes along and makes them possible, he deserves the credit, not me. I argue the same applies to at least my own thesis or I'll never be satisfied. This is also why I came to believe academia is a bad place for me, though I may be good for it in the same way that eating one's brussel sprouts is good for one.

26 May 2006:

Both indirectly and first hand I've encountered a lot of people trying to pierce the mystique of the Cell development environment. These tend to be people who are interested in high performance computing who've seen a lot of the Cell popular press but don't know anyone who has personal experience porting to and writing for Cell. As a result, they've all heard horrible rumours about and speculations about trying to run real applications in 256K of code + stack + data, awkward toolkits, etc.
I'd like to say that all of those rumours are massively overblown. They're spun around kernels of reality, but in many ways developing for Cell is just like developing for a chip multiprocessor (or any other SMP). If you don't want to involve the PPE core in your compute kernels (and you certainly don't need to) then there's write-once support code to spin up the 8 SPEs (or however many you want) and launch your apps. You write it once based on either the sample code or the tutorial and never look back. Occasionally, if your workload needs it, you add some very simple message passing for SPEs to signal the PPE when they need to be sent more work and the PPE to respond when more work is sent. Anyone who has ever written a work queue or used a socket for signalling can do this three days dead. It's no different for Cell than via Win32 events or BSD sockets. The APIs just have their own function names. Hooray.
Okay, so now you've gotten your SPE up and running. To be precise, just like starting a thread on any other OS, you've issued a library / system call that took a chunk of code (the ELF binary compiled for the SPE), an entry point (thread main), a void * with initial arguments, and some unimportant optional flags. What about the SPE code? You take gcc and hand it vanilla C code. Or, you take xlc and hand it vanilla C code if you think xlc is more elite than gcc (we don't. Other people around here do. It seems to vary according to personal taste and application). Okay, so it's not quite that easy. You can transparently use stack allocated memory and static / global arrays or objects that are small enough to fit in local store. You cannot transparently malloc huge chunks of memory or dereference pointers to large, system memory backed regions. However, you can mechanically convert every dereference of your big data structures into synchronous DMA and you'll come out with working code. If you're writing from scratch or have anything resembling an accessor function, this is near trivial. The DMA builtins actually take system memory pointers as their argument without any translation or anything. We hit this point with the ray tracer after a few days' messing around. And a major portion of that time was browsing reference material and one-time cobbling together of a Makefile with the correct include paths and what-not to run the cross-compile toolchain.
That's it. Porting in a nutshell (I know, why is it in a nutshell and how do you get it out? Don't ask). But, but, but, you splutter, what about having to restructure your whole application? What about fitting your code + stack + data in 256KB? You don't have to restructure anything. You always can run on the PPE. If you want performance from the SPEs, you will have to multithread at least part of your application into at least 8 threads (or however many SPEs you want to use). However, if you want performance with any chip including current conventional CPUs, you have to multithread the computationally dense portion of your application. Multithreading for Cell isn't intrinsically different from multithreading for a "normal" CPU. So, while I'm happy to grant that having to multithread your code to get performance can be a pain, it's no special barrier unqiue to Cell. Now, as for fitting in 256KB, well, that's a ton of space. I'm clearly a child of the wrong era and it's about to be very clear, but once all your application data is excluded (it lives in system memory, not local store), 256KB is great! With four byte instructions (I have no idea what Cell's instruction width is, but 4 bytes is a fine proxy), splitting LS half and half instruction and data gets you 32K static instructions which is huge for a computational kernel. Moreover, dynamic linking just plain isn't hard and it naturally combines with overlays to allow arbitrarily large code executed in a fixed amount of space. That's too hard for you? Luckily only one guy needs to write it for Cell and we can all use it. However, that's really more for the future. The whole raycasting portion of our code compiles down to 60KB or so. Similarly, the stack size limitation just doesn't seem interesting to me. I've probably spent too long writing kernel code and other specialized code, but if you need more than 4KB, or maybe a whole 16KB, of stack space then you're not writing for performance. And if you don't want performance, get that code back on the PPE where it won't bother us. All told, in a fairly pessimistic scenario, you're left with 128KB or 96KB of space for data. If you're just replacing pointer dereferences with DMA that's tons of space. Actually, there's 2KB of register file (128 x 4-word wide) and that's probably enough.
What's that? Replacing pointer dereferences with DMA is unusuably slow? So it is. If you grant my conservative assessment that there's 96KB of LS available for data then just use it as a simple cache. It takes a day or so to code a simple direct mapped cache (remember, we're discussing how hard it is to get code up and running on Cell. By this point, the naive code was working in the last paragraph and we're just looking for any cheap extra boost) and it's a tiny amount of extra instructions. That 96KB of code is 6x the L1 on a Pentium 4 (and you've got 2KB of registers where the Pentium 4 has to use its L1 to compensate for its tiny handful of registers). So, in exchange for another day of porting (we're almost up to 1 week for 1 grad student who's easily distracted and a little lazy) not only have you ported your app, you've smoothed out the most unreasonable shortcut you used to get the port working.
If your application doesn't benefit from caching then the bad news is implementing the simple cache won't help you on Cell. However, the worse news is that your execution time on normal CPUs is already as bad as the synchronous DMA version on Cell. Or else, you have some a priori knowledge of your algorithm that lets you prefetch or do some other contortions on normal CPUs to get performance up. In that case, there's some good news-- there are no games on the SPE-- you just tell the DMA controller what you want it to prefetch and it goes and does it. No funny "we may or may not honour the hint" prefetch or non-temporal write instructions and no irritating hardware memory hierarchy working to thwart you. Programmer say, Cell do. Seriously, if you are lucky enough to have one of the workloads whose access pattern is structured then Cell is just great. Rather than hinting (or tricking) a CPU into doing the right thing to get bandwidth to memory, you lay it out explicitly for the DMA controller and it happens.
Anyhow, bottom line, we had a working port of the ray tracer in a couple of days and a reasonable starting point to beging analysis and optimization within about a week. The development environment / tool chain is fine (it's not *awesome*, but it's gcc). And the code is pretty much normal multithreaded C code with some funny Cell specific calls instead of pthread or Win32 calls for the scaffolding. So stop the fear mongering. Thanks.

22 May 2006:

Heh. Even the outright laziest build strategy turns out to be a net win for all our big scenes. With a refine-depth of 1 (each time you refine, only go down one level), the combined build + raycast time is less than 50% that of the offline builder. And, of course, as I continue to fly around the scene, the raycasting time pretty much immediately hits the peak (since the first frame builds the vast majority of the nodes needed). It's remarkable how much quicker 3.x second render times seem than 7.x. Now I need to actually start hoisting code out of the builder so it's only done once instead of once per refinement. And some form of hierarchical build would be nice too.
Today just isn't my day. While Microsoft does have an _aligned_realloc() to go with _aligned_malloc() there does not seem to be any POSIX or glibc equivalent. Amateurs.
Intel and Microsoft have such better ISA and compiler documentation than IBM that it's sad. I'm in search of a PowerPC equivalent of _mm_stream_ps() or some way to get similar functionality. vec_stl() looks to be as close as it gets. Digging through the altivec manual, it seems like if I were willing to suck up a healthy amount of complexity, dststt is actually the behavior I want. However, not only does it look crazy to use, there's this nice note in docs: "Note: dstst and dststt are not supported on the Cell Broadband Engine Processor". Of course. After all, if I built a chip where there was a major penalty for DMAing ranges that were stored in the the PPU cache, I'd make darn sure there was no easy way to keep data out of the PPU cache too. Pikers. Well, as Tim points out, at least that saves us the trouble of having to implement something that just won't work.

16 May 2006:

A few stupid bugs later, the lazy builder has landed! I haven't run any experiments other than offline and pure lazy (refine a single level at a time) and I haven't run any release builds, but correctness is there. The sheer amount of laziness is quite telling. The offline robots tree has 700k nodes and 916k triangle indices. The pure lazy ends up with 82k nodes and 36k triangle indices (and 52k refinements)! One intriguing consequence is that there's a lot of repeated intersection. The intersector did a total of 366k ray-triangle intersections with the worst case ray intersecting against 262 different indices. I'm not sure that really means much since in the degenerate single triangle scene, every ray tests against the same triangle but there's no real work to save, but it still caught my eye for a moment.
I ran a pre-checkin round of performance tests (with lazy disabled, but present in the code) and confirmed that the builder didn't get any slower (except for the cbox which is trivial enough that the overhead of the initial lazy root shows up as a fraction of a millisecond) and the ray casting times are actually faster (because of short circuiting on empty leaves). More rigourous numbers when I have more time.
Ouch! Now that the tree is lazy, the intersector isn't actually const any more. Intersection can refine the tree which is conceptually non-const (and it changes some internal state, but I could just mark that mutable).
Oh, that's reassuring. For a few minutes I thought I'd made the code slower somehow. Then I realized I wasn't testing 1024x1024 images and 256x256 images don't get as many rays per second (presumably from high packet inefficiency / reduced packet coherence). Whew.
And now for something completely different. Because it sounded interesting and akin to my ongoing arguments about streaming vs. threading, I went to Jayanth's talk on merging general purpose and streaming friendly models on the same hardware. The presentation was a little confusing and I really wish he and Mendel had laid out more concretely their terms, their claims, and their claimed results, but the controversy was good. Unfortunately, it's spilled over into the rest of the day as I've had conversations with Mattan, Mendel, and Mike plus Pat since then. I think Mendel and I are pretty much in alignment about problems and what would be ideal. I think we're a bit different in our estimates of how close Cell comes to addressing those problems and what our trade-off options are with "normal" CPUs.
They assumed that no matter what, the programmer has to take the hit of rewriting his monolithic code to be multi-core aware, but then tried to qualitatively assess the programmability of various systems with "general purpose multicore" at the maximum of ease. It certainly has seemed to me like once you suck up the hit of multi-threading your program, porting it to Cell's pretty straightforward. Other than the lack of an i-cache, the constrained LS is not big deal. Our software managed cache for the raytracer is both low overhead and extremely effective. There were also some weird claims about the available GFLOPS of various systems and nothing was consistently normalized either to per-core or per-chip. Ah well.
The most awkward part is that I'm still not sure what concretely they're proposing. Certain minor memory system changes are clear, but they've already demonstrated that they only get about a 20% boost from that, which isn't nearly enough to be competitive. There's some implicit assumption that general purpose CPU vendors will tack on lots of extra FPUs anyhow, so it's just a question of getting those fed. That's certainly not consistent with what I've seen to date (except in special purpose CPUs). And I don't know the cost. Since everything was per-core it's hard to get a sense of how many SPE-like things versus how many symmetric cores I could get on a given chip. Jayanth suggested a factor of 2 or 3 difference which makes me question the worth of the balance. I don't see what I get in exchange for having half or a third as many cores other than a branch predictor and a hardware cache. The branch predictor's not huge (and adding simple branch prediction to an SPE would be a trivial feature) and my software managed cache both has much more control (good for sophisticated programmers) and minimal performance impact (compared to a theoretical perfect full speed hardware cache the ray tracer runs at 94.5% speed). Clock for clock we're right around or over the point where clock for clock each SPE is faster than a Pentium 4 so cutting the number of SPEs by a factor of 2-3 in exchange for Pentium 4 cores is clearly a bad trade for us.
The most baffling part for me was the repeated statement (it wasn't even a claim, everyone seemed to take it as obvious) that Cell is a streaming processor. I don't get it. If a multicore CPU is threaded (or at least not streaming) then Cell is threaded (or at least not streaming). Fundamentally, I can have one SPE decompress my video, another deinterlace it, etc. and I can't do that on any streaming architecture I know. Certainly I can make all the SPEs work together in a streaming fashion, but that's true of any threaded architecture. The models are inclusive in that direction. But not in the other and Cell pretty clearly is threaded in the sense that it has 9 completely isolated and independent threads of execution that are available. Moreover, the memory system (and the DMA controller) allows arbitrary memory access patterns (with sufficient foreknowledge) instead of the a priori access patterns streaming demands. The whole concept of implementing a software managed cache in the heart of the inner loop of the raytracer should make that clear. (Of course, the fact that you can software pipeline among the SPEs should also make that clear). This Cell = streaming meme is very strange.

15 May 2006:

Oops. I made traversal faster. I was just trying not to make it too much slower. I added an explicit check for empty nodes in the traversal loop and skip the IntersectLeaf() call in those cases. At the same time, in the non-empty leaf case I added the code to test for lazy leaves and refine. The net result was a comfortable few percentage point boost! I guess saving all those empty leaf calls is handy. Now I just need to write the code to update all the accelerator's cached state on refine (the latched pointers, but also figure out what to do with the triangle intersection precomputed data) and I can actually start running the builder lazily.
You know, it occurs to me that pulling the early-out checks above the recursion is probably a good idea. That is, check to see if each child is empty, below a certain size, etc. before the recursive call in the parent and only recur for the nontrivial cases. It's a tiny amount of code duplication for left and right, but seems clearly beneficial. This came up while I was considering my lazy refinement policy-- currently it's implemented at the top of refineNode() after the other early out checks. However, for low cutoffs of how deep to refine on each invocation, it seems increasingly inefficient to have the check after the recursion. If I unconditionally move the checks to the bottom of the function, the only case that gets penalized is when the absolute root of the tree is empty or very small. I think that's an acceptable trade-off...
Okay, bizarrely, that turns out to be completely untrue. When I rewrote the code to do the test at the bottom of refineNode() before recurring, all the big scenes got nontrivially slower (tens of milliseconds slower). That doesn't make much sense since there should exactly the same amount of work per node (other than the root) still and near the bottom of the tree, we should save recursion. I guess the code may have gotton a tiny bit more branchy. Still, that's weird.
All of the scaffolding is now in place for lazy building, but the depth per refinement is still hardwired to be -1 (unlimited). I wrote a script to gather the build times and scalar + packet intersection times and gathered another set of data to be sure lazy-ability didn't impact the builder. Here's the current build times (bear in mind I've made other minor improvements since the last numbers):
- cbox: < 1 msec (32 triangles)
- glassner: 44 msecs (840 triangles)
- bunny: 7320 msecs (69451 triangles)
- robots: 7420 msecs (71708 triangles)
- kitchen: 7580 msecs (110561 triangles)
Bottom line: The infrastructure for the lazy builder doesn't impair the offline builder.
ATI writes a very reasonable NDA.

8 May 2006:

Another test I should have done-- varying packet size. Packets are made up of some number of 2x2 "bundles". In general, the larger the packet size, the more any triangle or node fetch is amortized, but the more additional nodes are traversed (i.e. if any ray in a packet enters a node, it takes all the others with it). There's also some extra triangle intersection. If an entire bundle is inactive when it hits a leaf, it doesn't do any intersection, but if at least one ray in a bundle is live, the whole bundle is tested. That makes the goal to choose big enough packets to amortize the fetches, but small enough that the extra traversal and intersection doesn't weigh down execution.
I'd sort of settled into 4 bundles per packet (16 ray packets) by default when experimenting with large packets. Because of awkwardness in my sampler, I'm stuck with power of two sized packets, so I can't explore the space too densely without fixing the sampler. Instead, I ran through the robots, kitchen, and bunny with 2, 4, and 8 bundles per packet. For the kitchen and bunny, 4 was definitely the highest point. For the robots, 8 was about 3% faster. When I tried them at 16 though, performance fell back down to the same level as 4. So, I guess I'll stick with 4 bundle packets for now.
Man. I never want to see another paper cite the performance results from our kd-tree paper. They're so bogus and so dated. Claiming you can do better than 270 KRays/sec on the robots is hardly inspiring. My single threaded single core CPU code is nearly 12x that and hasn't really been agressively optimized. More to the point, the same rough code is actually 5% faster in its current form on the slower board in my machine and a looping version should be 6x or so faster. If only ATI could release a driver that worked so I could run said looping version with fragments silently corrupting each others' branches. If only.
Sad. I just went and tried the current version of our brook kd-tree code on the various boards in the lab and it doesn't work at all. The ATI board just goes into an infinite loop and the NV board produces the same sorts of artifacts as the board I have. I'm so glad I don't care very much about GPUs.
How many times can I make the same mistake? Obviously, an unbounded number. The CPU shader looks much better now that I interpolate using w as the weight for v0, u for v1, and v for v2. Somehow I can never remember that our intersection routine generates such an intuitive ordering instead a weird one like e.g. u, v, w.
There are a few other tests I should have done. The numbers Friday were all allocating rays along horizontal 2x2 chunks (e.g. each bundle was a 2x2 block but a packet was a horizontal slice of them or effectively a 2x8 chunk). I also have sampling code I wrote a few weeks ago to allocate rays along a tiled 16x16 hilbert curve. The biggest potential benefit of the hilbert sampling is packets become roughly 4x4 chunks (which one would expect to have greater coherence on average, though clearly not for highly structured scenes). However, there's also a caching benefit as a successive packets (or even rays when packet tracing is disabled) are more coherent too. Bottom line seems to be as much as a 5% boost to both packet and non-packet tracing. Here are the same 1024x1024, 4 bundle packet timings, but with the rays laid out according to the tiled 16x16 hilbert curve:
- cbox: 4.1 MRays/s, 13.15 MRays/s (packet)
- glassner: 3.11 MRays/s, 9.63 MRays/s (packet)
- bunny: 2.31 MRays/s, 4.22 MRays/s (packet)
- robots: 1.23 MRays/s, 3.20 MRays/s (packet)
- kitchen: 1.28 MRays/s, 3.17 MRays/s (packet)
The rasterization numbers are a bit of a lie. It turns out the big raytracer restructuring made TraceRays() a nop for raster and moved all the work into Display(). For now I've given up and added timing code directly to raster.cpp instead of reworking things. *sigh*.
However, even once I correct the timing, the rasterization numbers are pretty good. They're pretty much a direct function of the number of triangles in the scene (no surprise since I shove all the primitives in a vertex array and dump them on the board). I've done some timing with and without GL_LIGHTING enabled. Without it, the cbox manages over 1350 fps down to 95 fps for the kitchen. With lighting, cbox only drops to 1310 fps, but the kitchen drops to 64 fps. Those are all at 1024x1024. Messing around, performance is clearly dependent on resolution, but not in any simple way. 512x512 looks to be a bit of a sweet spot for this particularly board.

5 May 2006:

Ooh, I've actually compiled Cell code now. I haven't run it, but I still feel privileged. It's a good thing I did it too since I had missed something when converting it to the new builder.
Speaking of which, I finally checked in the new builder code. It's not really new in any grand enough way to justify the fact that I overhauled the entire thing and put it in a new file, but at least now there's significantly less repacking and it should be straightforward to couple even more tightly to the tracer.
I think it's also time for some performance numbers. These are all from a single threaded app on a 3 GHz Pentium D with 2 gigs of RAM.
First, build times (release build, single run, no avg, no nothing). Note that 'build time' is the entire time it takes to run the constructor including packing the triangles into primitives, building the tree, and building all the precomputed triangle intersection numbers:
- cbox: < 1 msec (32 triangles)
- glassner: 50 msecs (840 triangles)
- bunny: 7800 msecs (69451 triangles)
- robots: 7700 msecs (71708 triangles)
- kitchen: 7900 msecs (110561 triangles)
Now, raycasting times for 1024x1024 images, 4 bundle packets:
- cbox: 4.1 MRays/s, 12.85 MRays/s (packet)
- glassner: 3.1 MRays/s, 9.26 MRays/s (packet)
- bunny: 2.2 MRays/s, 3.92 MRays/s (packet)
- robots: 1.2 MRays/s, 3.05 MRays/s (packet)
- kitchen: 1.24 MRays/s, 3.04 MRays/s (packet)
CPU Shading times are pretty awful, but then that's no surprise. I haven't done careful measurements against even the Brook shading code yet, though.
P.S. If you're curious, my X800XT can rasterize and do direct diffuse lighting at more than 1000 fps (equivalent to casting 1 GRays/s). That's fast enough that I feel obligated to double check the timing code, but it was accurate on the X300.

25 April 2006:

Here's a hindsight-obvious fact about packet tracing: bigger packets are more better, though obviously not at low screen resolutions. Seriously, even though the SSE SIMD width is four, real packet tracers use packets of 16, 32, or even more rays. However, they're bundled into four ray sub-packet bundles for SSE intersection and if a complete bundle is inactive in a given leaf, there's no need to perform intersection for it. That means larger packets do inflate traversal counts, but not so much intersection counts. It also means packet tracing is even more overtly a bandwidth optimization as opposed to a compute optimization or even a bandwidth and compute optimization. In effect, at least for address-space poor architectures like Cell and GPUs, packet tracing is a form of caching-- fetch a tree node and batch of triangles once and reuse it over multiple rays. It's not pure caching though since the batching reduces efficiency.
Fun fact for the day: A sufficiently compute bound shader on X1900's will cause the board the heat beyond its ability to cool / sink and eventually destroy the board. Oops. Good thing they have a one year warranty.

7 March 2006:

Paper review session is in full flight. Apparently I've taken my first baby steps over the threshold out of unknown-dom. I actually got a request to review a paper that didn't come from my advisor or someone else in the lab. From my pinhole view, there looks to be healthy interest and lots of activity among people who want to ray trace dynamic scenes at interactive rates. I've seen some neat data structures and impressively complex and comprehensive systems. It's cool stuff.
Of course, the bad news is that I haven't done any of it. Or seemingly much of anything else. It distinctly feels like the last few weeks have been juggling and treading water. It's already time to synthesize the new things we've read about and the ideas we have in flight and figure out what we're really going to do. I think I should just push forward with a rasterization based system and to heck with trying to fit it nicely into the existing raytracing framework. Oh, and course, ATI still can't release a driver that works. We finally have one that returns a value other than zero for occlusion queries, but the full blown raytracer generates either impressively wrong results or infinite loops depending on configuration. I love the choice of fast but hopelessly broken versus unusably slow. Maybe someday we'll finally get that Cell and be a the mercy of a new vendor's software system (without even a mature API that can supposedly be used).

9 February 2006:

I implemented the intersection routine described in the Intel Paper and it does make things a little faster in the both the scalar and packet code paths. It's not really a huge impact though. Here's the current numbers:
- cbox: 4.4 Mrays/s, SSE: 10.3 Mrays/s
- glassner: 3.2 Mrays/s, SSE: 7.0 Mrays/s
- bunny: 1.7 Mrays/s, SSE: 2.3 Mrays/s
- robots: 1.15 Mrays/s, SSE: 2.15 Mrays/s
- kitchen: 1.24 Mrays/s, SSE: 2.16 Mrays/s
I spent a little more time looking at mailboxing. It has a significant impact on the number of intersections done and yet each of the implementations I've tried has made the code actually slower. I tried both the simplest possible 'single mailbox per triangle' mailboxing (which eliminates the most possible redundant intersections) and 'direct mapped N mailboxes per ray' mailboxes for a few values of N. I really don't understand why eliminating so many intersection computations hurts performance. The mailbox checks are computationally cheap and I'd expect at least the per-ray checks to be cheap in bandwidth too.
For reference, here's the reduction in packet intersections with perfect mailboxing. Mailboxing has a bigger relative benefit with packets than scalar intersection because the packets visit more a few more nodes. This is with 256x256 scenes.
- cbox: 37720 down to 37487 (-0.0%)
- glassner: 37908 down to 34189 (-0.1%)
- bunny: 76619 down to 45344 (-40.8%)
- robots: 99507 down to 69374 (-30.2%)
- kitchen: 86331 down to 72449 (-16.1%)
Given my belief that intersection is roughly half of the total ray casting time, I must be doing something wrong not to be able to capture any speedup on the bunny or robots.

2 February 2006:

Crazy busy life. So it goes. At least, as is often the case, getting sloppy about tracking what I've been doing actually means I've been doing a lot, not doing nothing worth recording. The most tangible is Tuesday's FLASHG on adding ray tracing effects to existing rendering systems. It's a bit of a jumble because I try to speak generally initially, but there's really three sub-scenarios: pure CPU rendering systems, CPU + GPU or Cell + GPU style systems, and augmenting current GPU workflows and/or hardware.
It was good to get something down as a starting point (though really that's a distillation of a series of whiteboard "implementations" and discussions Tim and I had) and I've started accumulating additional slides and descriptions to have around for future presentations.
Now I should actually get more organized about writing directed code.

20 January 2006:

As I planned on writing, before being so rudely struck down by a fickle switch, focus is beginning to emerge and I have written out some avenues of exploration, potential implementation strategies, and the loftier questions that frame them. It's still (and likely will always be) a draft. I may even update it in place. But I think it's good to stake some claims and articulate some ideas and motivations at this point. It's good to balance a quarter of writing a ray tracer just because I wanted to with some time producing some research for others.
Through a FLASHG scheduling jumble, I've stepped in to fill a slot in a mere week and a half. I'm increasingly seriously contemplating using it to argue my position on how ray tracing is going to be incorporated into GPUs / future interactive rendering systems. Should be a good fight if nothing else.

19 January 2006:

We've been having furious rounds of discussions ostenibly about our ray tracing efforts, but bouncing around a variety of wider questions and ranging afield as often as staying on topic. While this frustrates me slightly in terms of getting guidance and filling my need to construct context, a plan for the quarter and hopefully some directions even beyond that, it's also intellectually interesting and has provoked a lot more thoughts and opinions (without accompanying research data to defend them, though) than during the coding or even writing time while a project is in flight. That's the lure and trap of grad school and academia in general though: you can discuss and speculate forever, but in order to have an impact you have to actually produce enough to build a big enough audience to hear and accept your vision. Or you have to be famous enough that people invite you to give talks all the time and trust your opinions. Or you have to prove your vision by dropping out, starting a company, and making millions. That's really a special case of 'be famous' though.
I do think that focus is beginning to emerge through, or despite, it all. Unfortunately the two outside sources whose feedback I sought are buried under SIGGRAPH submissions. On the upside, hopefully I'll get some interesting papers to read fairly soon. On the downside, no feedback for a while.
Interregnum: In the midst of that last paragraph, local network conditions melted down completely and after an hour or so, I gave up and left. I've finished out my sentence, but hence the abrupt ending. I *heart* NFS.

13 January 2006:

I absolutely loathe how inconsistent brook code is about row major versus column major data layout. You'd imagine that if you had a create() function call that took two arguments then it would be create(x, y). Nope. It's create(y, x).
Now, imagine you have an n * m float array on the CPU named buffer[] that you streamRead() into an n x m stream named s<> (created via stream::create(m, n)). You might expect that buffer[ x + y * m ] is the same as s[x][y]. Nope. It is, however, the same as s[float2(x, y)] and the same as s[y][x]. That's right, s[float2(x, y)] == s[y][x] and not s[x][y]. I can't even really blame brook for that because it's really how cg and hlsl operate. I realize there's all sorts of "legacy context" that can be marshalled to defend this. I'm not interested. It remains confusing as hell and each time I go to use 2D gathers or iterators I have to write a little test app to remind myself.
To make matters worse, when using 2D streams as proxies for long 1D streams, the raytracer code sometimes holds the X width fixed at 1024 and varies Y based on total length and sometimes holds Y fixed at 1024 and varies X. Argh. After some quality time with the whiteboard, it seems pretty clear to me that holding X fixed makes the math simpler (regardless of which you hold fixed, the X width determines the 2D <-> conversion so you want it to be a constant). It's also the one that seems more natural to me. Thus, of course, Y is the dimension held fixed for the one set of data beyond my direct control.
Friday the 13th and MSFT delivers. Despite the fact that pixel shader 3.0 introduces dynamic flow control and the Microsoft specified assembly includes a 'break' instruction, HLSL doesn't understand the break statement. Better still, Mike informs me that if I contort my while loops correctly, HLSL will generate assembly that does use the break instruction, so it's not like the HLSL folks can even claim ignorance. What colossally poor software development judgement.

12 January 2006:

I have a draft of ray tracing projects from our various discussions and efforts thus far. I'll post it after collecting feedback and recording anything else that slipped my mind while writing.
The theory group has a lab. This amuses me.
It's the new year. And then some. That makes it a good time to take a step back and evaluate where we're going with ray tracing and where we should be going instead of plugging away doing a little more of whatever we did yesterday. Soul searching is good for the, err, soul.
Over break I came across a few interesting links. Intel posted an article on SSE raytracing back in November that I didn't spot until recently. It goes through some background and nearly the complete code to do traversal and intersection via compiler intrinsics. There's also the latest edition of Ray Tracing News with a bunch of material of varying degrees of interest (to me personally, of course).
And, I finally took the time to go through and read this Master's thesis on GPU raytracing with grids, kd-trees, and BVH's. I've known about it for a while and discussed possible reasons for his BVH so dramatically outperforming a kd-tree with Tim, but hadn't read it. After reading it, I'm a bit disappointed. While at a high level, he implemented kd-trees (or at least the same modified kd-tree algorithm we used), his implementation is suboptimal in lots of ways. Probably the two most important are that his tree nodes are a float4 (2x the size they should be and on a workload he demonstrates is memory bound) and that his implementation is a mixture of multipassing and pixel-shader 3.0 branching I found hard to follow. It certainly didn't sound like there was any z-culling, which is definitely going to vastly inflate the runtime and the branching / looping support in NV4x and G70 is pretty awful as I've described before and as GPUbench results demonstrate. The BVH implementation though, was quite clever.

16 December 2005:

Time for the Journal of Irreproducible Results. Since we'd fixed up the vertex normal computation in the BART scene exporter, I wanted to export new versions of our scenes. Unfortunately, the BART scenes are all animated and we're using an arbitrary frame from each. It turns out we don't appear to have the frame times recorded anywhere and that Alex didn't know them either when he first exported specular scenes so he just eyeballed it. Now, as we all know, I do not just eyeball things. So, I hacked up the exporter to track the delta between the current camera parameters and the ones from the existing XML files and iterated through the animation at increasingly detailed time steps until I actually reproduced identical frames (actually I lost the bottom decimal place in one of the nine parameters for the kitchen, but that seemed good enough). They look much nicer with proper vertex normals. So much so that I'm tempted to try and process the bunny to get good vertex normals there.

15 December 2005:

I wrote code that I believe enables Flush-To-Zero and Denorms-Are-Zero SSE handling, but performance (and accuracy) appeared unchanged and VTune still insists SIMD Input Assists are having a big impact. Makes me sad. And a little confused.
I should explore some more of the VC++ optimization options. /Oa and /Ow (no aliasing and no aliasing within functions) may be too bold, but targetted '#pragma optimize("a", on)' (or w) may help. I also don't understand why the brook release builds specify /Os instead of /Ot. I'm not sure that smaller code is productive other than icache behavior. Right, so using /O2 instead of the ad hoc options from Brook makes the scalar CPU tracer vastly faster (cbox shoots up from 3.75 to 4.33!) and has a neutral to mildly positive impact on the SSE tracer. Here are the numbers using these CFLAGS:
- Release: /ML /O2 /arch:SSE2 /G7 /DNDEBUG
- cbox: C: 4.3 Mrays/s, SSE: 9.9 Mrays/s
- glassner: 3.05 Mrays/s, SSE: 6.7 Mrays/s
- bunny: 1.65 Mrays/s, SSE: 2.3 Mrays/s
- robots: 1.14 Mrays/s, SSE: 2.05 Mrays/s
- kitchen: 1.17 Mrays/s, SSE: 2.09 Mrays/s
Cleaned up the SSE-friendly packing of the triangle intersection data and removed the old packing. The cbox is still a bit slower than before caching uu and vv, but it's getting close and the robots are actually faster (they would of course be faster still if I undid the uu and vv caching, but still at the expense of end to end performance). Actually, now's a good time for another round of performance numbers:
- cbox: C: 3.75 Mrays/s, SSE: 9.8 Mrays/s
- glassner: 2.75 Mrays/s, SSE: 6.6 Mrays/s
- bunny: 1.58 Mrays/s, SSE: 2.26 Mrays/s
- robots: 1.04 Mrays/s, SSE: 2.02 Mrays/s
- kitchen: 1.06 Mrays/s, SSE: 2.04 Mrays/s
As before, these numbers are just the time to walk the rays through the kd-tree (and intersect any primitives), not the time to shade, generate the eye rays, or anything else.

13 December 2005:

I'm getting better at using VTune, though I've given up on callgraph sampling for now. It's very insistent that both intersection and traversal are suffering from "streaming SIMD extensions input assists". I can't find a lot of documentation, but the implication of what I've gathered is that I've most likely got a bunch of denormals rattling around, especially during traversal. I suppose that means they're coming from the tHit computation. They should be getting passed in since GenEyeRays() is already using SSE and so presumably bashing all the denorms down to zero. I suppose I should check the control flags.
One odd thing-- the various bits of documentation encourage enabling flush to zero and denorms are zero mode as allowing faster math. However, it appears that having those enabled also opens me up to this input assist business.
While I was still trying to decipher this, I hacked up the SSE intersection code (and constructor) so that the triangle data was laid out in a more SSE friendly way and I did the reads and shuffles myself instead of hiding them behind _mm_set1_ps(). _mm_set1_ps() actually does a great job if you're reading a single value. When you're reading 12, though, it's better to compress down to 3 128-byte reads and do the shuffles yourself.
A final bit of bad news-- the input assist "impact" (a random scalar vtune produces to allow you to gauge how bad a problem is) gets a bunch worse when I switch from the cbox to the robots. Sigh.
Pixel shader 3.0 continues to challenge us. The X1800 is producing all sorts of glitches and quirks, but the 7800 seems content, if slow. Oddly, using kdtree instead of kdtree_conditional completely breaks the 7800.
VTune is a very complicated piece of software and its 'tuning hints' are pretty inscrutable. By far the most frustrating though is that in callgraph mode it's completely failing to instrument my statically linked library calls. They just don't manifest at all. In sampling mode they appear though.
Death to faceted shading. It turns out that Tim's code to export the BART scenes was clobbering the normals. It would view them correctly, but then write bogus face normals rather than transform the vertex normals to undo the modelview. And it turns out that the triangle intersection routine and the shading routines differed in opinion about which vertex corresponded to u, which to v, and which to w. Oops. Together with a bit of specularity, the robots and kitchen look much nicer now.

9 December 2005:

Hard work is always punished. I reworked the tracer and all the CPU accelerators so that ray and hit data is generated and passed around in an SSE-friendly manner (everything is four wide so it can be pulled directly to and from SSE registers). It did help performance, but I'm still below the numbers from a week ago. Hopefully that's all the extra overhead of storing uu and vv in the intersection code. Yeah, it looks like it is. If I take it all out then the performance goes back up. It's not really any better though, which makes me curious what impact reworking data generation actually had. Oh well.
I realized that now that I have per-packet intersection, I could turn mailboxing back on, so I added the code and checked the effects. The stat counters show reductions in intersection calculations, but the performance numbers aren't materially affected. Sad, and a little odd.
So, I augmented the stats (and repaired some counters that fell by the wayside when Tim redid the connection between the builder and the accelerator) out of curiosity. There are a lot of empty leaves in our trees. Most of the scenes look to average roughly two triangles per non-empty leaf, but the robots are way higher. Traversal of all the scenes looks to visit primarily empty leaves, which is also interesting.
Every now and then Visual C++ produces an error that points to a very odd world view. As an example, I used the wrong variable (hit instead of outHit) in some code and the manifestation of failure was "type 'HitSSE' does not have an overloaded member 'operator ->'" and then it goes on to ask me if I meant to use '.' instead. In my world, when the programmer uses -> on something that isn't a pointer, the problem is never that he forgot to hijack that operator in an effort to confuse the heck out of some future innocent reader of his code. Maybe I'm just naive and have too sheltered of an exposure to C++ programmers.

8 December 2005:

Despite the grim weather, it was a Spring cleaning type day for me. I got the CPU version of Hit2ShadingHit working and modified the CPU accelerators to stash uu and vv in the hit struct instead of recomputing them (computation is a lot more expensive on the CPU than GPU). Other than that, though, the big effort was pulling the F3 type up to the toplevel and plubming it through more thoroughly so I could hide the random crazy float vector types that crept in from fileIO and the voxelization. Hopefully, F3 will become the One True Type, except where we need float3 in Brook-specific code (filling in stream contents).
Congratulations to Pradeep on his Defense. It's always nice to listen to someone string a whole bunch of work they've done together into a coherent unit and identify the larger implications.

6 Decemember 2005:

One for the journal of negative results: packing the kd-tree nodes breadth first instead of depth first (though still with a given left + right pair next to each other) doesn't appear to have any better performance than the simple depth first packing. One obvious thing to do-- lay the nodes out in little cache-line sized subtree stamps. That is, a 64 byte cacheline holds 8 cache nodes and chunking the tree into 8 node subtrees means that as soon as the ray fetches the parent of a given subtree, all the children will be available regardless of which way the ray traverses it. The over the top alternative of course is actually to build 8 kd-trees for each possible combination of signs of the direction vectors and pack each one depth first in the 'near' direction for corresponding rays. That's a big hammer that's almost certainly impractical for a real system, but might be an okay research project.

2 December 2005:

A round of performance numbers for reference now that there's an initial end-to-end SSE packet implementation online. Mailboxing is disabled for both SSE and C because I haven't integrated it well yet.
- cbox: C: 3.45 Mrays/s, SSE: 10.35 Mrays/s
- glassner: 2.6 Mrays/s, SSE: 6.8 Mrays/s
- bunny: 1.61 Mrays/s, SSE: 2.35 Mrays/s
- robots: 1.0 Mrays/s, SSE: 1.97 Mrays/s
- kitchen: 1.07 Mrays/s, SSE: 2.07 Mrays/s
This is on my 3.4 GHz Pentium 4 with hyperthreading enabled (but only a single-threaded application). The speedup on the cbox and glassner scenes makes me extra suspicious that the SSE-ification of intersection had such an impact for reasons other than raw compute since even in an ideal case it should have maxed out at making only the intersection portion a little less than 4x faster. It'll be interesting to see what impact smoothing the final edges has (prepacking the triangle intersection data better and propagating SSE packed hits end to end). And of course, there's always more optimal intersection routines like the one Ingo (Wald) describes in his thesis.
Man, I should have done this SSE triangle intersection thing ages ago. Here's two bits of wisdom for those who want to write a fast raytracer:
- SSE/Packet intersection is critical for performance. SSE/Packet traversal is only a means to enable SSE/Packet intersection. The cbox averages 2.1 intersections per ray and 11.1 traversals per ray and the bunny averages 1.4 triangles per ray with 23 traversals per ray. Despite that, going from C intersection to SSE intersection (but using SSE traversal in both cases) more than doubles performance. Really.
- With Visual C++, release builds are vastly faster than debug builds. More importantly, the relative performance of two algorithms in debug builds is uncorrelated with their relative performance in release builds. The SSE tracer is 22% slower than the C tracer in debug builds for the bunny and it's 46% faster in release builds.
Each of these was somewhat non-obvious to me. Given tens of traversals per intersection, it seemed unlikely that intersection was the critical to performance. Since I still haven't dug up a license for VTune, I don't know if it was literally raw compute that was holding me up or if SSE-ifying intersection actually paid off in cache performance. However, even my relative naive implementation (I repack from the C friendly layout into SSE registers every time and various other silly things to simplify the initial implementation) has irrefutably put the packet SSE tracer ahead of the non-packet C tracer.
Also, given how heavily the SSE tracer uses compiler intrinsics that pretty much dictate the resulting assembly, I'm baffled that the debug SSE tracer is so much slower than the release version. With the C tracer, there's roughly a factor of 3 between them. The SSE tracer has a factor around 5. I guess the argument is that once you've truly pared the algorithm down to its essentials, introducing noise (or just debug-friendly assembly) has a larger relative impact. I was still surprised. For reference, in our Makefiles, debug and release mean:
- Debug: /MLd /Zi /Yd /GZ
- Release: /ML /Ogisyb2 /Gs /arch:SSE2 /G7 /DNDEBUG
SSE2 integer intrinsics aren't nearly so well documented as the float intrinsics. I suppose that's only to be expected, but it's annoying. They also aren't as well thought out. Two concrete nuisances: I generated a mask with _mm_cmpXX_ps() (the SSE float compare, but it generates the same mask as the SSE2 integer compares-- just a simple bitmask) and it was really hard to browbeat the compiler into using the same mask with SSE2 integers. I ended up using "*(__m128i *) &mask" to get the compiler to just reuse the value it already had in an appropriate register. There's also no intrinsic to store a 4 vector of ints into a int * pointer, just an __m128i *. That problem also gave way to casting. I honestly don't know how those type-safe language people ever write any interesting programs. I'm constantly forced into constructs that no type-safe compiler would tolerate. I guess when you're writing managed code destined to run sandboxed away from anything approaching a real CPU, intrinsics and performance aren't hot topics.

1 December 2005:

Good news: the SSE eye ray generation was almost correct. Bad news: finding the one character typo still took me forever. I also took the oppportunity to compare precision. I compute the eye rays using normal C floats (not even doubles) and also using SSE intrinsics. Then I compute the maximum delta between any component of any C ray and its SSE counterpart. Across all the various scenes, the maximum diff is consistently about 0.00035 absolute. Obviously the relative impact depends on how close to zero the various components of direction are, but it's enough to introduce a small handful of visible speckles in all of the various scenes. That's not so good. Perhaps a triangle intersection routine chosen to minimize gaps could mitigate the problem.
For reference, I also benchmarked each of the routines. With release builds, the SSE generation makes 65536 rays in 1.13 msecs and the C version takes 5.18 msecs. The works out to be about 58 million eye rays per second for SSE and 13 million for C. Those numbers remain roughly accurate at 512x512 and 1024x1024 though there's a small falloff at 1024x1024 that I suspect is caused by memory pressure (created elsewhere in the system) as much as anything. They're also stable across scenes (which is good. I'd worry if the CPU could do identical math on certain values appreciably faster or slower than on others). I should note that I haven't heavily optimized either version, though the nature of reformulating the SSE version meant I manually hoisted some computations and precompute some values. So these certainly aren't the best times possible, just a generic data point.
Wow! With SSE eye ray generation and thus more conveniently aligned and packed rays to the SSE traversal, the SSE packet traversal is finally faster than the straight C single ray traversal. And the robots scene is over a million rays per second. Looking at the code, there's some more I can shave by using the SSE representation to quickly detect invalid packets faster (or I could just eliminate them up front, but I'm too lazy) and I can squeeze out a couple of the branches in the inner loop(!) by assigning the new node index to near or far with conditional moves and guarding the stack push with a branch. Bears some testing.
My first unsolicited email from someone who'd actually read through this and attempted to use some kd-tree code for his own purposes. Hooray.

29 November 2005:

Oy. I'm in the midst of trying to revamp the tracer so that it generates SSE-friendly eye rays, but pushing all the changes through is a hefty chunk of busy-work. I really need an SSE intersector to cut down on how much craziness is being passed around. A system is evolving despite itself that's actually organized increasingly reasonably.
I've updated my canonical versions of vanilla C and SSE packet kd-tree traversal to reflect what I'm currently using and distill down the fundamentals. I'm hoping that laying them out like that will make it easier to conceive and consider optimizations. Plus, it was enough work to get correct that I figure it's worth sharing in case it saves someone else work.
Tim made the builder much faster. Unfortunately, he temporarily made it much less correct. Now it's much faster and correct which is a winning combination.
Whoosh, Thanksgiving has come and gone. There was a nice snowfall in Chicago. Now, back to work.

18 November 2005:

I decided it was time to pay some attention to intersection. The fact that packet traversal isn't really faster than single ray traversal isn't such a surprise after some thought. kd-tree nodes probably cache very nicely. The big bandwidth gain and compute gain from packets do seem more naturally to come at intersection time.
However, before diving too deeply in there, I've been working on the single ray intersection code. After staring at it for a while and quizzing my officemate on how dot and cross products intermingle, I pulled out a bunch of redundant computation. While I was in there, it seemed like a good time to pack intersection data in the same order as it's stored in the kd-tree nodes instead of randomly accessing into the array of vertices. All told, it definitely helped, but I didn't keep precise numbers.
After that, I decided to bite the bullet and experiment with mailboxing. The results were somewhat confusing. With perfect mailboxing (one mailbox per primitive that stores the id of the last ray it saw. If a ray ever encounters a primitive it's already seen, it just skips that primitive) and 256x256 scenes I get the following:
- cbox: 140216 Intersections down to 140060 (-0.1%)
- glassner: 129866 Intersections down to 121085 (-6.8%)
- bunny: 93955 Intersections down to 61785 (-34.2%)
- robots: 222594 Intersections down to 183820 (-17.5%)
- kitchen: 188153 Intersections down to 171910 (-8.6%)
That seems like a pretty compelling demonstration that mailboxing is useful. At the same time, the cbox numbers don't make a lot of sense. With mailboxing, despite the fact that the mailboxing accomplishes nothing, it makes raycasting more than 5% faster?! I have no explanation. There's a similarly positive effect on the other scenes, but it makes more sense there. I wonder if the compiler's generating different code that has a weird impact.
A long, but interesting discussion with Tim and Pat about strategies for raytracing on Cell and future generations of GPUs. Various relevant notions (some are exclusive, some are independent):
- Packet tracing within a thread (at the natural SIMD width).
- Packet tracing across threads (for GPUs where the massive SIMD effect is that branching only works well with copious numbers of threads branching the same way).
- Stuffing portions of triangles or kd-tree into constant registers or other faster places than texture memory.
- (For Cell) Treat the local store as a software mananaged cache and pull kd-tree nodes (or triangles) in as needed, plus some eviction strategy.
- (For Cell) Page the kd-tree / make trees of trees and build up bins of rays until there are enough to justify fetching an new 'page' of kd-tree nodes and traversing it.
And no doubt some more variations I'm condensing down.
Apparently, I think of kd-tree construction differently from everyone else. I have this scheme I really want to evaluate for constructing using a sorted list of primitives and then refining intervals (instead of constantly binning or sorting primitives). To kick it up a notch, what I really want is to build a grid instead of just a sorted list. With an inverse map (primitives to where they are in the list / grid and grid cells to any overlapping kd-tree leaves or subtrees). It'll be nifty. I don't know when I'm going to get a chance to try it out though.

17 November 2005:

Well, aggressive tracking of which rays are done and enforcing that they don't contribute to any further traversal or intersection decisions substantially reins in the extra intersections I was seeing. There's still a healthy number of additional traversals, but it's down below 10%. One interesting case came up that I hadn't considered: the traversal can pop a node that's no longer interesting. If e.g. node 20 were pushed because only ray0 wanted to visit it, but then ray0 hit some geometry, then node 20 should be discarded if rays1-3 keep traversing long enough to pop it off the stack and nodes should be popped until one comes up that's relevant to the rays that are still live.
Also, I cleared up my confusion on why the packet tracer saw such a higher MaxIntersections count (number of leaves visited by the packet that visited the most leaves). First off, I was scaling it by the packet size. Secondly though, I only track it across the whole packet, not across the rays within the packet, so imperfections in coherence can inflate the MaxIntersections count for a packet even if no ray actually is active in that many leaves. Pretty obvious in hindsight.
Precision has reared its ugly head again. Take the example of ray 5102 in a 256x256 cbox scene. With the ray at a time code, this ray traverses: far, far, near, near + push, near + push, pop, hit geometry. The ray is only in the cell where it hits geometry from tMin of 1098.60 to 1098.89. As part of an SSE packet, the same ray again starts off far, far, near, near + push, near + push, pop. However, this time it shows up in the node where it should hit geometry with a tMin of 1098.83 and a tMax of 1098.62. Since tMin is larger than tMax, the SSE tracer culls this ray and concludes it misses all geometry. I'm surprised to see that the SSE traversal has that much trouble relative to standard C floats. I know there are a variety of SSE math styles available at a trade off of speed for precision, but I didn't select any of those. The only saving grace is that with a SIMD+SSE triangle intersector, I'll compute those intersections even for 'inactive' rays (because it'll be effectively free). That would catch this case, but it wouldn't help in general (if precision caused us to stop traversing one node sooner).
At this point, the stats look pretty competitive-- the packet tracer isn't doing that much extra work:
```
   Stats for module kdtreeCPU (cbox 256x256 single ray at a time)

      kdtreeCPU:Traversals                       730616
      kdtreeCPU:TraverseLeaves                   178118
      kdtreeCPU:Intersections                    140214

      kdtreeCPU:Push                             141189
      kdtreeCPU:Pop                              112582

      kdtreeCPU:MaxIntersections                      9
      kdtreeCPU:MailboxHits                          20
      kdtreeCPU:MaxStackDepth                         5

      kdtreeCPU:TriMissTHit                         195
      kdtreeCPU:TriMissBary                       73994
      kdtreeCPU:TriHit                            66025



   Stats for module kdtreeCPU (cbox 256x256 sse 4 rays at a time)

      kdtreeCPU:Traversals                       750128
      kdtreeCPU:TraverseLeaves                   185968
      kdtreeCPU:Intersections                    140333

      kdtreeCPU:Push                             151096
      kdtreeCPU:Pop                              121252

      kdtreeCPU:MaxIntersections                     11
      kdtreeCPU:MailboxHits                          19
      kdtreeCPU:MaxStackDepth                         5

      kdtreeCPU:MissGeometry                          1
      kdtreeCPU:MissBadPacket                      1024
      kdtreeCPU:HitEmptyStack                         3

      kdtreeCPU:TriMissTHit                         253
      kdtreeCPU:TriMissBary                       74056
      kdtreeCPU:TriHit                            66024
```
The 'MissGeometry' and 'HitEmptyStack' stats are poor ray 5102 and its packet mates. The BadPacket count is the number of packets whose direction vectors didn't agree in sign in at least one direction and were traced ray at a time.
The bad news is that the packet tracer isn't dramatically faster than the single ray tracer. In fact, the debug version is consistently slower (at least, for cbox), but the release version is slightly faster (again, for cbox). With the bunny, even the release packet tracer is slightly slower than the release single ray tracer. There is a ray of hope in all of this (pardon the pun). If I remove all the leaf intersection code and just test raw traversal speed, the release sse tracer is 2x the release single tracer for the cbox and a dinky 20% faster for the bunny.
It's 7:30 and I have yet to engage in anything I wanted to happen today. However, after some painful debugging, the GPU kd-tree code works on my nv41. It's awfully slow, but at least it's correct. The final culprit (after the brook runtime's abuse of the depth buffer with no render target bound) was a disasterously buggy fxc.exe. It completely lost track of the constants and generated broken shaders. Upgrading to the latest DX9 SDK got one that worked, but also let me spend some pleasurable time upgrading my machine and getting things building again.
Tim updated the builder to build trees that work with the GPU code (the critical item was apparently making sure no cells were zero extent or close enough that 24-bit precision made them look zero extent. Zero extent cells break the restart algorithm). He ran some numbers and it looks like tweaked correctly the new builder is strictly better than the old one and we can switch over.
The bunny looks really odd when it's shiny and plastic-y. (Hooray for material properties. The pink specs appear to be the PPM file writer since it looks fine in the viewer).

15 November 2005:

Right, so I'm an idiot. While I was at nvidia yesterday, I wrote out a simple formulation of a packet tracer as I understood it (with SSE intrinsics and all the rest). When I compared it against my code, I discovered to my horror this set of checks:
```
   if (hits[ii].tHit > tMaxVec[ii]) { allDone = false; }
   if (hits[ii].tHit < tMaxVec[ii]) { allDone = false; }
```
That's right, I would declare a packet not done if any ray's tHit were less than or greater than tMax. Clearly I meant tMin for the latter condition. And, as I thought about it some more, I didn't even need the latter condition. The good news is my performance is no longer pathological for big scenes (the effect of the bogus check was that no packet ever finished until it walked off the end of the whole kd-tree. So, more geometry led to bigger trees led to huge performance gaps.
Sadly, even that fix doesn't get me the legendary packet tracing speedup and I still see more intersections than the single ray at a time case. I worked through a simple example on the board and the wrinkle seems to come in determining when a packet is done. Here are two problematic scenarios: ray1 hits a primitive in the current cell. It's now done, but rays2-4 are still going. As long as the packet continues to traverse cells that ray1 would have traversed (if it hadn't hit anything) than my code will consider ray1 'live' and keep intersecting it with primitives, all of which will be rejected as worse hits than its current hit (after all, the current hit is the best possible hit, ray1 should be done).
Problem two: ray1 hits a primitive, but the packet keeps traversing because rays2-4 don't. Later, rays2-4 hit a primitive in a cell ray1 wouldn't have visited. That means, given the current algorithm, that ray1's current tMax is less than tMin which means tHit for hit1 is definitely less than tMax. It's hard to capture the proper tMax against which ray1's tHit should be tested since the algorithm goes out of its way to clobber tMax for inactive rays.
And now xenon has crashed while I was in the middle of mail trying to work out whether one of the stack values is truly redundant. Grr. I guess that means it's time to go home.

11 November 2005:

Byte-ordering sucks. Row major vs. column major always trips me too. So, it should be no surprise then when I thought I was packing rays into an SSE register as (ray0, ray1, ray2, ray3) I was actually packing them in exactly the reverse order. The triangle intersection code blithely inspected a live-mask built in reverse order and made correspondingly bogus decisions. Hence my 'precision' problems. Rays in a bundle are close enough to each other the if you fudge things by enough then they actually do become mostly, but not completely, interchangable. Oopsie.
We're still doing vastly too many intersections (and, to some extent, traversals). I know my masking is very conservative right now, so hopefully tightening it up fix the problem.

10 November 2005:

Compiler intrinsics are fun. Now I wonder if this SSE traversal code works. Or rather, how broken it is.
Wow! That's certainly an interesting effect. Sort of like looking at the Cornell Box through a screen door whose vertical hatching is wider than its horizontal hatching.
Okay, it's now significantly better (there's a big difference between the SSE instructions that only affect the first component and the ones that affect all four). However, I have to fudge tMax by a lot more (10% or so) to avoid the precision problems with the cbox and it does a whole lot more traversals and intersections than the ray-at-a-time version. I strongly suspect I'm mismanaging the masks. I know I need to get the size of stack elements down. And, I'm suspicious there's some other math problem-- possibly just putting an epsilon fudge factor into the intersection routine since I still see glitching on robots and glassner. Still, progress.
Floating point precision are hard. I went to run the debug version of the kd-tree code with the new builder and noticed that it assert failed drawing the cbox. Some rays were hitting triangles, but at values greater than the tMax generated by clipping the ray to the scene. It turned out they were only .0005 or so beyond tMax and it was just precision issues that don't manifest with release builds (in addition to the assertion, the resulting image had holes in the back wall). Fudging the initial tMax by an extra 1% avoids the problem, but I'm still annoyed. And mildly curious how debug builds are that different from release builds in precision.
I finally made time to call Dell support and order a replacement for the drive that died (in the machine that's less than a year old!). They'll ship it out and I'll see if there's anything nontrivial about replacing it and kicking the RAID software to rebuild.

8 November 2005:

So now, with a shortfixed4-free kd-tree implementation, we blithely set out to finally enjoy access to a GPU kd-tree on nvidia hardware. It even worked fine on Tim's Quadro 7500 (nice board, eh?). Sadly, but not too surprisingly, the 6800 Ultra in the lab and 7800 in my machine were undeterred and continue to produce incorrect images. The best part-- they produce different incorrect images. The 6800 gets all the big scenes right, but mangles the cbox and glassner. The 7800 mangles all of them (all the scenes terminate, but no rays hit any triangles).
Ever suspicious, I decided it was time to bulk up the brook writeQuery regression test. So, I augmented it to do the following:
- Create and clear a write mask (depth buffer)
- Enable the write mask (depth test)
- Copy a size x size stream from input to output
- Enable write mask updates (depth buffer writes)
- Copy all fragments less than a threshold from input to output
- Disable write mask updates (depth buffer writes)
- Copy all fragments from input to output (with updated depth buffer)
Nothing special, right? Each drawing pass is wrapped in an occlusion query to count how many fragments are actually drawn. I expect the first pass to draw all size * size fragments, the second pass to draw threshold fragments (because of how the data is initialized) and the final pass to draw size * size - threshold fragments.
On ATI hardware, that's almost what happens. For smaller values of size (below about 490), everything works flawlessly. For larger values of size, only the first 240k or so fragments ever appear to be issued. The first occlusion query always reports roughly 240k and numbers reported by the final two passes sum to the 240k number.
On NVIDIA hardware, no such luck. As long as the depth test is enabled, all of the occlusion queries return garbage (around a few thousand fragments regardless of the specified size) for the two copies and zero fragments (or all fragments. It's random from run to run) for the shader with the fragment kills. Disabling the depth test produces correct results for the first two shaders, but since the third relies upon depth culling, it obviously doesn't work as intended (but it works as it should given that the depth test is disabled).
Here's a copy of the test binary for reference. Try running "./writeQuery " with various sizes to see the chaos for yourself. Hopefully the only problem is a bug / quirk in the brook runtime code that we can fix easily enough.
Daniel sold us up the river on shortfixed4's. In an attempt to unbreak the raytracer on the nvidia hardware (and simplify life in general), Tim wrote a version of the kd-tree kernels that packed data into floats instead of shortfixeds and we ran some tests. The results:
```
   cbox backtrack 512x512 shortfixed = 2679537 r/s
   cbox backtrack 512x512 float = 2672713 r/

   bunny backtrack 512x512 shortfixed = 378592 r/s
   bunny backtrack 512x512 float = 376533 r/s

   robots backtrack 512x512 shortfixed = 269271 r/s
   robots backtrack 512x512 float = 269068 r/s

   kitchen backtrack 512x512 shortfixed = 289405 r/s
   kitchen backtrack 512x512 float = 290543 r/s
```
Bottom line: shortfixed was an optimization unjustified by the data. Stray aside-- those numbers are all significantly worse than the ones we report in our paper. I suspect I have the driver developers at ATI to thank for that (given how many driver versions I had to test to find the best ones before). Hooray.
Proctoring the graphics comp was uneventful. I read about using SSE/SSE2 via compiler intrinsics, Intel's wrapper classes, and handwritten assembly until I ran out of material and killed time by wandering around.

4 November 2005:

Hooray for Tim, the new builder has landed! And it's fast. Or rather, it's quite slow, but the trees it builds are fast. The cbox is up over 3.25 million rays per second and the kitchen is just a teensy bit shy of a million rays per second. It's very encouraging that building good trees has the impact we were promised. Always nice to know you haven't been lied to.
Two corollary questions though: Why is this builder so much better than the pbrt builder / Are there minor changes that could be ported with a major impact? And, what effect do these kd-trees have on the GPU traversal / what's the right tree building strategy for a GPU targeted traversal?
Packet tracing are hard. After some pain, I now have a stupid packet tracer. It literally just rearranges the single ray code with a for loop around computations and a simple bitvector for the 'any ray ways to go near / far' masks. Needless to say, it's not faster than single ray at a time, but at least it's correct and it uncovered some flawed assumptions Tim and I had about termination condition. Oh, and I'm still a big fan of stat counters. Numerical data makes me happy.
I'm supposed to proctor the Graphics Comp on Tuesday. The pretesters haven't yet seen a copy (nor have I. I wanted to read it over in advance so I could be sure I understood all the questions) and I know Pat's going to be gone Mon - Wed. So it goes.

3 November 2005:

That stack thing was indeed pretty easy. On the other hand, I'm a moron for forgetting the "continue;" and taking so long to figure out why the code was running off into the weeds.
Interesting. One consequence obvious in hindsight (but I had to look at the stat counters to realize it) is that there are a whole lot fewer traversals for the same scenes. For the kitchen numbers below, the default kd-tree makes 95183 traversals and the GPU-friendly one makes 67411. Now, traversal only needs to be 11.2x the speed of intersection in order for that to be a good trade. It's still not a good trade, at least for debug builds, but the default kd-tree is only about 5% slower instead of vastly slower (which is to say I didn't record the number, but it was tens of percentage points).

Well, I know why the default kd-trees don't perform as well as the GPU-friendly kd-trees, even on the CPU. Here are some stats for the kitchen (that's right, I hacked up a stats infrastructure and am looking for an excuse to discuss it) scene. These are from the tree generated by the default rules:

   Stats for module kdtreeCPU
      kdtreeCPU:Nodes                            185069
      kdtreeCPU:Splits                            92534
      kdtreeCPU:Leaves                            92535
      kdtreeCPU:EmptyLeaves                       25250

      kdtreeCPU:Traversals                       236591
      kdtreeCPU:Restarts                          13081
      kdtreeCPU:Intersections                     31817

These are from the tree generated by the GPU-friendly rules:

   Stats for module kdtreeCPU
      kdtreeCPU:Nodes                            107841
      kdtreeCPU:Splits                            53920
      kdtreeCPU:Leaves                            53921
      kdtreeCPU:EmptyLeaves                       10394

      kdtreeCPU:Traversals                       142922
      kdtreeCPU:Restarts                          11244
      kdtreeCPU:Intersections                     34296

So, the GPU-friendly tree causes a bit under 8% more intersections, but nearly a 40% reduction in traversals. In order for that to break even, let alone be a good trade, traversal needs to be nearly 38x as fast as intersection. I don't think I'm quite there...

File formats are not such an exciting topic. Big, surprise, I know.
Grind, grind, grind. Another 8% or 9% gone. Without a profiler, I'm probably nearing the point of diminishing returns for optimization and it's time to suck it up and do the packet tracing stuff. Actually, concretely, the list is more like:
- Better builder (but Tim's doing this for now)
- Add the stack to traversal
- Packet traced traversal
- Packet traced intersection
- Optimized triangle intersector
- Experiment with mailboxing strategies
- Rig something to time traversal and intersection in isolation from each other.
- Measure and adjust. Repeatedly.
- And of course, fight the good fight against speckles.
The "measure and adjust" phase is blocked behind my acquiring a reasonable profiler, but VTune has eval licenses while we see whether or not we can con Intel into donating some.
Lightfields are cool. Lenslet arrays are cool. If only lasers were involved, it'd be super cool. The microlens array folks are still doing neat stuff, though.

1 November 2005:

Christmas in November. Fear the 512 MB Quadro G70s. They're big honking masses of plastic and metal.
Ray tracing are hard. Tim found my silly error traversing, but the code's still not interestingly fast. And, I've got speckles. Ick. Time to scrutinize all my comparisions.
Well, now the code seems to be correct (modulo restart having some awkward boundary conditions), but still speckled. And still not so fast. Well, the release build gets over 1.5 million rays per second on the cbox, but it's down below 400k rays per second for the kitchen. Definitely time for some more optimizing.
An interesting note-- the cbox throughput halves when I switch to the GPU-friendly kd-tree construction parameters, but the kitchen performance goes up 20%. Clearly I'm still doing a poor job with deeper trees.

28 October 2005:

Just committed to CVS:

   The CPU kdtree has landed.

   Initial version of CPU kdtree support.  We use the same builder as the brook
   kdtree and then repack the results into our preferred format.  I didn't
   implement a stack yet, so I use the restart algorithm instead.  I'm almost
   certain I did something wrong because a number of the images are incorrect
   in places, especially if I tweak the various tree creation parameters.

   Specific changes:

     - Compile LOG() and LOG_ONLY() out of release builds completely.

     - Convert f3 from a struct with .[xyz] to an array of 3 floats.  I toyed
       with making them a union for syntactic convenience and decided the
       nuisance of vector.components.x vs. vector.array[0] outweighed any
       potential clarity.

     - Beef up the timing and stats reporting for the CPU tracer.

     - Fix intersectRay() to take a const RayCPU& instead of copy-constructing
       a ray (oops).

     - Pull the ray-triangle intersector into a header file so all the CPU
       accelerators can share it.

     - Add F3_INV and F3_FROM_FLOAT3 / FLOAT3_FROM_F3 to facilitate
       conversions.

     - All the kdtree code (obviously).

     - Clean up some bogus status messages.

     - Attempt to restore the Cornell Box lighting to a close approximation of
       its original form.  The coloured lights were cool in a tie-died sort of
       way, but made it harder for me to spot problems in the image while
       debugging.

Hooray. Now I really need to nail down the incorrections and then optimize it. Sadly, it's already as fast as the GPU versions as it stands.

27 October 2005:

CPU kd-trees march forward, but they're slipperly little suckers and packing them into the form I want them seems to involve more contortions than I expected. Any day now!
Hooray for shiny new hardware. Even if it's more shiny than solid. Dell has clearly cut costs even further on their Dimension cases, though they've given some of that savings back to the consumer in the form of spring loaded cases. Really. I'm going to miss the PS/2 ports, though I appreciate that (finally, after years) the legacy free PC is here. Maybe after musical monitors settles, I'll get to replace my CRT with an LCD. Clearly I should just buy a 2001FP to use at home. Clearly I should buy a computer to use at home.
Getting code working is inimical to writing about it, so the CPU bruteforce code is now armed and fully operational even though I've not yet mentioned it. Even the release version of the code is 10x slower than the GPU bruteforce code on a GeForce 7800, but that's no surprise. Looking through the disassembly, the compiler doesn't seem to respect requests for inlining of generate all that reasonable of code. Alas. I guess that means I'll be handwriting a lot of code to get the kd-trees working.

25 October 2005:

Apparently, academic groups who think they can make money from ray-tracing are much more protective of their source code than I am of mine. I guess that means no pilfering of the code of others. I actually have to write my own high performance ray-tracer if I want one. Sad.
Sadly, the largest 3D texture supported on the 7800 is still 512x512x512. At least, that's the limit on the one we've got.

14 October 2005:

Wow, I'm actually modifying code. Of course, it's all Spring cleaning type stuff, but if I squint hard enough, it seems like forward progress. I just have to decide whether to try gluing a CPU raytracer into the middle of things or forking off on my own.
Well, I got distracted and simplified the shading code instead. It was slightly relevant to my larger goal of reducing the amount of stream-state so that I can substitute in CPU-back routines. Only slightly, though. Ah well, time to go not drink repeatedly filtered cheap vodka.
Wacky raytracing discussions continue apace. External time committments (which is a sad way to describe the High Holy Days, but there's more in there too) also continue apace. I believe that after I get back from VMworld and bonus travel, I might actually be able to work in an organized manner. I'm still hungrily absorbing information from nvidia and getting settled (to the exclusion of writing actual code, though I have a runtime environment initially working). Alex seems interested enough in working with us to attack our shading code this Quarter. We've a long wish list as a starting point.
Hooray for RAID. Boo for the hard drive that died in this machine we got from Dell last Winter. Now I need to find the time to do the warranty dance.

6 October 2005:

We've been having long, wandering conversations about ray tracing on everything from conventional CPUs (Gordon came to visit for flashg and gave an encore of his siggraph course talk) to Cell to pixel-shader 3 era GPUs. I think Kayvon and Mike are going to start avoiding us soon. We really do need more of a structured approach, but the brainstorming arena is fast and furious.
Hooray for ATI. We have an X1800 / R5-something or another to play with now that it's officially launched. Mike has some impressive looking slides. At the same time the gpubench results don't look all that much more impressive than the 7800 GTX. The thing is, that really seems to underscore shortcomings in gpubench's methodology. Side by side, the 6800 and 7800 look at least as good, if not better, than their X800 and X1800 counterparts. In testing though, both ClawHMMer and our ray tracing stuff came very firmly down in favour of the ATI cards and Mike's GROMACS data is the same way. It seems like we need to augment or adjust gpubench so that we could get a better sense of the factors that have been key for us.
Tim ran the simple hacked up pixel-shader 3 kd-tree raytracer and it was in impressive over 1 million rays per second territory. It's completely unoptimized, but a strong endorsement of looping and branching for hardware that has realy granularity. The gpubench test isn't available since the board only supports pixel-shader 3 under D3D, but reputedly (and apparently, judging from results), it needs far less coherence than the 64x64 or 128x128 that the 6800 and 7800's need.
It must be the start of the quarter and the start of school year if I'm juggling this many things to actually be doing any of them. And the weeklong hiatus to VMworld plus vacation in two weeks is not going to help. Ah well. Tim and I have gained an ally in our voyages through raytracing. Pat has connected us to a coterm student who's interested in the world of research. Most likely, he'll be helping us out on the shading side of the world since it would be really nice to make pictures that don't suck, even if the ground work is more implementation and less exploratory.
In other news, new employee orientation at nvidia wasn't a great fit for my role as a (very) part-time intern, but the HR people were friendly and helpful, I actually did have a cube and a machine, and we managed one long constructive briefing / conversation on technical issues before I turned into an erev Rosh Hashannah pumpkin.
Addendum: I forgot to mention that the nvidia IT people somehow managed to turn Tim into a total non-person in the process of converting him from summer intern into school year intern. They took away one of his machines, stole the video card from the other, and completely garbled his user credentials so he couldn't access anything and got really weird errors messages. Most impressive.

26 September 2005:

It looks like it will be another fun year in esoteric hardware. The word is that ATI's forthcoming GPUs (and they're forthcoming pretty soon) have reasonable granularity conditionals and have had a lot of effort invested on the bandwidth side. Since we're big bandwidth hogs, that can only be good news. We're also hopefully getting access to some Cell systems (of some sort or another) before the end of the quarter. The Saarland folks showed off a Cell based ray tracer at Graphics Hardware and SIGGRAPH, but said they hadn't had very much time to experiment. Tim and I are in the midst of some form of arrangement to become nvidia interns part-time so we can obtain access to various unreleased software and hardware for further experimentation on that front too. Finally, some Intel folks had a neat partner and a really impressive demo at SIGGRAPH. Gordon gave a fun talk on raytracing performance which was centrally about getting peak performance out of kdtrees. So, lots of fronts to pursue once we do some planning.
School's back in session and it's time to get organized. I volunteered for onerous task of coordinating speakers and topics for research group meetings this quarter. Should be easy enough. SIGGRAPH was fun, though unlike last year there weren't very many times I wished I could be in multiple places at once. There were whole days of dead time (and at least an entire sub-conference's worth of mesh papers. Oy). A healthy resurgence of folks talking about real-time raytracing and a wide array of hardware platforms. The good news is that much interest becomes contagious and there are smart people from whom to learn and with whom to talk. The bad news is that's so many smart people who have already dug into the problem.

6 June 2005:

Summer is drawing near at a rapid rate and it looks like Louisiana will have to wait until Fall. Even the 256x256x1 voxelization doesn't seem to fit with either the kdtree or voxelgrid (the voxelgrid has more than 2 million entries in the triangle list and the kdtree has more than 4.6 millions entries in the triangle list!).
Guess that means back to idly pondering the gpubench / shader analyzing paper.

31 May 2005:

Hooray. There's an app checked into the brook tree that can draw streams as textures on NV hardware. If only I'd had this when I went to write the raytracer, I would have had less pain then. Ah well, it's a good proof of concept.

27 May 2005:

The final version of our paper on using kdtrees with GPU based raytracing (which will be at this summer's Graphics Hardware) is now available.
Quals are over! (At least for this year). In the final reckoning, Graphics turned out to be much more approachable than feared while OS was a total curveball. Architecture had by far the best match for material I prepared versus that tested. Today's the day for clearing the backlog that built up while I was sequestered from the world preparing. Things like registering for SIGGRAPH.

19 April 2005:

Ha. Now I have elite nvidia drivers that can actually run my crash kernel without crashing. Unfortunately, the raytracer doesn't just magically work on NV4x now. With the voxelgrid, even the conditional voxelgrid, the big scenes all work (but some are really slow). With the terrain, glassner, and cbox scenes though, the results are pretty horrific. The kdtree kernels produce results where all rays are claimed to have missed the scene. Sigh.

15 April 2005:

Not a lot of consequence these days. I've been mostly reading GPU Gems 2 and material for Quals. I read a slightly odd paper on GPU data structures that seemed like it was actually an address tranlsation paper in (very thin) disguise and would have been much easier to follow had it not tried to be so broad.
We talked to the GROMACS guys yesterday and I looked over their code. It seems to be the worst sort of expedience-first programming and I suspect a lot of that derives from its roots in Ian just hacking both Brook and the .br files to try and get through his thesis. It certainly seems a start contrast to the raytracer in terms of organization and readability (but that may be influenced by the fact that I know the raytracing code so well and chose the organization). In any event, I think it would be ideal to try and educate them a lot more on the constraints and strengths of the hardware (and underlying Brook implementations).
For whatever reason, Pat was interested in our chief grievances for getting good performance from the nvidia hardare, so I drafted my thoughts (this version has been tidied a little and made more precise after some comments from Kayvon).

12 April 2005:

GPUBench has exited stealth mode. No one in the group is currently in a highly beholden position to any organization with a business side that takes a partial view to actual evaluation of their hardware. Plus, the concepts are steadily creeping out there already, so we might as well try and lend them some rigour. Specifically new are names and results, both of which could theoretically have been mined before, but were a little stealthy.
Pat has decided Tim and I are in charge of helping out the GROMACS people. I have no idea how he came up with my name, but maybe it won't be an awful sinkhole of distraction. Maybe.
There's some near material in GPU Gems 2, but there are also a lot of lies. I'd like to talk to Pete Warden and hear the version of his chapter that didn't get heavily edited by nvidia. It's a sign of the warped world in which I live that I can read an intro that says "Occlusion queries have existed in hardware for over 2 years and no one uses them". In my world, people talk about occlusion queries all the time. Sometimes I forget how badly outnumbered I am by real graphics developers.

11 April 2005:

And finally the paper is submitted. The fax machine in Germany even worked this year. Sadly, our most qualified reviewers are all disqualified because they're tainted by various associated with Pat, but that meant we could run copies by them and they were uniformly positive. The only real adventure at the end was reconciling the conclusion Pat wanted with the paper we actually wrote. Tim did a good job at that, though.
Now I really, really, need to study for quals. Instead, I'll read GPU Gems 2 for a while.

7 April 2005:

Finally, a complete draft. Hooray.
This is always how it goes-- more urgency associated with working, less reliable notekeeping. Data gathering for the raytracer was huge, both in terms of work and in terms of utility. I really understand what's going on in the kernels now. The whole load-balancing / earlyz effectiveness of how we actually fun the full-blown thing is still murky, but conceptually the streaming kd-tree's characteristics are rock solid.
My bandwidth gloom and doom of last spring has proven somewhat overstated. At least the ATI hardware is roughly 2 clock per float pulled across the bus and the raytracing kernels aren't so far off. I guess the saving grace is that they aren't vectorized so they do so little math. We're actually compute bound on bruteforce, but only by about 1 instruction. Still, I'd anticipated being heartily bandwidth bound. Ironically, of all the kernels, the kd-tree downwards traversal kernel is the only one that's clearly compute bound. So much for fast and cheap traversal. The poor voxelgrid traversal is massively bandwidth bound, but part of that is ATI's committment to throwing performance out the window if you have 3 or 4 render outputs.
Actually, one major repeated pain point in this whole thing has been getting stuck with the ATI hardware. I have to concede that it actually works whereas the NV4x using D3D bluescreen running the kd-tree and just produce inexplicably erroneous ray-triangle intersectiosn with the voxelgrid. However, everything about using it is a struggle. Last weekend I installed every driver from 4.9 through 5.3 looking for one that was stable and sane and each one caused at least one kernel to plummet or shoot up in performance (and it was all different kernels and all different improvements or harm). Daniel and Mike found that the 5.3 drivers cut their performance by more than a factor of 3x relative to 4.10! Today yielded 5.4 drivers and they're somewhere in the middle of the fray too. Don't even ask about GL, either. The Brook regression tests just plain won't run unless you back all the way off to 4.9 because of bugs in pbuffer handling. Oh, and numerical precision is apparently optional. I can ruefully shake my head at 16-bit mantissas and realize the implications, but we found an even more fun item when Tim was trying to determine why the restart tree was broken: sometimes <= just doesn't work. We needed to add an additional case to handle two values being equal because the board conclude that the "or equals" part of <= was just us joking around.
Some hearty soul is out there porting gpubench to dx9 and sending occasional questions. I can hardly wait to compare the graphs between dx9 and GL.
In other gpubench news, apparently some marketing people in the world have nothing better to do than web searches for gpubench results and then tamper around with the URL until they find more.

28 March 2005:

Against my better judgement I swapped my PCIe NV41 for a PCIe X800XT. In an immediate attempt to annoy me, the X800 proved incapable of querying my monitor's refresh information or allowing me to manually specify the same refresh rates the other card had been using, so I got to readjust all the settings (they weren't even remotely close to having the image fill the monitor).
GPUBench results now available and for no good reason the PCIe board appears to have about 2x the readback rate of the AGP board. No good reason meaning it's still only 200 MB/s so it's not like it's bus bound (as a matter of fact, it's about 10x away from being bus bound). So much for PCI express solving the world's readback and download problems.
And, the DX9 version of the raytracer even seems to run correctly. That'll be quite helpful for gathering non-timing data instead of constantly running out to the lab.
Paper paper paper. After a week respite in Sacramento, it's nose to the grindstone time. Unfortunately, Tim's gone this week. If we'd only been slightly more clever we could have both been gone the same week. I have a complete first draft outline now though. I think some data gathering is the next step.

17 March 2005:

I've been looking through the nvidia talks from GDC '05 for interesting material. There's one whose summary is "How Far Cry uses various ps_2_* and ps_3_0 effects." I looked at it only to find a slide that says: "Far Cry uses the latest DX9 features: Shader Models 2.x / 3.0 except for vertex textures and dynamic flow control" (Slide 2). Excuse me? Once you take away vertex textures and dynamic flow control, you're only using the latest DX9 shader model features if you're presenting at GDC '03. Better still, the next two slides are about how dynamic flow control in pixel shaders is unusable and the slide after that explains that it works poorly for loops too.
Looks like I'm not the only one viewing 'branching' with a jaundiced eye.
Related to nothing, just a note about remapping keys under Windows. Microsoft has a useful article at http://www.microsoft.com/whdc/device/input/w2kscan-map.mspx describing how to adjust key mappings via the registry. Additionally, the Windows Server 2003 Resource Kit Tools contain an interactive tool, remapkey.exe, which will do the heavy lifting for you. Cool.

15 March 2005:

Two victories for ATI today. I wrote a gpubench occlusion query test that just does my fast reductions: fill texture with random values, run shader with parameter t that does:
```
   if (tex0[texCoord0] >= t) {
      discard; /* kill this fragment */
   }
   result.color = tex0[texCoord0];
```
and wrap the whole thing in an occlusion query to determine how many fragments are below the threshold. It's a simple test exactly equivalent to my brook writeQuery test. Testing my NV41 shows that GL occlusion queries shouldn't be as slow as I've been seeing:
```
(* size		usecs per query		queries/sec *)
    128		        26.6539		      37518	(* 8058 *)
    256		        63.2904		      15800	(* 32505 *)
    512		       208.8024		       4789	(* 130736 *)
   1024		       771.0500		       1297	(* 524323 *)
   2048		      3070.4773		        326	(* 2096241 *)
```
Unfortunately, testing the X800 produces all zeroes in the framebuffer / pbuffer and always returns a count of the resolution drawn. That's right, not only does it draw no samples, it claims it drew all of them. I wonder how I'm supposed to be issuing occlusion queries on ATI cards.
ATI wins again! I've been chasing my slow GL occlusion queries some more. I hacked up brook so I could declare a stream 'writeOnly' and thus skip the glCopyTexSubImage2D() but that had zero impact on performance (which makes sense I suppose. Since I'm seeing 3.5 msecs on a 256x256 quad, I should be able to copy that a whole lot of times without it taking longer). So, I checked in my changes to test on the X800.
Apparently, strictly speaking, occlusion queries require the depth test be enabled. Who knew? Not my NV41, which doesn't mind if I leave depth off. The folks at ATI know, though, and reserve the right to report completely bogus results if depth is not present. Better still though, it takes exactly as long to report these incorrect results as to report correct ones. To top it all off, my GL occlusion queries there are an extra 30+ times slower than on NV41. It takes the card around 110 msecs to report incorrect results. The DX numbers are quick and, once I added depth, even correct.

14 March 2005:

All sorts of progress. My ogl write query code works! I checked in a simple brook test for it. Tim has shiny new scenes too. Yee haw.
Unfortunately the GL fast reduction / query is fairly slow compared to the D3D version (nearly 10x slower). I suspect it's the copy from the framebuffer of the (unneeded) output. Grr. The D3D version benefits immensely because not only are the queries so much faster, readback is so much slower.
I've been playing with this 'pixperf' test I got on Friday. It reports absolutely bogusly bad numbers though. There's something very wrong-- when I run it, it creates its window and my printfs claim the tests are beginning, but they don't terminate until I click in the window or obscure it (at which point they report really poor performance unless I click in the window really quickly). I have no idea what their problem is. Their PBO and NV-data-range support also doesn't quite work because of symbol mismatches, though the PBO one is an easy fix.
Ah. Hacking it up to compile with gpubench instead of glut produced a working test. And, indeed, its barebones throughput numbers are quite impressive (over 1500 MB/sec on my NV41). Unfortunately, actually touching the data (i.e. read the contents of the buffer) halves throughput back down to the same ranges we've been seeing all along.
Progress over the weekend. Tim revoxelized the bunny and checked in a correct data set. He also checked in a voxelizing app so we can voxelize whatever we want. I hacked together initial write query support (occlusion query) for the OpenGL brook runtime. I need to write some tests and verify it works.

10 March 2005:

Grr. As Mike is fond of saying, "Lies, Damn Lies, and GPUs". I don't think bunny.vox contains a correct voxelization of the triangles into a uniform grid. In my 16x16 image, the bruteforce approach has 85 rays that hit. Of those 85, the voxelgrid only finds about 33. Another 14 it claims hit a different triangle and the rest it claims miss. After staring at the code and logs for too long and implementing most of the voxelgrid for the CPU too, I noticed that the reason ray 53 was missing was the triangle it hits is just plain in the wrong voxel. Oh well, that's both good and bad. Now I just need to come up with some 'convert triangle list to uniform grid' code so I can fix it. Triangle - Voxel cell intersections are messy though.
Meeting with Pat was pretty good. There was a whole lot of rehashing of how exactly the kdtree stuff works and why Tim's uncomfortable with the kdtree moniker (instead of just bounding volume hierarchy) which was both a little frustrating and useful. We have a bunch of ideas for optimizations or changes later. Most importantly though, we have a theme: kdtrees (or recursive acceleration structures) in streaming ray tracing. I've jotted some notes for sections too so maybe I'll get an outline started after inflicting more beatings on the voxelgrid (which still doesn't like the bunny).
In early for a meeting with our friendly neighborhood nvidiite. Unfortunately, he had to deal with his car and didn't make it in. Upside: we were all there at a reasonable time and Tim and I got to scheme a little more on the raytracer in preparation for our planning meeting with Pat. Double upside: he made it in after lunch so we still had a good conversation. Sadly I didn't get any great answers for things to make my life easier, but I wasn't really expecting them. I have a new readback/download toy to play with though.

4 March 2005:

More TLC for the voxel grid. I was reminded today that division and multiplication aren't actually inverse operations on a GPU. The grid would compute the time at which the ray would first hit the grid, use that to find the position of the hit, and use that to find the voxel that was hit. Unfortunately, sometimes precision would intervene in the final step and it'd choose the voxel immediately outside of the grid in one axis, leaving rays whose initial position was voxel -1 or voxel grid_dim + 1. No more. Tightening that meant I could be precise about claiming a ray was dead if it wasn't in the grid, switch over my liveness check, and nuke the precomputed 'outNo' field, which was somewhat silly all things considered.
The grid is faster and more robust than it's ever been, but sadly bruteforce still has it beat. Poor poor missing earlyz.

3 March 2005:

Ah, infinity, how I do love thee. I've finally gotten to the bottom of why certain rays seem to rattle around in the uniform grid and never escape. It turns out related grief was also the reason stray rays would inexplicably miss triangles too. It turns out the NV4x is very picky in its handling of unusual floating point values and, more importantly, exercises its right to do anything it wants with NaN's. The result is that when the code tries to step 0 voxels in a direction with infinite cost, the card ends up concluding the new total ending time for continuing in that direction is NaN, which it then decides is the fastest way to travel. Unfortunately, it also decides that travelling NaN steps of -1 voxels is a step of 0 voxels and ends up in exactly the same voxel it thought it left. Lather, rinse, repeat.
That was more aggravating, but less justified, than the fact that the code that computed tMax would end up with negative infinity instead of positive infinity for parallel rays. That little trick (which was just a greater-than instead of greater-than-or-equals) caused the code to always traverse the wrong way its first time.

28 February 2005:

Where did the day go? Tim and I did some pondering of our ultimate direction and purpose to this raytracing business. He had quite a week last week. We now have a working kd-tree, earlyz based kd-tree, DX viewer, shadows, 2D addressing, and other goodness. I can trace a nice little bunny, I just have to be patient... Pity I don't have nearly the same to show for myself. Tomorrow morning's meeting will be devoted to trying to determine the scope of our ambitions.
One big concern / nuisance in this whole thing is the miasma of hardware and driver issues. Currently only the DX version of Brook supports earlyz and occlusion queries. The latter is easy enough to add to GL, but useful earlyz really requires read-modify-write access to textures which, even with FBOs, we can't currently get on GL. However, the DX version of the raytracer is an instant bluescreen on NV4x with any acceleration structure and just a garbled image with bruteforce. However however, we only have the one X800 and it has a few quirks of its own (most notably precision). So we can either commit to DX, which will leave us with one board between two people (and the 9800s just fail fetching triangles for no good reason) or we can use the NV4x's we have. I sent some more mail off to our friends at nvidia and prepared demo versions of the bluescreen shader and my FBO demo. Hopefully it'll turn out to be some bug (heck, hopefully even our bug so we can fix it) that's easily resolved. A boy can hope, can't he?

24 February 2005:

Since the bleeding edge wasn't nearly exciting enough, I decided to take my life in my hands and bungie jump into the world of framebuffer objects. Step 1 was the creation of fbo-ext.h, which pulls all the #defines, functions, and typedefs in from the fbo spec. The good news was that, as advertised, these experimental drivers indeed export GL_EXT_framebuffer_object and superficially things seem to work. The bad news involves the following test:
```
   Allocate texture texID and fill it with input data
   for (i = 0; i < count; i++) {
      glBindTexture(GL_TEXTURE_RECTANGLE_EXT, texID);
      glFramebufferTexture2DEXT(... texID ...);

      /*
       * Rasterize a quad with a shader that writes:
       *   result.color = texture0[texCoord0] * 4;
       */
   }
```
I was hoping the test would just work (i.e. after count iterations, the framebuffer and texture would contain 4^N time the original input). I was prepared for errors or garbage results. The nvidia driver developers still managed to catch me by surprise, though. The result of the above is that after count iterations (or any other number) the framebuffer contains exactly 4 time the original input and the input texture is unchanged. Somehow 'render to texture' became 'render to framebuffer, but skip copying back to texture'. If I modify the same test to allocate two textured and ping pong back and forth then everything works as expected. 'Tis a pity.
I love playing ATI bingo. I went all the way back to the 4.10 drivers (the ones from September, 5 months and nearly as many driver release ago) in order to find the last drivers that run Brook programs. Conveniently, those are also the last drivers Mike explicitly tested while he was at ATI for the summer. Incredible.

22 February 2005:

Sigh. Here goes. When I made it to ten slides without mentioning ray tracing I had a momentary urge to try and do the whole thing without it, but other than amusement, that would have been cheating. Not to mention that this morning's meeting was a reminder that Tim and I have to present something on ray tracing next Tuesday morning anyhow.

21 February 2005:

Tried some bigger scenes (even glassner is only 840 triangles). Kayvon gave me a .vox for the bunny and some synthetic terrain, but I can't get even the 900 triangle version one to work-- passing the triangle stream and triangle list stream produce a weird complaint from the runtime that I suspect is 'stream too large' in disguise, but I'm not certain. We need to pull over the 2D address from the rayengine so we can try some real scenes.
Trying to produce something coherent for tomorrow. Lots of little gems. The bruteforce kernel for the raytracer consistently gets more than 100 million ray-triangle intersections per second (or just under 10 nanoseconds per intersection) on my NV41. Even going back two versions of drivers from ATI, Brook just plain doesn't run. My NV3x is about 18x slower.
I did some more pushing on the branching code to get an idea of granularity and have some interesting graphs. As I'd already seen, earlyz is just way better than pixel-shader 3.0. Here are the results of from my nv41 and the X800 in the lab. The NV41 seems to have an earlyz tile size of 4x4 (or possibly 2x4 since I only tested squares) while the X800 is 2x2 (or 2x1).
Now meet ps30 (NV41 only. ATI doesn't have any ps30 parts currently)! Not only is it slower in all cases (note the y-intercept when no pixels are being drawn and the value when 100% of the pixels are being drawn), but it requires huge coherence. My card only has 12 pipes, so even requiring all currently executing fragments to branch the same way wouldn't be too onerous. Instead, it looks like most (all?) of the pixels in flight all have to be taking the same branch. It'd be nice to run a test where both sides of the branch (i.e. taken and not-taken) do heavy math, just different heavy math. It's not so bad if that ends up taking the longer of the two sides' execution time. If it takes the sum though, ouch.

18 February 2005:

The current ATI drivers break Brook on the X800. wglMakeCurrent() just fails for no good reason. Best of all, it only happens for some of the regression tests. Why is it that ATI drivers work so rarely?
In raytracer related news, the big restructuring is complete. The viewer controls the world. More seriously, you can now interactively move the camera and see the results and resize the window to 'zoom' (i.e. we just stretch the texture, with all filtering and other anti-ugly options disabled since it's also a brook stream). It's a fun demo.
Unfortunately, playing with it interactively drives home exactly how much faster the 'bruteforce' approach is to the uniform grid for our scenes and the current code. The brute force approach can do even the glassner scene at about 10 fps while the grid is under 1 fps. Part of the delta is that currenly the grid reads back each iteration to tell how many rays are live, but even if I hack it to only run the correct number of iterations without reading back, the grid is still probably 8-10x slower for the cornell box. Moral of the story: (1) we need better (and bigger) scenes and (2) the grid code needs some attention. The big nuisance with the grid is the lack of summary statistics (i.e. it's hard to count how many rays are traversing / intersecting / or shading without running an expensive reduction or reading back). They can probably be faked with occlusion queries, but it's unclear how cheaply. The more urgent need is more efficient traversal and intersection and that probably means earlyz. Just packing some needless float4's into float1's may help a lot too (given the NV40 pathology with non-float1 textures and the texture cache).
And now I really need something to talk about Tuesday. Or at least some structure to collect all the various irons I've got going into coherent unified form.
Yesterday was such an exciting and busy day I forgot to make notes. Really. Well, not really. We had an exciting gpubench day exchanging email with a new gpubench fan who had some interesting ideas about why we were seeing such slow numbers. But, he saw our faces, and now he's a believer. His speculation (without any actual knowledge of upcoming hardware) is that next generation cards will feature 60+ GB/s texture cache bandwidth. We can only hope.

15 February 2005:

Argh. Sacrifice another day to the genius who decided that, if you bind a floating point texture to a visible buffer (e.g. a window) then you must bind a fragment shader, even if all it does is look up the texture coordinate and fetch the corresponding texel, or else texturing is silently disabled. Oh well. The raytracer has a viewer now and hopefully I can play with interactively updating things now.
Hooray for amateurs! It turns out that if you call create a GL context and then call wglMakeCurrent() on it, the BRT subsuquently gets very confused. sigh. Ah well. Happily though, Tim's back from vacation and the grief I was having using his changes for the acceleration structures in the ray tracer was easily cleared up.

14 February 2005:

The good news is that the shiny new NV drivers no longer hyperspace with my crash kernel and OpenGL. It even produces correct results! Sadly, the dx9 version still bluescreens. Still, it's progress.
GPUBench indicates that, in addition to being more correct, the drivers perform better in some nice ways. Their sequential texture fetch bandwidth about doubled and data dependent random texture fetch went up by 4x! Fill rates for float2-4 buffers also doubled, but are still only half that of an X800. The raw numbers for the card finally beat a 9800XT across the board though...
The person who sent the 3Dlabs gpubench numbers sent me his patches late last week and I've been looking at them. He also reports that he's fixed all the strange results I saw in his last graphs except for the cache hit tests. Frustratingly, they're against 1.0.1, which was never tagged, and not against the current (or even recent) CVS and I've made all sorts of changes since then. However, it's been straightforward enough to distill them down to their essentials. I'm not sure when I'll incorporate them though. I need to actually focus on raytracing for a while I think. Here's his patch and here's my cleaned up version (it looks like his has acquired some eol damage at some point. Mine are against 1.0, but I removed the spurious other changes I noticed).
I finally got new NV drivers over the weekend. It appears the CS mail servers silently consume any mail with attachments they don't like. And, for whatever reason, they didn't like the driver attachments. Ah well. Now I need to play with them.

9 February 2005:

It's GPUBench day. I looked at the 3Dlabs results Ian forwarded and they are definitely a little dodgy. I especially like the floatbandwidth numbers (dependent random fetches are 3x streaming or constant accesses!), but the cache hit graph is good too.
Anyhow, I finally augmented the main script to generate early-z and pixel shader 3.0 (where available) conditional execution results and modified the instruction issue test so that it can test a range of sizes. The NV41 ramps up pretty quickly-- if you run 10 iterations of a 64 MUL kernel then you get nearly peak GFLOPS even at 64x64. Kayvon's suspicious that the ramp up is much slower when texture fetches are involved so he's going to write his own test and then we can make graphs of the ramp both with and without texture fetches.
Lots of little background updates. I'm still hoping for experimental drivers for my NV41, but now multiple people are theoretically trying to get them for me. A GPUBench fan in Germany has gotten GPUBench working on his 3Dlabs card! Quite timely and I've asked him to send me even a very raw patch so that I can see what all's involved and hopefully check something in.
Little GPUBench projects I'd like to get done: Branching graphs (wavefront vs. random for earlyz and ps30), and GFLOPS vs. rasterization size. And something for the 3Dlabs card if it's easy.

3 February 2005:

Another day in the trenches with the Realizm 100. Pat's contact at 3Dlabs responded to my email and even sent along some demos of floating point pbuffer usage and GPUBench can now create 1 - 4 component floating point pbuffers. However, the driver is the pickiest piece of code with which I've tangled in some time. Here are the rules: Don't set COLOR_BITS at all. It's redundant with the channel information and asking for a value larger than 64 angers the driver (even if though requesting 4 separate 32 bit channels works fine); Don't set GREEN_BITS / BLUE_BITS / ALPHA_BITS for a number of components that don't include them. Even if you request 0 bits, the presence of the request breaks. And, the most random rule-- always request at least 1 bit of DEPTH_BITS. Without that, nothing works.
Sadly, while I now have pbuffers, it looks like GL_TEXTURE_RECTANGLE_EXT isn't supported, which is going to be a big nuisance. Time to try the beta drivers and send some more email.

1 February 2005:

Confirmation-- this app has exactly the same result on the AGP NV40 and the other PCIe NV41 we have in the lab. Reassuring, but not terribly surprising.

I've produced a tiny, simple .br file that generates a binary that bluescreens the host with BRT_RUNTIME=dx9 and hyperspaces with ogl. It gets as small as this:

   kernel void
   crashKernel(float4 in1<>, float gat1[], out float4 out1<>, out float4 out2<>)
   {
      float val;
 
      val = gat1[0];
      out1 = in1;
      if (val < 0) {
         out2 = float4(0, 0, 1, 0);
      } else {
         out2 = float4(0, 1, 0, 1);
      }
   }
 
   int main(void) {
      float4 input<1>, out1<1>, out2<1>;
      float gat<1>;
      float4 out1_d[1];
 
      crashKernel(input, gat, out1, out2);
      streamWrite(out1, out1_d);
      return 0;
   }

Hopefully the nice people at nvidia will be able to figure out what's going wrong and suggest a workaround.

31 January 20005:

A quick note. For completeness, this shader runs just fine (i.e. doesn't hyperspace the driver):

   "PARAM c[6] = { program.local[0..4],\n"
   "               { 0, 1 } };\n"
   "TEMP R0;\n"
   "TEMP H0;\n"
   "TEMP RC;\n"
   "TEMP HC;\n"
   "MOVR  R0.y, c[5].x;\n"
   "MOVR  R0.x, c[3].z;\n"
   "TEX   R0.x, R0, texture[9], RECT;\n"
   "SLTR  H0.x, R0, c[5];\n"
   "TEX   result.color, fragment.texcoord[1], texture[3], RECT;\n"
   "MOVRC HC.x, H0;\n"
   "MOVRC HC.w, H0.x;\n"
   "TEX   result.color(EQ.w), fragment.texcoord[2], texture[4], RECT;\n"
   "MOVR  result.color[1], c[5].xyxy;\n"
   "MOVR  result.color[1](NE.x), c[5].xxyx;\n"
   "END \n"

Another indication that either all results must be written conditionally or none.

Tim has valiantly sacrificed his afternoon to ponder the question "How can they suck that much?!" with me. We don't have an answer, but we have some more information. First, the easy bugfix. The C++ construct:
```
    virtual unsigned int getMaximumOutputCount() const { return 1; }
```
is (rightfully) picky about letting you override it. For example, the following:
```
    virtual unsigned int getMaximumOutputCount();
```
does not suffice. Nor, of course, does the C++ compiler complain. After all, for all it knows, maybe you really did want to define a new, non-const method that just happens to have the same name as an inherited, but const, method. Happens every day. Anyhow, adding the word const to the latter instance unconfused the OpenGL brook runtime when handling multiple output shaders. Embarassingly, the confused OpenGL runtime still generated less breakage than allegedly legitimate D3D and OpenGL display driver. (I didn't say it was embarassing for the runtime).
Now, the meat. I decided I'd spent enough time chasing the dx9 BSOD for now and was going to try to figure out why the fp40 version of the raytracer generated a weirdly broken image. However, when I ran the fp40 version of the BSOD shader, Windows cheerfully informed me:
```
   raytracer.exe has encountered a problem and needs to
   close.  We are sorry for the inconvenience.
```
DevStudio told me moments later that I'd branched to hyperspace. Hours of debugging later, all of our code looks legitimate (modulo calling glDrawBuffersATI() twice back to back for fun) and glBegin(GL_TRIANGLES) causes the offending wild branch. Chalk up another kill for the display driver.
At this point, I resorted to tampering with the generated assembly. Here's what cgc produced (the #if's are my additions):
```
      "PARAM c[6] = { program.local[0..4],\n"
      "               { 0, 1 } };\n"
      "TEMP R0;\n"
      "TEMP RC;\n"
      "TEMP HC;\n"
      "MOVR  R0.y, c[5].x;\n"
      "MOVR  R0.x, c[3].z;\n"
      "TEX   R0.x, R0, texture[9], RECT;\n"
      "SLTRC HC.x, R0, c[5];\n"
   #if 1
      "TEX   result.color, fragment.texcoord[1], texture[3], RECT;\n"
   #endif
      "MOVR  result.color[1], c[5].xyxy;\n"
   #if 1
      "MOVR  result.color[1](NE.x), c[5].xxyx;\n"
   #endif
      "END \n"
```
Intriguingly, changing either '#if 1' to an '#if 0' produces a shader that runs without hyperspacing the GL driver. Apparently a mixing predicated output writes with non-predicated ones is evil. Someone should mention this fact to either cgc or the driver authors. Or both.
Remaining unanswered questions: Is this the reason the fp40 and arb versions of the raytracer look incorrect? I.e. can this sort of problem lead to incorrect results instead of the more heavy handed outright crash? Why do we crash? Is it just a driver bug or are we violating a hardware constraint? In the latter case, the failure mode mystifies me. Oh well, I can't think about that right now. If I do, I'll go crazy. I'll think about that tomorow. After all... tomorrow is another day.
(Aside) The dreaded 'pbrt doesn't normalize its rays' feature has claimed another victim. Alas, poor Yorick!
Sent some email to Pat's contact at 3Dlabs asking about floating point pbuffers. I'm not hugely optimistic, but a boy can hope.

29 January 2005:

In pursuit of the BSOD. I had some time to kill, so I came in to run some more experiments trying to narrow down the problem. I can bluescreen my NV41 host with the following simple Brook kernel and the dx9 backend:

   kernel void
   krnValidateIntersection(Ray ray<>, Hit candidateHit<>,
			   float3 grid_min, float3 grid_vsize, float3 grid_dim,
			   Hit prevHit<>, TraversalDataDyn travDataDyn<>,
			   RayState oldRayState<>, GridTrilist triList[],
			   out Hit hit<>, out RayState rayState<>)
   {
   #if 1
      /*
       * This kernel will crash with the dx9 runtime
       */
      float triNum;

      triNum = triList[0];
      hit = candidateHit;
      if (triNum < 0) {
	 SET_SHADING(rayState);
      } else {
	 SET_INTERSECTING(rayState, 1);
      }
   #else
      /*
       * This kernel will not crash with the dx9 runtime
       */
      float triNum;

      triNum = triList[0];
      if (triNum < 0) {
	 hit = candidateHit;
	 SET_SHADING(rayState);
      } else {
	 hit = prevHit;
	 SET_INTERSECTING(rayState, 1);
      }
   #endif
   }

The calling C++ code is literally just a kernel invocation and a streamWrite() of rayState. I ifdef'ed out the rest and it still bluescreens. Unfortunately, my attempts to replicate the crash by hacking up the multiple_outputs Brook regression test have all failed thus far. Ah well, out of time.

28 January 2005:

Back to the raytracer. To track the BSOD with the dx9 backend (while not exactly impeding forward progress, it is the scaries of the failure modes), I hacked the code to print a message and pause between kernel invocations. Wouldn't you know it? It didn't bluescreen during a kernel invocation. Nope. You know what kills it-- it's the call to streamWrite(). I wonder if I've got the size of my buffers wrong. And I wonder why that causes the driver to trash kernel memory! Amateurs. Anyhow, here's the bluescreen info:
```
   STOP: 0x0000008E (0xC0000005, 0xBF2936BD, 0xB466AE58, 0x0000000)

   nv4_disp.dll - Adress BF2936BD base at BF012000, DateStamp 4182eb4e
```
Probable English translation: "Device drivers actually have to check that their pointers are valid before dereferencing them?! I knew I forgot something." Making my buffer 100x as large doesn't help the situation either. Sigh. Guess that's enough for today. Good thing I keep letting Windows upload the crash dumps to its mothership. I'm sure some grateful support person at nvidia will welcome my data.
Today is officially hardware day. Now that the SIGGRAPH deadline is past, I actually got back multiply, which is to say I found where it had been moved, unplugged it, and summarily reclaimed it. In for a dime, in for a dollar, so Mike and I also deployed another NV41 and super case fan in the other PCIe machine (it was temporarily free of swarming camera groupies).
Since I'm a sucker, once everything was up and working, I decided it was finally time to take the ol' Realizm 100 for a spin. I pulled out the NV leafblower and replaced it with the inexplicably even larger 3dlabs card and went to work. (Aside, 3dlabs apparently has a "device driver" named Heidi:
```
   The 3Dlabs Heidi Device Driver is an interface for the Wildcat Realizm
   driver and AutoCAD applications. The Heidi driver provides improved
   performance and compatibility with AutoCAD applications and the hardware
   acceleration of the Wildcat Realizm video card through the use of OpenGL.
```
this is not their display driver, just some extra driver in case you thrive on confusion when downloading software). Anyhow, it quickly became clear that GPUBench wasn't going to play well with others. Despite the fact that the contact for all of the OpenGL extensions involving floating point buffers and textures seems to have an @3dlabs email address, issuing actual ARB wgl and gl calls makes the driver angry.
So I tried the readback test, which only really requires that glReadPixels() works. Death. Even in fixed point mode. Then I decided to run each format one by one and discovered two things: GL_ABGR_EXT no supporto, but RGBA and BGRA readback is pretty darn fast (roughly 1 GB/s). Just when you're ready to believe that fast readback on AGP is a myth, it's 3dlabs to the rescue!

27 January 2005:

After trying slightly harder, I've determined the following: fp40 and arb bother generate garbled images, but fp30 (still on the NV41) generates the correct image. ps20 still bluescreens.
I can compile brook and run all the brook regression tests on the PCIe NV41 with both the gl and dx9 runtimes. However, with the gl runtime the raytracer produces badly garbled results. Not to be outdone by gl, the dx9 runtime makes the display driver bluescreen and I have to hard reset the machine. Hooray.