jak-project/docs/scratch/shrub_asm.md
water111 b2ed9313bd
[graphics] First part of shrub extraction (#1258)
* decompile 90% of shrubbery

* some more progress

* some more

* big function decompiled

* went through `draw-prototype-inline-array-shrub` and made more notes

* shrub: start implementing extract_shrub

* read through current notes and add the info to current decomp

* decomp: allow skipping inline-asm from output

* add code to BspHeader to get GOAL types for shrubs

* add doc

* wip

* fix bad merge

Co-authored-by: Tyler Wilding <xtvaser@gmail.com>
Co-authored-by: Tyler Wilding <xTVaser@users.noreply.github.com>
2022-03-28 18:14:25 -04:00

45 KiB

Shrub Renderer

The shrub renderer is part of the background system. Each level probably has 1 or 0 drawable-tree-instance-shrubs, containing all of the shrubs in that level (if it has any shrubs).

Because the shrub renderer is part of the background system, actual DMA generation happens in finish-background.

Original Design

In shrub, there are prototypes and instances. Each "prototype" defines a model (like a bush, tree, etc). Each "instance" is a particular placement of a prototype in the world.

Each "prototype" has 4 different geometries. Some of the geometries can be missing:

  • prototype-generic-shrub
  • prototype-shrubbery
  • prototype-trans-shrubbery
  • billboard

The first two are believed to have the same data, but if the shrub is very close to the player and partially off-screen, it must be scissored, and only the generic renderer supports scissoring.

The prototype-trans-shrubbery allows shrubs to fade away. It's likely that the format is extremely similar, or even the exact same.

The billboard is a single quad.

Effects:

  • Time of Day lighting. It looks like each "drawable-tree-instance-shrub" has a time-of-day color palette that is adjusted based on the time of day
  • Per-instance time of day lighting. Each instance may use different colors.
  • Wind effect. This applies an additional transformation matrix per instance.

Our Design

We will ignore the prototype-generic-shrub - OpenGL will take care of scissoring for us.

Like with tfrag/tie, we will do the time of day interpolation in C++.

The shrubs without wind effect will be converted into a single giant mesh. Doing it as a single mesh reduces the number of draw calls, and the entire mesh can be left in GPU memory the whole time.

The shrubs with wind effect will be drawn as individual instances, as different shrubs need different wind matrices. It's likely going to be similar to render_tree_wind.

The time-of-day effect will be done like in tfrag/tie. We will create a new time of day texture on each frame, based on the current time, and each vertex will index into a single large texture. This approach is nice because the interpolation/upload can be done in a single large batch.

Setup Before (in background.gc)

The shrub system doesn't use the precomputed visibility strings, so we can ignore this.

  • The background-upload-vu0 function loads vf16-vf31 with various math camera values.
  • The background-upload-vu0 function loads hte background-vu0-block program to VU0 and runs the subroutine at 0.
  • The current level index (0 or 1) is stored in the scratchpad (as a terrain-context)
  • The time of day colors are calculated with time-of-day-interp-colors. The colors are stored in *instance-tie-work*. We can move this to C++ and do it faster.

After setup, the main function to generate DMA is draw-drawable-tree-instance-shrub. This function will be removed in the PC port. Instead, we will send the C++ code some data:

  • camera matrix
  • name of the level

draw-drawable-tree-instance-shrub

Basic outline

  • Reset the instance-shrub-work
  • Check if renderer is enabled
  • Call draw-inline-array-instance-shrub. Each prototype has a "bucket" containing a linked list of instances. This function adds the instances to the buckets.
  • Call draw-prototype-inline-array-shrub. This builds the final DMA list from the buckets.
  • Various performance counter things that we can ignore.

draw-inline-array-instance-shrub

Args:

  • a0 dma buffer
  • a1 inline array of draw-node (a usual draw-node BVH with child type instance-shrubbery)
  • a2 length of this array
  • a3 inline array of prototype-bucket-shrub
B0: ;; block 0: one-time setup
L57:

;; Function prologue
    daddiu sp, sp, -32
    sd ra, 0(sp)
    sq gp, 16(sp)

    lui t3, 28672                      ;; t3 = 0x70000000, the scratchpad
    lw v1, 4(a0)                       ;; v1 = (-> dma-buf base). we'll be writing DMA data here.
    lui t2, 4096                       ;; t2 = 0x10000000 (used later)
    lui t1, 4096                       ;; t1 = 0x10000000 (used later)

;; this does some data cache stuff. we don't have to worry about it.
    sync.l
    cache dxwbin v1, 0
    sync.l
    cache dxwbin v1, 1
    sync.l


    lw t0, *instance-shrub-work*(s7)  ;; t0 = instance-shrub-work. This stores many temporary variables.
    ori t5, t2, 54272                 ;; t5 = 0x1000D400 (DMA SPR_TO register)
    sw a0, 6524(t0)                   ;; stash dma-buf argument in instance-shrub-work.dma-buffer
    ori a0, t1, 53248                 ;; a0 = 0x1000D000 (DMA SPR_FROM register)
    lw t2, *wind-work*(s7)            ;; t2 = *wind-work*

;; note on crazy scratchpad stuff.
;; to get faster speed, it is useful to have both the input (instances) and output (DMA data) stored
;; in the scratchpad.  However, the scratchpad is not big enough to store everything.

;; they divide the scrachpad in 4:
;; 0-5200 is one "instance" buffer
;; 5200-10400 is the other "instance" buffer
;; 10400-12448 is on "out" buffer
;; 12448-end is the other "out" buffer.
;; This code reads instance data from one instance buffer and writes DMA data to one out buffer.
;; while this is happening, the SPR_TO/SPR_FROM channels will be copying the next instances to
;; the other instance buffer, and copying the output dma back into the dma-buf.
;; Once they are done, the buffers will swapped. So there is continuous copying and processing.

;; I will use notation like spad.instance-buf and spad.out-buf to indicate the scratchpad buffers.
;; There are two instance buffers, and we don't have to really care which one they are using -
;; we can assume that they implemented double buffering properly.

    ori t1, t3, 10416                 ;; t1 = spad.out-buf (high buffer)
    sw r0, 6544(t0)                   ;; instance-work.chains = 0
    ;; Note on "stack"
    ;; this draw-node tree is... a tree.
    ;; this drawing function traverses the tree.
    ;; in order to traverse a tree, you need something like a stack.
    ;; the tree has a fixed max depth of 6
    ;; The node/length fields of the instance-shrub-work are this stack.
    ;; t4 is the "stack pointer". It points to instance-shrub-work + 4*depth.
    ;; Then you can access at the normal offsets of node/length to access the correct
    ;; slot for your stack frame.

    or t4, t0, r0                     ;; t4 = instance-work (todo, why?)
    lqc2 vf3, 6064(t0)                ;; vf3 = instance-work.constants (128, 1.0, 0.0, fog0)
    sw t5, 6412(t0)                   ;; instance-work.to-spr = 0x1000D400 (just stashing this here for later)
    ori t6, t3, 16                    ;; t6 = spad.instance-buf (low buffer)
    addiu t7, r0, 720                 ;; t7 = 720
    sw a3, 6476(t0)                   ;; instance-work.prototypes = the input inline array of prototypes
    addiu t3, r0, 0                   ;; t3 = 0
    sw a3, 6404(t0)                   ;; instance-work.bucket-ptr = the input inline array of prototypes
    addiu a3, r0, 0                   ;; a3 = 0
    sw a1, 6428(t4)                   ;; instance-work.node = the input draw node. (note, we're using t4 here)
    or t3, t1, r0                     ;; t3 = spad.out-buf
    sw a2, 6452(t4)                   ;; instance-work.length = the input length (num draw nodes at this level)
    addiu a1, r0, -1                  ;; a1 = -1
    sw t7, 6516(t0)                   ;; instance-work.current-shrub-near-packet = 720 (?) 
    daddiu t7, t0, 48                 ;; t7 = instance-work.chaina
    sw t6, 6408(t0)                   ;; instance-work.src-ptr = spad.instance-buf
    daddiu a2, t0, 176                ;; a2 = instance-work.chainb
    sw t6, 6388(t0)                   ;; instance-work.instance-ptr = spad.instance-buf
    daddiu t6, r0, -64                ;; t6 = -64
    sw t5, 6412(t0)                   ;; instance-work.to-spr = 0x1000D4000 (oops, did it twice)
    ;; note on alignment.
    ;; the instance-shrub-work object is only 16-byte aligned.
    ;; but, for some reason, they want these chaina/chainb things to be 64 byte aligned.
    ;; they put a 48 byte "dummy" field before them, and and with -64 to get aligned versions.
    ;; I'll call these aligned versions chaina-aligned/chainb-aligned
    and t5, t7, t6                    ;; t5 = chaina-aligned
    sw a0, 6416(t0)                   ;; instance-work.from-spr = 0x1000D000
    and a2, a2, t6                    ;; a2 = chainb-aligned
    sw t5, 6392(t0)                   ;; instance-work.chain-ptr = chaina-aligned
    addiu t5, r0, -1                  ;; t5 = -1
    sw a2, 6396(t0)                   ;; instance-work.chain-ptr-next = chainb-aligned
    sll r0, r0, 0                     ;; nop
    sw t4, 6400(t0)                   ;; instance-work.stack-ptr = t4 (right now, at base)
    sll r0, r0, 0                     ;; nop
    sw t5, 6540(t0)                   ;; instance-work.last-shrubs = -1
    sll r0, r0, 0                     ;; nop
    sw r0, 6548(t0)                   ;; instance-work.flags = 0
    sll r0, r0, 0                     ;; nop
    sw r0, 6560(t0)                   ;; instance-work.inst-count = 0
    sll r0, r0, 0                     ;; nop
    sw r0, 6556(t0)                   ;; instance-work.node-count = 0

;; Note on vcallms 17. this is a tiny program that loads vf's
;; plane is the culling planes (in normal world coordinates)
;; vf24-vf27 use the camera-rot matrix. This confusingly also includes the
;; translation, but does not include the projection matrix.
;; each vector is just the z component of that camera vector repeated 4 times
;; (it's computed in the vcallms 0 of background-upload-vu0)
;;  lq.xyzw vf16, 0(vi00)      |  nop      ;; plane0                
;;  lq.xyzw vf17, 1(vi00)      |  nop      ;; plane1                
;;  lq.xyzw vf18, 2(vi00)      |  nop      ;; plane2                
;;  lq.xyzw vf19, 3(vi00)      |  nop      ;; plane3                
;;  lq.xyzw vf24, 12(vi00)     |  nop      ;; [cam-rot0.z cam0-rot.z cam0-rot.z cam0-rot.z]                
;;  lq.xyzw vf25, 13(vi00)     |  nop      ;; same but cam-rot1                  
;;  lq.xyzw vf26, 14(vi00)     |  nop :e                   
;;  lq.xyzw vf27, 15(vi00)     |  nop    
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
B1:
L58: ;; LOOP TOP. We reach here when we want to explore a new draw node.
    vcallms 17        ;; set up vf registers
    lw t4, 6400(t0)   ;; t4 = instance-work.stack-ptr
    addiu t5, r0, 7   ;; t5 = 7 (remaining instances in group. we find up to 7 visible instances)
    lw a2, 6392(t0)   ;; a2 = instance-work.chain-ptr
    sll r0, r0, 0     ;; nops, I guess to wait for the vu program?
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0

;; starting here, we're looking for a node that we can draw.
;; this is doing "sphere in view frustum" culling through the BVH tree
;; it will exit once it's found the next visible thing to draw.
;; the details here are:
;; - normal "can we see the sphere?" check
;; - also a distance from the camera check. If we fail that, skip.
;; - this builds DMA, but not drawing DMA. It builds DMA to upload the thing to the scratchpad.
;; once we find it, go to L63.
B2:
L59:
    dsubu t7, t4, t0 ;; t7 = 0 if at root of tree, negative otherwise
    lw t6, 6452(t4)  ;; t6 = length at current stack frame
    bltz t7, L63     ;; if we're not at one of the roots, draw it. we wouldn't have added it otherwise.
    lw t8, 6428(t4)  ;; t8 = node

;; we'll only get here if we're at the root. We have no idea if the roots are visible or not
    beq t6, r0, L62  ;; if no nodes, skip!
    lqc2 vf2, 12(t8) ;; vf2 = bsphere of the node

;; note that this code assumes we're deep enough to find instance-shrubs.
;; and sets up DMA to DMA them to the scratchpad for later processing.
;; but, we might have only found draw-nodes.
;; this is okay. The DMA we set up here will only be used if we actually find instance-shrubs.
;; we also set up the stack for more draw nodes. Again, it's okay because we'll only actually increment
;; the stack pointer if we find out that there are more levels.
;; the bsphere culling code for draw nodes/instances are identical, so that part
;; can be used in either case.
B4:
    sll r0, r0, 0                     ;; nop
    lqc2 vf6, -4(t8)                  ;; vf6.w = distance of the node. (other stuff is junk I think)
    vmulax.xyzw acc, vf16, vf2        ;; sphere in view frustum (will eventually put result in vf4)
    lbu t6, 3(t8)                     ;; t6 = node flags
    vmadday.xyzw acc, vf17, vf2       ;; sphere in view frustum
    lw t7, 4(t8)                      ;; t7 = node child
    vmaddaz.xyzw acc, vf18, vf2       ;; sphere in view frustum
    lbu t8, 2(t8)                     ;; t8 = node child count
    vmsubaw.xyzw acc, vf19, vf0       ;; sphere in view frustum
    lq t9, 6016(t0)                   ;; t9 = instance-work.dma-ref
    vmaddw.xyzw vf4, vf1, vf2         ;; sphere in view frustum (done!, vf4 now has signed distance from planes)
    sw t7, 6432(t4)                   ;; place child on stack
    vmulaw.xyzw acc, vf1, vf6         ;; acc = [dist, dist, dist, dist]
    sw t8, 6456(t4)                   ;; place child's length on stack
    vmsubax.xyzw acc, vf24, vf2       ;; dist calc (note, just for computing z)
    sq t9, 0(a2)                      ;; store dma-ref in chain-ptr
    vmsubay.xyzw acc, vf25, vf2       ;; more dist calc
    daddiu t9, t7, -4                 ;; t9 = node minus type tag
    vmsubaz.xyzw acc, vf26, vf2       ;; more dist calc
    sll t7, t8, 2                     ;; t7 = num children * 4
    qmfc2.i ra, vf4                   ;; ra = sphere/plane signed distances
    addu t7, t7, t8                   ;; t7 = num children * 5
    vmsubaw.xyzw acc, vf27, vf0       ;; more dist calc
    sw t9, 4(a2)                      ;; store address of draw nodes in the dma tag
    vmaddw.xyzw vf7, vf1, vf2         ;; finish dist calc
    sw t8, 8(a2)                      ;; stash the child count after the dma tag (space unused)
    pcgtw t8, r0, ra                  ;; check signed distance to planes
    lw t9, 6452(t4)                   ;; t9 = current stack length
    ppach ra, r0, t8                  ;; pack so signed distance compares are in lower 64
    lw t8, 6428(t4)                   ;; t8 = node
    bne ra, r0, L61                   ;; branch on reject
    sb t7, 0(a2)                      ;; store qwc in chain

;; if we reach here, we passed the sphere in view check
B5:
    sll r0, r0, 0
    sll r0, r0, 0
    daddiu t7, t9, -1    ;; t7 = stack length - 1
    qmfc2.i t9, vf7      ;; t9 = dist check result
    daddiu t8, t8, 32    ;; advance to next node (assuming draw nodes)
    sll r0, r0, 0
    bltz t9, L61         ;; branch if failed dist check
    sll r0, r0, 0

B6:
    beq t6, r0, L60      ;; check if we actually reached the instances (0 = instances).
    sll r0, r0, 0        ;; 
B7:
    beq r0, r0, L59      ;; didn't reach instances. need to go deeper in tree!
    daddiu t4, t4, 4     ;; inrease stack depth. branch will find visible things.

;; if we reach here:
;; - we've reached leaves (instances)
;; - the instance is visible
;; - we have a chain set up to DMA it to the scratchpad.
B8:
L60:
    daddiu a2, a2, 16  ;; advance dma building pointer (looks like we have room for up to 8)
    sw t7, 6452(t4)    ;; decrement stack length (we're done with this one)
    daddiu t5, t5, -1  ;; decrement instance count (counts down from 7, we can only do 7 in a group)
    sw t8, 6428(t4)    ;; increment node in stack
    blez t5, L63       ;; goto L63 if we're full for this group
    dsubu t6, t4, t0   ;; check if we're at the root still

B9:
    bgtz t7, L59       ;; not full, more at this level.
    sll r0, r0, 0

B10:
    blez t6, L63       ;; if we're at the root of the tree and the lenth is zero, we're done, draw what we have.
    daddiu t4, t4, -4  ;; "return" and decrement sp (go up a level, we finished exploring this one)

;; common "advance to next based on stack"
;; we might have to return multiple levels, and this loop here does this.
B11:
L61:
    sll r0, r0, 0
    lw t7, 6452(t4) ;; t7 = length
    sll r0, r0, 0
    lw t6, 6428(t4) ;; t6 = node
    daddiu t7, t7, -1 ;; dec
    dsubu t8, t4, t0 ;; depth check
    daddiu t6, t6, 32 ;; inc node
    sw t7, 6452(t4)   ;; store len
    bgtz t7, L59      ;; keep going if not done (break out of returning loop)
    sw t6, 6428(t4)   ;; store node

B12:
    blez t8, L63     ;; draw if we're at the end.
    sll r0, r0, 0

B13:
L62:
    beq r0, r0, L61 ;; reloop in the return loop
    daddiu t4, t4, -4 ;; ascend one level

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; DMA TO SPR
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; if we reach here, we've got a chain set up that will send visible instances to the SPR.
B14:
L63:
    sll r0, r0, 0     ;; nop
    sw t4, 6400(t0)   ;; store draw node stack pointer in instance-shurb-work
    sll r0, r0, 0     ;; nop
    lw t5, 6392(t0)   ;; t5 = instance-work.chain-ptr (the start of the visible instance chain we just made)
    sll r0, r0, 0     ;; nop
    lw t4, 6412(t0)   ;; t4 = instance-work.to-spr (EE DMA control register address)
    beq t5, a2, L66   ;; will be equal if we didn't have any DMA
    lq t5, 6032(t0)   ;; dma-end (an 'end packet)

;; if we get here, we actually have data to send

;; these two blocks just wait until any in-progress to-sprs finish.
;; every iteration of the loop increments the "wait-to-spr" counter
;; (they likely tuned this code to reduce waits by moving stuff around)
B15:
L64:
    lw t6, 0(t4)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    andi t6, t6, 256
    sll r0, r0, 0
    beq t6, r0, L65
    sll r0, r0, 0

B16:
    sll r0, r0, 0
    lw t6, 6568(t0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    daddiu t6, t6, 1
    sll r0, r0, 0
    sw t6, 6568(t0)
    beq r0, r0, L64
    sll r0, r0, 0

;; when we get here, there is no in-progress spr-to transfer

B17:
L65:
    sll r0, r0, 0        ;; nop
    lw t6, 6544(t0)      ;; t6 = instance-work.chains (just a counter of how many spad uploads we do)
    sll r0, r0, 0        ;; nop
    sq t5, 0(a2)         ;; store the end DMA tag (must go at the end of the DMA transfer)
    lw t5, 6392(t0)      ;; t5 = instance-work.chain-ptr (start of the DMA chain) 
    addiu a2, r0, 324    ;; a2 = 324 (constant to start DMA)
    lw t7, 6396(t0)      ;; t7 = instance-work.chain-ptr-next (to-spr chain dma mem is double buffered)
    ori t8, r0, 65535    ;; t8 = 65535
    sw t5, 6396(t0)      ;; instance-work.chain-ptr-next = chain-ptr (swap!)
    daddiu t6, t6, 1     ;; increment chain count
    sw t7, 6392(t0)      ;; instance-work.chain-ptr = chain-ptr-next (swap!)
    or t7, t5, r0        ;; t7 = chain for next time
    sll r0, r0, 0        ;; nop
    sw t6, 6544(t0)      ;; write back incremented chain count
    sll r0, r0, 0        ;; nop
    lw t6, 6388(t0)      ;; t6 = instance-work.instance-ptr (the scratchpad destination for the instance)
    sync.l
    cache dxwbin t7, 0   ;; write back the data (required before DMAing, EE DMA bypasses CPU caches)
    sync.l
    cache dxwbin t7, 1
    sync.l
    daddiu t7, t7, 64
    sync.l
    cache dxwbin t7, 0
    sync.l
    cache dxwbin t7, 1
    sync.l
    sw t6, 128(t4)      ;; set up destination addr in DMA register
    sw t5, 48(t4)       ;; set up source addr
    xori t5, t6, 5232   ;; toggle destination pointer (scratchpad destinations are double buffered)
    sw r0, 32(t4)       ;; set qwc = 0 (I think it's ignored in chain mode)
    sync.l
    sw a2, 0(t4)        ;; start transfer!
    sync.l
    sll r0, r0, 0
    sw t5, 6408(t0) ;; store instance-work.src-ptr
    beq r0, r0, L68 ;; always go to L68!
    sw t5, 6388(t0) ;; store instance-work.instance-ptr (starting a new block, so equal to src-ptr)

;; if we reach here, it's because we didn't have any more visible instances.
;; we have two cases:
;; 1). we have stuff in scratchpad (the other buffer) waiting to be drawn.
;; 2). nothing was visible, so we have nothing in scratchpad.
;; we can tell these two cases from the sign of the a1 flag.
B18:
L66:
    bltz a1, L98    ;; goto end (L98) if the flag is negative
    lw a2, 6388(t0) ;; a2 = instance-work.instance-ptr.

B19:
    sll r0, r0, 0
    sw r0, 6540(t0)  ;; instance-work.last-shrubs = 0
    sll r0, r0, 0
    xori a2, a2, 5232  ;; flip spad buffer (the last group isn't double buffered)
    sll r0, r0, 0
    sw a2, 6408(t0)    ;; store src-ptr
    sll r0, r0, 0
    sw a2, 6388(t0)    ;; store instance-ptr

;; dma sync - make sure the last to-spr is done.
B20:
L67:
    lw a2, 0(t4)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    andi a2, a2, 256
    sll r0, r0, 0
    beq a2, r0, L68
    sll r0, r0, 0

B21:
    sll r0, r0, 0
    lw a2, 6568(t0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    daddiu a2, a2, 1
    sll r0, r0, 0
    sw a2, 6568(t0)
    beq r0, r0, L67
    sll r0, r0, 0

;; the details of the from-spr is unknown, but it seems like setting a1 flag > 0 is used to indicate
;; that we have some pending stuff in spad that we have to copy back.
B22:
L68:
    bgez a1, L93    ;; if we have stuff, go to some later spad dma code
    lw a2, 6408(t0) ;; a2 = instance-work.src-ptr

B23:
    beq r0, r0, L58       ;; nope, we're done, go to loop top
    addiu a1, r0, 10000   ;; but, remember we just did a dma sync for to. So we do have more work to do.
                          ;; ideally we'll find more visible stuff and add to what we have now.
                          ;; but if we don't, we set this flag to >0 to indicate that we have
                          ;; stuff that we still need to process.

;; we reach here once we have visible instances in the scratchpad.
;; but, before we can process them, we have to make sure the output buffer
;; in the scratchpad has enough room.
;; If not, we do a DMA transfer back to RAM (to the dma-buf passed in)
;; this is copying completed VU1 DMA data.
B24:
L69:
    daddiu t4, a3, -106  ;; 106 instances max in out buf, I guess
    lqc2 vf2, 16(a2)     ;; vf2 = bsphere of the first instance (they start prepping for the instance loop here...)
    blez t4, L72         ;; goto L72 if we have enough room in spr
    lbu t4, 6(a2)        ;; t4 = instance.bucket-index (loaded as a u8, maybe only up to 255 buckets/tree?)

;; next three blocks wait for from-spr to finish. Need to do this before
;; starting the next from-spr transfer
B25:
    sll r0, r0, 0
    lw a0, 6416(t0)
    sll r0, r0, 0
    sll r0, r0, 0
B26:
L70:
    lw t3, 0(a0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    andi t3, t3, 256
    sll r0, r0, 0
    beq t3, r0, L71
    sll r0, r0, 0

B27:
    sll r0, r0, 0
    lw t3, 6564(t0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    daddiu t3, t3, 1
    sll r0, r0, 0
    sw t3, 6564(t0)
    beq r0, r0, L70
    sll r0, r0, 0

;; start from-spr and swap output data buffers
B28:
L71:
    sw t1, 128(a0)
    xori t1, t1, 6144 ;; swap buffer
    sw v1, 16(a0)
    sll t3, a3, 4     ;; compute size (16 qw's per instance?)
    addu v1, v1, t3   ;; v1 is the next dma-buf output address (maybe needed for refs in upcoming DMA build)
    or t3, t1, r0
    sw a3, 32(a0)
    addiu a3, r0, 256
    sw a3, 0(a0)      ;; start!
    addiu a3, r0, 0   ;; reset count
   

;; if we reach here, we're finally ready to process the instance.
;; one cool trick they do here is to build 
B29:
L72:
    vcallms 33        ;; see backround-vu0-result.txt. This program does the sphere in view and distance checks.
                      ;; the result is stored in vf04/vf06 and vi02
    lw t5, 6548(t0)   ;; t5 = instance-work.flags (was initialized to 0)
    beq a1, t4, L74   ;; if we're using the same prototype as last time, skip ahead a bit.
    daddiu t6, a1, -10000

B30:
    beq t6, r0, L73
    lw a1, 6404(t0)

B31:                ;; I think this only runs on the very first run.
    sll r0, r0, 0   ;; it copies the last/next/counts of instance-work to the first thing in the proto bucket array
    lq t5, 6336(t0)
    sll r0, r0, 0
    lq t6, 6352(t0)
    sll r0, r0, 0
    lq t7, 6368(t0)
    sll r0, r0, 0
    sq t5, 92(a1)
    sll r0, r0, 0
    sq t6, 60(a1)
    sll r0, r0, 0
    sq t7, 76(a1)
B32:
L73:
    or a1, t4, r0      ;; a1 = current prototype idx (remember it for next time)
    lw t5, 6476(t0)    ;; t5 = prototypes array
    addiu t6, r0, 112  ;; t6 = 112 
    sq r0, 6336(t0)    ;; work.lasts = 0
    multu3 t4, t4, t6  ;; multiply for array access
    sq r0, 6352(t0)    ;; work.nexts = 0
    daddu t4, t5, t4   ;; t4 = ptr to bucket
    sq r0, 6368(t0)    ;; work.counts = 0
    sll r0, r0, 0      ;; nop
    sw t4, 6404(t0)    ;; store bucket in work.bucket-ptr
    sll r0, r0, 0      ;; nop
    lw t5, 4(t4)       ;; t5 = bucket flags
    sll r0, r0, 0      ;; nop
    lqc2 vf15, 44(t4)  ;; vf15 = lengths
    andi t5, t5, 1     ;; t5 = flag & 1
    lqc2 vf14, 28(t4)  ;; vf14 = near/mid/far plane
    vmul.xyz vf15, vf15, vf3 ;; vf15 = lengths * some constants?
    sw t5, 6548(t0)    ;; store flags in instance-work.flags

;; from here on, it looks like we jump to L92 if we reject the instance
;; NOTE: starting here is the matrix stuff.
;; we'll need to understand this to "de-instance" the non-wind instances
;; and to implement wind in C++
B33:
L74:
    bne t5, r0, L92  ;; check flags & 1. This flag is only set from the debug menu (see dm-enable-instance-func)
                     ;; and it's just used to disable a specific prototype for debugging.
    ld t5, 56(a2)    ;; loading the origin matrix (4x 16-bit integers/row) (this the last row)

B34:
    sll r0, r0, 0
    ld t4, 32(a2)      ;; t4 = row 0
    pextlh t5, t5, r0  ;; unpack row 3 to u32's (effectively shifts left 16)
    ld t6, 40(a2)      ;; t6 = row 1
    psraw t7, t5, 10   ;; t7 = shift row 3 right by 10 (two shifts equivalent to shift left by 6 and sign extend)
    ld t5, 48(a2)      ;; t5 = row 2
    pextlh t8, t4, r0  ;; t8 = row 0 to u32's
    lhu t4, 8(a2)      ;; t4 = instance.color-indices (I think an offset in the tree's palette, different from TIE)
    psraw t8, t8, 16   ;; t8 = shift row 0 right by 16 (two shifts equivalent to just sign extending)
    lq t9, 64(a2)      ;; t9 = instance.flat-normal
    pextlh t6, t6, r0  ;; t6 = row 1 unpacked
    qmtc2.ni vf13, t7  ;; vf13 = row 3
    psraw t6, t6, 16   ;; t6 = row 1 shifted
    qmtc2.ni vf18, t9  ;; vf18 = instance.flat-normal
    pextlh t5, t5, r0  ;; t5 = row 2 unpacked
    qmtc2.ni vf10, t8  ;; vf10 = row 0
    psraw t5, t5, 16   ;; t5 = row 2 shifted
    qmtc2.ni vf11, t6  ;; vf11 = row 1
    daddu t4, t4, t0   ;; t4 = color data - 304
    qmtc2.ni vf12, t5  ;; vf12  = row 2
    sll r0, r0, 0
    cfc2.i t5, vi1         ;; t5 = vis result.
    vitof0.xyzw vf13, vf13 ;; vf13 = row 3, as floats
    lw t6, 304(t4)     ;; t6 = rgba for this instance (8888 format)
    bne t5, r0, L92    ;; possibly reject this instance.
    lq t4, 6080(t0)    ;; t4 = color constants (some hacky int to float stuff here)

B35:
    pextlb t5, r0, t6        ;; t5 = unpacked rgba to u16's
    lqc2 vf4, 6096(t0)       ;; vf4 = hmge-d
    pextlh t5, r0, t5        ;; t5 = unpacked rgba to u32's
    lqc2 vf25, 6176(t0)      ;; vf25 = min-dist (interesting...)
    vsub.xyzw vf9, vf6, vf14 ;; vf6 is the "dist" of the draw node?
    sll r0, r0, 0
    psllw t6, t5, 8          ;; t6 = multiply colors by 256
    mfc1 r0, f31
    paddw t4, t6, t4         ;; t4 = colors + color constants
    mfc1 r0, f31
    vmula.xyzw acc, vf1, vf3 ;; 
    sll r0, r0, 0
    vmsub.xyzw vf9, vf9, vf15
    sq t5, 6160(t0)          ;; stash bb color
    vadd.xyz vf13, vf13, vf2 ;; same bsphere origin trick as tie
    sq t4, 6144(t0)          ;; store floating point color
    vsubw.xyzw vf8, vf6, vf2 ;; distance compensate for bsphere radius
    sll r0, r0, 0
    vitof12.xyzw vf10, vf10  ;; row 0 as floats
    sll r0, r0, 0
    vmini.xyzw vf9, vf9, vf3    ;; dist crap
    lw t4, 6404(t0)             ;; t4 = bucket-ptr
    vadd.xyz vf18, vf18, vf13   ;; flat-normal + real-origin
    sll r0, r0, 0
    vmulax.xyzw acc, vf28, vf13 ;; 
    lw t4, 24(t4)               ;; geom3
    vmadday.xyzw acc, vf29, vf13
    sll r0, r0, 0
    vmaxx.xyzw vf9, vf9, vf0
    sll r0, r0, 0
    vmaddaz.xyzw acc, vf30, vf13
    sll r0, r0, 0
    vmaddw.xyzw vf5, vf31, vf0 ;; vf.w is inverse distance from camera, I think
    sll r0, r0, 0
    vitof12.xyzw vf11, vf11    ;; vf11 = row 1 floats
    sll r0, r0, 0
    vftoi0.xyzw vf19, vf9      ;; distance stuff
    sll r0, r0, 0
    vmini.xyzw vf25, vf8, vf25 ;; apply min dist
    sll r0, r0, 0
    vsubz.xyzw vf4, vf8, vf4   ;; apply hmge
    addiu t5, r0, 128          ;; ?? t5 = 128
    vitof12.xyzw vf12, vf12    ;; vf12 = row 2 float
    addiu t6, r0, 255          ;; ?? t6 = 255
    vmulw.y vf9, vf9, vf15     ;; multiply by lengths
    sll r0, r0, 0
    sll r0, r0, 0
    qmfc2.i t7, vf19           ;; integer dist compare
    vdiv Q, vf3.w, vf5.w       ;; compute Q here, I guess
    sll r0, r0, 0
    and t6, t7, t6
    sll r0, r0, 0
    dsubu t7, t5, t6
    sw t6, 6156(t0)           ;; adjusted color for fade out.
    beq t5, t6, L80           ;; branch if don't try billboard, I think?
    sqc2 vf25, 6176(t0)

B36:
    beq t4, r0, L75           ;; don't do billboard if we don't have it
    sw t7, 6172(t0)

B37:
;;;;;;;;;;;;;;;
;; BILLBOARD
;;;;;;;;;;;;;;;
    vmulax.xyzw acc, vf28, vf18
    lq t4, 5104(t0)
    vmadday.xyzw acc, vf29, vf18
    lq t5, 5120(t0)
    vmaddaz.xyzw acc, vf30, vf18
    lw t6, 6348(t0)
    vmaddw.xyzw vf18, vf31, vf0
    lw t7, 6364(t0)
    sll t8, a3, 4
    lqc2 vf8, 6112(t0)
    addu t8, t8, v1
    lqc2 vf7, 64(a2)
    vmulaq.xyz acc, vf5, Q
    lq a2, 6160(t0)
    vmulaw.w acc, vf5, vf0
    movz t6, t8, t6
    vmadd.xyzw vf5, vf1, vf8
    lhu t9, 6374(t0)
    vmulq.w vf19, vf7, Q
    sll r0, r0, 0
    daddiu t9, t9, 1
    lqc2 vf6, 5136(t0)
    vmulq.xyzw vf26, vf1, Q
    sw t6, 6348(t0)
    vmulq.xyzw vf27, vf1, Q
    sw t8, 6364(t0)
    vnop
    sll r0, r0, 0
    vmaxz.w vf5, vf5, vf6
    sh t9, 6374(t0)
    vdiv Q, vf3.w, vf18.w
    sll r0, r0, 0
    vmulax.xyzw acc, vf20, vf10
    sq t4, 0(t3)
    vaddx.x vf26, vf0, vf0
    sq t5, 16(t3)
    vminiw.w vf5, vf5, vf6
    sq a2, 48(t3)
    vmadday.xyzw acc, vf21, vf10
    sq a2, 96(t3)
    vmaddz.xyzw vf10, vf22, vf10
    sq a2, 144(t3)
    vmulaw.w acc, vf18, vf0
    sq a2, 192(t3)
    vmulaq.xyz acc, vf18, Q
    sw t7, 4(t3)
    vmadd.xyzw vf18, vf1, vf8
    sll r0, r0, 0
    vmulq.w vf8, vf7, Q
    sll r0, r0, 0
    vmulq.xyzw vf24, vf1, Q
    sll r0, r0, 0
    vmulq.xyzw vf25, vf1, Q
    sll r0, r0, 0
    vmaxz.w vf18, vf18, vf6
    sll r0, r0, 0
    vadd.xy vf24, vf0, vf0
    sll r0, r0, 0
    vaddy.y vf25, vf0, vf0
    sll r0, r0, 0
    vmulax.xyzw acc, vf20, vf11
    sll r0, r0, 0
    vminiw.w vf18, vf18, vf6
    sll r0, r0, 0
    vmadday.xyzw acc, vf21, vf11
    sll r0, r0, 0
    vmaddz.xyzw vf11, vf22, vf11
    sll r0, r0, 0
    vmulax.xyzw acc, vf20, vf12
    sll r0, r0, 0
    vsub.xyzw vf16, vf18, vf5
    sll r0, r0, 0
    vmadday.xyzw acc, vf21, vf12
    sll r0, r0, 0
    vmaddz.xyzw vf12, vf22, vf12
    sll r0, r0, 0
    vmulax.xyzw acc, vf20, vf13
    sll r0, r0, 0
    vaddy.y vf16, vf16, vf16
    sll r0, r0, 0
    vmadday.xyzw acc, vf21, vf13
    sll r0, r0, 0
    vmaddaz.xyzw acc, vf22, vf13
    sll r0, r0, 0
    vmaddw.xyzw vf13, vf23, vf0
    sll r0, r0, 0
    vmul.xy vf17, vf16, vf16
    sll r0, r0, 0
    sll r0, r0, 0
    sqc2 vf24, 32(t3)
    sll r0, r0, 0
    sqc2 vf25, 80(t3)
    sll r0, r0, 0
    sqc2 vf26, 128(t3)
    vaddy.x vf17, vf17, vf17
    sll r0, r0, 0
    sll r0, r0, 0
    sqc2 vf27, 176(t3)
    vmulw.xyzw vf2, vf18, vf0
    sll r0, r0, 0
    vmulw.xyzw vf4, vf18, vf0
    sll r0, r0, 0
    vrsqrt Q, vf0.w, vf17.x
    sll r0, r0, 0
    sll r0, r0, 0
    vwaitq
    vmulq.xy vf17, vf16, Q
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    vsuby.x vf16, vf0, vf17
    sll r0, r0, 0
    vaddx.y vf16, vf0, vf17
    sll r0, r0, 0
    sll r0, r0, 0
    sqc2 vf10, 240(t3)
    sll r0, r0, 0
    sqc2 vf11, 256(t3)
    vmulw.xy vf8, vf16, vf8
    sll r0, r0, 0
    vmulw.xy vf19, vf16, vf19
    sll r0, r0, 0
    sll r0, r0, 0
    lq a2, 6144(t0)
    sll r0, r0, 0
    sll r0, r0, 0
    vmul.xy vf8, vf8, vf6
    sll r0, r0, 0
    vmul.xy vf19, vf19, vf6
    sll r0, r0, 0
    vmulw.xyzw vf6, vf5, vf0
    sll r0, r0, 0
    vmulw.xyzw vf7, vf5, vf0
    sq a2, 304(t3)
    vadd.xy vf2, vf18, vf8
    sll r0, r0, 0
    vsub.xy vf4, vf18, vf8
    sll r0, r0, 0
    vadd.xy vf6, vf5, vf19
    sll r0, r0, 0
    vsub.xy vf7, vf5, vf19
    sll r0, r0, 0
    vftoi4.xyzw vf2, vf2
    sll r0, r0, 0
    vftoi4.xyzw vf4, vf4
    daddiu t3, t3, 224
    vftoi4.xyzw vf6, vf6
    daddiu a3, a3, 14
    vftoi4.xyzw vf7, vf7
    lw a2, 6156(t0)
    sll r0, r0, 0
    sqc2 vf2, -160(t3)
    sll r0, r0, 0
    sqc2 vf4, -112(t3)
    sll r0, r0, 0
    sqc2 vf6, -64(t3)
    beq a2, r0, L92
    sqc2 vf7, -16(t3)

B38:
    beq r0, r0, L76
    sll r0, r0, 0

B39:
L75:
    beq t6, r0, L92
    vmulax.xyzw acc, vf20, vf10

B40:
    vmadday.xyzw acc, vf21, vf10
    lq a2, 6144(t0)
    vmaddz.xyzw vf10, vf22, vf10
    sll r0, r0, 0
    vmulax.xyzw acc, vf20, vf11
    sll r0, r0, 0
    vmadday.xyzw acc, vf21, vf11
    sll r0, r0, 0
    vmaddz.xyzw vf11, vf22, vf11
    sll r0, r0, 0
    vmulax.xyzw acc, vf20, vf12
    sll r0, r0, 0
    vmadday.xyzw acc, vf21, vf12
    sll r0, r0, 0
    vmaddz.xyzw vf12, vf22, vf12
    sll r0, r0, 0
    vmulax.xyzw acc, vf20, vf13
    sll r0, r0, 0
    vmadday.xyzw acc, vf21, vf13
    sll r0, r0, 0
    vmaddaz.xyzw acc, vf22, vf13
    sll r0, r0, 0
    vmaddw.xyzw vf13, vf23, vf0
    sq a2, 80(t3)
    sll r0, r0, 0
    sqc2 vf10, 16(t3)
    sll r0, r0, 0
    sqc2 vf11, 32(t3)
B41:
L76:
    sll a2, a3, 4
    lhu t4, 6380(t0)
    addu t5, a2, v1
    lhu t7, 6372(t0)
    sll t6, t4, 4
    lw a2, 6360(t0)
    daddu t8, t6, t0
    lw t6, 6344(t0)
    daddiu t7, t7, 1
    lq t8, 4400(t8)
    daddiu a3, a3, 6
    sh t7, 6372(t0)
    daddiu t7, t4, 1
    sq t8, 0(t3)
    daddiu t8, t7, -20
    sqc2 vf12, 48(t3)
    movz t7, r0, t8
    sqc2 vf13, 64(t3)
    daddiu t8, t4, -10
    sh t7, 6380(t0)
    daddiu t3, t3, 96
    sw a2, -92(t3)
    beq t4, r0, L77
    sw t5, 6360(t0)

B42:
    bne t8, r0, L78
    sll r0, r0, 0

B43:
L77:
    sll r0, r0, 0
    lq t4, 5040(t0)
    sll r0, r0, 0
    lq t7, 5056(t0)
    sll r0, r0, 0
    sw t5, 6344(t0)
    sll r0, r0, 0
    movz t4, t7, t6
    daddiu a3, a3, 1
    sq t4, 0(t3)
    sll r0, r0, 0
    sw a2, 4(t3)
    beq r0, r0, L92
    daddiu t3, t3, 16

B44:
L78:
    daddiu t5, t4, -9
    sll r0, r0, 0
    beq t5, r0, L79
    daddiu t4, t4, -19

B45:
    bne t4, r0, L92
    sll r0, r0, 0

B46:
L79:
    sll r0, r0, 0
    sll t4, t7, 4
    sll r0, r0, 0
    daddu t4, t4, t0
    daddiu a3, a3, 1
    lq t4, 4720(t4)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sq t4, 0(t3)
    sll r0, r0, 0
    sw a2, 4(t3)
    beq r0, r0, L92
    daddiu t3, t3, 16


;; I think the end of billboard.
B47:
L80:
    sll r0, r0, 0
    lw t4, 1324(t2) ;; t4 = wind time (from global wind work)
    sll r0, r0, 0
    lhu t5, 62(a2)  ;; t5 = wind-index of the instance
    sll r0, r0, 0
    lw a2, 6384(t0)     ;; a2 = wind-vectors
    dsll t6, t5, 4      ;; t6 = t5 * 16
    lqc2 vf19, 6048(t0) ;; vf19 = wind-const
    daddu a2, a2, t6    ;; a2 = wind-vector + (wind-index * 16)
    daddu t4, t5, t4    ;; t4 = wind-time + wind-index
    andi t5, t4, 63     ;; t5 = (wind-time + wind-index) & 63
    ld t4, 8(a2)        ;; t4 = winds
    sll t6, t5, 4       ;; t6 = ((wind-time + wind-index) & 63) * 16
    ld t5, 0(a2)        ;; t5 = winds
    addu t7, t6, t2
    qmfc2.i t6, vf4
    pextlw t4, r0, t4
    lqc2 vf16, 12(t7)
    pextlw t5, r0, t5
    qmtc2.i vf18, t4
    sll r0, r0, 0
    qmtc2.i vf17, t5
    vmula.xyzw acc, vf16, vf1
    sll r0, r0, 0
    vmsubax.xyzw acc, vf18, vf19
    sll r0, r0, 0
    vmsuby.xyzw vf16, vf17, vf19
    sll r0, r0, 0
    pcgtw t5, r0, t6
    mfc1 r0, f31
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    lqc2 vf24, 6208(t0)
    vmulaz.xyzw acc, vf16, vf19
    sll r0, r0, 0
    vmadd.xyzw vf18, vf1, vf18
    sll r0, r0, 0
    sll r0, r0, 0
    lqc2 vf25, 6224(t0)
    sll r0, r0, 0
    lqc2 vf26, 6240(t0)
    sll r0, r0, 0
    lqc2 vf27, 6256(t0)
    vmulaz.xyzw acc, vf18, vf19
    sll r0, r0, 0
    vmadd.xyzw vf17, vf17, vf1
    sll r0, r0, 0
    vmulax.xyzw acc, vf24, vf2
    sll r0, r0, 0
    vmadday.xyzw acc, vf25, vf2
    sll r0, r0, 0
    vmaddaz.xyzw acc, vf26, vf2
    sll r0, r0, 0
    vminiw.xyzw vf17, vf17, vf0
    sll r0, r0, 0
    vmsubaw.xyzw acc, vf27, vf0
    sll r0, r0, 0
    vmsubw.xyzw vf24, vf1, vf2
    sll r0, r0, 0
    sll r0, r0, 0
    qmfc2.i t4, vf18
    vmaxw.xyzw vf27, vf17, vf19
    sll r0, r0, 0
    ppacw t4, r0, t4
    mfc1 r0, f31
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    qmfc2.i t6, vf24
    vmuly.xyzw vf27, vf27, vf9
    sll r0, r0, 0
    pcgtw t6, r0, t6
    mfc1 r0, f31
    ppach t6, r0, t6
    mfc1 r0, f31
    vmulax.yw acc, vf0, vf0
    sll r0, r0, 0
    vmulay.xz acc, vf27, vf10
    sll r0, r0, 0
    vmadd.xyzw vf10, vf1, vf10
    sll r0, r0, 0
    or t5, t6, t5
    qmfc2.i t6, vf27
    vmulax.yw acc, vf0, vf0
    lw t7, 6552(t0)
    vmulay.xz acc, vf27, vf11
    sll r0, r0, 0
    vmadd.xyzw vf11, vf1, vf11
    sll r0, r0, 0
    bne t7, s7, L81
    ppacw t6, r0, t6

B48:
    vmulax.yw acc, vf0, vf0
    sd t4, 8(a2)
    vmulay.xz acc, vf27, vf12
    sd t6, 0(a2)
    bne t5, r0, L86
    vmadd.xyzw vf12, vf1, vf12

B49:
    beq r0, r0, L82
    sll r0, r0, 0

B50:
L81:
    vmulax.yw acc, vf0, vf0
    sll r0, r0, 0
    vmulay.xz acc, vf27, vf12
    sll r0, r0, 0
    bne t5, r0, L86
    vmadd.xyzw vf12, vf1, vf12

B51:
L82:
    vmulax.xyzw acc, vf20, vf10
    lq a2, 6144(t0)
    vmadday.xyzw acc, vf21, vf10
    sll r0, r0, 0
    vmaddz.xyzw vf10, vf22, vf10
    sll r0, r0, 0
    vmulax.xyzw acc, vf20, vf11
    sll r0, r0, 0
    vmadday.xyzw acc, vf21, vf11
    sll r0, r0, 0
    vmaddz.xyzw vf11, vf22, vf11
    sll r0, r0, 0
    vmulax.xyzw acc, vf20, vf12
    sll r0, r0, 0
    vmadday.xyzw acc, vf21, vf12
    sll r0, r0, 0
    vmaddz.xyzw vf12, vf22, vf12
    sll r0, r0, 0
    vmulax.xyzw acc, vf20, vf13
    sll r0, r0, 0
    vmadday.xyzw acc, vf21, vf13
    sll r0, r0, 0
    vmaddaz.xyzw acc, vf22, vf13
    sll r0, r0, 0
    vmaddw.xyzw vf13, vf23, vf0
    sq a2, 80(t3)
    sll r0, r0, 0
    sqc2 vf10, 16(t3)
    sll r0, r0, 0
    sqc2 vf11, 32(t3)
    sll a2, a3, 4
    lhu t4, 6378(t0)
    addu t5, a2, v1
    lhu t7, 6370(t0)
    sll t6, t4, 4
    lw a2, 6356(t0)
    daddu t8, t6, t0
    lw t6, 6340(t0)
    daddiu t7, t7, 1
    lq t8, 4400(t8)
    daddiu a3, a3, 6
    sh t7, 6370(t0)
    daddiu t7, t4, 1
    sq t8, 0(t3)
    daddiu t8, t7, -20
    sqc2 vf12, 48(t3)
    movz t7, r0, t8
    sqc2 vf13, 64(t3)
    daddiu t8, t4, -10
    sh t7, 6378(t0)
    daddiu t3, t3, 96
    sw a2, -92(t3)
    beq t4, r0, L83
    sw t5, 6356(t0)

B52:
    bne t8, r0, L84
    sll r0, r0, 0

B53:
L83:
    sll r0, r0, 0
    lq t4, 5040(t0)
    sll r0, r0, 0
    lq t7, 5056(t0)
    sll r0, r0, 0
    sw t5, 6340(t0)
    sll r0, r0, 0
    movz t4, t7, t6
    daddiu a3, a3, 1
    sq t4, 0(t3)
    sll r0, r0, 0
    sw a2, 4(t3)
    beq r0, r0, L92
    daddiu t3, t3, 16

B54:
L84:
    daddiu t5, t4, -9
    sll r0, r0, 0
    beq t5, r0, L85
    daddiu t4, t4, -19

B55:
    bne t4, r0, L92
    sll r0, r0, 0

B56:
L85:
    sll r0, r0, 0
    sll t4, t7, 4
    sll r0, r0, 0
    daddu t4, t4, t0
    daddiu a3, a3, 1
    lq t4, 4720(t4)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sq t4, 0(t3)
    sll r0, r0, 0
    sw a2, 4(t3)
    beq r0, r0, L92
    daddiu t3, t3, 16

B57:
L86:
    vmulax.xyzw acc, vf28, vf10
    lqc2 vf24, 6160(t0)
    vmadday.xyzw acc, vf29, vf10
    sll r0, r0, 0
    vmaddz.xyzw vf10, vf30, vf10
    sll r0, r0, 0
    vmulax.xyzw acc, vf28, vf11
    sll r0, r0, 0
    vmadday.xyzw acc, vf29, vf11
    lhu t4, 6536(t0)
    vmaddz.xyzw vf11, vf30, vf11
    lw a2, 6404(t0)
    vmulax.xyzw acc, vf28, vf12
    daddiu t8, t4, 1
    vmadday.xyzw acc, vf29, vf12
    sh t8, 6536(t0)
    vmaddz.xyzw vf12, vf30, vf12
    lw t4, 12(a2) ;; load the generic geometry?
    vmulax.xyzw acc, vf28, vf13
    lw t5, 6532(t0)
    vmadday.xyzw acc, vf29, vf13
    lh t6, 2(t4)                    ;; generic frag count.
    vmaddaz.xyzw acc, vf30, vf13
    lw a2, 6528(t0)
    vmaddw.xyzw vf13, vf31, vf0
    lw t7, 6516(t0)
    vitof0.xyz vf24, vf24
    sh t8, 6368(t0)
B58:                           ;; generic loop
L87:
    daddiu t8, a3, -115
    sll r0, r0, 0
    blez t8, L90
    lw t8, 28(t4)             ;; load the frag

B59:                          ;; dma
L88:
    lw t3, 0(a0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    andi t3, t3, 256
    sll r0, r0, 0
    beq t3, r0, L89
    sll r0, r0, 0

B60:
    sll r0, r0, 0
    lw t3, 6564(t0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    daddiu t3, t3, 1
    sll r0, r0, 0
    sw t3, 6564(t0)
    beq r0, r0, L88
    sll r0, r0, 0

B61:
L89:
    sw t1, 128(a0)
    xori t1, t1, 6144
    sw v1, 16(a0)
    sll t3, a3, 4
    addu v1, v1, t3
    or t3, t1, r0
    sw a3, 32(a0)
    addiu a3, r0, 256
    sw a3, 0(a0)
    addiu a3, r0, 0
B62:
L90:
    daddu t9, t7, t0
    addiu t7, t7, -144
    daddiu t4, t4, 4
    daddiu t9, t9, 5152
    bgez t7, L91
    lq ra, 0(t9)

B63:
    sll r0, r0, 0
    addiu t7, r0, 720
B64:
L91:
    sll r0, r0, 0
    sw t5, 84(t9)
    sll t5, a3, 4
    sq ra, 0(t3)
    addu t5, t5, v1
    sqc2 vf10, 16(t3)
    movz a2, t5, a2
    sqc2 vf11, 32(t3)
    daddiu a3, a3, 12
    sqc2 vf12, 48(t3)
    sll r0, r0, 0
    lw ra, 4(t8)            ;; ra = vtx-cnt
    sll r0, r0, 0
    sqc2 vf13, 64(t3)
    sll r0, r0, 0
    sqc2 vf24, 80(t3)
    sll r0, r0, 0
    sw ra, 96(t3)
    sll r0, r0, 0
    lw ra, 12(t8)           ;; ra = cnt
    sll r0, r0, 0
    lbu gp, 8(t8)           ;; gp = cnt-qwc
    sll r0, r0, 0
    sw ra, 20(t9)
    sll r0, r0, 0
    sb gp, 16(t9)
    sll r0, r0, 0
    sb gp, 30(t9)
    sll r0, r0, 0
    lw ra, 24(t8)          ;; ra = stq
    sll r0, r0, 0
    lbu gp, 11(t8)         ;; gp = stq-qwc
    sll r0, r0, 0
    sw ra, 36(t9)
    sll r0, r0, 0
    sb gp, 32(t9)
    sll r0, r0, 0
    lw ra, 20(t8)          ;; ra = col
    sll r0, r0, 0
    lbu gp, 10(t8)         ;; gp = col-qwc
    sll r0, r0, 0
    sw ra, 52(t9)
    sll r0, r0, 0
    sb gp, 48(t9)
    sll r0, r0, 0
    lw ra, 16(t8)          ;; ra = vtx
    sll r0, r0, 0
    lbu gp, 9(t8)          ;; gp = vtx-qwc
    sll r0, r0, 0
    sw ra, 68(t9)
    sll r0, r0, 0
    sb gp, 64(t9)
    sll r0, r0, 0
    lw t8, 4(t8)
    sll r0, r0, 0
    lq ra, 16(t9)
    sll r0, r0, 0
    sb t8, 46(t9)
    sll r0, r0, 0
    sb t8, 62(t9)
    sll r0, r0, 0
    sb t8, 78(t9)
    sll r0, r0, 0
    sq ra, 112(t3)
    sll r0, r0, 0
    lq t8, 32(t9)
    sll r0, r0, 0
    lq ra, 48(t9)
    sll r0, r0, 0
    sq t8, 128(t3)
    sll r0, r0, 0
    sq ra, 144(t3)
    sll r0, r0, 0
    lq t8, 64(t9)
    sll r0, r0, 0
    lq t9, 80(t9)
    sll r0, r0, 0
    sq t8, 160(t3)
    daddiu t3, t3, 192
    sq t9, -16(t3)
    daddiu t6, t6, -1
    sll r0, r0, 0
    bgtz t6, L87
    sll r0, r0, 0

B65:
    sll r0, r0, 0
    sw t7, 6516(t0)
    lui t4, 4096
    sw t5, 6532(t0)
    ori t4, t4, 54272
    sw a2, 6528(t0)
    sll r0, r0, 0
    sll r0, r0, 0
B66:
L92:
    vcallms 25
    lw a2, 6408(t0)
    sll r0, r0, 0
    lw t4, 6420(t0)
    daddiu a2, a2, 80
    sll r0, r0, 0
    daddiu t4, t4, -1
    sw a2, 6408(t0)
    bgtz t4, L69
    sw t4, 6420(t0)

B67:
L93:
    sll r0, r0, 0
    lw t4, 8(a2)
    daddiu a2, a2, 16
    lw t5, 6540(t0)
    sll r0, r0, 0
    sw a2, 6408(t0)
    bne t4, r0, L69
    sw t4, 6420(t0)

B68:
    bne t5, r0, L58
    sll r0, r0, 0

B69:
    sll r0, r0, 0
    lw a1, 6404(t0)
    sll r0, r0, 0
    lq a2, 6336(t0)
    sll r0, r0, 0
    lq t2, 6352(t0)
    sll r0, r0, 0
    lq t3, 6368(t0)
    sll r0, r0, 0
    sq a2, 92(a1)
    sll r0, r0, 0
    sq t2, 60(a1)
    sll r0, r0, 0
    sq t3, 76(a1)
    beq a3, r0, L96
    sll r0, r0, 0

B70:
    sll r0, r0, 0
    lw a0, 6416(t0)
    sll r0, r0, 0
    sll r0, r0, 0
B71:
L94:
    lw a1, 0(a0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    andi a1, a1, 256
    sll r0, r0, 0
    beq a1, r0, L95
    sll r0, r0, 0

B72:
    sll r0, r0, 0
    lw a1, 6564(t0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    daddiu a1, a1, 1
    sll r0, r0, 0
    sw a1, 6564(t0)
    beq r0, r0, L94
    sll r0, r0, 0

B73:
L95:
    sw v1, 16(a0)
    sll a1, a3, 4
    sw t1, 128(a0)
    xori a2, t1, 6144
    addu v1, v1, a1
    or a1, a2, r0
    sw a3, 32(a0)
    addiu a1, r0, 256
    sw a1, 0(a0)
    addiu a1, r0, 0
B74:
L96:
    lw a1, 0(a0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    andi a1, a1, 256
    sll r0, r0, 0
    beq a1, r0, L97
    sll r0, r0, 0

B75:
    sll r0, r0, 0
    lw a1, 6564(t0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    daddiu a1, a1, 1
    sll r0, r0, 0
    sw a1, 6564(t0)
    beq r0, r0, L96
    sll r0, r0, 0

B76:
L97:
    lw a0, 6524(t0)
    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0
    sw v1, 4(a0)
    sll r0, r0, 0
B77:
L98:
    or v0, r0, r0
    ld ra, 0(sp)
    lq gp, 16(sp)
    jr ra
    daddiu sp, sp, 32

    sll r0, r0, 0
    sll r0, r0, 0
    sll r0, r0, 0