jak-project/docs/progress-notes/jak2/emerc.md
water111 0fcc7eb8e9
[merc2] Support emerc (#2147)
This adds environment mapping support to `Merc2`, and turns it on for
Jak 1 and Jak 2.

- The performance is much better
- Jak 1 can be toggled back to the old behavior with `(set! *emerc-hack*
#f)`. The new environment mapping is identical to the old one everywhere
I checked.
- Jak 1 still falls back to generic for ripple/texscroll/blerc/eyes -
there's still no dynamic texture or vertex updating support. The eye
detection stuff will sometimes flag stuff as eyes which is not eyes,
which is fine, but means that generic will be used in some places where
emerc could be used. For example, the shiny plates on jak's arm will be
drawn with generic because jak has eyes.
- Jak 2 hasn't been checked super carefully against PCSX2 yet.
- Jak 2 still isn't technically using emerc, but instead putting emerc
models in the merc bucket.
- The interface to merc is a lot different now and totally custom
OpenGOAL DMA code. The original merc drawing asm doesn't run anymore.
- The FR3 format changed
- Something funky going on with foreground lighting in escape, but
doesn't seem to be related to this change?

Performance comparison, jak 1, in likely the most generic-merc heavy
spot:

![image](https://user-images.githubusercontent.com/48171810/213882718-feb2ab59-95a9-44a2-b0e5-95fba860c7b0.png)

![image](https://user-images.githubusercontent.com/48171810/213882736-8dbbf4c9-6bbf-4d0b-96ce-78d63274660c.png)
2023-01-22 18:30:31 -05:00

104 KiB

Emerc

Outline

It's one of two renderers used for foreground + environment mapping. There's also a generc + merc (mercneric) renderer.

As far as I know, the supported effects are:

  • skinning, with up to 3 bones influencing each vertex, and per-vertex specification of bone weights
  • up to 3 directional lights, plus an ambient light
  • vertex colors
  • texturing
  • texture-based environment mapping (done per vertex, not fragment)

Our hope is to port the emerc renderer to PC, then use it for all rendering for envmapped foreground objects. I believe that emerc will be easier to understand than mercneric. The hope is that either emerc can be used for all models, or once we understand emerc, it will be straightforward to convert mercneric-only models to work with PC emerc.

The mercneric renderer handles partially offscreen stuff, and is believed to be slower than emerc. However, mercneric may use less VU1 time, in exchange for more EE time.

As far as I can tell, the way the game decides to use emerc only if all three of these conditions are true:

  • emerc effect bit is set in the model, indicating it can use emerc.
  • we're an actor spawned by scene-player
  • we're not in a frame range specified by scissor-frame in the scene info

The emerc bit is only there on high-resolution cutscene models. Most of the time, there are no frames specified in scissor-frame. This makes sense, usually the actors are onscreen during cutscenes, and emerc seems quite tolerant of partially offscreen characters. (similar story in jak 1 - they were aggressive at letting merc draw offscreen instead of clipping triangles, likely because the clipping pipeline is so much slower).

In very rare cases, they manually specified a frame range for a character who is mostly offscreen (like daxter's feet are visible in frame 2324 of city-krew-collection-intro), and then the character is rendered with mercneric.

My guess is that they just used emerc by default everywhere. If a cutscene character is partially offscreen/behind the camera in a bad way that causes GS coordinates to overflow, this would draw garbage triangles, and they would manually annotate the frame range where this happened.

Review of how all this gets called

Setup

  • A level containing entity-actors is loaded
  • The level-update method (called once per frame) in entity.gc calls birth! on entity-actors that are visible and eligible to be spawned
  • The newly created actor process is initialized by calling init-from-entity!, which is a method that all objects must implement.
  • This method will eventually call initialize-skeleton, a method of the parent process-drawable class.
  • This method creates a draw-control with skeleton-group->draw-control
  • This method calls setup-cspace-and-add
  • This method adds the process drawable to *foreground-draw-engine*, a list of processes to be drawn.
  • The connection uses function add-process-drawable, which just calls the dma-add-func of the draw-control, which is dma-add-process-drawable by default

Per-Frame Draw

  • Game-objects are responsible for calling ja-post, or adding themselves to the matrix-engine list, or somehow coming up with joint transforms.

  • main loop in main.gc calls (*draw-hook*), which points to real-main-draw-hook. This function generates all DMA data for drawing.

  • foreground-engine-execute

    • foreground-init (doesn't do anything emerc-related)
    • calls execute-connections on the engine, the dma-add-process-drawable for each object
      • various stuff for shadows/picking lights
      • generates vu-lights (light values in VU-friendly format)
      • picks LOD based on distances
      • sets texture masks to indidate to texture system which LODs of which textures will be used
      • determines if close-to-screen culling is needed.
      • call foreground-draw
        • add an entry to the *bone-calculation-list* to tell it to compute skinning matrices.
        • rotate lights to camera frame (note that merc only gets a perspective transform, transforming to camera frame is done in skinning calc to avoid a full affine transform on VU1)
        • there's some confusing logic for the renderer selection, but in the end it populates merc-effect-bucket-info including a color and a few flags.
        • calls foreground-emerc, which generates DMA data for emerc (asm func)
  • foreground-execute-cpu-vu0-engines

    • runs bones, modifying the above DMA data to contain skinning matrices computed from joints.
  • display-frame-finish called after all drawing

    • Calls emerc-vu1-init-buffers, which adds some init data to all used emerc buckets.

Emerc DMA Generation

The call in GOAL:

(set! dma-ptr (foreground-emerc dc (-> (scratchpad-object foreground-work) regs mtxs) dma-ptr 29 19))

The arguments are:

  • draw-control, which contains settings for drawing, and the actual merc geometry (called geo)
  • a pointer to the "matrix area", which will contain skinning matrices computed by bones
  • dma-ptr, a pointer to the DMA buffer to write data to
  • 29, 19, likely addresses in the VU1 microprogram to start execution. Typically there is one program for the first run of the renderer, which initializes some VU1 registers/data memory, and then a slightly shorter program that skips the init step.

Before the asm, the rough breakdown is:

  • a draw-control stores 4 geos, one for each lod (some may be unpopulated)
  • Each geo is a merc-ctrl, which is an entire model
  • Each merc-ctrl is made up merc-effects
  • Each merc-ctrl is made up of "fragment"s. Each fragment has a frag-geo (actual data needed in VU1) and frag-ctrl (metadata describing how to upload data to VU1)
  • Each fragment has a few types of data:
    • unsigned-four: containing weights (u8), rgba (u8), addresses for crosscopy/samecopy. Unpacked [u8x4] to [u32x4] by VIF on upload to VU1.
    • lump-four: containing vertex data. Unpacked [u8x4] to [u32x4 + some_magic_constant] by VIF on upload to VU1. This unpack magically converts integers to floats.
    • fp data: containing a header, and "shaders" (giftags for setting up textures/settings). Copied directly by VIF.

The calling function foreground-draw sets flags (per effect) in the merc-bucket-info array. All emerc stuff gets merc-path set to 1.

High-level description of what it does. Note that this is simplified from the assembly version, which combines some dma transfers shown here. Also - this does not actually run any DMA or microprograms, it just generates a DMA chain that will do this later On the next game frame, the giant DMA chain generated by all renderers will be submitted, and all these will run.

// get the merc control for our level of detail (selected in drawable.gc)
MercControl& mc = draw_control.lod_set[draw_control.cur_lod].geo;

// loop over each "effect" in the merc control.
// The "effect" is the grouping for what can be sent to one renderer or another
for (int effect_idx = 0; effect_idx < mc.header.effect_count; effect_idx++) {
    MercEffect& merc_effect = mc.effect[effect_idx]; // merc data in the art group
    MercBuckedInfo& merc_effect_info = gForeground.merc_bucket_info[effect_idx]; // settings generated by foreground-draw

    if (merc_effect_info.disable_draw) {
        continue; // skip if disabled
    }

    if (merc_effect_info.merc_path != 1) {
        continue; // skip if not emerc (1 means emerc here)
    }

    // where we started writing dma for this effect
    u8* effect_dma_start = dma_ptr;

    // the source data (stored in the art group) that we'll be sending.
    u8* source_ptr = merc_effect.frag_geo;

    // loop over fragments
    for (int frag_idx = 0; frag_idx < merc_effect.frag_count; frag_idx++) {
        MercFragmentControl& frag_ctrl = merc_effect.frag_ctrl[frag_idx];
        // set the ROW register of the VIF.
        // when kRowAdd flag is given, the VIF will add these 4 values to each component of each quadword it writes out.
        // This is used as part of the process to go from u8's to floats
        // (they do some cool magic where they don't actually do int->float, they just add integers with VIF and
        //  do float math on VU1 and it works out somehow)
        dma_ptr = generate_vif_strow(dma_ptr, mc.header.st_vif_add, mc.header.st_vif_add, 0x47800000, 0x4b010000);

        // number of quadwords (16-byte words) in EE memory of unsigned_four data to send
        // unsigned_four data is stored as [u8, u8, u8, u8] and unpacked to [u32, u32, u32, u32].
        // the count variable is in units of 4 values. (4 bytes in EE memory, 16 bytes in VU1 memory)
        int u4_qwc_in_ee_mem = (frag_ctrl.unsigned_four_count + 3) / 4;

        int dest_addr_qw = 140;

        dma_ptr = generate_vif_unpack(dma_ptr
            kUnpackV4_8,             // unpack [u8, u8, u8, u8] to [u32, u32, u32, u32]
            kUnsigned,               // zero extend when unpacking
            dest_addr_qw,            // VU1 data address (in quadwords)
            kUseTop,                 // add value of TOP register to destination (VU1 program controls destination)
            source_ptr,              // source pointer
            u4_qwc_in_ee_mem,        // number of QW to transfer from EE memory
            frag_ctrl.unsigned_four_count,  // number of QW written to VU1 memory
            kNoRow,                  // do not add row
            );
        // note: to write 7 QW of data, the would have this in EE memory:
        // [v0, v1, v2, v3] (4 bytes)
        // [v4, v5, v6, XX] (4 bytes)
        // they would transfer 2 QW to vif (including 1 padding byte)
        // but you can tell VIF to unpack only 7 QW, and it will discard the padding.

        // advance source pointer to the next data (lump data)
        source_ptr += u4_qwc_in_ee_mem * 16;

        // advance dest pointer.
        dest_addr_qw += frag_ctrl.unsigned_four_count;

        // lump 4 is unpacked from [u8, u8, u8, u8] to [u32 + rx, u32 + ry, u32 + rz, u32 + rw]
        // where [rx, ry, rz, rw] are specified in ROW set above.
        int l4_qwc_in_ee_mem = (frag_ctrl.lump_four_count + 3) / 4;

        dma_ptr = generate_vif_unpack(dma_ptr
            kUnpackV4_8,             // unpack [u8, u8, u8, u8] to [u32, u32, u32, u32]
            kUnsigned,               // zero extend when unpacking
            dest_addr_qw,            // VU1 data address (in quadwords)
            kUseTop,                 // add value of TOP register to destination (VU1 program controls destination)
            source_ptr,              // source pointer
            l4_qwc_in_ee_mem,        // number of QW to transfer from EE memory
            frag_ctrl.lump_four_count,  // number of QW written to VU1 memory
            kAddRow                  // add the row value
            );

        // advance source pointer to the next data (lump data)
        source_ptr += l4_qwc_in_ee_mem * 16;

        // advance dest pointer.
        dest_addr_qw += frag_ctrl.unsigned_four_count;

        // send fp data.
        dma_ptr = generate_vif_unpack(dma_ptr
            kUnpackV4_32,            // just plain memcpy to VU1 memory
            kSigned,                 // no effect? they set it explicitly always, not sure why.
            dest_addr_qw,            // VU1 data address (in quadwords)
            kUseTop,                 // add value of TOP register to destination (VU1 program controls destination)
            source_ptr,              // source pointer
            fp_qwc,  // number of QW to transfer from EE memory
            frag_ctrl.fp_qwc,  // number of QW written to VU1 memory
            kNoRow                   // don't add the row value
            );

        // adavne source pointer
        source_ptr += frag_ctrl.fp_qwc * 16;

        // there's some special data shared between all fragments. We put this DMA after the DMA
        // for the first fragment as an optimization. We can write the first fragment of this effect
        // to VU1 data memory while VU1 is processing the last fragment of the previous effect.
        // This is ok because the per-fragment data is double buffered (controlled with the TOP register)
        // However, the shared data is not double buffered, and we must wait for the previous effect
        // to be fully done before transferring. We want to delay this as long as possible, so we
        // transfer the first per-fragment data of this effect before this part.
        if (frag_idx == 0) {
            // sneak some more data in lights
            auto lights = gForeground.merc_bucket_info.lights;
            lights.qws[1].w = ignore_alpha ? 0x3f85026b : 0x3f85026a;
            // copy the 7 qw of lights to the dma buffer now, setting up a transfer for them to go
            // to address 140 in VU1 (no TOP).
            // the previous code sets up these lights in VU format (vu-lights).

            dma_ptr = dma_memcpy_to_buffer_then_vu1(dma_ptr, 132, &lights, 7);
            // copy these 4 values to address 139 (copying them to the dma-buffer now)
            dma_ptr = dma_copy_to_buffer_then_vu1(dma_ptr, 139, merc_ctrl.header.xyz_scale, merc_ctrl.header.st_magic, merc_ctrl.header.st_out_a, merc_ctrl.header.st_out_b);

            // emerc new transfer - copying 1 qw color_fade (u8's unpacked to u32)
            dma_ptr = dma_copy_to_buffer_then_vu1(dma_ptr, 118, unpack_u8_to_u32(merc_effect_info.color_fade));

            AdgifShader* envmap_shader = DefaultEnvmapShader;
            if (merc_effect.extra_info && merc_effect.extra_info.shader_offset) { // nonzero check
                envmap_shader = ((u8*)&merc_effect.extra_info) + 16 * merc_effect.extra_info.shader_offset;
            }

            // 5 qw envmap shader
            dma_ptr = dma_copy_to_buffer_then_vu1(dma_ptr, 119, envmap_shader, 5 * 16);
        }


        // fragments will (most of the time) need new matrix data.
        // there are some cases where they can reuse some matrix data from previous fragments in the same
        // effect, so it's possible for there to be no matrices to transfer. But usually there are some
        for (int mat_xfer = 0; mat_xfer < frag_ctrl.max_xfer_count; mat_xfer++) {
            auto& info = frag_ctrl.mat_dest_data[mat_xfer];
            dma_ptr = dma_transfer_matrix(dma_ptr, info.matrix_dest, matrix_mem + sizeof(MercMatrix) * info.mattrix_number);
        }

        // finally, call program.
        dma_ptr = dma_mscal(frag_idx == 0 ? program_addr_1 : program_addr_2);


    }

    // a bunch of bucket patching crap
}

The actual asm:

L101:                      ;; function prologue
    daddiu sp, sp, -128
    sd ra, 0(sp)
    sq s0, 16(sp)
    sq s1, 32(sp)
    sq s2, 48(sp)
    sq s3, 64(sp)
    sq s4, 80(sp)
    sq s5, 96(sp)
    sq gp, 112(sp)

    ;; one-time setup for this "merc-control". A merc-control is a model (at a particular lod)
    ;; for a process-drawable.
    ;; using dc as the input draw-control (a constant)
    ;; using mc as (-> dc lod-set (-> dc cur-lod) geo), the merc-control we're drawing (a constant)
    ;; using t8 = mep as (-> mc effect <n>), one of the merc-effects in the merc-control (variable)
    ;; using t7 = mec (merc-effect counter), the number of remaining merc-counters
    ;; using t9 = mebp as (-> *foreground* merc-bucket-info effect <n>), one of the merc-bucket-info's filled out by
    ;;  the calling function, containing per-effect settings.
B0:
    or t7, a3, r0         ;; t7 = program-addr-1
    or v1, t0, r0         ;; v1 = program-addr-2
    lui t0, 4096          ;; t0 = 0x10000000
    lui t1, 18304         ;; t1 = 0x47800000
    daddiu t0, t0, 1      ;; t0 = 0x10000001
    dsll32 t1, t1, 0      ;; t1 = 0x47800000'00000000
    lui a3, 12288         ;; a3 = 0x30000000
    lui t8, 19201         ;; t8 = 0x4b010000
    pcpyld t0, a3, t0     ;; t0 = 0x00000000'30000000'00000000'10000001 (STROW)
    lbu a3, 78(a0)        ;; a3 = (-> dc cur-lod)
    pcpyld t1, t8, t1     ;; t1 = 0x00000000'4b010000'47800000'00000000
    lui t2, 28160         ;; t2 = 0x6e000000
    addiu t8, r0, 8       ;; t8 = 8
    multu3 a3, a3, t8     ;; a3 = (* 8 (-> dc cur-lod))
    lui t3, 1280          ;; t3 = 0x05000000
    lui t4, 27648         ;; t4 = 0x6c000000
    dsll32 t2, t2, 0      ;; t2 = 0x6e000000'00000000
    dsll32 t4, t4, 0      ;; t4 = 0x6c000000'00000000
    daddu t4, t4, t3      ;; t4 = 0x6c000000'05000000
    daddu t3, t2, t3      ;; t3 = 0x6e000000'05000000
    daddiu t3, t3, 1      ;; t3 = 0x6e000000'05000001
    daddu a0, a3, a0      ;; a0 = (+ dc (* 8 (-> dc cur-lod)))
    pcpyld t2, t2, r0     ;; t2 = 0x6e000000'00000000'00000000'00000000 (unpack-v4-8, no change to row)
    lw a0, 28(a0)         ;; a0 = (-> dc lod-set (-> dc cur-lod) geo) ;; a merc-ctrl
    pcpyld t3, t3, r0     ;; t3 = 0x6e000000'05000001'00000000'00000000 (unpack-v4-8, row add)
    pcpyld t4, t4, r0     ;; t4 = 0x6c000000'05000000'00000000'00000000 (unpack-v4-32, disable row add)
    lui t5, 12288         ;; t5 = 0x30000000
    lui t6, 4096          ;; t6 = 0x10000000
    daddiu t5, t5, 7      ;; t5 = 0x30000007
    lui t8, 5120          ;; t8 = 0x14000000
    lui a3, 27655         ;; a3 = 0x6c070000
    daddu t7, t8, t7      ;; t7 = 0x14000000 + program-addr-1
    dsll32 a3, a3, 0      ;; a3 = 0x6c070000'00000000
    dsll32 t8, t7, 0      ;; t8 = (0x14000000 + program-addr-1) << 32
    pcpyld t5, a3, t5     ;; t5 = 0x6c070000'00000000'00000000'30000007
    lwu t7, 52(a0)        ;; t7 = (-> mc effect-count)
    pcpyld t6, t8, t6     ;; t6 = ((0x14000000 + program-addr-1) << 32) << 64 + 0x00000000'10000000
    daddiu t8, a0, 156    ;; t8 = (-> mc effect 0) = mep "merc effect pointer"
    beq t7, r0, L109      ;; branch if there's no effects (I think this is buggy and jumps to the wrong spot)
    lw a3, *foreground*(s7) ;; a3 = *foreground*

B1:
    daddiu t9, a3, 2508 ;; t9 = (-> *foreground* merc-bucket-info effect 0)
B2:

   ;; TOP of per-effect loop
   ;; (I've marked lines with stats if they are just for computing statistics)
L102:
    lbu a3, 6(t9)              ;; a3 = (-> mebp disable-draw)
    or ra, a2, r0              ;; ra = start-of-dma-for-this-effect
    lbu gp, 4(t9)              ;; gp = (-> mebp merc-path)
    bne a3, r0, L109           ;; jump to next effect if this is disabled.
    lw a3, *merc-global-stats*(s7) ;; a3 = mgs

B3:
    daddiu a3, a3, 16          ;; a3 = (-> *merc-global-stats* emerc)
    daddiu gp, gp, -1          ;; check if `merc-path` is 1, skip this fragment if it's something else
    sll r0, r0, 0
    bne gp, r0, L109
    lhu s4, 2(a3)              ;; stats.fragments

B4:
    lhu s3, 18(t8)             ;; s3 = (-> mep frag-count)
    lwu gp, 4(a3)              ;; stats
    lhu s5, 22(t8)             ;; s5 = (-> mep tri-count)
    daddu s4, s4, s3           ;; stats
    lwu s3, 8(a3)              ;; stats
    lhu s2, 24(t8)             ;; s2 = (-> mep dvert-count)
    daddu gp, gp, s5           ;; stats
    sh s4, 2(a3)               ;; stats
    sw gp, 4(a3)               ;; stats
    daddu s5, s3, s2           ;; stats
    lwu t2, 0(t8)              ;; t2 = (-> mep frag-geo)
    lwu gp, 4(t8)              ;; gp = (-> mep frag-ctrl)
    lui s4, 12288              ;; 0x30000000
    dsll32 t2, t2, 0           ;; (-> mep frag-geo) << 32
    sw s5, 8(a3)               ;; stats
    or t2, t2, s4              ;; t2 = ((-> mep frag-geo) << 32) + 0x30000000 (upper 64-bits still have dma tmpl)
    lhu s5, 18(t8)             ;; s5 = (-> mep frag-count)
    addiu s4, r0, 0            ;; s4 = 0
    beq s5, r0, L109           ;; skip to next effect if no frags in this effect.
    sll r0, r0, 0

B5:
    sll r0, r0, 0

    ;; top of per-fragment loop.
    ;; s4 = current-frag-idx
    ;; s5 = num-frags
    ;; a2 = dma-ptr
    ;; DMA memory layout
    ;;      lower-bits                               higher bits
    ;; 0   [dmatag-lower, dmatag-upper, strow-viftag, ROW_X      ] ;; transfer 1 qw, immediately after this
    ;; 1   [ROW_Y       , ROW_Z       , ROW_W       , nop-viftag ] ;; the qw transferred by 0
    ;; 2   [dmatag-lower, dmatag-upper, nop         , unpack-v4-8] ;; (unsigned4's)
    ;; 3   [dmatag-lower, dmatag-upper, strow 1     , unpack-v4-8] ;; lumps
    ;; 4   [dmatag-lower, dmatag-upper, strow 0     , unpack-v4-32]
B6:
L103:
    lbu s0, 0(gp)             ;; s0 = frag-ctrl.unsigned-four-count (number of 4xu8's in memory)
    sll r0, r0, 0
    lbu s2, 1(gp)             ;; s2 = frag-ctrl.lump-four-count
    xori s1, r0, 49292        ;; s1 = 0xc08c
    lbu s3, 2(gp)             ;; s3 = frag-ctrl.fp-qwc
    daddiu v0, s0, 3          ;; v0 = u4count + 3
    lw a3, 44(a0)             ;; a3 = header.st-vif-add
    srl v0, v0, 2             ;; v0 = (u4count + 3) / 4
    sq t0, 0(a2)              ;; set DMA qw 0 (dmatag-strow only)
    xor t2, t2, v0            ;; set dma qwc
    sq t2, 32(a2)             ;; store dma line 2.
    xor t2, t2, v0            ;; unset dma qwc
    sh s1, 44(a2)             ;; set addr for unpack (tops + unsigned bits)
    daddu s1, s1, s0          ;; unpdate qwc for next unpack
    sb s0, 46(a2)             ;; set qwc for unpack
    dsll32 s0, v0, 4          ;; v0 = (u4-ee-qwc << 36)
    daddu t3, t2, s0          ;; t3 = dma-tag templ
    daddiu s0, s2, 3          ;; s0 = l4c + 3
    sw a3, 12(a2)             ;; ROW_X = header.st-vif-add
    srl s0, s0, 2             ;; s0 /= 4
    sq t1, 16(a2)             ;; ROW_Z, W
    xor t3, t3, s0            ;; set dma qwc
    sq t3, 48(a2)             ;; store dma templ 3
    xor t3, t3, s0            ;; unset dma qwc
    sh s1, 60(a2)             ;; set vif unpack
    daddu s1, s1, s2          ;; next dest
    sb s2, 62(a2)             ;; store.
    dsll32 s2, s0, 4          ;; s2 = dma-src-inc shifted
    sw a3, 16(a2)             ;; ROW Y
    daddu t4, t3, s2          ;; unpack-v4-32 tmpl
    xor t4, t4, s3            ;; set qwc in dma tmpl
    xori a3, s1, 16384        ;; turn off sign extension in unpack
    sq t4, 64(a2)             ;; store dma 4
    xor t4, t4, s3            ;; unset qwc
    sb s3, 78(a2)             ;; set qwc in unpack
    dsll32 s3, s3, 4          ;; qwc -> bytes
    sh a3, 76(a2)             ;; set unpack
    daddu t2, t4, s3          ;; ?? (maybe reset t2 tmpl)
    lbu s3, 3(gp)             ;; s3 = mat-xfer-count
    daddiu gp, gp, 4          ;; next fragment control
    bne s4, r0, L105          ;; do B7, B8, B9, B10 only on first fragment
    daddiu a2, a2, 80         ;; advance DMA ptr.

B7:
    sd t6, 0(a2)              ;; weirdo dma generation code (somebody had too much fun here)
    addiu s2, r0, 8           ;; transfer 8 qw
    sd t6, 8(a2)              ;; more weird crap
    lui a3, 27656             ;; 0x6c08
    sb s2, 0(a2)              ;; transfer 8 qw
    daddiu a3, a3, 132        ;; to 140
    lw s2, *foreground*(s7)   ;; fg
    daddiu s2, s2, 2384       ;; s2 = merc-bucket-info array
    sw a3, 12(a2)             ;; unpack to 140
    lq a3, 0(s2)              ;; a3 = lights 0
    lq s1, 16(s2)             ;; s1 = lights 1
    lq s0, 32(s2)             ;; s0 = lights 2
    lq v0, 48(s2)             ;; v0 = lights 3
    sq a3, 16(a2)             ;; store lights
    sq s1, 32(a2)
    sq s0, 48(a2)
    sq v0, 64(a2)
    lq a3, 64(s2)             ;; lights again
    lq s1, 80(s2)
    lq s0, 96(s2)             ;; lights 6
    lui v0, 16261
    lq s2, 28(a0)
    daddiu v0, v0, 619        ;; v0 = 0x3f85026b
    sq a3, 80(a2)             ;; light store
    lbu a3, 5(t9)             ;; a3 = ignore-alpha
    sq s1, 96(a2)             ;; lights
    sq s0, 112(a2)            ;; last lights
    dsubu a3, v0, a3          ;; compute ignore alpha
    sq s2, 128(a2)            ;; header
    sw a3, 28(a2)             ;; light[1].w
    daddiu a2, a2, 144        ;; inc dma
    sd t6, 0(a2)
    addiu s2, r0, 6
    sd t6, 8(a2)
    lui a3, 27654             ;; 0x6C06
    sb s2, 0(a2)
    daddiu a3, a3, 118
    sw a3, 12(a2)       ;; to 124
    lw a3, 0(t9)        ;; a3 = color fade
    pextlb a3, r0, a3   ;; unpack u8 to u32's
    pextlh a3, r0, a3
    sq a3, 16(a2)       ;; store color fade
    lw a3, *default-envmap-shader*(s7) ;; envmap ptr.
    lw s2, 28(t8)       ;; merc-extra-info
    beq s2, r0, L104
    sll r0, r0, 0

B8:
    lbu s1, 1(s2)
    beq s1, r0, L104
    sll r0, r0, 0

B9:
    sll a3, s1, 4
    addu a3, s2, a3
B10:
L104:
    lq s2, 0(a3) ;; copy shader to dma buff
    lq s1, 16(a3)
    lq s0, 32(a3)
    lq v0, 48(a3)
    lq a3, 64(a3)
    sq s2, 32(a2)
    sq s1, 48(a2)
    sq s0, 64(a2)
    sq v0, 80(a2)
    sq a3, 96(a2)
    daddiu a2, a2, 112

    ;; after first time per-effect stuff
B11:
L105:
    beq s3, r0, L107
    addiu s2, r0, 128 ;; s2 = 128 (matrix size)

B12:
    lbu a3, 0(gp) ;; get mat number
    sll r0, r0, 0
B13:
L106:
    multu3 s1, a3, s2 ;; s1 = matrix offset in ee world
    sq t5, 0(a2)      ;; mat transfer tmplate
    lbu s0, 1(gp)     ;; mat dest
    daddiu gp, gp, 2  ;; gp = next mat transfer
    lbu a3, 0(gp)     ;; a3 = next matrix offset
    daddiu s3, s3, -1 ;; dec remaining
    sb s0, 12(a2)     ;; store dest
    daddiu a2, a2, 16 ;; inc dma
    daddu s1, s1, a1  ;; compute matrix pointer
    sll r0, r0, 0
    bne s3, r0, L106
    sw s1, -12(a2)

B14:
L107:
    sq t6, 0(a2)
    daddiu a2, a2, 16
    bne s4, r0, L108
    daddiu s4, s4, 1

B15:
    or a3, v1, r0 ;; execute program (1 for first round, 2 for later ones)
    sb a3, -4(a2)
B16:
L108:
    bne s4, s5, L103 ;; loop frag
    sll r0, r0, 0

B17: ;; patching crap, based on texture index now. should document eventually...
    lui s5, 28672
    lbu a3, 26(t8)
    addiu gp, r0, 48
    lw s5, 52(s5)
    mult3 a3, a3, gp
    sll r0, r0, 0
    daddu a3, s5, a3
    sll r0, r0, 0
    lw gp, 12(a3)
    sll r0, r0, 0
    lw s5, 16(a3)
    lui s4, 8192
    sq r0, 0(a2)
    movz gp, ra, gp
    sw s4, 0(a2)
    or s4, a2, r0
    sw gp, 12(a3)
    daddiu a2, a2, 16
    beq s5, r0, L109
    sw s4, 16(a3)

B18:
    sll r0, r0, 0
    sw ra, 4(s5)
B19:
L109:
    daddiu t8, t8, 32
    daddiu t9, t9, 8
    daddiu t7, t7, -1
    bne t7, r0, L102 ;; loop effect
    sll r0, r0, 0

B20:
    or v0, a2, r0
    ld ra, 0(sp)
    lq gp, 112(sp)
    lq s5, 96(sp)
    lq s4, 80(sp)
    lq s3, 64(sp)
    lq s2, 48(sp)
    lq s1, 32(sp)
    lq s0, 16(sp)
    jr ra
    daddiu sp, sp, 128

    sll r0, r0, 0
    sll r0, r0, 0

Summary of above

Overall, it's very similar to merc. There's some extra data transfered:

  • the "low memory" stuff setup in emerc.gc is 1 QW longer (an extra "unperspect QW")
  • the rgba color-fade is transferred to non-double-buffered memory (like lights) (1 Qw, unpack to u32's)
  • 5 QW shader for envmapping (either *default-envmap-shader* or one provided in merc extra info)

emerc data appears backward compatible with merc, which makes sense:

  • emerc falls back to merc if it's too far away to envmap
  • we put emerc stuff through merc code (blending and stuff is wrong, but the geometry comes out right)

The promising thing is that we don't seem to need much extra information to do environment mapping. I kinda though we'd need another set of texture coordinates, but I don't see where that enters yet.

If all we need is the shader, plus tint values, it would be easy to do this for any model that succeeds with merc.

EMERC VU1 constants

Triangle strip giftag - same as normal merc exactly (in the normal no-alpha case)

(set! (-> s5-0 tri-strip-gif tag)
      (new 'static 'gif-tag64
        :pre #x1
        :prim (new 'static 'gs-prim :prim (gs-prim-type tri-strip) :iip #x1 :tme #x1 :fge #x1)
        :nreg #x3
        )
      )
(set! (-> s5-0 tri-strip-gif regs)
      (new 'static 'gif-tag-regs :regs0 (gif-reg-id st) :regs1 (gif-reg-id rgbaq) :regs2 (gif-reg-id xyzf2))
      )
;; word 3 gets set to #x303e4000

Program list

  • 0: per-frame init
  • 19: effect init
  • 29: process frag

Memory map

All in 16-byte quadword addresses.

Low memory: after DMA
[0    ] : tri-strip-gif (st, rgbaq, xyzf2), no abe, 0x303e4000 in word 3, same as merc.
[1    ] : adgif-shader giftag (giftag for 5 a+d's)
[2    ] : hvdf-offset
[3 - 7] : perspective matrix (only perspective project, no rotation/translation). 3 gets set to persp_vector
[7    ] : fog (pfog0, fog-min, fog-max, 0.0)
[8    ] : unperspect (1/P(0, 0), 1/P(1, 1), 0.5, 1/P(2, 3))

Low memory: after inits (both frame and effect)
[0    ] : tri-strip-gif (st, rgbaq, xyzf2), no abe, 0x303e4000 in word 3, same as merc.
[1    ] : adgif-shader giftag (giftag for 5 a+d's)
[2    ] : hvdf-offset
[3    ] : P_mult = [low.P(0, 0), low.P(1, 1), low.P(2, 2), low.P(2, 3)]
[4    ] : P_add = [low.P(3, 0), low.P(3, 1), low.P(3, 2), low.P(3, 3)]
[5    ] : P_mult_scale = P_mult * header.xyz-scale
[7    ] : fog (pfog0, fog-min, fog-max, 0.0)
[8    ] : unperspect (1/P(0, 0), 1/P(1, 1), 0.5, 1/P(2, 3))

Summary of math

The "transformed vertex" refers to the vertex before perspective divide, and pfog0 multiply. The "transformed normal" is the rotated normal, after normalization.

vf08 = transformed
vf23 = unperspect
vf14 = rgba-fade
vf24 = normal st

mul.xyzw vf09, vf08, vf23 ;; do unperspect

subw.z vf10, vf10, vf00 ;; subtract 1 from z

addw.z vf09, vf00, vf09 ;; xyww the unperspected thing

mul.xyz vf15, vf09, vf10 ;;

adday.xyzw vf15, vf15

maddz.x vf15, vf21, vf15

div Q, vf15.x, vf10.z

mulaw.xyzw ACC, vf09, vf00

mul.xyzw vf09, vf08, vf23

madd.xyzw vf10, vf10, Q

eleng.xyz P, vf10

mfp.w vf10, P

div Q, vf23.z, vf10.w

addaz.xyzw vf00, vf23

madd.xyzw vf10, vf10, Q

mulz.xy vf24, vf10, vf24 ;; mul tex by q

;; new rgba
sq.xyzw vf14, 443(vi10)

;;
vf24

VU1 Program: init (per frame)

  lq.xyzw vf01, 7(vi00)      |  nop
  lq.xyzw vf25, 3(vi00)      |  nop
  lq.xyzw vf26, 4(vi00)      |  nop
  lq.xyzw vf27, 5(vi00)      |  nop
  lq.xyzw vf28, 6(vi00)      |  nop
  lq.xyzw vf08, 8(vi00)      |  nop
  mr32.xyzw vf01, vf01       |  nop
  move.y vf25, vf26          |  nop
  move.zw vf25, vf27         |  nop
  sq.xyzw vf25, 3(vi00)      |  nop
  sq.xyzw vf08, 124(vi00)    |  nop
  2048.0                     |  nop :i
  255.0                      |  maxi.x vf17, vf00, I :i
  -65537.0                   |  maxi.y vf17, vf00, I :i
  mr32.xyzw vf02, vf01       |  minii.z vf17, vf00, I
  lq.xyzw vf22, 2(vi00)      |  minii.z vf18, vf00, I
  0.003921569                |  minii.z vf19, vf00, I :i
  sq.xyzw vf28, 4(vi00)      |  minii.w vf29, vf00, I :e
  mr32.xyzw vf03, vf02       |  nop

Simplified code (??'s are either garbage, or some value that isn't important later on). Leaving out stores to low memory documented in the Memory Map section.

vf01 = [??, ??, ??, low.pfog0]
vf02 = [??, ??, ??, low.fog_min]
vf03 = [??, ??, ??, low.fog_max]
vf17 = [2048., 255., -65537., ??]
vf22 = low_in.hvdf_offset

VU1 Program: init (per effect)

Note that this continues directly into the per-frag program, to match the note in frag == 0 case in the dma generation part.

  lq.xyzw vf25, 139(vi00)    |  nop
  lq.xyzw vf26, 3(vi00)      |  nop
  lq.xyz vf01, 132(vi00)     |  nop
  lq.xyz vf02, 133(vi00)     |  nop
  lq.xyz vf03, 134(vi00)     |  addy.xy vf19, vf00, vf25
  lq.xyzw vf04, 135(vi00)    |  mulx.xyzw vf26, vf26, vf25
  lq.xyzw vf05, 136(vi00)    |  nop
  lq.xyzw vf06, 137(vi00)    |  nop
  lq.xyzw vf07, 138(vi00)    |  nop
  sq.xyzw vf26, 5(vi00)      |  nop ;; P_mult_scale store.

Simplified code (note: some of this stuff set later)

vf25 = [xyz-scale, st-magic, st-out-a, st-out-b];
vf26 = low.P_mult * xyz-scale;
vf01 = [lt0.xyz, pfog0]
vf02 = [lt1.xyz, fog-min]
vf03 = [lt2.xyz, fog-max]
vf19 = [st-magic, st-magic, -65537, xyz-add.z];
vf04 = lt0_color;
vf05 = lt1_color;
vf06 = lt2_color;

VU1 Program: per-fragment, pre-looping init

;; reg setup stuff
  lq.xyzw vf28, 139(vi00)    |  minix.xyzw vf15, vf00, vf00    ;; vf28 = merc-ctrl-header, vf15 = [0, 0, 0, 0]
  xtop vi15                  |  nop                            ;; vi15 = 0 (output buffer)
  iaddiu vi12, vi15, 0x8c    |  nop                            ;; vi12 = xtop + 140 (merc-byte-header, u4)
  nop                        |  nop                            ;; in merc was a branch for st-a/st-b select.
  ilwr.w vi03, vi12          |  maxz.xy vf18, vf00, vf28       ;; set vf18.xy = [st-out-a, st-out-a] (for a buffer)
  iaddiu vi15, vi00, 0x173   |  nop                            ;; vi15 = xtop + 371
  lq.xyzw vf14, 0(vi00)      |  nop                            ;; vf14 = tri-strip-gif-tag
  nop                        |  nop                            ;; in merc was fadeout
  iadd vi03, vi03, vi12      |  nop                            ;; st-output location = st-out-a + xtop + 140
  ilwr.w vi09, vi03          |  nop          ;; vi09 = fp-header u8's [shader-cnt, kick-off, kick-step, hword-cnt]
  lqi.xyzw vf27, vi03        |  nop          ;; vf27 = xyz-add
  ilw.x vi04, 1(vi12)        |  nop          ;; vi04 = mat1-cnt
  iaddiu vi05, vi00, 0x7f    |  addw.xyz vf15, vf15, vf00  ;; vf15 = [1, 1, 1, 0], vi05 = 0x7f
  iand vi09, vi09, vi05      |  nop                        ;; mask to get vi09 = shader-cnt
  ilw.y vi06, 1(vi12)        |  miniz.w vf19, vf00, vf27   ;; setup vf19, vi06 = mat2-cnt
  nop                        |  miniy.w vf18, vf00, vf27   ;; setup vf18, merc had branch for no strips.
  ilwr.z vi01, vi12          |  minix.w vf17, vf00, vf27   ;; vi01 = lump-off

;;  vf17 = [2048, 255, -65537, xyz-add.x]
;;  vf18 = [st-out-X, st-out-X, -65537, xyz-add.y] (X = a if xtop = 0, X = b otherwise)
;;  vf19 = [st-magic, st-magic, -65537, xyz-add.z]

;; shader setup (not envmap)
  lq.xyzw vf13, 1(vi00)      |  nop     ;; vf13 = adgif gif tag.
  ilwr.w vi02, vi03          |  nop     ;; vi02 = shader control word 0 (dest offset)
  lqi.xyzw vf08, vi03        |  nop     ;; load shader data
  lqi.xyzw vf09, vi03        |  nop
  lqi.xyzw vf10, vi03        |  nop
  lqi.xyzw vf11, vi03        |  nop
  lqi.xyzw vf12, vi03        |  nop
  iadd vi02, vi02, vi15      |  nop     ;; compute destination
  mtir vi08, vf09.w          |  nop     ;; eop stuff (not sure this makes sense in 1-shader emerc)
  sqi.xyzw vf13, vi02        |  nop     ;; store adgif gif tag
  sqi.xyzw vf08, vi02        |  nop     ;; shader store 1
  sqi.xyzw vf09, vi02        |  nop     ;; shader store 2
  mfir.x vf14, vi08          |  nop     ;; set eop bit in giftag template
  sqi.xyzw vf10, vi02        |  nop     ;; shader store 3
  sqi.xyzw vf11, vi02        |  nop     ;; shader store 4
  sqi.xyzw vf12, vi02        |  nop     ;; shader store 5
  sq.xyzw vf14, 0(vi02)      |  nop     ;; store end giftag

;; matrix warmup
  lq.xyzw vf28, 3(vi00)      |  nop     ;; vf28 = persp-diag
  ilw.y vi08, 3(vi12)        |  nop     ;; vi08 = mat-slot.0
  lq.xyzw vf16, 5(vi00)      |  nop     ;; vf16 = scaled-persp-diag
  lq.xyzw vf20, 4(vi00)      |  nop     ;; vf20 = persp-off
  ilw.z vi09, 3(vi12)        |  mul.xyzw vf27, vf28, vf15  ;; vf27 = [pdx, pdy, pdz, 0], vi09 = mat-slot.1
  ior vi11, vi08, vi00       |  mul.xyzw vf28, vf28, vf00  ;; vf28 = [0, 0, 0, pdw], vi11 = vi08 = mat-slot.0
  ibeq vi00, vi08, L2        |  mul.xyzw vf15, vf16, vf15  ;; vf15 = [spdx, spdy, spdz, 0], skip if slot = 0
  iaddi vi13, vi12, 0x3      |  mul.xyzw vf16, vf16, vf00  ;; vi13 = mat-slot-ptr, vf16 = [0, 0, 0, spdw]
  • mostly same as merc
  • always picks st-a, merc had a branch here based on state of xtop.
  • no fade out flag stuff

Matrix multiply loop

Premultiplies uploaded matrices by perspective. Only does matrices that were uploaded this time. Same as merc, so skipping.

The rest of it

  • Transformed vertex (before perspective divide and pfog0 multiply is store back over lump[2])
  • Transformed normal is stored over rgba
L2: (L14 in og merc)
;; Pipelining Start for vertex transform
  ilw.x vi02, 3(vi12)        |  nop  ;; vi02 = perc-off
  ibeq vi00, vi04, L13       |  nop  ;; goto L13 if mat1 count is 0
  iadd vi01, vi01, vi12      |  nop  ;; vi01 = lump.

;; Pipelining start for matrix 1's
  ilwr.x vi08, vi01          |  nop ;; vi08 = lump[0].x = mat-0?
  lqi.xyzw vf08, vi01        |  nop
  lqi.xyzw vf11, vi01        |  nop
  lqi.xyzw vf14, vi01        |  nop ;; vf14 = lump[2] = [texs, text, nrmz, posz]
  lq.xyz vf29, 4(vi08)       |  nop
  lq.xyz vf30, 5(vi08)       |  add.zw vf08, vf08, vf17
  lq.xyzw vf31, 6(vi08)      |  add.xyzw vf11, vf11, vf18
  iaddi vi04, vi04, -0x1     |  add.xyzw vf14, vf14, vf19
  iadd vi02, vi02, vi12      |  nop
  lqi.xyzw vf24, vi02        |  mulaz.xyzw ACC, vf29, vf08
  mtir vi10, vf11.x          |  maddaz.xyzw ACC, vf30, vf11
  mtir vi13, vf11.y          |  maddz.xyz vf11, vf31, vf14
  lq.xyzw vf25, 0(vi08)      |  nop
  lq.xyzw vf26, 1(vi08)      |  itof0.xyzw vf24, vf24
  lq.xyzw vf27, 2(vi08)      |  nop
  erleng.xyz P, vf11         |  nop
  lq.xyzw vf28, 3(vi08)      |  mulaw.xyzw ACC, vf25, vf08
  nop                        |  maddaw.xyzw ACC, vf26, vf11 ;; modified from merc, no mercprime crap
  mr32.z vf14, vf00          |  maddw.xyzw vf08, vf27, vf14
  lqi.xyzw vf09, vi01        |  nop
  ilwr.y vi03, vi12          |  nop
  ilw.z vi07, 1(vi12)        |  nop
  lqi.xyzw vf12, vi01        |  add.xyzw vf08, vf08, vf28
  lqi.xyzw vf15, vi01        |  nop
  mtir vi08, vf09.x          |  nop ;; mercprime stuff in og.

  ;; CHANGE: transformed vf08 (pre perspective divide, pfog mult)
  ;; is stored back! over lop lump[2] (texs, text, nrmz, posz)
  sq.xyzw vf08, -4(vi01)     |  miniw.w vf08, vf08, vf01

  iadd vi03, vi03, vi12      |  nop
  div Q, vf01.w, vf08.w      |  add.zw vf09, vf09, vf17
  iadd vi04, vi04, vi03      |  add.xyzw vf12, vf12, vf18
  lq.xyz vf29, 4(vi08)       |  add.xyzw vf15, vf15, vf19
  lq.xyz vf30, 5(vi08)       |  nop
  iadd vi06, vi06, vi04      |  nop
  lq.xyzw vf31, 6(vi08)      |  nop
  lq.xyzw vf25, 0(vi08)      |  nop
  lq.xyzw vf26, 1(vi08)      |  mul.xyz vf08, vf08, Q
  mtir vi11, vf12.x          |  mul.xyzw vf14, vf14, Q
  mtir vi14, vf12.y          |  nop
  lq.xyzw vf27, 2(vi08)      |  nop
  lqi.xyzw vf23, vi03        |  add.xyzw vf08, vf08, vf22  ;; load rgba, hvdf offset
  iadd vi07, vi07, vi06      |  mulaz.xyzw ACC, vf29, vf09
  lq.xyzw vf28, 3(vi08)      |  maddaz.xyzw ACC, vf30, vf12
  mfp.w vf20, P              |  maddz.xyz vf12, vf31, vf15
  nop                        |  nop
  1024.0                     |  miniw.w vf08, vf08, vf03 :i
  nop                        |  mulaw.xyzw ACC, vf25, vf09  ;; modified, no mercprime branch
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf11, vf11, vf20  ;;
  erleng.xyz P, vf12         |  maxi.xy vf08, vf08, I       ;; like mercprimt path (L82 in og merc)
  3072.0                     |  nop :i
  nop                        |  minii.xy vf08, vf08, I

;; CHANGE store back normal over RGBA.
  sq.xyzw vf11, -1(vi03)     |  maddaw.xyzw ACC, vf26, vf12
  mr32.z vf15, vf00          |  maddw.xyzw vf09, vf27, vf15
  lqi.xyzw vf10, vi01        |  mulax.xyzw ACC, vf01, vf11
  ibne vi04, vi03, L4        |  madday.xyzw ACC, vf02, vf11 ;; branch to L4, pipelined mat 1
  nop                        |  maddz.xyzw vf11, vf03, vf11
  ibne vi06, vi03, L17       |  nop
  nop                        |  nop
  b L52                      |  nop
  nop                        |  nop

  ;; pipelined mat 1 loop start
L3: (L16 in og)
  sq.xyzw vf11, -1(vi03)     |  nop                             ;; normal store back
  3072.0                     |  mulax.xyzw ACC, vf01, vf11 :i   ;; mercprime crap
  lqi.xyzw vf10, vi01        |  minii.xy vf08, vf08, I
  sq.xyzw vf13, 1(vi12)      |  madday.xyzw ACC, vf02, vf11
  sq.xyzw vf13, 1(vi15)      |  maddz.xyzw vf11, vf03, vf11
  ;; pipelined mat 1 entry point
L4: (L17 in og)
  lqi.xyzw vf13, vi01        |  add.xyzw vf09, vf09, vf28
  lqi.xyzw vf16, vi01        |  maxw.w vf08, vf08, vf02
  mtir vi08, vf10.x          |  itof0.xyzw vf23, vf23
  ilw.y vi09, -9(vi01)       |  maxx.xyzw vf11, vf11, vf00
  sq.xyzw vf09, -4(vi01)     |  miniw.w vf09, vf09, vf01
  div Q, vf01.w, vf09.w      |  add.zw vf10, vf10, vf17
  move.xyzw vf21, vf08       |  add.xyzw vf13, vf13, vf18
  lq.xyz vf29, 4(vi08)       |  add.xyzw vf16, vf16, vf19
  lq.xyz vf30, 5(vi08)       |  mulax.xyzw ACC, vf04, vf11
  ibgtz vi09, L5             |  madday.xyzw ACC, vf05, vf11
  lq.xyzw vf31, 6(vi08)      |  maddaz.xyzw ACC, vf06, vf11
  nop                        |  addx.w vf21, vf21, vf17
L5: (L18 in og)
  lq.xyzw vf25, 0(vi08)      |  maddw.xyzw vf11, vf07, vf00
  lq.xyzw vf26, 1(vi08)      |  mul.xyz vf09, vf09, Q
  mtir vi12, vf13.x          |  mul.xyzw vf15, vf15, Q
  mtir vi15, vf13.y          |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf27, 2(vi08)      |  mul.xyzw vf11, vf11, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf09, vf09, vf22
  ibne vi00, vi09, L6        |  mulaz.xyzw ACC, vf29, vf10
  sq.xyzw vf21, 2(vi10)      |  maddaz.xyzw ACC, vf30, vf13
  nop                        |  ftoi4.xyzw vf21, vf08
L6: (L19 in og)
  mfp.w vf20, P              |  maddz.xyz vf13, vf31, vf16
  sq.xyzw vf14, 0(vi10)      |  miniy.xyzw vf11, vf11, vf17
  sq.xyzw vf14, 0(vi13)      |  miniw.w vf09, vf09, vf03
  sq.xyzw vf21, 2(vi13)      |  mulaw.xyzw ACC, vf25, vf10
  lq.xyzw vf28, 3(vi08)      |  mulw.xyzw vf12, vf12, vf20
  1024.0                     |  ftoi0.xyzw vf11, vf11 :i
  erleng.xyz P, vf13         |  maxi.xy vf09, vf09, I
  ibne vi04, vi03, L7        |  maddaw.xyzw ACC, vf26, vf13
  mr32.z vf16, vf00          |  maddw.xyzw vf10, vf27, vf16
  ibne vi06, vi03, L22       |  nop
  ilw.y vi09, -6(vi01)       |  nop
  ibne vi07, vi03, L57       |  nop
  nop                        |  nop
  b L67                      |  nop
  nop                        |  nop
L7: (L20 in og)
  sq.xyzw vf12, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf12 :i
  lqi.xyzw vf08, vi01        |  minii.xy vf09, vf09, I
  sq.xyzw vf11, 1(vi10)      |  madday.xyzw ACC, vf02, vf12
  sq.xyzw vf11, 1(vi13)      |  maddz.xyzw vf12, vf03, vf12
  lqi.xyzw vf11, vi01        |  add.xyzw vf10, vf10, vf28
  lqi.xyzw vf14, vi01        |  maxw.w vf09, vf09, vf02
  mtir vi08, vf08.x          |  itof0.xyzw vf23, vf23
  ilw.y vi09, -9(vi01)       |  maxx.xyzw vf12, vf12, vf00
  sq.xyzw vf10, -4(vi01)     |  miniw.w vf10, vf10, vf01
  div Q, vf01.w, vf10.w      |  add.zw vf08, vf08, vf17
  move.xyzw vf21, vf09       |  add.xyzw vf11, vf11, vf18
  lq.xyz vf29, 4(vi08)       |  add.xyzw vf14, vf14, vf19
  lq.xyz vf30, 5(vi08)       |  mulax.xyzw ACC, vf04, vf12
  ibgtz vi09, L8             |  madday.xyzw ACC, vf05, vf12
  lq.xyzw vf31, 6(vi08)      |  maddaz.xyzw ACC, vf06, vf12
  nop                        |  addx.w vf21, vf21, vf17
L8: (L21 in og)
  lq.xyzw vf25, 0(vi08)      |  maddw.xyzw vf12, vf07, vf00
  lq.xyzw vf26, 1(vi08)      |  mul.xyz vf10, vf10, Q
  mtir vi10, vf11.x          |  mul.xyzw vf16, vf16, Q
  mtir vi13, vf11.y          |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf27, 2(vi08)      |  mul.xyzw vf12, vf12, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf10, vf10, vf22
  ibne vi00, vi09, L9        |  mulaz.xyzw ACC, vf29, vf08
  sq.xyzw vf21, 2(vi11)      |  maddaz.xyzw ACC, vf30, vf11
  nop                        |  ftoi4.xyzw vf21, vf09
L9: (L22 in og)
  mfp.w vf20, P              |  maddz.xyz vf11, vf31, vf14
  sq.xyzw vf15, 0(vi11)      |  miniy.xyzw vf12, vf12, vf17
  sq.xyzw vf15, 0(vi14)      |  miniw.w vf10, vf10, vf03
  sq.xyzw vf21, 2(vi14)      |  mulaw.xyzw ACC, vf25, vf08
  lq.xyzw vf28, 3(vi08)      |  mulw.xyzw vf13, vf13, vf20
  1024.0                     |  ftoi0.xyzw vf12, vf12 :i
  erleng.xyz P, vf11         |  maxi.xy vf10, vf10, I
  ibne vi04, vi03, L10       |  maddaw.xyzw ACC, vf26, vf11
  mr32.z vf14, vf00          |  maddw.xyzw vf08, vf27, vf14
  ibne vi06, vi03, L27       |  nop
  ilw.y vi09, -6(vi01)       |  nop
  ibne vi07, vi03, L62       |  nop
  nop                        |  nop
  b L72                      |  nop
  nop                        |  nop
L10: (L23 in og)
  sq.xyzw vf13, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf13 :i
  lqi.xyzw vf09, vi01        |  minii.xy vf10, vf10, I
  sq.xyzw vf12, 1(vi11)      |  madday.xyzw ACC, vf02, vf13
  sq.xyzw vf12, 1(vi14)      |  maddz.xyzw vf13, vf03, vf13
  lqi.xyzw vf12, vi01        |  add.xyzw vf08, vf08, vf28
  lqi.xyzw vf15, vi01        |  maxw.w vf10, vf10, vf02
  mtir vi08, vf09.x          |  itof0.xyzw vf23, vf23
  ilw.y vi09, -9(vi01)       |  maxx.xyzw vf13, vf13, vf00
  sq.xyzw vf08, -4(vi01)     |  miniw.w vf08, vf08, vf01
  div Q, vf01.w, vf08.w      |  add.zw vf09, vf09, vf17
  move.xyzw vf21, vf10       |  add.xyzw vf12, vf12, vf18
  lq.xyz vf29, 4(vi08)       |  add.xyzw vf15, vf15, vf19
  lq.xyz vf30, 5(vi08)       |  mulax.xyzw ACC, vf04, vf13
  ibgtz vi09, L11            |  madday.xyzw ACC, vf05, vf13
  lq.xyzw vf31, 6(vi08)      |  maddaz.xyzw ACC, vf06, vf13
  nop                        |  addx.w vf21, vf21, vf17
L11: (L24 in og)
  lq.xyzw vf25, 0(vi08)      |  maddw.xyzw vf13, vf07, vf00
  lq.xyzw vf26, 1(vi08)      |  mul.xyz vf08, vf08, Q
  mtir vi11, vf12.x          |  mul.xyzw vf14, vf14, Q
  mtir vi14, vf12.y          |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf27, 2(vi08)      |  mul.xyzw vf13, vf13, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf08, vf08, vf22
  ibne vi00, vi09, L12       |  mulaz.xyzw ACC, vf29, vf09
  sq.xyzw vf21, 2(vi12)      |  maddaz.xyzw ACC, vf30, vf12
  nop                        |  ftoi4.xyzw vf21, vf10
L12: (L25 in og)
  mfp.w vf20, P              |  maddz.xyz vf12, vf31, vf15
  sq.xyzw vf16, 0(vi12)      |  miniy.xyzw vf13, vf13, vf17
  sq.xyzw vf16, 0(vi15)      |  miniw.w vf08, vf08, vf03
  sq.xyzw vf21, 2(vi15)      |  mulaw.xyzw ACC, vf25, vf09
  lq.xyzw vf28, 3(vi08)      |  mulw.xyzw vf11, vf11, vf20
  1024.0                     |  ftoi0.xyzw vf13, vf13 :i
  erleng.xyz P, vf12         |  maxi.xy vf08, vf08, I
  ibne vi04, vi03, L3        |  maddaw.xyzw ACC, vf26, vf12
  mr32.z vf15, vf00          |  maddw.xyzw vf09, vf27, vf15
  ibne vi06, vi03, L16       |  nop
  ilw.y vi09, -6(vi01)       |  nop
  ibne vi07, vi03, L51       |  nop
  nop                        |  nop
  b L77                      |  nop
  nop                        |  nop

L13 (L26 in og merc):
 ;; pipeline startup for mat 2's (assuming you have no mat1's)
  ibeq vi00, vi06, L47       |  nop
  iadd vi02, vi02, vi12      |  nop
  lqi.xyzw vf08, vi01        |  nop
  lqi.xyzw vf24, vi02        |  nop
  lqi.xyzw vf11, vi01        |  nop
  lqi.xyzw vf14, vi01        |  nop
  mtir vi10, vf08.x          |  nop
  mtir vi13, vf08.y          |  itof0.xyzw vf24, vf24
  iaddi vi06, vi06, -0x1     |  add.zw vf08, vf08, vf17
  nop                        |  add.xyzw vf11, vf11, vf18
  iand vi10, vi10, vi05      |  add.xyzw vf14, vf14, vf19
  nop                        |  mulw.xyzw vf24, vf24, vf29
  iand vi13, vi13, vi05      |  nop
  lq.xyzw vf20, 0(vi10)      |  nop
  lq.xyzw vf25, 0(vi13)      |  nop
  lq.xyzw vf23, 1(vi10)      |  nop
  lq.xyzw vf26, 1(vi13)      |  nop
  lq.xyzw vf20, 2(vi10)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf27, 2(vi13)      |  maddy.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 3(vi10)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyzw vf28, 3(vi13)      |  maddy.xyzw vf26, vf26, vf24
  lq.xyzw vf20, 4(vi10)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi13)       |  maddy.xyzw vf27, vf27, vf24
  lq.xyzw vf23, 5(vi10)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi13)       |  maddy.xyzw vf28, vf28, vf24
  lq.xyzw vf20, 6(vi10)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf31, 6(vi13)      |  maddy.xyz vf29, vf29, vf24
  mtir vi10, vf11.x          |  mulax.xyzw ACC, vf23, vf24
  mtir vi13, vf11.y          |  maddy.xyz vf30, vf30, vf24
  nop                        |  mulax.xyzw ACC, vf20, vf24
  nop                        |  maddy.xyzw vf31, vf31, vf24
  nop                        |  mulaz.xyzw ACC, vf29, vf08
  nop                        |  maddaz.xyzw ACC, vf30, vf11
  nop                        |  maddz.xyz vf11, vf31, vf14
  nop                        |  nop
  nop                        |  nop
  nop                        |  mulaw.xyzw ACC, vf25, vf08
  nop                        |  nop
  erleng.xyz P, vf11         |  nop
  nop                        |  maddaw.xyzw ACC, vf26, vf11
  mr32.z vf14, vf00          |  maddw.xyzw vf08, vf27, vf14
  lqi.xyzw vf09, vi01        |  nop
  ilwr.y vi03, vi12          |  nop
  ilw.z vi07, 1(vi12)        |  nop
  lqi.xyzw vf12, vi01        |  add.xyzw vf08, vf08, vf28
  lqi.xyzw vf15, vi01        |  nop
  mtir vi11, vf09.x          |  nop
  mtir vi14, vf09.y          |  nop
  sq.xyzw vf08, -4(vi01)     |  miniw.w vf08, vf08, vf01
  div Q, vf01.w, vf08.w      |  add.zw vf09, vf09, vf17
  iadd vi03, vi03, vi12      |  add.xyzw vf12, vf12, vf18
  iand vi11, vi11, vi05      |  add.xyzw vf15, vf15, vf19
  iadd vi06, vi06, vi03      |  nop
  iadd vi07, vi07, vi06      |  nop
  iand vi14, vi14, vi05      |  nop
  ibne vi05, vi11, L14       |  nop
  iaddiu vi08, vi00, 0x23a   |  mul.xyz vf08, vf08, Q
  mtir vi11, vf12.x          |  mul.xyzw vf14, vf14, Q
  mtir vi14, vf12.y          |  nop
  b L15                      |  nop
  lqi.xyzw vf23, vi03        |  add.xyzw vf08, vf08, vf22
L14: (L28 in og)
  lq.xyzw vf20, 0(vi11)      |  mul.xyzw vf14, vf14, Q
  lq.xyzw vf25, 0(vi14)      |  nop
  lq.xyzw vf23, 1(vi11)      |  nop
  lq.xyzw vf26, 1(vi14)      |  add.xyzw vf08, vf08, vf22
  lq.xyzw vf20, 2(vi11)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyzw vf27, 2(vi14)      |  maddw.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 3(vi11)      |  mulaz.xyzw ACC, vf23, vf24
  lq.xyzw vf28, 3(vi14)      |  maddw.xyzw vf26, vf26, vf24
  lq.xyzw vf20, 4(vi11)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi14)       |  maddw.xyzw vf27, vf27, vf24
  lq.xyzw vf23, 5(vi11)      |  mulaz.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi14)       |  maddw.xyzw vf28, vf28, vf24
  lq.xyzw vf20, 6(vi11)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyzw vf31, 6(vi14)      |  maddw.xyz vf29, vf29, vf24
  lqi.xyzw vf23, vi02        |  mulaz.xyzw ACC, vf23, vf24
  mtir vi11, vf12.x          |  maddw.xyz vf30, vf30, vf24
  mtir vi14, vf12.y          |  mulaz.xyzw ACC, vf20, vf24
  iaddiu vi08, vi00, 0x18c   |  maddw.xyzw vf31, vf31, vf24
  lqi.xyzw vf23, vi03        |  itof0.xyzw vf24, vf23
L15: (L29 in og)
  nop                        |  mulaz.xyzw ACC, vf29, vf09
  nop                        |  maddaz.xyzw ACC, vf30, vf12
  mfp.w vf20, P              |  maddz.xyz vf12, vf31, vf15
  nop                        |  nop
  1024.0                     |  miniw.w vf08, vf08, vf03 :i
  nop                        |  mulaw.xyzw ACC, vf25, vf09
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf11, vf11, vf20
  erleng.xyz P, vf12         |  maxi.xy vf08, vf08, I
  3072.0                     |  nop :i
  sq.xyzw vf11, -1(vi03)     |  minii.xy vf08, vf08, I
  ibeq vi06, vi03, L50       |  maddaw.xyzw ACC, vf26, vf12
  mr32.z vf15, vf00          |  maddw.xyzw vf09, vf27, vf15
  lqi.xyzw vf10, vi01        |  mulax.xyzw ACC, vf01, vf11
  jr vi08                    |  madday.xyzw ACC, vf02, vf11
  nop                        |  maddz.xyzw vf11, vf03, vf11
L16: (L30 in og)
  sq.xyzw vf11, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf11 :i
  lqi.xyzw vf10, vi01        |  minii.xy vf08, vf08, I
  sq.xyzw vf13, 1(vi12)      |  madday.xyzw ACC, vf02, vf11
  sq.xyzw vf13, 1(vi15)      |  maddz.xyzw vf11, vf03, vf11
L17: (L31 in og)
  lqi.xyzw vf13, vi01        |  add.xyzw vf09, vf09, vf28
  lqi.xyzw vf16, vi01        |  maxw.w vf08, vf08, vf02
  mtir vi12, vf10.x          |  itof0.xyzw vf23, vf23
  mtir vi15, vf10.y          |  maxx.xyzw vf11, vf11, vf00
  sq.xyzw vf09, -4(vi01)     |  miniw.w vf09, vf09, vf01
  div Q, vf01.w, vf09.w      |  add.zw vf10, vf10, vf17
  move.xyzw vf21, vf08       |  add.xyzw vf13, vf13, vf18
  iand vi12, vi12, vi05      |  add.xyzw vf16, vf16, vf19
  nop                        |  mulax.xyzw ACC, vf04, vf11
  ibgtz vi09, L18            |  madday.xyzw ACC, vf05, vf11
  iand vi15, vi15, vi05      |  maddaz.xyzw ACC, vf06, vf11
  nop                        |  addx.w vf21, vf21, vf17
L18: (L32 in og)
  ibne vi05, vi12, L19       |  maddw.xyzw vf11, vf07, vf00
  ilw.x vi09, -9(vi01)       |  mul.xyz vf09, vf09, Q
  mtir vi12, vf13.x          |  mul.xyzw vf15, vf15, Q
  mtir vi15, vf13.y          |  ftoi4.xyzw vf21, vf21
  b L20                      |  mul.xyzw vf11, vf11, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf09, vf09, vf22
L19: (L33 in og)
  lq.xyzw vf20, 0(vi12)      |  mul.xyzw vf15, vf15, Q
  nop                        |  mulw.xyzw vf24, vf24, vf29
  lq.xyzw vf25, 0(vi15)      |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf23, 1(vi12)      |  mul.xyzw vf11, vf11, vf23
  lq.xyzw vf26, 1(vi15)      |  add.xyzw vf09, vf09, vf22
  lq.xyzw vf20, 2(vi12)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf27, 2(vi15)      |  maddy.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 3(vi12)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyzw vf28, 3(vi15)      |  maddy.xyzw vf26, vf26, vf24
  lq.xyzw vf20, 4(vi12)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi15)       |  maddy.xyzw vf27, vf27, vf24
  lq.xyzw vf23, 5(vi12)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi15)       |  maddy.xyzw vf28, vf28, vf24
  lq.xyzw vf20, 6(vi12)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf31, 6(vi15)      |  maddy.xyz vf29, vf29, vf24
  mtir vi12, vf13.x          |  mulax.xyzw ACC, vf23, vf24
  mtir vi15, vf13.y          |  maddy.xyz vf30, vf30, vf24
  b L35                      |  mulax.xyzw ACC, vf20, vf24
  lqi.xyzw vf23, vi03        |  maddy.xyzw vf31, vf31, vf24
L20: (L34 in og)
  ibgez vi09, L21            |  mulaz.xyzw ACC, vf29, vf10
  sq.xyzw vf21, 2(vi10)      |  maddaz.xyzw ACC, vf30, vf13
  nop                        |  ftoi4.xyzw vf21, vf08
L21: (L35 in og)
  mfp.w vf20, P              |  maddz.xyz vf13, vf31, vf16
  sq.xyzw vf14, 0(vi10)      |  miniy.xyzw vf11, vf11, vf17
  sq.xyzw vf14, 0(vi13)      |  miniw.w vf09, vf09, vf03
  sq.xyzw vf21, 2(vi13)      |  mulaw.xyzw ACC, vf25, vf10
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf12, vf12, vf20
  1024.0                     |  ftoi0.xyzw vf11, vf11 :i
  erleng.xyz P, vf13         |  maxi.xy vf09, vf09, I
  ibne vi06, vi03, L22       |  maddaw.xyzw ACC, vf26, vf13
  mr32.z vf16, vf00          |  maddw.xyzw vf10, vf27, vf16
  ibne vi07, vi03, L57       |  nop
  nop                        |  nop
  b L67                      |  nop
  nop                        |  nop
L22: (L36 in og)
  sq.xyzw vf12, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf12 :i
  lqi.xyzw vf08, vi01        |  minii.xy vf09, vf09, I
  sq.xyzw vf11, 1(vi10)      |  madday.xyzw ACC, vf02, vf12
  sq.xyzw vf11, 1(vi13)      |  maddz.xyzw vf12, vf03, vf12
  lqi.xyzw vf11, vi01        |  add.xyzw vf10, vf10, vf28
  lqi.xyzw vf14, vi01        |  maxw.w vf09, vf09, vf02
  mtir vi10, vf08.x          |  itof0.xyzw vf23, vf23
  mtir vi13, vf08.y          |  maxx.xyzw vf12, vf12, vf00
  sq.xyzw vf10, -4(vi01)     |  miniw.w vf10, vf10, vf01
  div Q, vf01.w, vf10.w      |  add.zw vf08, vf08, vf17
  move.xyzw vf21, vf09       |  add.xyzw vf11, vf11, vf18
  iand vi10, vi10, vi05      |  add.xyzw vf14, vf14, vf19
  nop                        |  mulax.xyzw ACC, vf04, vf12
  ibgtz vi09, L23            |  madday.xyzw ACC, vf05, vf12
  iand vi13, vi13, vi05      |  maddaz.xyzw ACC, vf06, vf12
  nop                        |  addx.w vf21, vf21, vf17
L23: (L37 in og)
  ibne vi05, vi10, L24       |  maddw.xyzw vf12, vf07, vf00
  ilw.x vi09, -9(vi01)       |  mul.xyz vf10, vf10, Q
  mtir vi10, vf11.x          |  mul.xyzw vf16, vf16, Q
  mtir vi13, vf11.y          |  ftoi4.xyzw vf21, vf21
  b L25                      |  mul.xyzw vf12, vf12, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf10, vf10, vf22
L24: (L38 in og)
  lq.xyzw vf20, 0(vi10)      |  mul.xyzw vf16, vf16, Q
  nop                        |  mulw.xyzw vf24, vf24, vf29
  lq.xyzw vf25, 0(vi13)      |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf23, 1(vi10)      |  mul.xyzw vf12, vf12, vf23
  lq.xyzw vf26, 1(vi13)      |  add.xyzw vf10, vf10, vf22
  lq.xyzw vf20, 2(vi10)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf27, 2(vi13)      |  maddy.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 3(vi10)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyzw vf28, 3(vi13)      |  maddy.xyzw vf26, vf26, vf24
  lq.xyzw vf20, 4(vi10)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi13)       |  maddy.xyzw vf27, vf27, vf24
  lq.xyzw vf23, 5(vi10)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi13)       |  maddy.xyzw vf28, vf28, vf24
  lq.xyzw vf20, 6(vi10)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf31, 6(vi13)      |  maddy.xyz vf29, vf29, vf24
  mtir vi10, vf11.x          |  mulax.xyzw ACC, vf23, vf24
  mtir vi13, vf11.y          |  maddy.xyz vf30, vf30, vf24
  b L40                      |  mulax.xyzw ACC, vf20, vf24
  lqi.xyzw vf23, vi03        |  maddy.xyzw vf31, vf31, vf24
L25: (L39 in og)
  ibgez vi09, L26            |  mulaz.xyzw ACC, vf29, vf08
  sq.xyzw vf21, 2(vi11)      |  maddaz.xyzw ACC, vf30, vf11
  nop                        |  ftoi4.xyzw vf21, vf09
L26: (L40 in og)
  mfp.w vf20, P              |  maddz.xyz vf11, vf31, vf14
  sq.xyzw vf15, 0(vi11)      |  miniy.xyzw vf12, vf12, vf17
  sq.xyzw vf15, 0(vi14)      |  miniw.w vf10, vf10, vf03
  sq.xyzw vf21, 2(vi14)      |  mulaw.xyzw ACC, vf25, vf08
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf13, vf13, vf20
  1024.0                     |  ftoi0.xyzw vf12, vf12 :i
  erleng.xyz P, vf11         |  maxi.xy vf10, vf10, I
  ibne vi06, vi03, L27       |  maddaw.xyzw ACC, vf26, vf11
  mr32.z vf14, vf00          |  maddw.xyzw vf08, vf27, vf14
  ibne vi07, vi03, L62       |  nop
  nop                        |  nop
  b L72                      |  nop
  nop                        |  nop
L27: (L41 in og)
  sq.xyzw vf13, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf13 :i
  lqi.xyzw vf09, vi01        |  minii.xy vf10, vf10, I
  sq.xyzw vf12, 1(vi11)      |  madday.xyzw ACC, vf02, vf13
  sq.xyzw vf12, 1(vi14)      |  maddz.xyzw vf13, vf03, vf13
  lqi.xyzw vf12, vi01        |  add.xyzw vf08, vf08, vf28
  lqi.xyzw vf15, vi01        |  maxw.w vf10, vf10, vf02
  mtir vi11, vf09.x          |  itof0.xyzw vf23, vf23
  mtir vi14, vf09.y          |  maxx.xyzw vf13, vf13, vf00
  sq.xyzw vf08, -4(vi01)     |  miniw.w vf08, vf08, vf01
  div Q, vf01.w, vf08.w      |  add.zw vf09, vf09, vf17
  move.xyzw vf21, vf10       |  add.xyzw vf12, vf12, vf18
  iand vi11, vi11, vi05      |  add.xyzw vf15, vf15, vf19
  nop                        |  mulax.xyzw ACC, vf04, vf13
  ibgtz vi09, L28            |  madday.xyzw ACC, vf05, vf13
  iand vi14, vi14, vi05      |  maddaz.xyzw ACC, vf06, vf13
  nop                        |  addx.w vf21, vf21, vf17
L28: (L42 in og)
  ibne vi05, vi11, L29       |  maddw.xyzw vf13, vf07, vf00
  ilw.x vi09, -9(vi01)       |  mul.xyz vf08, vf08, Q
  mtir vi11, vf12.x          |  mul.xyzw vf14, vf14, Q
  mtir vi14, vf12.y          |  ftoi4.xyzw vf21, vf21
  b L30                      |  mul.xyzw vf13, vf13, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf08, vf08, vf22
L29: (L43 in og)
  lq.xyzw vf20, 0(vi11)      |  mul.xyzw vf14, vf14, Q
  nop                        |  mulw.xyzw vf24, vf24, vf29
  lq.xyzw vf25, 0(vi14)      |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf23, 1(vi11)      |  mul.xyzw vf13, vf13, vf23
  lq.xyzw vf26, 1(vi14)      |  add.xyzw vf08, vf08, vf22
  lq.xyzw vf20, 2(vi11)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf27, 2(vi14)      |  maddy.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 3(vi11)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyzw vf28, 3(vi14)      |  maddy.xyzw vf26, vf26, vf24
  lq.xyzw vf20, 4(vi11)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi14)       |  maddy.xyzw vf27, vf27, vf24
  lq.xyzw vf23, 5(vi11)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi14)       |  maddy.xyzw vf28, vf28, vf24
  lq.xyzw vf20, 6(vi11)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf31, 6(vi14)      |  maddy.xyz vf29, vf29, vf24
  mtir vi11, vf12.x          |  mulax.xyzw ACC, vf23, vf24
  mtir vi14, vf12.y          |  maddy.xyz vf30, vf30, vf24
  b L45                      |  mulax.xyzw ACC, vf20, vf24
  lqi.xyzw vf23, vi03        |  maddy.xyzw vf31, vf31, vf24
L30: (L44 in og)
  ibgez vi09, L31            |  mulaz.xyzw ACC, vf29, vf09
  sq.xyzw vf21, 2(vi12)      |  maddaz.xyzw ACC, vf30, vf12
  nop                        |  ftoi4.xyzw vf21, vf10
L31: (L45 in og)
  mfp.w vf20, P              |  maddz.xyz vf12, vf31, vf15
  sq.xyzw vf16, 0(vi12)      |  miniy.xyzw vf13, vf13, vf17
  sq.xyzw vf16, 0(vi15)      |  miniw.w vf08, vf08, vf03
  sq.xyzw vf21, 2(vi15)      |  mulaw.xyzw ACC, vf25, vf09
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf11, vf11, vf20
  1024.0                     |  ftoi0.xyzw vf13, vf13 :i
  erleng.xyz P, vf12         |  maxi.xy vf08, vf08, I
  ibne vi06, vi03, L16       |  maddaw.xyzw ACC, vf26, vf12
  mr32.z vf15, vf00          |  maddw.xyzw vf09, vf27, vf15
  ibne vi07, vi03, L51       |  nop
  nop                        |  nop
  b L77                      |  nop
  nop                        |  nop
L32: (L46 in og)
  sq.xyzw vf11, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf11 :i
  lqi.xyzw vf10, vi01        |  minii.xy vf08, vf08, I
  sq.xyzw vf13, 1(vi12)      |  madday.xyzw ACC, vf02, vf11
  sq.xyzw vf13, 1(vi15)      |  maddz.xyzw vf11, vf03, vf11
  lqi.xyzw vf13, vi01        |  add.xyzw vf09, vf09, vf28
  lqi.xyzw vf16, vi01        |  maxw.w vf08, vf08, vf02
  mtir vi12, vf10.x          |  itof0.xyzw vf23, vf23
  mtir vi15, vf10.y          |  maxx.xyzw vf11, vf11, vf00
  sq.xyzw vf09, -4(vi01)     |  miniw.w vf09, vf09, vf01
  div Q, vf01.w, vf09.w      |  add.zw vf10, vf10, vf17
  move.xyzw vf21, vf08       |  add.xyzw vf13, vf13, vf18
  iand vi12, vi12, vi05      |  add.xyzw vf16, vf16, vf19
  nop                        |  mulax.xyzw ACC, vf04, vf11
  ibgtz vi09, L33            |  madday.xyzw ACC, vf05, vf11
  iand vi15, vi15, vi05      |  maddaz.xyzw ACC, vf06, vf11
  nop                        |  addx.w vf21, vf21, vf17
L33: (L47 in og)
  ibne vi05, vi12, L34       |  maddw.xyzw vf11, vf07, vf00
  ilw.x vi09, -9(vi01)       |  mul.xyz vf09, vf09, Q
  mtir vi12, vf13.x          |  mul.xyzw vf15, vf15, Q
  mtir vi15, vf13.y          |  ftoi4.xyzw vf21, vf21
  b L35                      |  mul.xyzw vf11, vf11, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf09, vf09, vf22
L34: (L48 in og)
  lq.xyzw vf20, 0(vi12)      |  mul.xyzw vf15, vf15, Q
  lq.xyzw vf25, 0(vi15)      |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf23, 1(vi12)      |  mul.xyzw vf11, vf11, vf23
  lq.xyzw vf26, 1(vi15)      |  add.xyzw vf09, vf09, vf22
  lq.xyzw vf20, 2(vi12)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyzw vf27, 2(vi15)      |  maddw.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 3(vi12)      |  mulaz.xyzw ACC, vf23, vf24
  lq.xyzw vf28, 3(vi15)      |  maddw.xyzw vf26, vf26, vf24
  lq.xyzw vf20, 4(vi12)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi15)       |  maddw.xyzw vf27, vf27, vf24
  lq.xyzw vf23, 5(vi12)      |  mulaz.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi15)       |  maddw.xyzw vf28, vf28, vf24
  lq.xyzw vf20, 6(vi12)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyzw vf31, 6(vi15)      |  maddw.xyz vf29, vf29, vf24
  lqi.xyzw vf23, vi02        |  mulaz.xyzw ACC, vf23, vf24
  mtir vi12, vf13.x          |  maddw.xyz vf30, vf30, vf24
  mtir vi15, vf13.y          |  mulaz.xyzw ACC, vf20, vf24
  b L20                      |  maddw.xyzw vf31, vf31, vf24
  lqi.xyzw vf23, vi03        |  itof0.xyzw vf24, vf23
L35: (L49 in og)
  ibgez vi09, L36            |  mulaz.xyzw ACC, vf29, vf10
  sq.xyzw vf21, 2(vi10)      |  maddaz.xyzw ACC, vf30, vf13
  nop                        |  ftoi4.xyzw vf21, vf08
L36: (L50 in og)
  mfp.w vf20, P              |  maddz.xyz vf13, vf31, vf16
  sq.xyzw vf14, 0(vi10)      |  miniy.xyzw vf11, vf11, vf17
  sq.xyzw vf14, 0(vi13)      |  miniw.w vf09, vf09, vf03
  sq.xyzw vf21, 2(vi13)      |  mulaw.xyzw ACC, vf25, vf10
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf12, vf12, vf20
  1024.0                     |  ftoi0.xyzw vf11, vf11 :i
  erleng.xyz P, vf13         |  maxi.xy vf09, vf09, I
  ibne vi06, vi03, L37       |  maddaw.xyzw ACC, vf26, vf13
  mr32.z vf16, vf00          |  maddw.xyzw vf10, vf27, vf16
  ibne vi07, vi03, L57       |  nop
  nop                        |  nop
  b L67                      |  nop
  nop                        |  nop
L37: (L51 in og)
  sq.xyzw vf12, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf12 :i
  lqi.xyzw vf08, vi01        |  minii.xy vf09, vf09, I
  sq.xyzw vf11, 1(vi10)      |  madday.xyzw ACC, vf02, vf12
  sq.xyzw vf11, 1(vi13)      |  maddz.xyzw vf12, vf03, vf12
  lqi.xyzw vf11, vi01        |  add.xyzw vf10, vf10, vf28
  lqi.xyzw vf14, vi01        |  maxw.w vf09, vf09, vf02
  mtir vi10, vf08.x          |  itof0.xyzw vf23, vf23
  mtir vi13, vf08.y          |  maxx.xyzw vf12, vf12, vf00
  sq.xyzw vf10, -4(vi01)     |  miniw.w vf10, vf10, vf01
  div Q, vf01.w, vf10.w      |  add.zw vf08, vf08, vf17
  move.xyzw vf21, vf09       |  add.xyzw vf11, vf11, vf18
  iand vi10, vi10, vi05      |  add.xyzw vf14, vf14, vf19
  nop                        |  mulax.xyzw ACC, vf04, vf12
  ibgtz vi09, L38            |  madday.xyzw ACC, vf05, vf12
  iand vi13, vi13, vi05      |  maddaz.xyzw ACC, vf06, vf12
  nop                        |  addx.w vf21, vf21, vf17
L38: (L52 in og)
  ibne vi05, vi10, L39       |  maddw.xyzw vf12, vf07, vf00
  ilw.x vi09, -9(vi01)       |  mul.xyz vf10, vf10, Q
  mtir vi10, vf11.x          |  mul.xyzw vf16, vf16, Q
  mtir vi13, vf11.y          |  ftoi4.xyzw vf21, vf21
  b L40                      |  mul.xyzw vf12, vf12, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf10, vf10, vf22
L39: (L53 in og)
  lq.xyzw vf20, 0(vi10)      |  mul.xyzw vf16, vf16, Q
  lq.xyzw vf25, 0(vi13)      |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf23, 1(vi10)      |  mul.xyzw vf12, vf12, vf23
  lq.xyzw vf26, 1(vi13)      |  add.xyzw vf10, vf10, vf22
  lq.xyzw vf20, 2(vi10)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyzw vf27, 2(vi13)      |  maddw.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 3(vi10)      |  mulaz.xyzw ACC, vf23, vf24
  lq.xyzw vf28, 3(vi13)      |  maddw.xyzw vf26, vf26, vf24
  lq.xyzw vf20, 4(vi10)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi13)       |  maddw.xyzw vf27, vf27, vf24
  lq.xyzw vf23, 5(vi10)      |  mulaz.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi13)       |  maddw.xyzw vf28, vf28, vf24
  lq.xyzw vf20, 6(vi10)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyzw vf31, 6(vi13)      |  maddw.xyz vf29, vf29, vf24
  lqi.xyzw vf23, vi02        |  mulaz.xyzw ACC, vf23, vf24
  mtir vi10, vf11.x          |  maddw.xyz vf30, vf30, vf24
  mtir vi13, vf11.y          |  mulaz.xyzw ACC, vf20, vf24
  b L25                      |  maddw.xyzw vf31, vf31, vf24
  lqi.xyzw vf23, vi03        |  itof0.xyzw vf24, vf23
L40: (L54 in og)
  ibgez vi09, L41            |  mulaz.xyzw ACC, vf29, vf08
  sq.xyzw vf21, 2(vi11)      |  maddaz.xyzw ACC, vf30, vf11
  nop                        |  ftoi4.xyzw vf21, vf09
L41: (L55 in og)
  mfp.w vf20, P              |  maddz.xyz vf11, vf31, vf14
  sq.xyzw vf15, 0(vi11)      |  miniy.xyzw vf12, vf12, vf17
  sq.xyzw vf15, 0(vi14)      |  miniw.w vf10, vf10, vf03
  sq.xyzw vf21, 2(vi14)      |  mulaw.xyzw ACC, vf25, vf08
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf13, vf13, vf20
  1024.0                     |  ftoi0.xyzw vf12, vf12 :i
  erleng.xyz P, vf11         |  maxi.xy vf10, vf10, I
  ibne vi06, vi03, L42       |  maddaw.xyzw ACC, vf26, vf11
  mr32.z vf14, vf00          |  maddw.xyzw vf08, vf27, vf14
  ibne vi07, vi03, L62       |  nop
  nop                        |  nop
  b L72                      |  nop
  nop                        |  nop
L42: (L56 in og)
  sq.xyzw vf13, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf13 :i
  lqi.xyzw vf09, vi01        |  minii.xy vf10, vf10, I
  sq.xyzw vf12, 1(vi11)      |  madday.xyzw ACC, vf02, vf13
  sq.xyzw vf12, 1(vi14)      |  maddz.xyzw vf13, vf03, vf13
  lqi.xyzw vf12, vi01        |  add.xyzw vf08, vf08, vf28
  lqi.xyzw vf15, vi01        |  maxw.w vf10, vf10, vf02
  mtir vi11, vf09.x          |  itof0.xyzw vf23, vf23
  mtir vi14, vf09.y          |  maxx.xyzw vf13, vf13, vf00
  sq.xyzw vf08, -4(vi01)     |  miniw.w vf08, vf08, vf01
  div Q, vf01.w, vf08.w      |  add.zw vf09, vf09, vf17
  move.xyzw vf21, vf10       |  add.xyzw vf12, vf12, vf18
  iand vi11, vi11, vi05      |  add.xyzw vf15, vf15, vf19
  nop                        |  mulax.xyzw ACC, vf04, vf13
  ibgtz vi09, L43            |  madday.xyzw ACC, vf05, vf13
  iand vi14, vi14, vi05      |  maddaz.xyzw ACC, vf06, vf13
  nop                        |  addx.w vf21, vf21, vf17
L43: (L57 in og)
  ibne vi05, vi11, L44       |  maddw.xyzw vf13, vf07, vf00
  ilw.x vi09, -9(vi01)       |  mul.xyz vf08, vf08, Q
  mtir vi11, vf12.x          |  mul.xyzw vf14, vf14, Q
  mtir vi14, vf12.y          |  ftoi4.xyzw vf21, vf21
  b L45                      |  mul.xyzw vf13, vf13, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf08, vf08, vf22
L44: (L58 in og)
  lq.xyzw vf20, 0(vi11)      |  mul.xyzw vf14, vf14, Q
  lq.xyzw vf25, 0(vi14)      |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf23, 1(vi11)      |  mul.xyzw vf13, vf13, vf23
  lq.xyzw vf26, 1(vi14)      |  add.xyzw vf08, vf08, vf22
  lq.xyzw vf20, 2(vi11)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyzw vf27, 2(vi14)      |  maddw.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 3(vi11)      |  mulaz.xyzw ACC, vf23, vf24
  lq.xyzw vf28, 3(vi14)      |  maddw.xyzw vf26, vf26, vf24
  lq.xyzw vf20, 4(vi11)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi14)       |  maddw.xyzw vf27, vf27, vf24
  lq.xyzw vf23, 5(vi11)      |  mulaz.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi14)       |  maddw.xyzw vf28, vf28, vf24
  lq.xyzw vf20, 6(vi11)      |  mulaz.xyzw ACC, vf20, vf24
  lq.xyzw vf31, 6(vi14)      |  maddw.xyz vf29, vf29, vf24
  lqi.xyzw vf23, vi02        |  mulaz.xyzw ACC, vf23, vf24
  mtir vi11, vf12.x          |  maddw.xyz vf30, vf30, vf24
  mtir vi14, vf12.y          |  mulaz.xyzw ACC, vf20, vf24
  b L30                      |  maddw.xyzw vf31, vf31, vf24
  lqi.xyzw vf23, vi03        |  itof0.xyzw vf24, vf23
L45: (L59 in og)
  ibgez vi09, L46            |  mulaz.xyzw ACC, vf29, vf09
  sq.xyzw vf21, 2(vi12)      |  maddaz.xyzw ACC, vf30, vf12
  nop                        |  ftoi4.xyzw vf21, vf10
L46: (L60 in og)
  mfp.w vf20, P              |  maddz.xyz vf12, vf31, vf15
  sq.xyzw vf16, 0(vi12)      |  miniy.xyzw vf13, vf13, vf17
  sq.xyzw vf16, 0(vi15)      |  miniw.w vf08, vf08, vf03
  sq.xyzw vf21, 2(vi15)      |  mulaw.xyzw ACC, vf25, vf09
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf11, vf11, vf20
  1024.0                     |  ftoi0.xyzw vf13, vf13 :i
  erleng.xyz P, vf12         |  maxi.xy vf08, vf08, I
  ibne vi06, vi03, L32       |  maddaw.xyzw ACC, vf26, vf12
  mr32.z vf15, vf00          |  maddw.xyzw vf09, vf27, vf15
  ibne vi07, vi03, L57       |  nop
  nop                        |  nop
  b L77                      |  nop
  nop                        |  nop

;; mat 3
L47:
  lqi.xyzw vf08, vi01        |  nop
  lqi.xyzw vf24, vi02        |  nop
  lqi.xyzw vf11, vi01        |  nop
  lqi.xyzw vf14, vi01        |  nop
  mtir vi10, vf08.x          |  nop
  mtir vi13, vf08.y          |  itof0.xyzw vf24, vf24
  nop                        |  add.zw vf08, vf08, vf17
  nop                        |  add.xyzw vf11, vf11, vf18
  iand vi10, vi10, vi05      |  add.xyzw vf14, vf14, vf19
  ilw.w vi08, -1(vi02)       |  mulw.xyzw vf24, vf24, vf29
  iand vi13, vi13, vi05      |  nop
  lq.xyzw vf20, 0(vi10)      |  nop
  lq.xyzw vf31, 0(vi13)      |  nop
  lq.xyzw vf25, 0(vi08)      |  nop
  lq.xyzw vf23, 1(vi10)      |  nop
  lq.xyzw vf20, 1(vi13)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf26, 1(vi08)      |  madday.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 2(vi10)      |  maddz.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 2(vi13)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyzw vf27, 2(vi08)      |  madday.xyzw ACC, vf20, vf24
  lq.xyzw vf20, 3(vi10)      |  maddz.xyzw vf26, vf26, vf24
  lq.xyzw vf31, 3(vi13)      |  mulax.xyzw ACC, vf31, vf24
  lq.xyzw vf28, 3(vi08)      |  madday.xyzw ACC, vf23, vf24
  lq.xyzw vf23, 4(vi10)      |  maddz.xyzw vf27, vf27, vf24
  lq.xyzw vf20, 4(vi13)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi08)       |  madday.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 5(vi10)      |  maddz.xyzw vf28, vf28, vf24
  lq.xyzw vf23, 5(vi13)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi08)       |  madday.xyzw ACC, vf20, vf24
  lq.xyzw vf20, 6(vi10)      |  maddz.xyz vf29, vf29, vf24
  lq.xyzw vf22, 6(vi13)      |  mulax.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 6(vi08)      |  madday.xyzw ACC, vf23, vf24
  lqi.xyzw vf23, vi02        |  maddz.xyz vf30, vf30, vf24
  mtir vi10, vf11.x          |  mulax.xyzw ACC, vf20, vf24
  mtir vi13, vf11.y          |  madday.xyzw ACC, vf22, vf24
  lq.xyzw vf22, 2(vi00)      |  maddz.xyzw vf31, vf31, vf24
  nop                        |  itof0.xyzw vf24, vf23
  nop                        |  mulaz.xyzw ACC, vf29, vf08
  nop                        |  maddaz.xyzw ACC, vf30, vf11
  nop                        |  maddz.xyz vf11, vf31, vf14
  nop                        |  nop
  nop                        |  nop
  nop                        |  mulaw.xyzw ACC, vf25, vf08
  nop                        |  nop
  erleng.xyz P, vf11         |  nop
  nop                        |  maddaw.xyzw ACC, vf26, vf11
  mr32.z vf14, vf00          |  maddw.xyzw vf08, vf27, vf14
  lqi.xyzw vf09, vi01        |  nop
  ilwr.y vi03, vi12          |  nop
  ilw.z vi07, 1(vi12)        |  nop
  lqi.xyzw vf12, vi01        |  add.xyzw vf08, vf08, vf28
  lqi.xyzw vf15, vi01        |  nop
  mtir vi11, vf09.x          |  nop
  mtir vi14, vf09.y          |  nop
  sq.xyzw vf08, -4(vi01)     |  miniw.w vf08, vf08, vf01
  div Q, vf01.w, vf08.w      |  add.zw vf09, vf09, vf17
  iadd vi03, vi03, vi12      |  add.xyzw vf12, vf12, vf18
  iand vi11, vi11, vi05      |  add.xyzw vf15, vf15, vf19
  ilw.w vi08, -1(vi02)       |  nop
  iadd vi07, vi07, vi03      |  nop
  iand vi14, vi14, vi05      |  nop
  ibne vi05, vi11, L48       |  nop
  iaddi vi07, vi07, -0x1     |  mul.xyz vf08, vf08, Q
  mtir vi11, vf12.x          |  mul.xyzw vf14, vf14, Q
  mtir vi14, vf12.y          |  nop
  b L49                      |  nop
  lqi.xyzw vf23, vi03        |  add.xyzw vf08, vf08, vf22
L48:
  lq.xyzw vf20, 0(vi11)      |  mul.xyzw vf14, vf14, Q
  nop                        |  mulw.xyzw vf24, vf24, vf29
  lq.xyzw vf31, 0(vi14)      |  nop
  lq.xyzw vf25, 0(vi08)      |  nop
  lq.xyzw vf23, 1(vi11)      |  add.xyzw vf08, vf08, vf22
  lq.xyzw vf20, 1(vi14)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf26, 1(vi08)      |  madday.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 2(vi11)      |  maddz.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 2(vi14)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyzw vf27, 2(vi08)      |  madday.xyzw ACC, vf20, vf24
  lq.xyzw vf20, 3(vi11)      |  maddz.xyzw vf26, vf26, vf24
  lq.xyzw vf31, 3(vi14)      |  mulax.xyzw ACC, vf31, vf24
  lq.xyzw vf28, 3(vi08)      |  madday.xyzw ACC, vf23, vf24
  lq.xyzw vf23, 4(vi11)      |  maddz.xyzw vf27, vf27, vf24
  lq.xyzw vf20, 4(vi14)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi08)       |  madday.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 5(vi11)      |  maddz.xyzw vf28, vf28, vf24
  lq.xyzw vf23, 5(vi14)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi08)       |  madday.xyzw ACC, vf20, vf24
  lq.xyzw vf20, 6(vi11)      |  maddz.xyz vf29, vf29, vf24
  lq.xyzw vf22, 6(vi14)      |  mulax.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 6(vi08)      |  madday.xyzw ACC, vf23, vf24
  lqi.xyzw vf23, vi02        |  maddz.xyz vf30, vf30, vf24
  mtir vi11, vf12.x          |  mulax.xyzw ACC, vf20, vf24
  mtir vi14, vf12.y          |  madday.xyzw ACC, vf22, vf24
  lq.xyzw vf22, 2(vi00)      |  maddz.xyzw vf31, vf31, vf24
  lqi.xyzw vf23, vi03        |  itof0.xyzw vf24, vf23
L49:
  nop                        |  mulaz.xyzw ACC, vf29, vf09
  nop                        |  maddaz.xyzw ACC, vf30, vf12
  mfp.w vf20, P              |  maddz.xyz vf12, vf31, vf15
  nop                        |  nop
  1024.0                     |  miniw.w vf08, vf08, vf03 :i
  nop                        |  mulaw.xyzw ACC, vf25, vf09
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf11, vf11, vf20
  erleng.xyz P, vf12         |  maxi.xy vf08, vf08, I
  3072.0                     |  nop :i
  sq.xyzw vf11, -1(vi03)     |  minii.xy vf08, vf08, I
  nop                        |  maddaw.xyzw ACC, vf26, vf12
  mr32.z vf15, vf00          |  maddw.xyzw vf09, vf27, vf15
L50:
  lqi.xyzw vf10, vi01        |  mulax.xyzw ACC, vf01, vf11
  b L52                      |  madday.xyzw ACC, vf02, vf11
  nop                        |  maddz.xyzw vf11, vf03, vf11
L51:
  sq.xyzw vf11, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf11 :i
  lqi.xyzw vf10, vi01        |  minii.xy vf08, vf08, I
  sq.xyzw vf13, 1(vi12)      |  madday.xyzw ACC, vf02, vf11
  sq.xyzw vf13, 1(vi15)      |  maddz.xyzw vf11, vf03, vf11
L52:
  lqi.xyzw vf13, vi01        |  add.xyzw vf09, vf09, vf28
  lqi.xyzw vf16, vi01        |  maxw.w vf08, vf08, vf02
  mtir vi12, vf10.x          |  itof0.xyzw vf23, vf23
  mtir vi15, vf10.y          |  maxx.xyzw vf11, vf11, vf00
  sq.xyzw vf09, -4(vi01)     |  miniw.w vf09, vf09, vf01
  div Q, vf01.w, vf09.w      |  add.zw vf10, vf10, vf17
  move.xyzw vf21, vf08       |  add.xyzw vf13, vf13, vf18
  iand vi12, vi12, vi05      |  add.xyzw vf16, vf16, vf19
  ilw.w vi08, -1(vi02)       |  mulax.xyzw ACC, vf04, vf11
  ibgtz vi09, L53            |  madday.xyzw ACC, vf05, vf11
  iand vi15, vi15, vi05      |  maddaz.xyzw ACC, vf06, vf11
  nop                        |  addx.w vf21, vf21, vf17
L53:
  ibne vi05, vi12, L54       |  maddw.xyzw vf11, vf07, vf00
  ilw.x vi09, -9(vi01)       |  mul.xyz vf09, vf09, Q
  mtir vi12, vf13.x          |  mul.xyzw vf15, vf15, Q
  mtir vi15, vf13.y          |  ftoi4.xyzw vf21, vf21
  b L55                      |  mul.xyzw vf11, vf11, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf09, vf09, vf22
L54:
  lq.xyzw vf20, 0(vi12)      |  mul.xyzw vf15, vf15, Q
  nop                        |  mulw.xyzw vf24, vf24, vf29
  lq.xyzw vf31, 0(vi15)      |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf25, 0(vi08)      |  mul.xyzw vf11, vf11, vf23
  lq.xyzw vf23, 1(vi12)      |  add.xyzw vf09, vf09, vf22
  lq.xyzw vf20, 1(vi15)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf26, 1(vi08)      |  madday.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 2(vi12)      |  maddz.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 2(vi15)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyzw vf27, 2(vi08)      |  madday.xyzw ACC, vf20, vf24
  lq.xyzw vf20, 3(vi12)      |  maddz.xyzw vf26, vf26, vf24
  lq.xyzw vf31, 3(vi15)      |  mulax.xyzw ACC, vf31, vf24
  lq.xyzw vf28, 3(vi08)      |  madday.xyzw ACC, vf23, vf24
  lq.xyzw vf23, 4(vi12)      |  maddz.xyzw vf27, vf27, vf24
  lq.xyzw vf20, 4(vi15)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi08)       |  madday.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 5(vi12)      |  maddz.xyzw vf28, vf28, vf24
  lq.xyzw vf23, 5(vi15)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi08)       |  madday.xyzw ACC, vf20, vf24
  lq.xyzw vf20, 6(vi12)      |  maddz.xyz vf29, vf29, vf24
  lq.xyzw vf22, 6(vi15)      |  mulax.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 6(vi08)      |  madday.xyzw ACC, vf23, vf24
  lqi.xyzw vf23, vi02        |  maddz.xyz vf30, vf30, vf24
  mtir vi12, vf13.x          |  mulax.xyzw ACC, vf20, vf24
  mtir vi15, vf13.y          |  madday.xyzw ACC, vf22, vf24
  lq.xyzw vf22, 2(vi00)      |  maddz.xyzw vf31, vf31, vf24
  lqi.xyzw vf23, vi03        |  itof0.xyzw vf24, vf23
L55: (L70 in og)
  ibgez vi09, L56            |  mulaz.xyzw ACC, vf29, vf10
  sq.xyzw vf21, 2(vi10)      |  maddaz.xyzw ACC, vf30, vf13
  nop                        |  ftoi4.xyzw vf21, vf08
L56:
  mfp.w vf20, P              |  maddz.xyz vf13, vf31, vf16
  sq.xyzw vf14, 0(vi10)      |  miniy.xyzw vf11, vf11, vf17
  sq.xyzw vf14, 0(vi13)      |  miniw.w vf09, vf09, vf03
  sq.xyzw vf21, 2(vi13)      |  mulaw.xyzw ACC, vf25, vf10
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf12, vf12, vf20
  1024.0                     |  ftoi0.xyzw vf11, vf11 :i
  erleng.xyz P, vf13         |  maxi.xy vf09, vf09, I
  ibeq vi07, vi03, L67       |  maddaw.xyzw ACC, vf26, vf13
  mr32.z vf16, vf00          |  maddw.xyzw vf10, vf27, vf16
L57:
  sq.xyzw vf12, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf12 :i
  lqi.xyzw vf08, vi01        |  minii.xy vf09, vf09, I
  sq.xyzw vf11, 1(vi10)      |  madday.xyzw ACC, vf02, vf12
  sq.xyzw vf11, 1(vi13)      |  maddz.xyzw vf12, vf03, vf12
  lqi.xyzw vf11, vi01        |  add.xyzw vf10, vf10, vf28
  lqi.xyzw vf14, vi01        |  maxw.w vf09, vf09, vf02
  mtir vi10, vf08.x          |  itof0.xyzw vf23, vf23
  mtir vi13, vf08.y          |  maxx.xyzw vf12, vf12, vf00
  sq.xyzw vf10, -4(vi01)     |  miniw.w vf10, vf10, vf01
  div Q, vf01.w, vf10.w      |  add.zw vf08, vf08, vf17
  move.xyzw vf21, vf09       |  add.xyzw vf11, vf11, vf18
  iand vi10, vi10, vi05      |  add.xyzw vf14, vf14, vf19
  ilw.w vi08, -1(vi02)       |  mulax.xyzw ACC, vf04, vf12
  ibgtz vi09, L58            |  madday.xyzw ACC, vf05, vf12
  iand vi13, vi13, vi05      |  maddaz.xyzw ACC, vf06, vf12
  nop                        |  addx.w vf21, vf21, vf17
L58:
  ibne vi05, vi10, L59       |  maddw.xyzw vf12, vf07, vf00
  ilw.x vi09, -9(vi01)       |  mul.xyz vf10, vf10, Q
  mtir vi10, vf11.x          |  mul.xyzw vf16, vf16, Q
  mtir vi13, vf11.y          |  ftoi4.xyzw vf21, vf21
  b L60                      |  mul.xyzw vf12, vf12, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf10, vf10, vf22
L59:
  lq.xyzw vf20, 0(vi10)      |  mul.xyzw vf16, vf16, Q
  nop                        |  mulw.xyzw vf24, vf24, vf29
  lq.xyzw vf31, 0(vi13)      |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf25, 0(vi08)      |  mul.xyzw vf12, vf12, vf23
  lq.xyzw vf23, 1(vi10)      |  add.xyzw vf10, vf10, vf22
  lq.xyzw vf20, 1(vi13)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf26, 1(vi08)      |  madday.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 2(vi10)      |  maddz.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 2(vi13)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyzw vf27, 2(vi08)      |  madday.xyzw ACC, vf20, vf24
  lq.xyzw vf20, 3(vi10)      |  maddz.xyzw vf26, vf26, vf24
  lq.xyzw vf31, 3(vi13)      |  mulax.xyzw ACC, vf31, vf24
  lq.xyzw vf28, 3(vi08)      |  madday.xyzw ACC, vf23, vf24
  lq.xyzw vf23, 4(vi10)      |  maddz.xyzw vf27, vf27, vf24
  lq.xyzw vf20, 4(vi13)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi08)       |  madday.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 5(vi10)      |  maddz.xyzw vf28, vf28, vf24
  lq.xyzw vf23, 5(vi13)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi08)       |  madday.xyzw ACC, vf20, vf24
  lq.xyzw vf20, 6(vi10)      |  maddz.xyz vf29, vf29, vf24
  lq.xyzw vf22, 6(vi13)      |  mulax.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 6(vi08)      |  madday.xyzw ACC, vf23, vf24
  lqi.xyzw vf23, vi02        |  maddz.xyz vf30, vf30, vf24
  mtir vi10, vf11.x          |  mulax.xyzw ACC, vf20, vf24
  mtir vi13, vf11.y          |  madday.xyzw ACC, vf22, vf24
  lq.xyzw vf22, 2(vi00)      |  maddz.xyzw vf31, vf31, vf24
  lqi.xyzw vf23, vi03        |  itof0.xyzw vf24, vf23
L60:
  ibgez vi09, L61            |  mulaz.xyzw ACC, vf29, vf08
  sq.xyzw vf21, 2(vi11)      |  maddaz.xyzw ACC, vf30, vf11
  nop                        |  ftoi4.xyzw vf21, vf09
L61:
  mfp.w vf20, P              |  maddz.xyz vf11, vf31, vf14
  sq.xyzw vf15, 0(vi11)      |  miniy.xyzw vf12, vf12, vf17
  sq.xyzw vf15, 0(vi14)      |  miniw.w vf10, vf10, vf03
  sq.xyzw vf21, 2(vi14)      |  mulaw.xyzw ACC, vf25, vf08
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf13, vf13, vf20
  1024.0                     |  ftoi0.xyzw vf12, vf12 :i
  erleng.xyz P, vf11         |  maxi.xy vf10, vf10, I
  ibeq vi07, vi03, L72       |  maddaw.xyzw ACC, vf26, vf11
  mr32.z vf14, vf00          |  maddw.xyzw vf08, vf27, vf14
L62:
  sq.xyzw vf13, -1(vi03)     |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf13 :i
  lqi.xyzw vf09, vi01        |  minii.xy vf10, vf10, I
  sq.xyzw vf12, 1(vi11)      |  madday.xyzw ACC, vf02, vf13
  sq.xyzw vf12, 1(vi14)      |  maddz.xyzw vf13, vf03, vf13
  lqi.xyzw vf12, vi01        |  add.xyzw vf08, vf08, vf28
  lqi.xyzw vf15, vi01        |  maxw.w vf10, vf10, vf02
  mtir vi11, vf09.x          |  itof0.xyzw vf23, vf23
  mtir vi14, vf09.y          |  maxx.xyzw vf13, vf13, vf00
  sq.xyzw vf08, -4(vi01)     |  miniw.w vf08, vf08, vf01
  div Q, vf01.w, vf08.w      |  add.zw vf09, vf09, vf17
  move.xyzw vf21, vf10       |  add.xyzw vf12, vf12, vf18
  iand vi11, vi11, vi05      |  add.xyzw vf15, vf15, vf19
  ilw.w vi08, -1(vi02)       |  mulax.xyzw ACC, vf04, vf13
  ibgtz vi09, L63            |  madday.xyzw ACC, vf05, vf13
  iand vi14, vi14, vi05      |  maddaz.xyzw ACC, vf06, vf13
  nop                        |  addx.w vf21, vf21, vf17
L63:
  ibne vi05, vi11, L64       |  maddw.xyzw vf13, vf07, vf00
  ilw.x vi09, -9(vi01)       |  mul.xyz vf08, vf08, Q
  mtir vi11, vf12.x          |  mul.xyzw vf14, vf14, Q
  mtir vi14, vf12.y          |  ftoi4.xyzw vf21, vf21
  b L65                      |  mul.xyzw vf13, vf13, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf08, vf08, vf22
L64:
  lq.xyzw vf20, 0(vi11)      |  mul.xyzw vf14, vf14, Q
  nop                        |  mulw.xyzw vf24, vf24, vf29
  lq.xyzw vf31, 0(vi14)      |  ftoi4.xyzw vf21, vf21
  lq.xyzw vf25, 0(vi08)      |  mul.xyzw vf13, vf13, vf23
  lq.xyzw vf23, 1(vi11)      |  add.xyzw vf08, vf08, vf22
  lq.xyzw vf20, 1(vi14)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyzw vf26, 1(vi08)      |  madday.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 2(vi11)      |  maddz.xyzw vf25, vf25, vf24
  lq.xyzw vf23, 2(vi14)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyzw vf27, 2(vi08)      |  madday.xyzw ACC, vf20, vf24
  lq.xyzw vf20, 3(vi11)      |  maddz.xyzw vf26, vf26, vf24
  lq.xyzw vf31, 3(vi14)      |  mulax.xyzw ACC, vf31, vf24
  lq.xyzw vf28, 3(vi08)      |  madday.xyzw ACC, vf23, vf24
  lq.xyzw vf23, 4(vi11)      |  maddz.xyzw vf27, vf27, vf24
  lq.xyzw vf20, 4(vi14)      |  mulax.xyzw ACC, vf20, vf24
  lq.xyz vf29, 4(vi08)       |  madday.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 5(vi11)      |  maddz.xyzw vf28, vf28, vf24
  lq.xyzw vf23, 5(vi14)      |  mulax.xyzw ACC, vf23, vf24
  lq.xyz vf30, 5(vi08)       |  madday.xyzw ACC, vf20, vf24
  lq.xyzw vf20, 6(vi11)      |  maddz.xyz vf29, vf29, vf24
  lq.xyzw vf22, 6(vi14)      |  mulax.xyzw ACC, vf31, vf24
  lq.xyzw vf31, 6(vi08)      |  madday.xyzw ACC, vf23, vf24
  lqi.xyzw vf23, vi02        |  maddz.xyz vf30, vf30, vf24
  mtir vi11, vf12.x          |  mulax.xyzw ACC, vf20, vf24
  mtir vi14, vf12.y          |  madday.xyzw ACC, vf22, vf24
  lq.xyzw vf22, 2(vi00)      |  maddz.xyzw vf31, vf31, vf24
  lqi.xyzw vf23, vi03        |  itof0.xyzw vf24, vf23
L65:
  ibgez vi09, L66            |  mulaz.xyzw ACC, vf29, vf09
  sq.xyzw vf21, 2(vi12)      |  maddaz.xyzw ACC, vf30, vf12
  nop                        |  ftoi4.xyzw vf21, vf10
L66: (L80 in og)
  mfp.w vf20, P              |  maddz.xyz vf12, vf31, vf15
  sq.xyzw vf16, 0(vi12)      |  miniy.xyzw vf13, vf13, vf17
  sq.xyzw vf16, 0(vi15)      |  miniw.w vf08, vf08, vf03
  sq.xyzw vf21, 2(vi15)      |  mulaw.xyzw ACC, vf25, vf09
  ilw.y vi09, -6(vi01)       |  mulw.xyzw vf11, vf11, vf20
  1024.0                     |  ftoi0.xyzw vf13, vf13 :i
  erleng.xyz P, vf12         |  maxi.xy vf08, vf08, I
  ibne vi07, vi03, L51       |  maddaw.xyzw ACC, vf26, vf12
  mr32.z vf15, vf00          |  maddw.xyzw vf09, vf27, vf15
  b L77                      |  nop
  nop                        |  nop

;;;;;;;;;;; OG merc has a bunch of merc prime alternate paths here.

;;;; next we have 3x pipeline exits.
;;

L67:
  3072.0                     |  mulax.xyzw ACC, vf01, vf12 :i
  sq.xyzw vf12, -1(vi03)     |  minii.xy vf09, vf09, I
  sq.xyzw vf11, 1(vi10)      |  madday.xyzw ACC, vf02, vf12
  sq.xyzw vf11, 1(vi13)      |  maddz.xyzw vf12, vf03, vf12
  iaddiu vi05, vi00, 0x173   |  add.xyzw vf10, vf10, vf28
  lq.xyzw vf26, 1(vi00)      |  maxw.w vf09, vf09, vf02
  iaddi vi08, vi00, 0x1      |  itof0.xyzw vf23, vf23
  isw.x vi08, -2(vi05)       |  maxx.xyzw vf12, vf12, vf00
  sq.xyzw vf10, -1(vi01)     |  miniw.w vf10, vf10, vf01
  div Q, vf01.w, vf10.w      |  nop
  move.xyzw vf21, vf09       |  nop
  iaddiu vi08, vi00, 0x42    |  nop
  isw.z vi08, -1(vi05)       |  mulax.xyzw ACC, vf04, vf12
  ibgtz vi09, L68            |  madday.xyzw ACC, vf05, vf12
  isw.x vi00, -1(vi05)       |  maddaz.xyzw ACC, vf06, vf12
  nop                        |  addx.w vf21, vf21, vf17
L68:
  sq.yzw vf26, -2(vi05)      |  maddw.xyzw vf12, vf07, vf00
  ilw.x vi09, -6(vi01)       |  mul.xyz vf10, vf10, Q
  iaddiu vi08, vi00, 0x171   |  mul.xyzw vf16, vf16, Q
  nop                        |  ftoi4.xyzw vf21, vf21
  nop                        |  mul.xyzw vf12, vf12, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf10, vf10, vf22
  ibgez vi09, L69            |  nop
  sq.xyzw vf21, 2(vi11)      |  nop
  nop                        |  ftoi4.xyzw vf21, vf09
L69:
  mfp.w vf20, P              |  nop
  sq.xyzw vf15, 0(vi11)      |  miniy.xyzw vf12, vf12, vf17
  sq.xyzw vf15, 0(vi14)      |  miniw.w vf10, vf10, vf03
  sq.xyzw vf21, 2(vi14)      |  nop
  ilw.y vi09, -3(vi01)       |  mulw.xyzw vf13, vf13, vf20
  1024.0                     |  ftoi0.xyzw vf12, vf12 :i
  nop                        |  maxi.xy vf10, vf10, I
  nop                        |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf13 :i
  sq.xyzw vf13, -1(vi03)     |  minii.xy vf10, vf10, I
  sq.xyzw vf12, 1(vi11)      |  madday.xyzw ACC, vf02, vf13
  sq.xyzw vf12, 1(vi14)      |  maddz.xyzw vf13, vf03, vf13
  nop                        |  nop
  nop                        |  maxw.w vf10, vf10, vf02
  nop                        |  itof0.xyzw vf23, vf23
  nop                        |  maxx.xyzw vf13, vf13, vf00
  nop                        |  nop
  move.xyzw vf21, vf10       |  nop
  nop                        |  nop
  nop                        |  mulax.xyzw ACC, vf04, vf13
  ibgtz vi09, L70            |  madday.xyzw ACC, vf05, vf13
  nop                        |  maddaz.xyzw ACC, vf06, vf13
  nop                        |  addx.w vf21, vf21, vf17
L70:
  nop                        |  maddw.xyzw vf13, vf07, vf00
  ilw.x vi09, -3(vi01)       |  nop
  xtop vi05                  |  nop
  iaddiu vi05, vi05, 0x8c    |  ftoi4.xyzw vf21, vf21
  ilwr.z vi01, vi05          |  mul.xyzw vf13, vf13, vf23
  ilwr.y vi03, vi05          |  nop
  ibgez vi09, L71            |  nop
  sq.xyzw vf21, 2(vi12)      |  nop
  nop                        |  ftoi4.xyzw vf21, vf10
L71:
  nop                        |  nop
  sq.xyzw vf16, 0(vi12)      |  miniy.xyzw vf13, vf13, vf17
  sq.xyzw vf16, 0(vi15)      |  nop
  sq.xyzw vf21, 2(vi15)      |  nop
  nop                        |  nop
  nop                        |  ftoi0.xyzw vf13, vf13
  lq.xyzw vf23, 124(vi00)    |  nop
  iadd vi01, vi01, vi05      |  nop
  iadd vi03, vi03, vi05      |  nop
  sq.xyzw vf13, 1(vi12)      |  nop
  b L82                      |  nop
  sq.xyzw vf13, 1(vi15)      |


L72:
  3072.0                     |  mulax.xyzw ACC, vf01, vf13 :i
  sq.xyzw vf13, -1(vi03)     |  minii.xy vf10, vf10, I
  sq.xyzw vf12, 1(vi11)      |  madday.xyzw ACC, vf02, vf13
  sq.xyzw vf12, 1(vi14)      |  maddz.xyzw vf13, vf03, vf13
  iaddiu vi05, vi00, 0x173   |  add.xyzw vf08, vf08, vf28
  lq.xyzw vf26, 1(vi00)      |  maxw.w vf10, vf10, vf02
  iaddi vi08, vi00, 0x1      |  itof0.xyzw vf23, vf23
  isw.x vi08, -2(vi05)       |  maxx.xyzw vf13, vf13, vf00
  sq.xyzw vf08, -1(vi01)     |  miniw.w vf08, vf08, vf01
  div Q, vf01.w, vf08.w      |  nop
  move.xyzw vf21, vf10       |  nop
  iaddiu vi08, vi00, 0x42    |  nop
  isw.z vi08, -1(vi05)       |  mulax.xyzw ACC, vf04, vf13
  ibgtz vi09, L73            |  madday.xyzw ACC, vf05, vf13
  isw.x vi00, -1(vi05)       |  maddaz.xyzw ACC, vf06, vf13
  nop                        |  addx.w vf21, vf21, vf17
L73:
  sq.yzw vf26, -2(vi05)      |  maddw.xyzw vf13, vf07, vf00
  ilw.x vi09, -6(vi01)       |  mul.xyz vf08, vf08, Q
  iaddiu vi08, vi00, 0x171   |  mul.xyzw vf14, vf14, Q
  nop                        |  ftoi4.xyzw vf21, vf21
  nop                        |  mul.xyzw vf13, vf13, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf08, vf08, vf22
  ibgez vi09, L74            |  nop
  sq.xyzw vf21, 2(vi12)      |  nop
  nop                        |  ftoi4.xyzw vf21, vf10
L74:
  mfp.w vf20, P              |  nop
  sq.xyzw vf16, 0(vi12)      |  miniy.xyzw vf13, vf13, vf17
  sq.xyzw vf16, 0(vi15)      |  miniw.w vf08, vf08, vf03
  sq.xyzw vf21, 2(vi15)      |  nop
  ilw.y vi09, -3(vi01)       |  mulw.xyzw vf11, vf11, vf20
  1024.0                     |  ftoi0.xyzw vf13, vf13 :i
  nop                        |  maxi.xy vf08, vf08, I
  nop                        |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf11 :i
  sq.xyzw vf11, -1(vi03)     |  minii.xy vf08, vf08, I
  sq.xyzw vf13, 1(vi12)      |  madday.xyzw ACC, vf02, vf11
  sq.xyzw vf13, 1(vi15)      |  maddz.xyzw vf11, vf03, vf11
  nop                        |  nop
  nop                        |  maxw.w vf08, vf08, vf02
  nop                        |  itof0.xyzw vf23, vf23
  nop                        |  maxx.xyzw vf11, vf11, vf00
  nop                        |  nop
  move.xyzw vf21, vf08       |  nop
  nop                        |  nop
  nop                        |  mulax.xyzw ACC, vf04, vf11
  ibgtz vi09, L75            |  madday.xyzw ACC, vf05, vf11
  nop                        |  maddaz.xyzw ACC, vf06, vf11
  nop                        |  addx.w vf21, vf21, vf17
L75:
  nop                        |  maddw.xyzw vf11, vf07, vf00
  ilw.x vi09, -3(vi01)       |  nop
  xtop vi05                  |  nop
  iaddiu vi05, vi05, 0x8c    |  ftoi4.xyzw vf21, vf21
  ilwr.z vi01, vi05          |  mul.xyzw vf11, vf11, vf23
  ilwr.y vi03, vi05          |  nop
  ibgez vi09, L76            |  nop
  sq.xyzw vf21, 2(vi10)      |  nop
  nop                        |  ftoi4.xyzw vf21, vf08
L76:
  nop                        |  nop
  sq.xyzw vf14, 0(vi10)      |  miniy.xyzw vf11, vf11, vf17
  sq.xyzw vf14, 0(vi13)      |  nop
  sq.xyzw vf21, 2(vi13)      |  nop
  nop                        |  nop
  nop                        |  ftoi0.xyzw vf11, vf11
  lq.xyzw vf23, 124(vi00)    |  nop
  iadd vi01, vi01, vi05      |  nop
  iadd vi03, vi03, vi05      |  nop
  sq.xyzw vf11, 1(vi10)      |  nop
  b L82                      |  nop
  sq.xyzw vf11, 1(vi13)      |  nop



L77:
  3072.0                     |  mulax.xyzw ACC, vf01, vf11 :i
  sq.xyzw vf11, -1(vi03)     |  minii.xy vf08, vf08, I
  sq.xyzw vf13, 1(vi12)      |  madday.xyzw ACC, vf02, vf11
  sq.xyzw vf13, 1(vi15)      |  maddz.xyzw vf11, vf03, vf11
  iaddiu vi05, vi00, 0x173   |  add.xyzw vf09, vf09, vf28
  lq.xyzw vf26, 1(vi00)      |  maxw.w vf08, vf08, vf02
  iaddi vi08, vi00, 0x1      |  itof0.xyzw vf23, vf23
  isw.x vi08, -2(vi05)       |  maxx.xyzw vf11, vf11, vf00
  sq.xyzw vf09, -1(vi01)     |  miniw.w vf09, vf09, vf01
  div Q, vf01.w, vf09.w      |  nop
  move.xyzw vf21, vf08       |  nop
  iaddiu vi08, vi00, 0x42    |  nop
  isw.z vi08, -1(vi05)       |  mulax.xyzw ACC, vf04, vf11
  ibgtz vi09, L78            |  madday.xyzw ACC, vf05, vf11
  isw.x vi00, -1(vi05)       |  maddaz.xyzw ACC, vf06, vf11
  nop                        |  addx.w vf21, vf21, vf17
L78:
  sq.yzw vf26, -2(vi05)      |  maddw.xyzw vf11, vf07, vf00
  ilw.x vi09, -6(vi01)       |  mul.xyz vf09, vf09, Q
  iaddiu vi08, vi00, 0x171   |  mul.xyzw vf15, vf15, Q ;; vi08 = 0x171: output location (fixed?)
  nop                        |  ftoi4.xyzw vf21, vf21
  nop                        |  mul.xyzw vf11, vf11, vf23
  lqi.xyzw vf23, vi03        |  add.xyzw vf09, vf09, vf22
  ibgez vi09, L79            |  nop
  sq.xyzw vf21, 2(vi10)      |  nop
  nop                        |  ftoi4.xyzw vf21, vf08
L79:
  mfp.w vf20, P              |  nop
  sq.xyzw vf14, 0(vi10)      |  miniy.xyzw vf11, vf11, vf17
  sq.xyzw vf14, 0(vi13)      |  miniw.w vf09, vf09, vf03
  sq.xyzw vf21, 2(vi13)      |  nop
  ilw.y vi09, -3(vi01)       |  mulw.xyzw vf12, vf12, vf20
  1024.0                     |  ftoi0.xyzw vf11, vf11 :i
  nop                        |  maxi.xy vf09, vf09, I
  nop                        |  nop
  3072.0                     |  mulax.xyzw ACC, vf01, vf12 :i
  sq.xyzw vf12, -1(vi03)     |  minii.xy vf09, vf09, I
  sq.xyzw vf11, 1(vi10)      |  madday.xyzw ACC, vf02, vf12
  sq.xyzw vf11, 1(vi13)      |  maddz.xyzw vf12, vf03, vf12
  nop                        |  nop
  nop                        |  maxw.w vf09, vf09, vf02
  nop                        |  itof0.xyzw vf23, vf23
  nop                        |  maxx.xyzw vf12, vf12, vf00
  nop                        |  nop
  move.xyzw vf21, vf09       |  nop
  nop                        |  nop
  nop                        |  mulax.xyzw ACC, vf04, vf12
  ibgtz vi09, L80            |  madday.xyzw ACC, vf05, vf12
  nop                        |  maddaz.xyzw ACC, vf06, vf12
  nop                        |  addx.w vf21, vf21, vf17
L80:
  nop                        |  maddw.xyzw vf12, vf07, vf00
  ilw.x vi09, -3(vi01)       |  nop
  xtop vi05                  |  nop
  iaddiu vi05, vi05, 0x8c    |  ftoi4.xyzw vf21, vf21          ;; vi05 = byte-header
  ilwr.z vi01, vi05          |  mul.xyzw vf12, vf12, vf23      ;; vi01 = lump
  ilwr.y vi03, vi05          |  nop                            ;; vi03 = rgba
  ibgez vi09, L81            |  nop
  sq.xyzw vf21, 2(vi11)      |  nop
  nop                        |  ftoi4.xyzw vf21, vf09
L81:
  nop                        |  nop
  sq.xyzw vf15, 0(vi11)      |  miniy.xyzw vf12, vf12, vf17
  sq.xyzw vf15, 0(vi14)      |  nop
  sq.xyzw vf21, 2(vi14)      |  nop
  nop                        |  nop
  nop                        |  ftoi0.xyzw vf12, vf12
  lq.xyzw vf23, 124(vi00)    |  nop                          ;; unperspect
  iadd vi01, vi01, vi05      |  nop                          ;; lump
  iadd vi03, vi03, vi05      |  nop                          ;; rgba
  sq.xyzw vf12, 1(vi11)      |  nop
  sq.xyzw vf12, 1(vi14)      |  nop

;; COMMON finish part

L82:
  xgkick vi08                |  nop                         ;; normal draw?

;; pipeline startup for envmap math
  lq.xyzw vf08, 2(vi01)      |  nop                         ;; vf08 = transformed vert
  lqi.xyzw vf10, vi03        |  nop                         ;; vf10 = transformed normal
  ilw.x vi04, 1(vi05)        |  nop                         ;; vi04 = mat1-cnt
  ilw.y vi06, 1(vi05)        |  nop                         ;; vi06 = mat2-cnt
  ilw.z vi07, 1(vi05)        |  mul.xyzw vf09, vf08, vf23   ;; vi07 = mat3-cnt, unperspect the vert
  iadd vi04, vi04, vi06      |  subw.z vf10, vf10, vf00     ;; vi04 = mat1-cnt + mat2-cnt, refl1
  iaddi vi01, vi01, 0x3      |  nop                         ;; step lump
  iadd vi04, vi04, vi07      |  nop                         ;; vi04 = mat1 + mat2 + mat3 counts
  iadd vi02, vi03, vi04      |  addw.z vf09, vf00, vf09     ;; vi02 = end rgba, vert1
  iaddi vi02, vi02, 0x2      |  nop                         ;; end rgba more
  lq.xyzw vf14, 118(vi00)    |  maxw.xyzw vf21, vf00, vf00  ;; vf14 = rgba-fade, vf21 = [1, 1, 1, 1]
  lq.xyzw vf26, 371(vi00)    |  nop                         ;; vf26 = the giftag
  nop                        |  mul.xyz vf15, vf09, vf10    ;; multiply
  lq.xyzw vf27, 119(vi00)    |  nop                         ;; vf27 = e-adgif0
  nop                        |  nop
  lq.xyzw vf28, 120(vi00)    |  nop                         ;; vf28 = e-adgif1
  nop                        |  adday.xyzw vf15, vf15
  lq.xyzw vf31, 121(vi00)    |  maddz.x vf15, vf21, vf15    ;; vf31 = e-adgif2
  nop                        |  nop
  sq.xyzw vf26, 813(vi00)    |  nop                         ;; store giftag
  lq.xyzw vf08, 2(vi01)      |  nop ;; pipe
  lqi.xyzw vf11, vi03        |  nop ;; pipe
  div Q, vf15.x, vf10.z      |  nop                         ;; div
  sq.xyzw vf27, 814(vi00)    |  mulaw.xyzw ACC, vf09, vf00  ;; store e-ad0, mul
  nop                        |  mul.xyzw vf09, vf08, vf23   ;; pipe
  sq.xyzw vf28, 815(vi00)    |  subw.z vf11, vf11, vf00     ;; store e-ad1, pipe
  iaddi vi01, vi01, 0x3      |  nop ;; pipe
  sq.xyzw vf31, 816(vi00)    |  nop                         ;; store e-ad2
  nop                        |  addw.z vf09, vf00, vf09     ;; pipe
  lq.xyzw vf26, 0(vi00)      |  madd.xyzw vf10, vf10, Q     ;; vf26 = tristrip giftag, madd
  nop                        |  nop
  lq.xyzw vf27, 122(vi00)    |  nop                         ;; vf27 = e-ad3
  nop                        |  mul.xyz vf15, vf09, vf11    ;; pipe
  eleng.xyz P, vf10          |  nop ;; len
  lq.xyzw vf28, 123(vi00)    |  nop  ;; vf28 = e-ad4
  nop                        |  nop
  lq.xyzw vf31, 377(vi00)    |  adday.xyzw vf15, vf15 ;; vf31 = old tristrip???
  nop                        |  maddz.x vf15, vf21, vf15 ;; pipe
  mr32.xyzw vf26, vf26       |  nop ;; rotate tristrip template
  nop                        |  nop
  lq.xyzw vf08, 2(vi01)      |  nop ;; pipe
  lqi.xyzw vf12, vi03        |  nop ;; pipe
  div Q, vf15.x, vf11.z      |  nop ;; pipe
  mr32.xyzw vf26, vf26       |  mulaw.xyzw ACC, vf09, vf00 ;; rotate | pipe
  sq.xyzw vf27, 817(vi00)    |  mul.xyzw vf09, vf08, vf23  ;; store adgif3 | pipe
  lq.xyzw vf25, -5(vi01)     |  subw.z vf12, vf12, vf00    ;; vf25 = lump[1] | pipe
  iaddi vi01, vi01, 0x3      |  nop                        ;; pipe
  sq.xyzw vf28, 818(vi00)    |  nop                        ;; e-ad4 store
  nop                        |  addw.z vf09, vf00, vf09    ;; pipe
  sq.xyzw vf31, 819(vi00)    |  madd.xyzw vf11, vf11, Q    ;; tristrip store | pipe
  nop                        |  nop
  mfp.w vf10, P              |  nop
  sq.y vf26, 819(vi00)       |  mul.xyz vf15, vf09, vf12   ;; set abe | pipe
  eleng.xyz P, vf11          |  nop
  nop                        |  nop
  div Q, vf23.z, vf10.w      |  nop ;; NOT PIPE (!)
  nop                        |  adday.xyzw vf15, vf15 ;; pipe
  nop                        |  maddz.x vf15, vf21, vf15 ;; pipe
  nop                        |  nop
  nop                        |  add.xyzw vf25, vf25, vf18 ;; lump dest stuff
L83:
  lq.xyzw vf08, 2(vi01)      |  nop                        ;; pipe
  lqi.xyzw vf13, vi03        |  addaz.xyzw vf00, vf23
  div Q, vf15.x, vf12.z      |  madd.xyzw vf10, vf10, Q
  mtir vi10, vf25.x          |  mulaw.xyzw ACC, vf09, vf00
  mtir vi13, vf25.y          |  mul.xyzw vf09, vf08, vf23
  lq.xyzw vf25, -5(vi01)     |  subw.z vf13, vf13, vf00
  ;;
  iaddi vi01, vi01, 0x3      |  nop
  lq.xyzw vf24, 0(vi10)      |  nop
  lq.xyzw vf16, 2(vi10)      |  addw.z vf09, vf00, vf09
  lq.xyzw vf20, 2(vi13)      |  madd.xyzw vf12, vf12, Q
  sq.xyzw vf14, 443(vi10)    |  nop
  mfp.w vf11, P              |  nop
  sq.xyzw vf14, 443(vi13)    |  mul.xyz vf15, vf09, vf13
  eleng.xyz P, vf12          |  mulz.xy vf24, vf10, vf24
  sq.xyzw vf16, 444(vi10)    |  nop
  div Q, vf23.z, vf11.w      |  nop
  sq.xyzw vf20, 444(vi13)    |  adday.xyzw vf15, vf15
  sq.xyzw vf24, 442(vi10)    |  maddz.x vf15, vf21, vf15
  ibeq vi02, vi03, L84       |  nop
  sq.xyzw vf24, 442(vi13)    |  add.xyzw vf25, vf25, vf18
  lq.xyzw vf08, 2(vi01)      |  nop
  lqi.xyzw vf10, vi03        |  addaz.xyzw vf00, vf23
  div Q, vf15.x, vf13.z      |  madd.xyzw vf11, vf11, Q
  mtir vi10, vf25.x          |  mulaw.xyzw ACC, vf09, vf00
  mtir vi13, vf25.y          |  mul.xyzw vf09, vf08, vf23
  lq.xyzw vf25, -5(vi01)     |  subw.z vf10, vf10, vf00
  iaddi vi01, vi01, 0x3      |  nop
  lq.xyzw vf24, 0(vi10)      |  nop
  lq.xyzw vf16, 2(vi10)      |  addw.z vf09, vf00, vf09
  lq.xyzw vf20, 2(vi13)      |  madd.xyzw vf13, vf13, Q
  sq.xyzw vf14, 443(vi10)    |  nop
  mfp.w vf12, P              |  nop
  sq.xyzw vf14, 443(vi13)    |  mul.xyz vf15, vf09, vf10
  eleng.xyz P, vf13          |  mulz.xy vf24, vf11, vf24
  sq.xyzw vf16, 444(vi10)    |  nop
  div Q, vf23.z, vf12.w      |  nop
  sq.xyzw vf20, 444(vi13)    |  adday.xyzw vf15, vf15
  sq.xyzw vf24, 442(vi10)    |  maddz.x vf15, vf21, vf15
  ibeq vi02, vi03, L84       |  nop
  sq.xyzw vf24, 442(vi13)    |  add.xyzw vf25, vf25, vf18
  lq.xyzw vf08, 2(vi01)      |  nop
  lqi.xyzw vf11, vi03        |  addaz.xyzw vf00, vf23
  div Q, vf15.x, vf10.z      |  madd.xyzw vf12, vf12, Q
  mtir vi10, vf25.x          |  mulaw.xyzw ACC, vf09, vf00
  mtir vi13, vf25.y          |  mul.xyzw vf09, vf08, vf23
  lq.xyzw vf25, -5(vi01)     |  subw.z vf11, vf11, vf00
  iaddi vi01, vi01, 0x3      |  nop
  lq.xyzw vf24, 0(vi10)      |  nop
  lq.xyzw vf16, 2(vi10)      |  addw.z vf09, vf00, vf09
  lq.xyzw vf20, 2(vi13)      |  madd.xyzw vf10, vf10, Q
  sq.xyzw vf14, 443(vi10)    |  nop
  mfp.w vf13, P              |  nop
  sq.xyzw vf14, 443(vi13)    |  mul.xyz vf15, vf09, vf11
  eleng.xyz P, vf10          |  mulz.xy vf24, vf12, vf24
  sq.xyzw vf16, 444(vi10)    |  nop
  div Q, vf23.z, vf13.w      |  nop
  sq.xyzw vf20, 444(vi13)    |  adday.xyzw vf15, vf15
  sq.xyzw vf24, 442(vi10)    |  maddz.x vf15, vf21, vf15
  ibeq vi02, vi03, L84       |  nop
  sq.xyzw vf24, 442(vi13)    |  add.xyzw vf25, vf25, vf18
  lq.xyzw vf08, 2(vi01)      |  nop
  lqi.xyzw vf12, vi03        |  addaz.xyzw vf00, vf23
  div Q, vf15.x, vf11.z      |  madd.xyzw vf13, vf13, Q
  mtir vi10, vf25.x          |  mulaw.xyzw ACC, vf09, vf00
  mtir vi13, vf25.y          |  mul.xyzw vf09, vf08, vf23
  lq.xyzw vf25, -5(vi01)     |  subw.z vf12, vf12, vf00
  iaddi vi01, vi01, 0x3      |  nop
  lq.xyzw vf24, 0(vi10)      |  nop
  lq.xyzw vf16, 2(vi10)      |  addw.z vf09, vf00, vf09
  lq.xyzw vf20, 2(vi13)      |  madd.xyzw vf11, vf11, Q
  sq.xyzw vf14, 443(vi10)    |  nop
  mfp.w vf10, P              |  nop
  sq.xyzw vf14, 443(vi13)    |  mul.xyz vf15, vf09, vf12
  eleng.xyz P, vf11          |  mulz.xy vf24, vf13, vf24
  sq.xyzw vf16, 444(vi10)    |  nop
  div Q, vf23.z, vf10.w      |  nop
  sq.xyzw vf20, 444(vi13)    |  adday.xyzw vf15, vf15
  sq.xyzw vf24, 442(vi10)    |  maddz.x vf15, vf21, vf15
  ibne vi02, vi03, L83       |  nop
  sq.xyzw vf24, 442(vi13)    |  add.xyzw vf25, vf25, vf18
L84:
  iaddiu vi08, vi00, 0x32d   |  nop
  xgkick vi08                |  nop
  nop                        |  nop :e
  nop                        |  nop