jak-project/decompiler/README.md

How to use
-----------
Compile (Linux):
```
mkdir build
cd build
cmake ..
make -j
cd ..
```

After compiling:
First create a folder for the output and create a folder for the input.  Add all of the CGO/DGO files into the input folder.
```
build/jak_disassembler config/jak1_ntsc_black_label.jsonc in_folder/ out_folder/
```


Notes
--------
The `config` folder has settings for the disassembly. Currently Jak 2 and Jak 3 are not as well supported as Jak 1.


# Procedure

## ObjectFileDB
The `ObjectFileDB` tracks unique object files. The games have a lot of duplicated objected files, and object files with the same names but different contents, so `ObjectFileDB` is used to create a unique name for each unique object file. It generates a file named `dgo.txt` which maps its names to the original name and which DGO files it appears in.  The `ObjectFileDB` extracts all object files from a DGO file, decompressing the DGO first if needed. (note: Jak 2 demo DGOs do not decompress properly). Each object file has a number of segments, which the game can load to separate places.  Sometimes there is just a single "data" segment, and other times there are three segments:

- `top-level` is executed at the end of the linking process, then discarded and goes in a special temporary heap
- `main` is loaded and linked onto the specified heap
- `debug` is loaded and linked onto the debug heap


## `ObjectFileDB::process_link_data`
This function interprets breaks the object file's data into segments, and processes the link data. The data is stored as a sequence of `LinkedObjectWord`s, which contain extra data from the link.  The `LinkedObjectWord`s are stored by segment in a `LinkedObjectFile`, which also contains a list of `Label`s that allow `LinkedObjectWord`s to refer to other `LinkedObjectWord`s.  Note that a `Label` can have a byte-offset into a word, which GOAL uses to load non-4-byte-aligned bytes and halfwords, and also to represent a `pair` object.

## `ObjectFileDB::find_code`
This function looks through the `LinkedObjectFile`s and splits each segment into data and code zones.  

The only files with code zones are from object files with three segments, and the code always comes first.  The end of the code zone is found by looking for the last GOAL `function` object, then finding the end of this object by looking one word past the last `jr ra` instruction.  This assumes that the last function in each segment doesn't have an extra inline assembly `jr ra` somewhere in the middle, but functions with multiple `jr ra`'s are extremely rare (and not generated by the GOAL compiler without the use of inline assembly), so this seems like a safe assumption for now.

The code zones are scanned for GOAL `function` types, which are in front every GOAL function, and used to create `Functions`.  Each `Function` is disassembled into EE Instructions, which also adds `Label`s for branch instructions, and can also contain linking data when appropriate.  The final step is to look for instructions which use the `fp` register to reference static data, and insert the apprioriate `Label`s. GOAL uses the following `fp` relative addressing modes:

- `lw`, `lwc1`, `ld` relative to the `fp` register to load static data.
- `daddiu` to create a pointer to fp-relative data within +/- `2^15` bytes
- Sequence of `ori`, `daddu` to generate a pointer that reaches within `+2^16` bytes
- Sequence of `lui`, `ori`, `daddu` to generate any 32-bit offset from `fp`.

The last two are only found in very large object files, and GOALDIS doesn't handle these.

The `fp` register is set with this sequence.  The function prologue only sets `fp` if it is needed in the function.

```
;; goal function call, t9 contains the function address
jalr ra, t9
sll v0, ra, 0 

;; example goal function prologue:
daddiu sp, sp, -16 
sd ra, 0(sp)
sd fp, 8(sp)
or fp, t9, r0
```

Note: there are a few hacks to avoid generating labels when `fp` is used as a temporary register in inline assembly. Like ignoring stores/loads of `fp` from the stack (kernel does this to suspend resume a thread), or ignoring `fp` when used with the `PEXTLW` function, or totally skipping this step for a single object file in Jak 2 (`effect-control`).

## `ObjectFileDB::process_labels`
This step simply renames labels with `L1`, `L2`, ....  It should happen before any custom label naming as it will overwrite all label names.

## `ObjectFileDB::find_and_write_scripts`
Looks for static linked lists and attempts to print them.  Doesn't support printing everything, but can print nested lists, strings, numbers, and symbols.

## `ObjectFileDB::write_object_file_words`
Dumps words in each segment like `hexdump`. There's an option to only run this on `v3` object files, which contain data, as opposed to `v2` which are typically large data.

## `ObjectFileDB::write_disassembly`
Like `write_object_file_words`, but code is replaced with disassembly.  There's a config option to avoid running this on object files with no functions, as these are usually large data files which are uninteresting to view as a binary dump and slow to dump.

## Basic Block Finding 
Look at branch intstructions and their destinations to find all basic blocks.  Implemented in `find_blocks_in_function` as part of `analyze_functions`.  This works for Jak 1, 2 and 3.

## Analyze Functions Prologues and Epilogues
This will help us find stack variables and make sure that the prologue/epilogue are ignored by the statement generation.

A "full" prologue looks like this:
```
    daddiu sp, sp, -208
    sd ra, 0(sp)       
    sd fp, 8(sp)       
    or fp, t9, r0        ;; set fp to the address of this function
    sq s3, 128(sp)     
    sq s4, 144(sp)     
    sq s5, 160(sp)     
    sq gp, 176(sp)     
    swc1 f26, 192(sp)  
    swc1 f28, 196(sp)  
    swc1 f30, 200(sp)  
```
GOAL will leave out instructions that aren't needed.  This prologue is "decoded" into:

```
Total stack usage: 0xd0 bytes
$fp set? : yes
$ra set? : yes
Stack variables : yes, 112 bytes at sp + 16
Saved gprs: gp s5 s4 s3
Saved fprs: f30 f28 f26
```
A similar process is done for the epilogue, and it is checked against the prologue.

The prologue is removed from the first basic block and the epilogue + alignment padding is removed from the last one.

# Documentation of Planned Steps that are not implemented
Currently the focus is to get these working for Jak 1. But it shouldn't be much extra work to support Jak 2/3.


## Guess Function Names (to be implemented)

When possible, we should guess function names. It's not always possible because GOAL supports anonymous lambda functions, like for example:

```
(lambda ((x int) (y int)) (+ x y))
```

which will generate a GOAL `function` object without a name.

But these are pretty uncommon, and the majority of GOAL functions are

- Normal functions, which are stored into a `symbol` with the same name as the function
- Methods, which are stored into the method table of their `type` with the `method-set!` function. Sadly we can't get the name of methods, but we can get their ID (to figure out the inheritance hierarchy) and what type they are defined for.
- State handlers / behaviors (not yet fully understood)
- Virtual state handlers / behaviors (not yet fully understood)

Currently the state/behavior stuff isn't well understood, or used in the early initialization of the game, so name guessing won't worry about this for now.

## Guess Types (to be implemented)

The majority of GOAL types have a compiler-generated `inpsect` method which prints their fields. We should detect these methods in the previous function name guessing step, and then read through them to determine the data layout of the type.


## Control Flow Analysis

The basic blocks should be built into a graph and annotated with control flow patterns, like `if`, `cond`, `and`, and various loops.  To do this, register liveliness will be determined for each instruction.

## Conversion to Statements

Instructions (or sequences of instructions that should not be separated) should be converted into `Statement`s, which represent something like `(add! r1 r2 r3)`.  The registers should be mapped to variables, using as many variables as possible, as we don't know at this point if a register will be holding the same GOAL variable at different instructions.

## Type propagation
`Variable`s should get types determined by arguments of the function, which should then be propagated to other `Statement`s in the function, and can then refine the argument types of other functions. This process should be repeated until things stop changing.

## Variable declaration
Variables which are actually the same variable will be merged.  The point at which variables are first defined/declared will be determined based on liveliness and then expanded to come up with a scope nesting that doesn't cross control flow boundaries.


## Statement -> S-Expression map tree
Due to the the simple single pass GOAL compiler design, we build a tree which represents how Statements can be combined to eliminate variables.  As an extremely simple example:

```
(set! r1 thing1)
(set! r2 thing2)
(add-int! r4 r2 r3)
(mult-int! r1 r4)
```
 can be collapsed to
```
(* thing1 (+ thing2 r3))
```

But 
```
(set! r2 thing2)
(add-int! r4 r2 r3)
(set! r1 thing1)
(mult-int! r1 r4)
```
 can be collapsed to
```
(let ((temp0 (+ thing2 r3)))
  (+ thing1 temp0)
)
  
```

and this difference will actually reflect the difference in how the code was originally written! This is a huge advantage over existing decompilers, which will be unable to tell the subtle difference between the two.


## Macro pattern matching
Lots of GOAL language features are implemented with macros, so once the s-expression nesting is recovered, we can pattern match to undo macros very precisely.
add decompiler 2020-08-22 23:30:17 -04:00			`How to use`
			`-----------`
			`Compile (Linux):`
			```
			`mkdir build`
			`cd build`
			`cmake ..`
			`make -j`
			`cd ..`
			```

			`After compiling:`
			`First create a folder for the output and create a folder for the input. Add all of the CGO/DGO files into the input folder.`
			```
			`build/jak_disassembler config/jak1_ntsc_black_label.jsonc in_folder/ out_folder/`
			```


			`Notes`
			`--------`
			The `config` folder has settings for the disassembly. Currently Jak 2 and Jak 3 are not as well supported as Jak 1.


			`# Procedure`

			`## ObjectFileDB`
			The `ObjectFileDB` tracks unique object files. The games have a lot of duplicated objected files, and object files with the same names but different contents, so `ObjectFileDB` is used to create a unique name for each unique object file. It generates a file named `dgo.txt` which maps its names to the original name and which DGO files it appears in. The `ObjectFileDB` extracts all object files from a DGO file, decompressing the DGO first if needed. (note: Jak 2 demo DGOs do not decompress properly). Each object file has a number of segments, which the game can load to separate places. Sometimes there is just a single "data" segment, and other times there are three segments:

			- `top-level` is executed at the end of the linking process, then discarded and goes in a special temporary heap
			- `main` is loaded and linked onto the specified heap
			- `debug` is loaded and linked onto the debug heap


			## `ObjectFileDB::process_link_data`
			This function interprets breaks the object file's data into segments, and processes the link data. The data is stored as a sequence of `LinkedObjectWord`s, which contain extra data from the link. The `LinkedObjectWord`s are stored by segment in a `LinkedObjectFile`, which also contains a list of `Label`s that allow `LinkedObjectWord`s to refer to other `LinkedObjectWord`s. Note that a `Label` can have a byte-offset into a word, which GOAL uses to load non-4-byte-aligned bytes and halfwords, and also to represent a `pair` object.

			## `ObjectFileDB::find_code`
			This function looks through the `LinkedObjectFile`s and splits each segment into data and code zones.

			The only files with code zones are from object files with three segments, and the code always comes first. The end of the code zone is found by looking for the last GOAL `function` object, then finding the end of this object by looking one word past the last `jr ra` instruction. This assumes that the last function in each segment doesn't have an extra inline assembly `jr ra` somewhere in the middle, but functions with multiple `jr ra`'s are extremely rare (and not generated by the GOAL compiler without the use of inline assembly), so this seems like a safe assumption for now.

			The code zones are scanned for GOAL `function` types, which are in front every GOAL function, and used to create `Functions`. Each `Function` is disassembled into EE Instructions, which also adds `Label`s for branch instructions, and can also contain linking data when appropriate. The final step is to look for instructions which use the `fp` register to reference static data, and insert the apprioriate `Label`s. GOAL uses the following `fp` relative addressing modes:

			- `lw`, `lwc1`, `ld` relative to the `fp` register to load static data.
			- `daddiu` to create a pointer to fp-relative data within +/- `2^15` bytes
			- Sequence of `ori`, `daddu` to generate a pointer that reaches within `+2^16` bytes
			- Sequence of `lui`, `ori`, `daddu` to generate any 32-bit offset from `fp`.

			`The last two are only found in very large object files, and GOALDIS doesn't handle these.`

			The `fp` register is set with this sequence. The function prologue only sets `fp` if it is needed in the function.

			```
			`;; goal function call, t9 contains the function address`
			`jalr ra, t9`
			`sll v0, ra, 0`

			`;; example goal function prologue:`
			`daddiu sp, sp, -16`
			`sd ra, 0(sp)`
			`sd fp, 8(sp)`
			`or fp, t9, r0`
			```

			Note: there are a few hacks to avoid generating labels when `fp` is used as a temporary register in inline assembly. Like ignoring stores/loads of `fp` from the stack (kernel does this to suspend resume a thread), or ignoring `fp` when used with the `PEXTLW` function, or totally skipping this step for a single object file in Jak 2 (`effect-control`).

			## `ObjectFileDB::process_labels`
			This step simply renames labels with `L1`, `L2`, .... It should happen before any custom label naming as it will overwrite all label names.

			## `ObjectFileDB::find_and_write_scripts`
			`Looks for static linked lists and attempts to print them. Doesn't support printing everything, but can print nested lists, strings, numbers, and symbols.`

			## `ObjectFileDB::write_object_file_words`
			Dumps words in each segment like `hexdump`. There's an option to only run this on `v3` object files, which contain data, as opposed to `v2` which are typically large data.

			## `ObjectFileDB::write_disassembly`
			Like `write_object_file_words`, but code is replaced with disassembly. There's a config option to avoid running this on object files with no functions, as these are usually large data files which are uninteresting to view as a binary dump and slow to dump.

			`## Basic Block Finding`
			Look at branch intstructions and their destinations to find all basic blocks. Implemented in `find_blocks_in_function` as part of `analyze_functions`. This works for Jak 1, 2 and 3.

			`## Analyze Functions Prologues and Epilogues`
			`This will help us find stack variables and make sure that the prologue/epilogue are ignored by the statement generation.`

			`A "full" prologue looks like this:`
			```
			`daddiu sp, sp, -208`
			`sd ra, 0(sp)`
			`sd fp, 8(sp)`
			`or fp, t9, r0 ;; set fp to the address of this function`
			`sq s3, 128(sp)`
			`sq s4, 144(sp)`
			`sq s5, 160(sp)`
			`sq gp, 176(sp)`
			`swc1 f26, 192(sp)`
			`swc1 f28, 196(sp)`
			`swc1 f30, 200(sp)`
			```
			`GOAL will leave out instructions that aren't needed. This prologue is "decoded" into:`

			```
			`Total stack usage: 0xd0 bytes`
			`$fp set? : yes`
			`$ra set? : yes`
			`Stack variables : yes, 112 bytes at sp + 16`
			`Saved gprs: gp s5 s4 s3`
			`Saved fprs: f30 f28 f26`
			```
			`A similar process is done for the epilogue, and it is checked against the prologue.`

			`The prologue is removed from the first basic block and the epilogue + alignment padding is removed from the last one.`

			`# Documentation of Planned Steps that are not implemented`
			`Currently the focus is to get these working for Jak 1. But it shouldn't be much extra work to support Jak 2/3.`


			`## Guess Function Names (to be implemented)`

			`When possible, we should guess function names. It's not always possible because GOAL supports anonymous lambda functions, like for example:`

			```
			`(lambda ((x int) (y int)) (+ x y))`
			```

			which will generate a GOAL `function` object without a name.

			`But these are pretty uncommon, and the majority of GOAL functions are`

			- Normal functions, which are stored into a `symbol` with the same name as the function
			- Methods, which are stored into the method table of their `type` with the `method-set!` function. Sadly we can't get the name of methods, but we can get their ID (to figure out the inheritance hierarchy) and what type they are defined for.
			`- State handlers / behaviors (not yet fully understood)`
			`- Virtual state handlers / behaviors (not yet fully understood)`

			`Currently the state/behavior stuff isn't well understood, or used in the early initialization of the game, so name guessing won't worry about this for now.`

			`## Guess Types (to be implemented)`

			The majority of GOAL types have a compiler-generated `inpsect` method which prints their fields. We should detect these methods in the previous function name guessing step, and then read through them to determine the data layout of the type.


			`## Control Flow Analysis`

			The basic blocks should be built into a graph and annotated with control flow patterns, like `if`, `cond`, `and`, and various loops. To do this, register liveliness will be determined for each instruction.

			`## Conversion to Statements`

			Instructions (or sequences of instructions that should not be separated) should be converted into `Statement`s, which represent something like `(add! r1 r2 r3)`. The registers should be mapped to variables, using as many variables as possible, as we don't know at this point if a register will be holding the same GOAL variable at different instructions.

			`## Type propagation`
			`Variable`s should get types determined by arguments of the function, which should then be propagated to other `Statement`s in the function, and can then refine the argument types of other functions. This process should be repeated until things stop changing.

			`## Variable declaration`
			`Variables which are actually the same variable will be merged. The point at which variables are first defined/declared will be determined based on liveliness and then expanded to come up with a scope nesting that doesn't cross control flow boundaries.`


			`## Statement -> S-Expression map tree`
			`Due to the the simple single pass GOAL compiler design, we build a tree which represents how Statements can be combined to eliminate variables. As an extremely simple example:`

			```
			`(set! r1 thing1)`
			`(set! r2 thing2)`
			`(add-int! r4 r2 r3)`
			`(mult-int! r1 r4)`
			```
			`can be collapsed to`
			```
			`(* thing1 (+ thing2 r3))`
			```

			`But`
			```
			`(set! r2 thing2)`
			`(add-int! r4 r2 r3)`
			`(set! r1 thing1)`
			`(mult-int! r1 r4)`
			```
			`can be collapsed to`
			```
			`(let ((temp0 (+ thing2 r3)))`
			`(+ thing1 temp0)`
			`)`

			```

			`and this difference will actually reflect the difference in how the code was originally written! This is a huge advantage over existing decompilers, which will be unable to tell the subtle difference between the two.`


			`## Macro pattern matching`
			`Lots of GOAL language features are implemented with macros, so once the s-expression nesting is recovered, we can pattern match to undo macros very precisely.`