formatter: rewrite and refactor, address more edge-cases, begin documenting my work (#3096)

This commit is contained in:
Tyler Wilding 2023-10-20 21:24:31 -04:00 committed by GitHub
parent 94603bce49
commit dccc3da1b3
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
17 changed files with 670 additions and 476 deletions

View file

@ -48,13 +48,15 @@
## Please read first
Our repositories on GitHub are primarily for development of the project and tracking active issues. Most of the information you will find here pertains to setting up the project for development purposes and is not relevant to the end-user.
> [!IMPORTANT]
> Our repositories on GitHub are primarily for development of the project and tracking active issues. Most of the information you will find here pertains to setting up the project for development purposes and is not relevant to the end-user.
For questions or additional information pertaining to the project, we have a Discord for discussion here: https://discord.gg/VZbXMHXzWv
Additionally, you can find further documentation and answers to **frequently asked questions** on the project's main website: https://opengoal.dev
**Do not use this decompilation project without providing your own legally purchased copy of the game.**
> [!WARNING]
> **Do not use this decompilation project without providing your own legally purchased copy of the game.**
### Quick Start

186
common/formatter/README.md Normal file
View file

@ -0,0 +1,186 @@
# OpenGOAL Formatter Documentation
> [!IMPORTANT]
> This is still in a highly experimental state, adjust your expectations accordingly.
Despite LISP being very simple, and taking the stance of a highly opinionated formatter that doesn't concern itself with line-length in most situations writing a code formatter is still an incredibly annoying and difficult thing to do. This documentation serves as a place for me to collect my thoughts and explain how it works (or more likely, how it does not).
Also in general I found it _really_ hard to find modern and easy to understand articles on how to get started (with the exception of zprint's usage docs, but that doesn't really cover implementation ideas). This area is either dominated by 40 year old journal articles, or discussions around really difficult formatting projects. There isn't a whole lot in the vain of _here's how I made a decent, simple formatter_, so maybe it'll help someone else (even if it's teaching them potentially what not to do!)
## Architecture / Overview
Atleast for me, it helps to understand the path code takes as it flows through the formatting process. Originally I tried to do as much at once for sake of efficiency but this just makes things incredibly hard to reason about and fix bugs. So like any problem that is complex, break it down, the formatter goes through many distinct phases chipping away at the problem:
```mermaid
%%{init: {'theme': 'dark', "flowchart" : { "curve" : "basis" } } }%%
flowchart TB
subgraph top1 [Build a Formatting Tree]
direction TB
a1(AST Parse) --> a2(Consolidate AST Nodes)
a1 --> a3(Collect Metadata)
end
subgraph top2 [Apply Formatting Configuration]
direction TB
b1(Recursively Iterate each Node) --> b2{Form head has\na predefined config?}
b2 --> |yes|b3(Fetch it and set the config on the node)
b2 --> |no|b4(Apply a default configuration)
end
subgraph top3 [Apply Formatting]
direction TB
c1(For each node, recursively build up a\nlist of source code lines, no indentation) --> c2(Consolidate the first two lines if hanging\nor if the config demands it)
c2 --> c3(Append indentation to each line as appropriate\nand surround each form with relevant parens)
end
subgraph top4 [Finalize Lines]
direction TB
d1(Concatenate all lines with whatever\nfile ending the file started with) --> d2(If we're formatting a file append a trailing new-line\n otherwise we are done)
end
top1-->|FormatterTree|top2
top2-->|FormatterTree with Configurations|top3
top3-->|String List|top4
```
### Build a Formatting Tree
At the end of the day, formatting is about changing the location and amounts of white-space in a file. Meaning it would be ideal if we could start with a representation that is whitespace-agnostic. Some formatters (like cljfmt) choose to use the original whitespace as an input into the formatting process, essentially meaning that your code should already be somewhat-formatted. This is somewhat advantageous as it can simplify the formatter's implementation or be a bit more flexible as to what the original programmer intended (assuming they knew what they were doing). But of course it has one fatal flaw -- the code has to already be well formatted.
For our formatter I wanted it to be able to take in horribly, ugly, minified code and output an optimal result -- especially since so much of our code gets spat out by another tool. To do that, we need to be able to format the code without any existing whitespace helping us, adding in the whitespace from the ground up.
The easiest way to do this is to construct an AST of the language which has a lot of other benefits for building language features but for here all we care about is getting rid of the whitespace and having some basic interpretation of the source code structure. The easiest way now-a-days to make an AST for a language is to create a [treesitter grammar](https://github.com/open-goal/tree-sitter-opengoal).
#### Consolidate AST Nodes and Collect Metadata
One downside with using an AST, atleast with the way the grammar is currently implemented is it creates a somewhat difficult tree to work with from a formatting perspective. For example in a LISP, everything is a list. In our AST it does not look like how you may intuitively expect:
`(+ 1 2)`
```mermaid
flowchart TB
a("source\n(+ 1 2)") --> b("list_lit\n")
b --> c("(") --> ca("(")
b --> d("sym_lit\n+") --> da("+")
b --> e("sym_lit\n1") --> ea("1")
b --> f("sym_lit\n2") --> fa("2")
b --> g(")") --> ga(")")
```
vs what you would probably prefer:
```mermaid
flowchart TB
a("top-level") --> b("list\n- surrounded in ()")
b --> da("+")
b --> ea("1")
b --> fa("2")
```
As you can see from even this simple example, the AST representation can be very verbose and in a sense has duplication. Each node is aware of it's original location in the parsed source-code, but it's the leaf-node tokens that we care about for formatting and perhaps the form types they were a part of (ie. were they comments or in a list, etc).
So to keep things brief, the AST representation is a bit too cumbersome and superfluous to use for formatting directly. So the first step is recurse through it and optimize it, removing unneeded height and capturing crucial metadata that is actually relevant for formatting.
We build a `FormatterTree` which consists of `FormatterNode`s. Each `FormatterNode` has either a single token, or a list of `refs`. A `ref` is just another `FormatterNode`. Each node also has a bunch of metadata and configuration on how it should be formatted (just initialized for now).
Having a dead-simple representation of the code like this is really only possible because it's a LISP, but it's also future proofed for us. If OpenGOAL ever decides to support different brackets -- for example in Clojure a vector uses `[]` or a map uses `{}` it would be easy for update the formatter to wrap each list node in the right bracket type.
### Applying Formatting Configuration
Now that we have our simplified tree, we can get to actually formatting. The first step is to figure out _how_ we intend to format things. By separating this from the actual formatting it's a lot easier to reason about and find bugs. You can also do potentially interesting things, where you do a full traversal of your source-code tree to figure out an optimal decision without trying to build up the final formatted string at the same time.
It is lists, or `form`s that we care about figuring out how to format. Each `form` has a head element (unless it's empty) that we can use to specify how it would be formatted. For example we cannot specify the formatting for every single function name, but we can format a `defun` or a `let` in a special way. With this mechanism, the formatter is easily extensible and configurable. There is even the ability to specify formatting configuration for a list's children by index -- for example you may want to format the first element in a `let` (the binding list) differently than the rest of the code.
There are also some forms that don't have a head-item, that we still identify and set some default formatting rules:
- The top-level node - indentation is disabled
- Lists of only literals, called `constant lists` - These are indented with `1` space instead of the default `2` for a `flow`'d form.
- If these lists literals are paired with keywords we call these `constant pairs` - We strive to put the pairs on the same line, if they are too long we spill to the next with `2` space indentation
- TODO - cond/case pairs (similar to constant pairs)
This brings up two important terms that you will see in the formatter implementation a lot -- `flow` and `hang`. The difference is as such:
```clj
(println "this"
"form"
"is"
"hung / hanged")
```
```clj
(println
"this"
"form"
"is"
"flowed")
```
The difference between the two other than visually is that hanging always results in 1 less line spent _because we don't care about line-length_. This is great because it means we can always default to hanging unless overridden. Flowing trades off that 1 vertical line for horizontal space and looks great for block-style forms. Like the top-level of a `defun` or the content of a `let`.
### Applying Formatting
The final real-step in the process, actually using the configuration and generating the formatting code. This phase is actually broken up into 3 sub-phases as we recurse through the `FormatterTree`
#### Build Lines
First we build the source code lines without indentation, it's a lot easier to figure out the indentation after all tokens are on the correct lines instead of doing it in tandem. Most tokens are given their own line, but in the case of a hung form we combine the first two elements on the first line. Other forms are formatted in even more complicated ways, or comments that were originally inline should remain next to that token.
Secondly, once we have all our consolidated lines we can apply the indentation. This is most often based on the configuration of the list node itself. For example a `hung` form must be indented the width of a `(` and the length of the head-form, plus the whitespace separator -- conversely a `flowed` form only needs to be indented by `2` spaces, regardless of token length. Because all of these algorithms are recursive we can hyper-focus on getting the indentation and formatting correct from the bottom-up. In otherwords, we don't care if a form is 200 spaces indented into a function, once the lines are returned recursively, they will get another layer of indentation added on until we hit the top-level node.
### Finalize Lines
Finally, we concatenate our list of strings, respecting the original file's line-endings.
As it is best practice, if we are formatting an entire file we should terminate with a trailing new-line.
Both of these items are technically a TODO.
## Configuration
TODO
## Reference
This formatter is highly influenced by a lot of other existing formatters and writings. The most significant of which is Clojure's [zprint](https://github.com/kkinnear/zprint/tree/main/doc/options) which has incredible documentation around it's capabilities and why it chooses what to do.
The formatter I've built basically copies zprint in terms of basic functionality, but simplifies things by being even more opinionated (for example, doing far less in terms of trying to adhere to a line-length limit).
### Other references
- [The Hardest Program Ive Ever Written - Dart Formatter](https://journal.stuffwithstuff.com/2015/09/08/the-hardest-program-ive-ever-written/)
- [cljfmt](https://github.com/weavejester/cljfmt)
### Why not respect a line-length limit
The OpenGOAL formatter will respect line-length in only very specific situations:
- When determining if a form can be fit entirely on a single line
- Probably in the future for multi-line comments
Besides it adding a massive amount of complexity, even zprint explains that a line-length limit won't always be followed
> zprint will work very hard to fit the formatted output into this width, though there are limits to its effort. For instance, it will not reduce the minimum indents in order to satisfy a particular width requirement.
Let's illustrate with an example:
```clj
(defun some-func
(+ 123
456
(+ 123
456
;; imagine this form exceeds our width, how can you "fix" this without
;; messing up the indentation levels of the entire expression
;; - you can't inline it (you're already past the width)
;; - you can't insert a line break because it'll still be at the same horizontal level
(+ 123
456
789))))
```
In LISP code, we don't have many ideal options to deal with this, all options are syntactically correct, but in terms of conventional readability they are bad because when reading LISP code you want to use the indentation as a guide (which is why new-line parens are considered ugly by basically all LISP style guides)
The other reason which is equally relevant is it is hard, it means the formatter would have to try to find an optimal solution by exploring different solutions until it finds the one that minimizes lines.
But with an opinionated formatting style you can avoid this complexity, if you want your code to look nice don't write code like the decompiler sometimes outputs when it encounters something truly cursed:
![](./docs/triangle-of-death.png)
> `game-save.gc` about 200+ nested `let`s!

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

View file

@ -16,124 +16,303 @@ extern "C" {
extern const TSLanguage* tree_sitter_opengoal();
}
std::string apply_formatting(
const FormatterTreeNode& curr_node,
std::string output,
std::optional<formatter_rules::config::FormFormattingConfig> form_element_config) {
int hang_indentation_width(const FormatterTreeNode& curr_node) {
if (curr_node.token || curr_node.refs.empty()) {
return 0;
}
// Get the first element of the form
const auto& first_elt = curr_node.refs.at(0);
if (first_elt.token) {
return first_elt.token->length() +
2; // +2 because the opening paren and then the following space
}
// Otherwise, continue nesting
return 1 + hang_indentation_width(first_elt);
}
// TODO - compute length of each node and store it
void apply_formatting_config(
FormatterTreeNode& curr_node,
std::optional<std::shared_ptr<formatter_rules::config::FormFormattingConfig>>
config_from_parent = {}) {
using namespace formatter_rules;
// node is empty, base-case
if (curr_node.token || curr_node.refs.empty()) {
return;
}
// first, check to see if this form already has a predefined formatting configuration
// if it does, that simplifies things because there is only 1 way of formatting the form
std::optional<formatter_rules::config::FormFormattingConfig> predefined_config;
if (!config_from_parent && !curr_node.refs.empty() && curr_node.refs.at(0).token) {
const auto& form_head = curr_node.refs.at(0).token;
if (form_head && config::opengoal_form_config.find(form_head.value()) !=
config::opengoal_form_config.end()) {
predefined_config = config::opengoal_form_config.at(form_head.value());
curr_node.formatting_config = predefined_config.value();
}
} else if (config_from_parent) {
predefined_config = *config_from_parent.value();
curr_node.formatting_config = predefined_config.value();
}
// In order to keep things simple, as well as because its ineffectual in lisp code (you can only
// enforce it so much without making things unreadable), line width will not matter for deciding
// whether or not to hang or flow the form
//
// This means that a hang would ALWAYS win, because it's 1 less line break. Therefore this
// simplifies our approach there is no need to explore both braches to see which one would be
// preferred.
//
// Instead, we either use the predefined configuration (obviously) or we do some checks for some
// outlier conditions to see if things should be formatted differently
//
// Otherwise, we always default to a hang.
//
// NOTE - any modifications here to child elements could be superseeded later in the recursion
// in order to maintain your sanity, only modify things here that _arent_ touched by default
// configurations. These are explicitly prepended with `parent_mutable_`
if (!predefined_config) {
if (curr_node.metadata.is_top_level) {
curr_node.formatting_config.indentation_width = 0;
curr_node.formatting_config.hang_forms = false;
} else if (constant_list::is_constant_list(curr_node)) {
// - Check if the form is a constant list (ie. a list of numbers)
curr_node.formatting_config.indentation_width = 1;
curr_node.formatting_config.hang_forms = false;
curr_node.formatting_config.has_constant_pairs =
constant_pairs::form_should_be_constant_paired(curr_node);
// If applicable, iterate through the constant pairs, since we can potentially pair up
// non-constant second elements in a pair (like a function call), there is the potential that
// they need to spill to the next line and get indented in extra. This is an exceptional
// circumstance, we do NOT do this sort of thing when formatting normal forms (cond/case pairs
// are another similar situation)
if (curr_node.formatting_config.has_constant_pairs) {
for (int i = 0; i < curr_node.refs.size(); i++) {
auto& child_ref = curr_node.refs.at(i);
const auto type = child_ref.metadata.node_type;
if (constant_types.find(type) == constant_types.end() &&
constant_pairs::is_element_second_in_constant_pair(curr_node, child_ref, i)) {
child_ref.formatting_config.parent_mutable_extra_indent = 2;
}
}
}
} else if (curr_node.formatting_config.hang_forms && curr_node.refs.size() > 1 &&
curr_node.refs.at(1).metadata.is_comment) {
// - Check if the second argument is a comment, it looks better if we flow instead
curr_node.formatting_config.hang_forms = false;
}
}
// If we are hanging, lets determine the indentation width since it is based on the form itself
if (curr_node.formatting_config.hang_forms) {
curr_node.formatting_config.indentation_width = hang_indentation_width(curr_node);
}
// iterate through the refs
for (int i = 0; i < curr_node.refs.size(); i++) {
auto& ref = curr_node.refs.at(i);
if (!ref.token) {
// If the child has a pre-defined configuration at that index, we pass it along
if (predefined_config &&
predefined_config->index_configs.find(i) != predefined_config->index_configs.end()) {
apply_formatting_config(ref, predefined_config->index_configs.at(i));
} else {
apply_formatting_config(ref);
}
}
}
}
int get_total_form_inlined_width(const FormatterTreeNode& curr_node) {
if (curr_node.token) {
return curr_node.token->length();
}
int width = 1;
for (const auto& ref : curr_node.refs) {
width += get_total_form_inlined_width(ref);
}
return width + 1;
}
bool form_contains_comment(const FormatterTreeNode& curr_node) {
if (curr_node.metadata.is_comment) {
return true;
}
for (const auto& ref : curr_node.refs) {
const auto contains_comment = form_contains_comment(ref);
if (contains_comment) {
return true;
}
}
return false;
}
bool form_contains_node_that_prevents_inlining(const FormatterTreeNode& curr_node) {
if (curr_node.formatting_config.should_prevent_inlining(curr_node.formatting_config,
curr_node.refs.size())) {
return true;
}
for (const auto& ref : curr_node.refs) {
const auto prevents_inlining = form_contains_node_that_prevents_inlining(ref);
if (prevents_inlining) {
return true;
}
}
return false;
}
bool can_node_be_inlined(const FormatterTreeNode& curr_node, int cursor_pos) {
using namespace formatter_rules;
// First off, we cannot inline the top level
if (curr_node.metadata.is_top_level) {
return false;
}
// If the config explicitly prevents inlining, or it contains a sub-node that prevents inlining
if (curr_node.formatting_config.prevent_inlining ||
form_contains_node_that_prevents_inlining(curr_node)) {
return false;
}
// nor can we inline something that contains a comment in the middle
if (form_contains_comment(curr_node)) {
return false;
}
// constant pairs are not inlined!
if (curr_node.formatting_config.has_constant_pairs) {
return false;
}
// If this is set in the config, then the form is intended to be partially inlined
if (curr_node.formatting_config.inline_until_index != -1) {
return false;
}
// let's see if we can inline the form all on one line to do that, we recursively explore
// the form to find the total width
int line_width = cursor_pos + get_total_form_inlined_width(curr_node);
return line_width <= indent::line_width_target; // TODO - comments
}
std::vector<std::string> apply_formatting(const FormatterTreeNode& curr_node,
std::vector<std::string> output = {},
int cursor_pos = 0) {
using namespace formatter_rules;
if (!curr_node.token && curr_node.refs.empty()) {
return output;
}
std::string curr_form = "";
// Print the token
// If its a token, just print the token and move on
if (curr_node.token) {
curr_form += curr_node.token.value();
return curr_form;
return {curr_node.token.value()};
}
if (!curr_node.metadata.is_top_level) {
curr_form += "(";
}
// Iterate the form
bool inline_form = false;
// Also check if the form should be constant-paired
const bool constant_pair_form = constant_pairs::form_should_be_constant_paired(curr_node);
if (!constant_pair_form) {
// Determine if the form should be inlined or hung/flowed
// TODO - this isn't entirely accurate, needs current cursor positioning (which is tricky
// because recursion!)
inline_form = indent::form_can_be_inlined(curr_form, curr_node);
}
const bool flowing = indent::should_form_flow(curr_node, inline_form);
std::optional<formatter_rules::config::FormFormattingConfig> form_config;
if (!curr_node.refs.empty() && curr_node.refs.at(0).token) {
const auto& form_head = curr_node.refs.at(0).token;
if (form_head && config::opengoal_form_config.find(form_head.value()) !=
config::opengoal_form_config.end()) {
form_config = config::opengoal_form_config.at(form_head.value());
}
}
// TODO - might want to make some kind of per-form config struct, simplify the passing around of
// info below
for (int i = 0; i < (int)curr_node.refs.size(); i++) {
bool inline_form = can_node_be_inlined(curr_node, cursor_pos);
// TODO - also if the form is inlinable, we can skip all the complication below and just...inline
// it!
// TODO - should figure out the inlining here as well, instead of the bool above
// Iterate the form, building up a list of the final lines but don't worry about indentation
// at this stage. Once the lines are finalized, it's easy to add the indentation later
//
// This means we may combine elements onto the same line in this step.
std::vector<std::string> form_lines = {};
for (int i = 0; i < curr_node.refs.size(); i++) {
const auto& ref = curr_node.refs.at(i);
// Figure out if the element should be inlined or not
bool inline_element = inline_form;
if (indent::inline_form_element(curr_node, i)) {
inline_element = indent::inline_form_element(curr_node, i).value();
}
// Append a newline if needed
// TODO - cleanup / move
bool is_binding_list = false;
bool force_newline = false;
bool override_force_flow = false;
if (form_config) {
force_newline = std::find(form_config->force_newline_at_indices.begin(),
form_config->force_newline_at_indices.end(),
i) != form_config->force_newline_at_indices.end();
// Check if it's a small enough binding list, if so we don't force a newline if the element
// can be inlined
if (inline_element && i > 0 && form_config->bindings_at_index == i - 1 &&
curr_node.refs.at(i - 1).refs.size() < form_config->allow_inlining_if_size_less_than) {
force_newline = false;
override_force_flow = true;
}
is_binding_list = form_config->bindings_at_index == i;
}
if (!curr_node.metadata.is_top_level &&
(!inline_element || is_binding_list || force_newline ||
(form_element_config && form_element_config->force_flow))) {
indent::append_newline(curr_form, ref, curr_node, i, flowing, constant_pair_form,
(form_element_config && form_element_config->force_flow));
}
// TODO - indent the line (or don't)
// Either print the element's token, or recursively format it as well
// Add new line entry
if (ref.token) {
// TODO depth hard-coded to 1, i think this can be removed, since
// forms are always done bottom-top recursively, they always act
// independently as if it was the shallowest depth
if (!inline_element || force_newline) {
indent::indent_line(curr_form, ref, curr_node, 1, i, flowing);
}
if (ref.metadata.node_type == "comment" && ref.metadata.is_inline) {
curr_form += " " + ref.token.value();
} else if (ref.metadata.node_type == "block_comment") {
curr_form += comments::format_block_comment(ref.token.value());
} else {
curr_form += ref.token.value();
}
if (!curr_node.metadata.is_top_level) {
curr_form += " ";
// Cleanup block-comments
std::string val = ref.token.value();
if (ref.metadata.node_type == "block_comment") {
// TODO - change this sanitization to return a list of lines instead of a single new-lined
// line
val = comments::format_block_comment(ref.token.value());
}
form_lines.push_back(val);
} else {
// See if the item at this position has specific formatting
std::optional<formatter_rules::config::FormFormattingConfig> config = {};
std::string formatted_form;
if (form_config && form_config->index_configs.find(i) != form_config->index_configs.end()) {
formatted_form = apply_formatting(ref, "", *form_config->index_configs.at(i));
} else {
formatted_form = apply_formatting(ref, "", {});
}
// TODO - align inner lines only
if (!curr_node.metadata.is_top_level) {
indent::align_lines(
formatted_form, ref, curr_node, constant_pair_form, flowing,
(!override_force_flow && form_config && i >= form_config->start_flow_at_index),
inline_element);
}
curr_form += formatted_form;
if (!curr_node.metadata.is_top_level) {
curr_form += " ";
// If it's not a token, we have to recursively build up the form
// TODO - add the cursor_pos here
const auto& lines = apply_formatting(ref, {}, cursor_pos);
for (int i = 0; i < lines.size(); i++) {
const auto& line = lines.at(i);
form_lines.push_back(fmt::format(
"{}{}", str_util::repeat(ref.formatting_config.parent_mutable_extra_indent, " "),
line));
}
}
// Handle blank lines at the top level, skip if it's the final element
blank_lines::separate_by_newline(curr_form, curr_node, ref, i);
// If we are hanging forms, combine the first two forms onto the same line
if (i == curr_node.refs.size() - 1 && form_lines.size() > 1 &&
(curr_node.formatting_config.hang_forms ||
curr_node.formatting_config.combine_first_two_lines)) {
form_lines.at(0) += fmt::format(" {}", form_lines.at(1));
form_lines.erase(form_lines.begin() + 1);
} else if ((i + 1) < curr_node.refs.size()) {
const auto& next_ref = curr_node.refs.at(i + 1);
// combine the next inline comment or constant pair
if ((next_ref.metadata.node_type == "comment" && next_ref.metadata.is_inline) ||
(curr_node.formatting_config.has_constant_pairs &&
constant_pairs::is_element_second_in_constant_pair(curr_node, next_ref, i + 1))) {
if (next_ref.token) {
form_lines.at(form_lines.size() - 1) += fmt::format(" {}", next_ref.token.value());
i++;
} else if (can_node_be_inlined(next_ref, cursor_pos)) {
const auto& lines = apply_formatting(next_ref, {}, cursor_pos); // TODO - cursor pos
for (const auto& line : lines) {
form_lines.at(form_lines.size() - 1) += fmt::format(" {}", line);
}
i++;
}
}
}
// If we are at the top level, potential separate with a new line
if (blank_lines::should_insert_blank_line(curr_node, ref, i)) {
form_lines.at(form_lines.size() - 1) += "\n";
}
}
// Consolidate any lines if the configuration requires it
if (curr_node.formatting_config.inline_until_index != -1) {
std::vector<std::string> new_form_lines = {};
for (int i = 0; i < form_lines.size(); i++) {
if (i < curr_node.formatting_config.inline_until_index) {
if (new_form_lines.empty()) {
new_form_lines.push_back(form_lines.at(i));
} else {
new_form_lines.at(0) += fmt::format(" {}", form_lines.at(i));
}
} else {
new_form_lines.push_back(form_lines.at(i));
}
}
form_lines = new_form_lines;
}
// Apply necessary indentation to each line and add parens
if (!curr_node.metadata.is_top_level) {
curr_form = str_util::rtrim(curr_form) + ")";
std::string form_surround_start = "(";
std::string form_surround_end = ")";
form_lines[0] = fmt::format("{}{}", form_surround_start, form_lines[0]);
form_lines[form_lines.size() - 1] =
fmt::format("{}{}", form_lines[form_lines.size() - 1], form_surround_end);
}
return curr_form;
std::string curr_form = "";
if (curr_node.formatting_config.parent_mutable_extra_indent > 0) {
curr_form += str_util::repeat(curr_node.formatting_config.parent_mutable_extra_indent, " ");
}
if (inline_form) {
form_lines = {fmt::format("{}", fmt::join(form_lines, " "))};
} else {
for (int i = 0; i < form_lines.size(); i++) {
if (i > 0) {
auto& line = form_lines.at(i);
line = fmt::format("{}{}",
str_util::repeat(curr_node.formatting_config.indentation_width_for_index(
curr_node.formatting_config, i),
" "),
line);
}
}
}
return form_lines;
}
std::string join_formatted_lines(const std::vector<std::string> lines) {
// TODO - respect original file line endings?
return fmt::format("{}", fmt::join(lines, "\n"));
}
std::optional<std::string> formatter::format_code(const std::string& source) {
@ -155,9 +334,22 @@ std::optional<std::string> formatter::format_code(const std::string& source) {
}
try {
const auto formatting_tree = FormatterTree(source, root_node);
std::string formatted_code = apply_formatting(formatting_tree.root, "", {});
return formatted_code;
// There are three phases of formatting
// 1. Simplify the AST down to something that is easier to work on from a formatting perspective
// this also gathers basic metadata that can be done at this stage, like if the token is a
// comment or if the form is on the top-level
auto formatting_tree = FormatterTree(source, root_node);
// 2. Recursively iterate through this simplified FormatterTree and figure out what rules
// need to be applied to produce an optimal result
apply_formatting_config(formatting_tree.root);
// 3. Use this updated FormatterTree to print out the final source-code, while doing so
// we may deviate from the optimal result to produce something even more optimal by inlining
// forms that can fit within the line width.
const auto formatted_lines = apply_formatting(formatting_tree.root);
// 4. Now we joint he lines together, it's easier when formatting to leave all lines independent
// so adding indentation is easier
const auto formatted_source = join_formatted_lines(formatted_lines);
return formatted_source;
} catch (std::exception& e) {
lg::error("Unable to format code - {}", e.what());
}

View file

@ -96,6 +96,11 @@ void FormatterTree::construct_formatter_tree_recursive(const std::string& source
// formatting So for strings, we treat them as if they should be a single token
tree_node.refs.push_back(FormatterTreeNode(source, curr_node));
return;
} else if (curr_node_type == "quoting_lit") {
// same story for quoted symbols
// TODO - expect to have to add more here
tree_node.refs.push_back(FormatterTreeNode(source, curr_node));
return;
}
for (size_t i = 0; i < ts_node_child_count(curr_node); i++) {
const auto child_node = ts_node_child(curr_node, i);

View file

@ -5,6 +5,7 @@
#include <string>
#include <vector>
#include "rules/rule_config.h"
#include "tree_sitter/api.h"
// Treesitter is fantastic for validating and parsing our code into a structured tree format without
@ -39,11 +40,13 @@ class FormatterTreeNode {
// eventually token node refs
std::optional<std::string> token;
formatter_rules::config::FormFormattingConfig formatting_config;
FormatterTreeNode() = default;
FormatterTreeNode(const std::string& source, const TSNode& node);
FormatterTreeNode(const Metadata& _metadata) : metadata(_metadata){};
bool is_list() const { return token.has_value(); }
bool is_list() const { return !token.has_value(); }
};
// A FormatterTree has a very simple and crude tree structure where:

View file

@ -14,27 +14,41 @@ namespace formatter_rules {
// differentiate between a quoted symbol and a quoted form
const std::set<std::string> constant_types = {"kwd_lit", "num_lit", "str_lit",
"char_lit", "null_lit", "bool_lit"};
namespace blank_lines {
void separate_by_newline(std::string& curr_text,
const FormatterTreeNode& containing_node,
const FormatterTreeNode& node,
const int index) {
// We only are concerned with top level forms or elements
// Skip the last element, no trailing new-lines (let the editors handle this!)
// Also peek ahead to see if there was a comment on this line, if so don't separate things!
if (!containing_node.metadata.is_top_level || index >= (int)containing_node.refs.size() - 1 ||
(containing_node.refs.at(index + 1).metadata.is_comment &&
containing_node.refs.at(index + 1).metadata.is_inline)) {
return;
namespace constant_list {
bool is_constant_list(const FormatterTreeNode& node) {
if (!node.is_list() || node.refs.empty()) {
return false;
}
const auto& type = node.refs.at(0).metadata.node_type;
return constant_types.find(type) != constant_types.end();
}
} // namespace constant_list
namespace blank_lines {
bool should_insert_blank_line(const FormatterTreeNode& containing_node,
const FormatterTreeNode& node,
const int index) {
// We only do this at the top level and don't leave a trailing new-line
if (!containing_node.metadata.is_top_level || index >= (int)containing_node.refs.size() - 1) {
return false;
}
curr_text += "\n";
// If it's a comment, but has no following blank lines, dont insert a blank line
if (node.metadata.is_comment && node.metadata.num_blank_lines_following == 0) {
return;
return false;
}
// Otherwise, add only 1 blank line
curr_text += "\n";
// If the next form is a comment and is inline, don't insert a comment
if ((index + 1) < containing_node.refs.size() &&
containing_node.refs.at(index + 1).metadata.is_comment &&
containing_node.refs.at(index + 1).metadata.is_inline) {
return false;
}
// TODO - only if the form doesn't fit on a single line
return true;
}
} // namespace blank_lines
namespace comments {
@ -70,6 +84,7 @@ std::string format_block_comment(const std::string& comment) {
namespace constant_pairs {
// TODO - remove index, not needed, could just pass in the previous node
bool is_element_second_in_constant_pair(const FormatterTreeNode& containing_node,
const FormatterTreeNode& node,
const int index) {
@ -85,11 +100,7 @@ bool is_element_second_in_constant_pair(const FormatterTreeNode& containing_node
// not be paired
return false;
}
// Check the type of the element
if (constant_types.find(node.metadata.node_type) != constant_types.end()) {
return true;
}
return false;
return true;
}
bool form_should_be_constant_paired(const FormatterTreeNode& node) {
@ -118,276 +129,4 @@ bool form_should_be_constant_paired(const FormatterTreeNode& node) {
} // namespace constant_pairs
namespace indent {
int cursor_pos(const std::string& curr_text) {
if (curr_text.empty()) {
return 0;
}
// Get the last line of the text (which is also the line we are on!)
int pos = 0;
for (int i = curr_text.size() - 1; i >= 0; i--) {
const auto& c = curr_text.at(i);
if (c == '\n') {
break;
}
pos++;
}
return pos;
}
int compute_form_width_after_index(const FormatterTreeNode& node,
const int index,
const int depth = 0) {
if (node.refs.empty()) {
if (node.token) {
return node.token->size();
} else {
return 0;
}
}
int form_width = 0;
for (int i = 0; i < (int)node.refs.size(); i++) {
const auto& ref = node.refs.at(i);
if (depth == 0 && i < index) {
continue;
}
if (ref.token) {
form_width += ref.token->size() + 1;
} else {
form_width += compute_form_width_after_index(ref, index, depth + 1) + 1;
}
}
return form_width;
}
bool form_exceed_line_width(const std::string& curr_text,
const FormatterTreeNode& containing_node,
const int index) {
// Compute length from the current cursor position on the line as this check is done for every
// element of the form and not in advance
//
// This is for a good reason, intermediate nodes may override this styling and force to be
// formatted inline
//
// We early out as soon as we exceed the width
int curr_line_pos = cursor_pos(curr_text);
if (curr_line_pos >= line_width_target) {
return true;
}
int remaining_width_required = compute_form_width_after_index(containing_node, index);
if (curr_line_pos + remaining_width_required >= line_width_target) {
return true;
}
return false;
}
bool form_contains_comment(const FormatterTreeNode& node) {
if (node.metadata.is_comment) {
return true;
}
for (const auto& ref : node.refs) {
if (ref.metadata.is_comment) {
return true;
} else if (!node.refs.empty()) {
if (form_contains_comment(ref)) {
return true;
}
}
}
return false;
}
bool form_can_be_inlined(const std::string& curr_text, const FormatterTreeNode& list_node) {
// is the form too long to fit on a line TODO - increase accuracy here
if (form_exceed_line_width(curr_text, list_node, 0)) {
return false;
}
// are there any comments? (inlined or not, doesn't matter)
if (form_contains_comment(list_node)) {
return false;
}
return true;
}
bool should_form_flow(const FormatterTreeNode& list_node, const bool inlining_form) {
if (form_contains_comment(list_node)) {
return true;
}
// does the form begin with a constant (a list of content elements)
if (!inlining_form && !list_node.refs.empty() &&
constant_types.find(list_node.refs.at(0).metadata.node_type) != constant_types.end()) {
return true;
}
// TODO - make a function to make grabbing this metadata easier...
// TODO - honestly should just have an is_list metadata
if (!list_node.refs.empty() && !list_node.refs.at(0).token) {
// if the first element is a comment, force a flow
if (list_node.refs.size() > 1 && list_node.refs.at(1).metadata.is_comment) {
return true;
}
const auto& form_head = list_node.refs.at(0).token;
// See if we have any configuration for this form
if (form_head && config::opengoal_form_config.find(form_head.value()) !=
config::opengoal_form_config.end()) {
const auto& form_config = config::opengoal_form_config.at(form_head.value());
return form_config.force_flow;
}
}
// TODO - cleanup, might be inside a let
/*if (!containing_form.refs.empty() && containing_form.refs.at(0).token) {
const auto& form_head = containing_form.refs.at(0).token;
if (form_head && config::opengoal_form_config.find(form_head.value()) !=
config::opengoal_form_config.end()) {
const auto& form_config = config::opengoal_form_config.at(form_head.value());
if (form_config.force_flow) {
return true;
}
}
}*/
return false;
}
std::optional<bool> inline_form_element(const FormatterTreeNode& list_node, const int index) {
// TODO - honestly should just have an is_list metadata
if (list_node.refs.empty() || !list_node.refs.at(0).token) {
return std::nullopt;
}
const auto& form_head = list_node.refs.at(0).token;
// See if we have any configuration for this form
if (form_head &&
config::opengoal_form_config.find(form_head.value()) != config::opengoal_form_config.end()) {
const auto& form_config = config::opengoal_form_config.at(form_head.value());
if (form_config.inline_until_index != -1) {
return index < form_config.inline_until_index;
}
}
return std::nullopt;
}
void append_newline(std::string& curr_text,
const FormatterTreeNode& node,
const FormatterTreeNode& containing_node,
const int index,
const bool flowing,
const bool constant_pair_form,
const bool force_newline) {
if ((force_newline && index >= 1) || (node.metadata.is_comment && !node.metadata.is_inline)) {
curr_text = str_util::rtrim(curr_text) + "\n";
return;
}
if (index <= 0 || containing_node.metadata.is_top_level ||
(node.metadata.is_comment && node.metadata.is_inline) || (!flowing && index <= 1)) {
return;
}
// Check if it's a constant pair
if (constant_pair_form &&
constant_pairs::is_element_second_in_constant_pair(containing_node, node, index)) {
return;
}
curr_text = str_util::rtrim(curr_text) + "\n";
}
void indent_line(std::string& curr_text,
const FormatterTreeNode& node,
const FormatterTreeNode& containing_node,
const int depth,
const int index,
const bool flowing) {
if (node.metadata.is_top_level || (node.metadata.is_inline && node.metadata.is_comment)) {
return;
}
// If the element is the second element in a constant pair, that means we did not append a
// new-line before hand so we require no indentation (it's inline with the previous element)
if (constant_pairs::is_element_second_in_constant_pair(containing_node, node, index)) {
return;
}
// If the first element in the list is a constant, we only indent with 1 space instead
if (index > 0 &&
constant_types.find(containing_node.refs.at(0).metadata.node_type) != constant_types.end()) {
curr_text += str_util::repeat(depth, " ");
} else if (index > 0 && flowing) {
curr_text += str_util::repeat(depth, " ");
} else if (index > 1 && !flowing) {
curr_text += str_util::repeat(containing_node.refs.at(0).token.value().length() + 2, " ");
}
}
// Recursively iterate through the node until we hit a token
int length_to_hang(const FormatterTreeNode& node, int length) {
if (node.token || node.refs.at(0).token) {
return length;
}
return length_to_hang(node.refs.at(0), length + 1);
}
void align_lines(std::string& text,
const FormatterTreeNode& node,
const FormatterTreeNode& containing_node,
const bool constant_pair_form,
const bool flowing,
const bool force_flow,
const bool inline_element) {
const auto lines = str_util::split(text);
int start_index = 0;
if (inline_element) {
start_index = 1;
}
int alignment_width = 2;
if (force_flow) {
start_index = 0;
} else if (constant_pair_form &&
constant_types.find(containing_node.refs.at(0).metadata.node_type) !=
constant_types.end()) {
start_index = 0;
alignment_width = 3;
} else if (!flowing) {
// If the form has a token (it's a normal list)
if (containing_node.refs.at(0).token) {
alignment_width = length_to_hang(containing_node.refs.at(1),
containing_node.refs.at(0).token.value().length()) +
1;
if (!node.token) {
alignment_width++;
}
} else {
// otherwise, it's a list of lists
alignment_width = 1;
}
} else if (!node.token) {
// If it's a list of lists
alignment_width = 1;
}
std::string aligned_form = "";
for (size_t i = 0; i < lines.size(); i++) {
if ((int)i >= start_index) {
aligned_form += str_util::repeat(alignment_width, " ");
}
aligned_form += lines.at(i);
if (i != lines.size() - 1) {
aligned_form += "\n";
}
}
if (!aligned_form.empty()) {
text = aligned_form;
}
}
} // namespace indent
namespace let {
bool can_be_inlined(const FormatterTreeNode& form) {
// Check a variety of things specific to `let` style forms (ones with bindings)
// - does the binding list have more than one binding?
const auto& bindings = form.refs.at(1); // TODO - assuming
if (bindings.refs.size() > 1) {
return false;
}
return true;
}
} // namespace let
} // namespace formatter_rules

View file

@ -1,12 +1,19 @@
#pragma once
#include <set>
#include <string>
#include "common/formatter/formatter_tree.h"
namespace formatter_rules {
extern const std::set<std::string> constant_types;
namespace constant_list {
bool is_constant_list(const FormatterTreeNode& node);
}
// The formatter will try to collapse as much space as possible in the top-level, this means
// separating forms by a single empty blank line
// separating forms by a single empty blank line
//
// The exception are comments, top level comments will retain their following blank lines from the
// original source
@ -18,11 +25,10 @@ namespace formatter_rules {
//
// Reference - https://github.com/kkinnear/zprint/blob/main/doc/options/blank.md
namespace blank_lines {
void separate_by_newline(std::string& curr_text,
const FormatterTreeNode& containing_node,
const FormatterTreeNode& node,
const int index);
}
bool should_insert_blank_line(const FormatterTreeNode& containing_node,
const FormatterTreeNode& node,
const int index);
} // namespace blank_lines
// TODO:
// - align consecutive comment lines
@ -93,35 +99,6 @@ bool form_should_be_constant_paired(const FormatterTreeNode& node);
namespace indent {
const static int line_width_target = 120;
bool form_can_be_inlined(const std::string& curr_text, const FormatterTreeNode& curr_node);
// TODO - right now this is very primitive in that it only checks against our hard-coded config
// eventually make this explore both routes and determine which is best
// Also factor in distance from the gutter (theres some zprint rationale somewhere on this)
bool should_form_flow(const FormatterTreeNode& list_node, const bool inlining_form);
std::optional<bool> inline_form_element(const FormatterTreeNode& list_node, const int index);
void append_newline(std::string& curr_text,
const FormatterTreeNode& node,
const FormatterTreeNode& containing_node,
const int index,
const bool flowing,
const bool constant_pair_form,
const bool force_newline);
void indent_line(std::string& curr_text,
const FormatterTreeNode& node,
const FormatterTreeNode& containing_node,
const int depth,
const int index,
const bool flowing);
void align_lines(std::string& text,
const FormatterTreeNode& node,
const FormatterTreeNode& containing_node,
const bool constant_pair_form,
const bool flowing,
const bool force_flow,
const bool inline_element);
} // namespace indent
// Let forms fall into two main categories
@ -138,8 +115,5 @@ void align_lines(std::string& text,
// - forms inside the let binding are flowed
//
// Reference - https://github.com/kkinnear/zprint/blob/main/doc/options/let.md
namespace let {
// TODO - like above, factor in current cursor position
bool can_be_inlined(const FormatterTreeNode& form);
} // namespace let
namespace let {} // namespace let
} // namespace formatter_rules

View file

@ -1,5 +1,7 @@
#include "rule_config.h"
#include "common/formatter/formatter_tree.h"
namespace formatter_rules {
namespace config {
@ -8,27 +10,40 @@ namespace config {
// TODO - this could be greatly simplified with C++20's designated initialization
FormFormattingConfig new_flow_rule(int start_index) {
FormFormattingConfig cfg;
cfg.force_flow = true;
cfg.start_flow_at_index = start_index;
cfg.hang_forms = false;
cfg.inline_until_index = start_index;
return cfg;
}
FormFormattingConfig new_binding_rule() {
FormFormattingConfig cfg;
cfg.start_flow_at_index = 2;
cfg.bindings_at_index = 1;
cfg.force_flow = true;
cfg.force_newline_at_indices = {2};
cfg.allow_inlining_if_size_less_than = 2;
cfg.hang_forms = false;
cfg.combine_first_two_lines = true;
auto binding_list_config = std::make_shared<FormFormattingConfig>();
binding_list_config->force_flow = true;
binding_list_config->hang_forms = false;
binding_list_config->indentation_width = 1;
binding_list_config->indentation_width_for_index = [](FormFormattingConfig cfg, int index) {
if (index == 0) {
return 0;
}
return 4;
};
binding_list_config->should_prevent_inlining = [](FormFormattingConfig config, int num_refs) {
// Only prevent inlining a binding list, if there are more than 1 bindings
if (num_refs > 1) {
return true;
}
return false;
};
binding_list_config->prevent_inlining =
true; // TODO - we only want to prevent inlining if there are more than 2 elements
cfg.index_configs.emplace(1, binding_list_config);
return cfg;
}
const std::unordered_map<std::string, FormFormattingConfig> opengoal_form_config = {
{"defun", new_flow_rule(3)},
{"defmethod", new_flow_rule(4)},
{"let", new_binding_rule()}};
} // namespace config
} // namespace formatter_rules

View file

@ -1,25 +1,31 @@
#pragma once
#include <functional>
#include <memory>
#include <optional>
#include <string>
#include <unordered_map>
#include <vector>
#include "common/formatter/rules/formatting_rules.h"
namespace formatter_rules {
namespace config {
struct FormFormattingConfig {
bool force_hang = false;
bool force_flow = false;
std::optional<int> allow_inlining_if_size_less_than = {};
int start_hang_at_index = 0;
int start_flow_at_index = 0;
// new
bool hang_forms = true; // TODO - remove this eventually, it's only involved in setting the
// indentation width, which we can do via the indentation_width function
int indentation_width =
2; // 2 for a flow // TODO - also remove this, prefer storing the first node's width in the
// metadata on the first pass, that's basically all this does
std::function<int(FormFormattingConfig, int)> indentation_width_for_index =
[](FormFormattingConfig config, int index) { return config.indentation_width; };
bool combine_first_two_lines =
false; // NOTE - basically hang, but will probably stick around after hang is gone
int inline_until_index = -1;
std::optional<int> bindings_at_index = {};
std::optional<int> skip_newlines_until_index = {};
std::vector<int> force_newline_at_indices = {};
bool bindings_force_newlines = false;
bool has_constant_pairs = false;
bool prevent_inlining = false;
std::function<bool(FormFormattingConfig, int num_refs)> should_prevent_inlining =
[](FormFormattingConfig config, int num_refs) { return config.prevent_inlining; };
int parent_mutable_extra_indent = 0;
std::unordered_map<int, std::shared_ptr<FormFormattingConfig>> index_configs = {};
};

View file

@ -1,5 +1,5 @@
===
Separate Forms
Separate Top Level
===
(println "test")

View file

@ -41,7 +41,7 @@ Four Pairs
:doit 789)
===
Not a Valid Constant
Function Call Pair - Inlinable
===
(:hello
@ -53,11 +53,33 @@ Not a Valid Constant
---
(:hello "world"
:world 123
:test 456
:not (println "hello world")
:doit 789)
===
Function Call Pair - Too Long and Multiline
===
(:hello
"world" :world 123
:test 456
:not (println "hello world" "hello world" "hello world" (println "hello world hello world hello world hello world hello world hello world hello world hello world"))
:doit 789)
---
(:hello "world"
:world 123
:test 456
:not
(println "hello world")
(println "hello world"
"hello world"
"hello world"
(println "hello world hello world hello world hello world hello world hello world hello world hello world"))
:doit 789)
===

View file

@ -0,0 +1,36 @@
===
Decent size and nesting
===
(defun can-display-query? ((arg0 process) (arg1 string) (arg2 float))
(let ((a1-3 (gui-control-method-12
*gui-control*
arg0
(gui-channel query)
(gui-action play)
(if arg1
arg1
(symbol->string (-> arg0 type symbol))
)
0
arg2
(new 'static 'sound-id)
)
)
)
(= (get-status *gui-control* a1-3) (gui-status active))
)
)
---
(defun can-display-query? ((arg0 process) (arg1 string) (arg2 float))
(let ((a1-3 (gui-control-method-12 *gui-control*
arg0
(gui-channel query)
(gui-action play)
(if arg1 arg1 (symbol->string (-> arg0 type symbol)))
0
arg2
(new 'static 'sound-id))))
(= (get-status *gui-control* a1-3) (gui-status active))))

View file

@ -11,6 +11,19 @@ Basic Nested Form
"world2"
"very-long-formvery-long-formvery-long-formvery-long-formvery-long-formvery-long-formvery-long-form"))
===
Basic Nested Form Reversed
===
(println (println "world" "world2" "very-long-formvery-long-formvery-long-formvery-long-formvery-long-formvery-long-formvery-long-form") "hello")
---
(println (println "world"
"world2"
"very-long-formvery-long-formvery-long-formvery-long-formvery-long-formvery-long-formvery-long-form")
"hello")
===
Multiple Top Level Forms
===

View file

@ -0,0 +1,9 @@
===
Quoted Symbols
===
(new 'static 'sound-id)
---
(new 'static 'sound-id)

View file

@ -1,14 +0,0 @@
===
Top Level Elements
===
(println "test")
(println "test") (println "test")
---
(println "test")
(println "test")
(println "test")

View file

@ -90,6 +90,8 @@ bool has_important_tests(const fs::path& file_path) {
return false;
}
// TODO - consider adding a test that auto-formats all of goal_src (there should be no errors)
bool run_tests(const fs::path& file_path, const bool only_important_tests) {
const auto& tests = get_test_definitions(file_path);
// Run the tests, report successes and failures
@ -101,6 +103,9 @@ bool run_tests(const fs::path& file_path, const bool only_important_tests) {
continue;
}
const auto formatted_result = formatter::format_code(test.input);
if (formatted_result && str_util::starts_with(test.name, "!?")) {
fmt::print("FORMATTED RESULT:\n\n{}\n\n", formatted_result.value());
}
if (!formatted_result) {
// Unable to parse, was that expected?
if (test.output == "__THROWS__") {
@ -122,6 +127,7 @@ bool run_tests(const fs::path& file_path, const bool only_important_tests) {
}
bool find_and_run_tests() {
// TODO - fails when it finds no tests
try {
// Enumerate test files
const auto test_files = file_util::find_files_recursively(