nixpkgs/lib/fileset
Silvan Mosberger 3429594114 lib.fileset: Fix tests on Darwin, more POSIX
This was found when trying to run the fileset tests on Darwin
(https://github.com/NixOS/nix/pull/9546#issuecomment-1967409445), which mysteriously fail on Darwin:

  test case at lib/fileset/tests.sh:342 failed: toSource { root = "/nix/store/foobar"; fileset = ./.; } should have errored with this regex pattern:

  lib.fileset.toSource: `root` \(/nix/store/foobar\) is a string-like value, but it should be a path instead.
  \s*Paths in strings are not supported by `lib.fileset`, use `lib.sources` or derivations instead.

  but this was the actual error:

  error: lib.fileset.toSource: `root` (/nix/store/foobar) is a string-like value, but it should be a path instead.
      Paths in strings are not supported by `lib.fileset`, use `lib.sources` or derivations instead.

After dissecting this, I find out that apparently \s works on Linux, but not on Darwin for some reason!

From the bash source code, it looks like <regex.h> with `REG_EXTENDED` is used for all platforms the same,
so there's nothing odd there.

It's almost impossible to know where <regex.h> comes from,
but it looks to be a POSIX thing.

So after digging through the almost impossible to find POSIX specifications
(https://pubs.opengroup.org/onlinepubs/007908799/xbd/re.html#tag_007_003_005),
I can indeed confirm that there's no mention of \s or the like!

_However_, there is a mention of `[[:blank:]]`, so we'll use that instead.
2024-02-27 23:26:40 +01:00
..
benchmark.sh
default.nix
internal.nix
mock-splitRoot.nix
README.md
tests.sh

File set library

This is the internal contributor documentation. The user documentation is in the Nixpkgs manual.

Goals

The main goal of the file set library is to be able to select local files that should be added to the Nix store. It should have the following properties:

  • Easy: The functions should have obvious semantics, be low in number and be composable.
  • Safe: Throw early and helpful errors when mistakes are detected.
  • Lazy: Only compute values when necessary.

Non-goals are:

  • Efficient: If the abstraction proves itself worthwhile but too slow, it can be still be optimized further.

Tests

Tests are declared in tests.sh and can be run using

./tests.sh

Benchmark

A simple benchmark against the HEAD commit can be run using

./benchmark.sh HEAD

This is intended to be run manually and is not checked by CI.

Internal representation

The internal representation is versioned in order to allow file sets from different Nixpkgs versions to be composed with each other, see internal.nix for the versions and conversions between them. This section describes only the current representation, but past versions will have to be supported by the code.

fileset

An attribute set with these values:

  • _type (constant string "fileset"): Tag to indicate this value is a file set.

  • _internalVersion (constant 3, the current version): Version of the representation.

  • _internalIsEmptyWithoutBase (bool): Whether this file set is the empty file set without a base path. If true, _internalBase* and _internalTree are not set. This is the only way to represent an empty file set without needing a base path.

    Such a value can be used as the identity element for union and the return value of unions [] and co.

  • _internalBase (path): Any files outside of this path cannot influence the set of files. This is always a directory and should be as long as possible. This is used by lib.fileset.toSource to check that all files are under the root argument

  • _internalBaseRoot (path): The filesystem root of _internalBase, same as (lib.path.splitRoot _internalBase).root. This is here because this needs to be computed anyway, and this computation shouldn't be duplicated.

  • _internalBaseComponents (list of strings): The path components of _internalBase, same as lib.path.subpath.components (lib.path.splitRoot _internalBase).subpath. This is here because this needs to be computed anyway, and this computation shouldn't be duplicated.

  • _internalTree (filesetTree): A tree representation of all included files under _internalBase.

  • __noEval (error): An error indicating that directly evaluating file sets is not supported.

filesetTree

One of the following:

  • { <name> = filesetTree; }: A directory with a nested filesetTree value for directory entries. Entries not included may either be omitted or set to null, as necessary to improve efficiency or laziness.

  • "directory": A directory with all its files included recursively, allowing early cutoff for some operations. This specific string is chosen to be compatible with builtins.readDir for a simpler implementation.

  • "regular", "symlink", "unknown" or any other non-"directory" string: A nested file with its file type. These specific strings are chosen to be compatible with builtins.readDir for a simpler implementation. Distinguishing between different file types is not strictly necessary for the functionality this library, but it does allow nicer printing of file sets.

  • null: A file or directory that is excluded from the tree. It may still exist on the file system.

API design decisions

This section justifies API design decisions.

Internal structure

The representation of the file set data type is internal and can be changed over time.

Arguments:

  • (+) The point of this library is to provide high-level functions, users don't need to be concerned with how it's implemented
  • (+) It allows adjustments to the representation, which is especially useful in the early days of the library.
  • (+) It still allows the representation to be stabilized later if necessary and if it has proven itself

Influence tracking

File set operations internally track the top-most directory that could influence the exact contents of a file set. Specifically, toSource requires that the given fileset is completely determined by files within the directory specified by the root argument. For example, even with dir/file.txt being the only file in ./., toSource { root = ./dir; fileset = ./.; } gives an error. This is because fileset may as well be the result of filtering ./. in a way that excludes dir.

Arguments:

  • (+) This gives us the guarantee that adding new files to a project never breaks a file set expression. This is also true in a lesser form for removed files: only removing files explicitly referenced by paths can break a file set expression.
  • (+) This can be removed later, if we discover it's too restrictive
  • (-) It leads to errors when a sensible result could sometimes be returned, such as in the above example.

Empty file set without a base

There is a special representation for an empty file set without a base path. This is used for return values that should be empty but when there's no base path that would makes sense.

Arguments:

  • Alternative: This could also be represented using _internalBase = /. and _internalTree = null.
    • (+) Removes the need for a special representation.
    • (-) Due to influence tracking, union empty ./. would have /. as the base path, which would then prevent toSource { root = ./.; fileset = union empty ./.; } from working, which is not as one would expect.
    • (-) With the assumption that there can be multiple filesystem roots (as established with the path library), this would have to cause an error with union empty pathWithAnotherFilesystemRoot, which is not as one would expect.
  • Alternative: Do not have such a value and error when it would be needed as a return value
    • (+) Removes the need for a special representation.
    • (-) Leaves us with no identity element for union and no reasonable return value for unions []. From a set theory perspective, which has a well-known notion of empty sets, this is unintuitive.

No intersection for lists

While there is intersection a b, there is no function intersections [ a b c ].

Arguments:

  • (+) There is no known use case for such a function, it can be added later if a use case arises
  • (+) There is no suitable return value for intersections [ ], see also "Nullary intersections" here
    • (-) Could throw an error for that case
    • (-) Create a special value to represent "all the files" and return that
      • (+) Such a value could then not be used with fileFilter unless the internal representation is changed considerably
    • (-) Could return the empty file set
      • (+) This would be wrong in set theory
  • (-) Inconsistent with union and unions

Intersection base path

The base path of the result of an intersection is the longest base path of the arguments. E.g. the base path of intersection ./foo ./foo/bar is ./foo/bar. Meanwhile intersection ./foo ./bar returns the empty file set without a base path.

Arguments:

  • Alternative: Use the common prefix of all base paths as the resulting base path
    • (-) This is unnecessarily strict, because the purpose of the base path is to track the directory under which files could be in the file set. It should be as long as possible. All files contained in intersection ./foo ./foo/bar will be under ./foo/bar (never just under ./foo), and intersection ./foo ./bar will never contain any files (never under ./.). This would lead to toSource having to unexpectedly throw errors for cases such as toSource { root = ./foo; fileset = intersect ./foo base; }, where base may be ./bar or ./..
    • (-) There is no benefit to the user, since base path is not directly exposed in the interface

Empty directories

File sets can only represent a set of local files. Directories on their own are not representable.

Arguments:

  • (+) There does not seem to be a sensible set of combinators when directories can be represented on their own. Here's some possibilities:
    • ./. represents the files in ./. and the directory itself including its subdirectories, meaning that even if there's no files, the entire structure of ./. is preserved

      In that case, what should fileFilter (file: false) ./. return? It could return the entire directory structure unchanged, but with all files removed, which would not be what one would expect.

      Trying to have a filter function that also supports directories will lead to the question of: What should the behavior be if ./foo itself is excluded but all of its contents are included? It leads to having to define when directories are recursed into, but then we're effectively back at how the builtins.path-based filters work.

    • ./. represents all files in ./. and the directory itself, but not its subdirectories, meaning that at least ./. will be preserved even if it's empty.

      In that case, intersection ./. ./foo should only include files and no directories themselves, since ./. includes only ./. as a directory, and same for ./foo, so there's no overlap in directories. But intuitively this operation should result in the same as ./foo everything else is just confusing.

  • (+) This matches how Git only supports files, so developers should already be used to it.
  • (-) Empty directories (even if they contain nested directories) are neither representable nor preserved when coercing from paths.
    • (+) It is very rare that empty directories are necessary.
    • (+) We can implement a workaround, allowing toSource to take an extra argument for ensuring certain extra directories exist in the result.
  • (-) It slows down store imports, since the evaluator needs to traverse the entire tree to remove any empty directories
    • (+) This can still be optimized by introducing more Nix builtins if necessary

String paths

File sets do not support Nix store paths in strings such as "/nix/store/...-source".

Arguments:

  • (+) Such paths are usually produced by derivations, which means toSource would either:

    • Require Import From Derivation (IFD) if builtins.path is used as the underlying primitive
    • Require importing the entire root into the store such that derivations can be used to do the filtering
  • (+) The convenient path coercion like union ./foo ./bar wouldn't work for absolute paths, requiring more verbose alternate interfaces:

    • let root = "/nix/store/...-source"; in union "${root}/foo" "${root}/bar"

      Verbose and dangerous because if root was a path, the entire path would get imported into the store.

    • toSource { root = "/nix/store/...-source"; fileset = union "./foo" "./bar"; }

      Does not allow debug printing intermediate file set contents, since we don't know the paths contents before having a root.

    • let fs = lib.fileset.withRoot "/nix/store/...-source"; in fs.union "./foo" "./bar"

      Makes library functions impure since they depend on the contextual root path, questionable composability.

  • (+) The point of the file set abstraction is to specify which files should get imported into the store.

    This use case makes little sense for files that are already in the store. This should be a separate abstraction as e.g. pkgs.drvLayout instead, which could have a similar interface but be specific to derivations. Additional capabilities could be supported that can't be done at evaluation time, such as renaming files, creating new directories, setting executable bits, etc.

  • (+) An API for filtering/transforming Nix store paths could be much more powerful, because it's not limited to just what is possible at evaluation time with builtins.path. Operations such as moving and adding files would be supported.

Single files

File sets cannot add single files to the store, they can only import files under directories.

Arguments:

  • (+) There's no point in using this library for a single file, since you can't do anything other than add it to the store or not. And it would be unclear how the library should behave if the one file wouldn't be added to the store: toSource { root = ./file.nix; fileset = <empty>; } has no reasonable result because returing an empty store path wouldn't match the file type, and there's no way to have an empty file store path, whatever that would mean.

fileFilter takes a path

The fileFilter function takes a path, and not a file set, as its second argument.

  • (-) Makes it harder to compose functions, since the file set type, the return value, can't be passed to the function itself like fileFilter predicate fileset
    • (+) It's still possible to use intersection to filter on file sets: intersection fileset (fileFilter predicate ./.)
      • (-) This does need an extra ./. argument that's not obvious
        • (+) This could always be /. or the project directory, intersection will make it lazy
  • (+) In the future this will allow fileFilter to support a predicate property like subpath and/or components in a reproducible way. This wouldn't be possible if it took a file set, because file sets don't have a predictable absolute path.
    • (-) What about the base path?
      • (+) That can change depending on which files are included, so if it's used for fileFilter it would change the subpath/components value depending on which files are included.
  • (+) If necessary, this restriction can be relaxed later, the opposite wouldn't be possible

Strict path existence checking

Coercing paths that don't exist to file sets always gives an error.

  • (-) Sometimes you want to remove a file that may not always exist using difference ./. ./does-not-exist, but this does not work because coercion of ./does-not-exist fails, even though its existence would have no influence on the result.
    • (+) This is dangerous, because you wouldn't be protected against typos anymore. E.g. when trying to prevent ./secret from being imported, a typo like difference ./. ./sercet would import it regardless.
    • (+) difference ./. (maybeMissing ./does-not-exist) can be used to do this more explicitly.
    • (+) difference ./. (difference ./foo ./foo/bar) should report an error when ./foo/bar does not exist ("double negation"). Unfortunately, the current internal representation does not lend itself to a behavior where both difference x ./does-not-exists and double negation are handled and checked correctly. This could be fixed, but would require significant changes to the internal representation that are not worth the effort and the risk of introducing implicit behavior.