diff options
Diffstat (limited to 'readme.md')
-rw-r--r-- | readme.md | 68 |
1 files changed, 43 insertions, 25 deletions
@@ -10,13 +10,16 @@ BLoC and heavily reduces its size. Before explaining the format or its optimization techniques, let me first show you some results: -1. The bruijn expression `fac (+30)`, where `fac` is the factorial - implementation from `std/Math`: +1. The [bruijn](https://github.com/marvinborner/bruijn) expression + `fac (+30)`, where `fac` is the factorial implementation from + `std/Math`: - the original expression takes 2000 bytes in bit-encoded BLC - the same expression in BLoC needs only 423 bytes 2. [My solution](https://github.com/marvinborner/bruijn/blob/main/samples/aoc/2022/01/solve.bruijn) - for the “Advent of Code” challenge 2022/01 in bruijn: + for the “Advent of Code” challenge + [2022/01](https://adventofcode.com/2022/day/1) in + [bruijn](https://github.com/marvinborner/bruijn): - the original expression takes 6258 bytes in bit-encoded BLC - the same expression in BLoC needs only 1169 bytes @@ -43,15 +46,40 @@ following derivation of normal bit-encoded BLC: | 00M | abstraction of expression `M` | | 010MN | application of `M` and `N` | | 1<sup>i+1</sup>0 | bruijn index `i` | -| 011I | 2 byte index to an entry | +| 011I | index\* to an entry | + +(\*): The encoding of indices is quite special: $I=XA$, where +$X\in\\{00,01,10,11\\}$ and length of binary index $A$ is +$L(A)\in\\{1,2,4,8\\}$ byte respectively. The final program will be in the last entry. The indices start counting from the number of entries down to 0. +## Example + +Let `E` be some kind of expression like `E=\x.(((M (\y.N)) M) N)`, where +`M` and `N` are both arbitrary expressions of length 16. + +The raw BLC expression of `E` would then be `E=00010101M00NMN`. This +obviously has the drawback of redundant repetition of `M` and `N`. + +A possible encoding in BLoC: + +| from | to | content | +|:-----|:-----|:----------------------------------------------------------------------------------------------------------------| +| 0x00 | 0x04 | “BLoC” | +| 0x04 | 0x06 | number of entries: 3 | +| 0x06 | 0x17 | encoded `M`: gets a bit longer due to different encoding | +| 0x17 | 0x28 | encoded `N`: gets a bit longer due to different encoding | +| 0x28 | 0x34 | `00010010010011<M>00011<N>011<M>011<N>`, where `<M>=00{1}` and `<N>=00{1}` are indices with length of 1.25 byte | + +Even in this small example BLoC uses less space than BLC (0x34 vs. 0x42 +bytes). Depending on the content of `M` and `N`, this could have +potentially been compressed even more. + ## Optimizer -The optimizer converts a normal BLC expression to the BLoC format. Its -strategies are similar to compression using finite state entropy. +The optimizer converts a normal BLC expression to the BLoC format. In order to find the largest repeating sub-expressions, the expression first gets converted to a hashed tree (similar to a Merkle tree). @@ -65,24 +93,14 @@ in any other way. As an idea for the future, long expressions could get reduced using different techniques/depths and then get replaced with the shortest one (as fully reduced expressions aren’t necessarily shorter). -## Example +## Improvements -Let `E` be some kind of expression like `E=\x.(((M (\y.N)) M) N)`, where -`M` and `N` are both arbitrary expressions of length 16. +The current optimizer does not always make the best deduplication +choices. It seems like finding the optimal deduplications requires quite +complex algorithms which would probably be rather inefficient. -The raw BLC expression of `E` would then be `E=00010101M00NMN`. This -obviously has the drawback of redundant repetition of `M` and `N`. - -A possible encoding in BLoC: - -| from | to | content | -|:-----|:-----|:--------------------------------------------------------------------------------------| -| 0x00 | 0x04 | “BLoC” | -| 0x04 | 0x06 | number of entries: 3 | -| 0x06 | 0x17 | encoded `M`: gets a bit longer due to different encoding | -| 0x17 | 0x28 | encoded `N`: gets a bit longer due to different encoding | -| 0x28 | 0x35 | `00010010010011<M>00011<N>011<M>011<N>`, where `<M>=1` and `<N>=0` are 2 byte indices | - -Even in this small example BLoC uses less space than BLC (0x35 vs. 0x42 -bytes). Depending on the content of `M` and `N`, this could have -potentially been compressed even more. +For example, as of right now the length of an expression as seen by the +deduplicator doesn’t consider the change of length when sub-expressions +get replaced by a reference (of unknown bit length!) to another +expression. This results in entries like `(0 <1234>)` that would not +have needed to be deduplicated. |