aboutsummaryrefslogtreecommitdiff
path: root/readme.md
diff options
context:
space:
mode:
Diffstat (limited to 'readme.md')
-rw-r--r--readme.md68
1 files changed, 43 insertions, 25 deletions
diff --git a/readme.md b/readme.md
index 0818a16..d42b2ab 100644
--- a/readme.md
+++ b/readme.md
@@ -10,13 +10,16 @@ BLoC and heavily reduces its size.
Before explaining the format or its optimization techniques, let me
first show you some results:
-1. The bruijn expression `fac (+30)`, where `fac` is the factorial
- implementation from `std/Math`:
+1. The [bruijn](https://github.com/marvinborner/bruijn) expression
+ `fac (+30)`, where `fac` is the factorial implementation from
+ `std/Math`:
- the original expression takes 2000 bytes in bit-encoded BLC
- the same expression in BLoC needs only 423 bytes
2. [My
solution](https://github.com/marvinborner/bruijn/blob/main/samples/aoc/2022/01/solve.bruijn)
- for the “Advent of Code” challenge 2022/01 in bruijn:
+ for the “Advent of Code” challenge
+ [2022/01](https://adventofcode.com/2022/day/1) in
+ [bruijn](https://github.com/marvinborner/bruijn):
- the original expression takes 6258 bytes in bit-encoded BLC
- the same expression in BLoC needs only 1169 bytes
@@ -43,15 +46,40 @@ following derivation of normal bit-encoded BLC:
| 00M | abstraction of expression `M` |
| 010MN | application of `M` and `N` |
| 1<sup>i+1</sup>0 | bruijn index `i` |
-| 011I | 2 byte index to an entry |
+| 011I | index\* to an entry |
+
+(\*): The encoding of indices is quite special: $I=XA$, where
+$X\in\\{00,01,10,11\\}$ and length of binary index $A$ is
+$L(A)\in\\{1,2,4,8\\}$ byte respectively.
The final program will be in the last entry. The indices start counting
from the number of entries down to 0.
+## Example
+
+Let `E` be some kind of expression like `E=\x.(((M (\y.N)) M) N)`, where
+`M` and `N` are both arbitrary expressions of length 16.
+
+The raw BLC expression of `E` would then be `E=00010101M00NMN`. This
+obviously has the drawback of redundant repetition of `M` and `N`.
+
+A possible encoding in BLoC:
+
+| from | to | content |
+|:-----|:-----|:----------------------------------------------------------------------------------------------------------------|
+| 0x00 | 0x04 | “BLoC” |
+| 0x04 | 0x06 | number of entries: 3 |
+| 0x06 | 0x17 | encoded `M`: gets a bit longer due to different encoding |
+| 0x17 | 0x28 | encoded `N`: gets a bit longer due to different encoding |
+| 0x28 | 0x34 | `00010010010011<M>00011<N>011<M>011<N>`, where `<M>=00{1}` and `<N>=00{1}` are indices with length of 1.25 byte |
+
+Even in this small example BLoC uses less space than BLC (0x34 vs. 0x42
+bytes). Depending on the content of `M` and `N`, this could have
+potentially been compressed even more.
+
## Optimizer
-The optimizer converts a normal BLC expression to the BLoC format. Its
-strategies are similar to compression using finite state entropy.
+The optimizer converts a normal BLC expression to the BLoC format.
In order to find the largest repeating sub-expressions, the expression
first gets converted to a hashed tree (similar to a Merkle tree).
@@ -65,24 +93,14 @@ in any other way. As an idea for the future, long expressions could get
reduced using different techniques/depths and then get replaced with the
shortest one (as fully reduced expressions aren’t necessarily shorter).
-## Example
+## Improvements
-Let `E` be some kind of expression like `E=\x.(((M (\y.N)) M) N)`, where
-`M` and `N` are both arbitrary expressions of length 16.
+The current optimizer does not always make the best deduplication
+choices. It seems like finding the optimal deduplications requires quite
+complex algorithms which would probably be rather inefficient.
-The raw BLC expression of `E` would then be `E=00010101M00NMN`. This
-obviously has the drawback of redundant repetition of `M` and `N`.
-
-A possible encoding in BLoC:
-
-| from | to | content |
-|:-----|:-----|:--------------------------------------------------------------------------------------|
-| 0x00 | 0x04 | “BLoC” |
-| 0x04 | 0x06 | number of entries: 3 |
-| 0x06 | 0x17 | encoded `M`: gets a bit longer due to different encoding |
-| 0x17 | 0x28 | encoded `N`: gets a bit longer due to different encoding |
-| 0x28 | 0x35 | `00010010010011<M>00011<N>011<M>011<N>`, where `<M>=1` and `<N>=0` are 2 byte indices |
-
-Even in this small example BLoC uses less space than BLC (0x35 vs. 0x42
-bytes). Depending on the content of `M` and `N`, this could have
-potentially been compressed even more.
+For example, as of right now the length of an expression as seen by the
+deduplicator doesn’t consider the change of length when sub-expressions
+get replaced by a reference (of unknown bit length!) to another
+expression. This results in entries like `(0 <1234>)` that would not
+have needed to be deduplicated.