1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
|
# BLoC
This project proposes a file format for programs encoded in binary
lambda calculus (BLC). Its main goal is to minimize redundancy by
incorporating reference-based term sharing, while remaining easy to work
with. The included `bloc` tool converts BLC to BLoC and greatly reduces
its size.
The additional tool `blocade` (found in
[BLoCade](https://github.com/marvinborner/blocade)) can compile `.bloc`
files to various targets, including a shared pure BLC representation.
## Results
Before explaining the format or its optimization techniques, let me
first show you some results:
1. x86 C compiler [8cc](https://github.com/rui314/8cc) translated [to
lambda calculus](https://github.com/woodrush/lambda-8cc):
- the original expression takes ~5M bytes in bit-encoded BLC
- the same expression in BLoC needs only ~640K bytes (which is
around 2x the original 8cc!)
2. The [bruijn](https://github.com/marvinborner/bruijn) expression
`fac (+30)`, where `fac` is the factorial implementation from
`std/Math`:
- the original expression takes 1200 bytes in bit-encoded BLC
- the same expression in BLoC needs only 349 bytes
3. [My
solution](https://github.com/marvinborner/bruijn/blob/main/samples/aoc/2022/01/solve.bruijn)
for the “Advent of Code” challenge
[2022/01](https://adventofcode.com/2022/day/1) in
[bruijn](https://github.com/marvinborner/bruijn):
- the original expression takes 6258 bytes in bit-encoded BLC
- the same expression in BLoC needs only 935 bytes
You can find these examples in `test/`.
## Format
It’s assumed that all bit-encoded strings are padded with zeroes at the
end.
### Header
| from | to | content |
|:-----|:-----|:-------------------|
| 0x00 | 0x04 | identifier: “BLoC” |
| 0x04 | 0x06 | number of entries |
| 0x06 | 0x?? | entries |
### Entry
This reflects the basic structure of an expression. It uses the
following derivation of normal bit-encoded BLC:
| prefix | content |
|:-----------------|:------------------------------|
| 010M | abstraction of expression `M` |
| 00MN | application of `M` and `N` |
| 1<sup>i+1</sup>0 | bruijn index `i` |
| 011I | index\* to an entry |
(\*): The encoding of indices is quite special: $I=XA$, where
$X\in\\{00,01,10,11\\}$ and length of binary index $A$ is
$L(A)\in\\{1,2,4,8\\}$ byte respectively (see `src/{build,parse}.c`).
The final program will be in the last entry. The indices start counting
from the number of entries down to 0.
## Example
Let `E` be some kind of expression like `E=\x.(((M (\y.N)) M) N)`, where
`M` and `N` are both arbitrary expressions of length 16.
The raw BLC expression of `E` would then be `E=00010101M00NMN`. This
obviously has the drawback of redundant repetition of `M` and `N`.
A possible encoding in BLoC:
| from | to | content |
|:-----|:-----|:----------------------------------------------------------------------------------------------------------------|
| 0x00 | 0x04 | “BLoC” |
| 0x04 | 0x06 | number of entries: 3 |
| 0x06 | 0x17 | encoded `M`: gets a bit longer due to different encoding |
| 0x17 | 0x28 | encoded `N`: gets a bit longer due to different encoding |
| 0x28 | 0x34 | `00010010010011<M>00011<N>011<M>011<N>`, where `<M>=00{1}` and `<N>=00{0}` are indices with length of 1.25 byte |
Even in this small example BLoC uses less space than BLC (0x34 vs. 0x42
bytes). Depending on the content of `M` and `N`, this could have
potentially been compressed even more.
You can dump the bloc table using the `-d/--dump` flag of `bloc`. Some
additional dumps for testing can also be found in `test/`.
## Optimizer
The optimizer converts a normal BLC expression to the BLoC format.
In order to find the largest repeating sub-expressions, the expression
first gets converted to a hashed tree (similar to a Merkle tree).
The repeating sub-trees then get sorted by length and inserted into the
final table, while replacing them with references in the original
expression (see `src/tree.c` for more).
As of right now, expressions **don’t** get beta-reduced or manipulated
in any other way. As an idea for the future, long expressions could get
reduced using different techniques/depths and then get replaced with the
shortest one (as fully reduced expressions aren’t necessarily shorter).
## Libraries
- [pqueue](https://github.com/vy/libpqueue/) \[BSD 2-Clause\]: Simple
priority queue implementation
- [xxHash](https://github.com/Cyan4973/xxHash/) \[BSD 2-Clause\]:
Extremely fast hash algorithm
|