Shoelacer

Documentation

Download

Design

Shoelacer

Introduction
License
Compiling
Usage
Using the Results
Further Considerations
Contact the Author

Introduction

Shoelacer attempts to generate a pair of small C functions to compress or decompress a set of data presented to it, and possibly other data of that kind. The data it is given may consist of entries of just a few bytes or longer. Shoelacer's goal is to be reasonably fast with a tiny memory overhead. Compression performance is not so much the goal, although it certainly isn't ignored.

License

Shoelacer is licensed under General Public License version 3. See the accompanying file "COPYING" for the full license. The output of this program is in no way restricted, guaranteed, or licensed.

Compiling

Compiling should be as simple as typing 'make'. You can validate Shoelacer by typing 'make check'. You can remove binaries and test output by typing 'make clean'.

Usage

Shoelacer is NOT an end-user utility. It requires an input file that is not human readable, and for which there is no standard tool to generate. Its output is not generally useful except in the context of another program. The expected overall use is: an input file is generated in a format that Shoelacer understands from data that the program understands, Shoelacer then generates C code from this input file, and the generated C code is used within the original program.

Basic usage: shoelacer < infile > outfile.c
The input file is expected to be formatted as follows:
<4 bytes representing length of the entry data> <entry data> <4 bytes representing length of the entry data> <entry data> ... (with no extra characters, newlines or otherwise) ... (see slformat.c for a simple example)
Options are:
shoelacer [-s<size>] [-d<distribution bits>] [<model name>] [<struct name>]
All arguments and flags are optional.
<size> dictates how many bytes to shorten the representation distribution tables down to. By default this parameter is 1024.
<distribution bits> dictates how many distributions there are in an exponential fashion with base 2. Consequently, it also dictates how many bits are used as a checksum for hashed 2-grams. By default this parameter is 6.
<model name> is the name of the structure instance printed within the output. By default this parameter is "model".
<struct name> is the name of the structure printed within the output. By default this parameter is "struct".

A utility to create an input file from data with entries delimited by newlines ('\n') is provided:
slformat < infile > entriesfile

Using the Results

Shoelacer prints a (hopefully) valid C file to standard out. It starts with a structure followed by data for the actual model and ends with the encoding (compression) and decoding (decompression) functions.

The function prototypes are:

unsigned int shoelacer_encode(SHOELACER_MODEL* model,unsigned char* text,unsigned char* code,unsigned int textLen,unsigned int codeMax)

unsigned int shoelacer_decode(SHOELACER_MODEL* model,unsigned char* text,unsigned char* code,unsigned int textMax,unsigned int codeLen);

where SHOELACER_MODEL is #define'd to "struct name_of_model_at_top" (not literally).

The encode/decode functions are dependent only on DIST_BITS (as #define'd in the output as DIST_BITS, and as specified to Shoelacer with the '-d' flag). Different outputs over different data sets may use the same compression/decompression routines, as long as DIST_BITS is set dynamically, or DIST_BITS is identical for both of the generated code files.

The semantics of encoding:
Data from text is encoded and put into code. Only the first textLen-bytes of text are encoded, and encoding will stop if more than codeMax-bytes are needed for the compressed version. The return value is the number of bytes written to the code buffer. If the encoder runs out of space in the code buffer, it will return codeMax. However, it may return codeMax if it compressed to exactly codeMax-bytes as a fluke without running out of code space.

The semantics of decoding:
The compressed data code is decoded and put into text. Only the first codeLen-bytes of code are decoded, and decoding will stop if more than textMax-bytes are needed for the compressed version. The return value is the number of bytes written to the text buffer. If the decoder runs out of space in the text buffer, it will return textMax. However, it may return textMax if it decompressed to exactly textMax-bytes as a fluke without running out of text space.

Further Considerations

Due to the design of the compression scheme, encoding/decoding is "reasonably fast" but not incredibly speedy. With this in mind, profiling for compilation of the the compression/decompression functions is recommended. (For gcc, -fprofile-generate and -fprofile-use.) On the Pentium 4's used for development, decompression requires 75 to 80 cycles per byte, where compression requires about twice as many. I would expect a large portion of these cycles to be due to pipeline stalls caused by difficult-to-predict branching.

Shoelacer has not been thoroughly tested in a dynamic environment. The utility of its output is heavily dependent on the quality and statistical properties of input provided. Careful measures may be necessary to ensure its model is not over-fitted to inappropriate data, especially in a dynamic environment where old data may or may not be representative of new data.

Contact the Author

If you've any questions, wish to contribute, or would like to report any bugs or suggestions, please contact:

Ian Elliot <kaganar(a)gmail*com>

Hosting for this project is provided by: