Compression Support
Contents
Compression Support
The major problem in supporting compressed files is random access, FSO needs to be able to do all operations on a compressed file as it would normally do with any file; fopen(), fseek(), fread(), fclose(). Whiout having to decompress the entire file and store it in ram to do those operations. .pof parsing specially uses a lot of fseeks() and small freads(). Generally, file compression algos are intended to decompress the entire file from start to finish in one go and thus random access its non standart for the compression format.
In FSO 21.2.0 the first compression support was merged into the code, the first implementation uses LZ4/LZ4-HC and it is called internally as "LZ41". The LZ41 implementation is mainly intended for loose files but adding those inside VPs is also supported.
LZ41
The LZ41 design is similar to the lz4 random access example. It uses the lz4 stream compression, with an int table which stores the offsets that indicate at what compressed block the original file position is, an int to indicate the number of these offsets, and the original filesize and blocksize. The major difference being that instead of using a dictionary, the stream is reset at each block, making them independent from each other, and thus you can pick any block you want and decompress that block at will. Resetting the stream rather than use a dictionary may be less efficient (this is untested as the time of writing this article), but it was the only way to do it for individual files without having to embed the dictionary too. The LZ41 implementation is non-LZ4 format standard, just like the official example for random access is, as the LZ4 data format does not seems to consider what is needed for random access. FSO checks for the file header on fopen() and if the file is compressed then the logic for compressed files is used.
LZ41 Data Format
4 byte header ("LZ41")
N Blocks
N offsets
int num_offsets
int original_filesize
int blocksize
LZ41 Tests
Uncompressed MVP 4.4.1
12.9GB - Space used on Disk
Compressed MVP 4.4.1 Block Size: 65536 LZ4-HC L6
7.01GB - Space used on disk
Compression time: 5 minutes (1 thread)
MVP 4.4.1 - All FS2 Models mission load time
0:50 - NVME Uncompressed
0:50 - NVME Compressed LZ4-HC L6, 7.01GB, BS: 65536
1:01 - SSD Compressed LZ4-HC L6, 7.01GB, BS: 65536
1:02 - SSD Uncompressed
1:49 - HDD Compressed LZ4-HC L6, 7.01GB, BS: 65536
2:30 - HDD Uncompressed
LZ41 Compressor Guidelines
Block Size
|Block size is set by the compressor 8192, 16384, 65536 block sizes were all tested, higher values are untested and may work or not. A higher block size will make the large compressed files smaller, but will also make FSO use slightly more RAM.
Ignore text based file formats
|.fc2, .fs2, .tbl, .tbm, .eff: Text files should be ignored. Compressing them works fine, but there is little to be gained and the only thing it does is to make the file unreadeable by text editors.
Movies are already compressed
|.mp4,.ogg,.mve,.webm: the ratio will be negative most of the time, and when is not, you are just saving a few kb, maybe a mb or two.
Audio files
|.ogg, .wav: .wav may give you some marginal gains, but it is just not worth trying. With ogg its the same problem as with movie files, these files are already compressed.
pngs
|.png and apng: These are already compressed and 99% of the time end up with a negative ratio. pngs that aren't compressed are going to go down in size, considerably, but this is uncommon. It might be possible to check the png header for the compression value they already have before trying to compress them.
Do not keep files that end up with negative ratios
|No matter if those files are inside a vp or they are loose files, if a compressed file end up being bigger than the original one, it must be discarded.
Compression Function Example
LZ41 Design Choices
A non standart format was used since the official random access example was already breaking the lz4 frame format.
Additional data were added trailing to each file, the offset table, the num offsets, the block size, the original filesize, normaly those would be on a container and cached on container load, but it is not possible for loose files.
Due to the 31 chars filename limit, the idea of adding a extra extension to the files was dropped, as changing the filename lenght was just not an option, and since loose files were getting no extra extension, a uncompressed .pof would be named the same as a compressed .pof, as a result of this, no need to store compressed files on a diferent container rather than plain .VP was seen as needed as there were already no difference for loose files. This will later been prove to be a mistake because it could cause user confunsion.
Future Improvements
At some point in 23.X.X versions a new guideline for compressing files will be requiered.
- The extra extension ".lz41" will be added to loose files. Since recent improvements in FSO code allows for loose files to use more than 31 characters. FSO will ignore this extra extension after load and thus "ship.pof.lz41" will be considered as "ship.pof".
- New VP container: .VPC
The VPC container format is identical to a .VP container in structure and limits, with one key difference: It may contain uncompressed files just like any normal VP, but it has to contain at least 1 compressed file.
Compressed files stored inside a .vpc container must not have the .lz41 extension or its name changed in any way.
While .VP can still tecnically be used to store compressed files, it is recomended not to.
None of these changes break compatibility with files compressed prior to these changes, but just as with any new feature or major changes, older versions of FSO can not use compressed files in this new way.
Versioning System
In order to avoid compatibility issues with future changes or additions, the file header is used in all fopen(), fseek(), fread() operations, the header tells FSO what to do in those operations, so in the future a new and better way to support compression is found, or a better algo, a different header must be used, lets assume "LZ42", this means FSO can handle "LZ41" files in a way and "LZ42" files in a totally different way whiout breaking any compatibility.