Difference between revisions of "Compression Support"

From FreeSpace Wiki
Jump to: navigation, search
(LZ41 Tests)
(LZ41)
 
Line 11: Line 11:
 
The LZ41 implementation is non-LZ4 format standard, just like the official example for random access is, as the LZ4 data format does not seems to consider what is needed for random access.
 
The LZ41 implementation is non-LZ4 format standard, just like the official example for random access is, as the LZ4 data format does not seems to consider what is needed for random access.
 
FSO checks for the file header on fopen() and if the file is compressed then the logic for compressed files is used.
 
FSO checks for the file header on fopen() and if the file is compressed then the logic for compressed files is used.
 +
 +
== LZ4 Random Access Concept ==
 +
LZ4 (and any other compression format) is composed out of a chain of blocks. There are two type of blocks, dependant blocks and independient blocks, dependant blocks are far more efficient than indenpendient blocks, but they are intended to be read in order from start to finish, you cant just seek the stream and decompress a block in the middle whiout decompressing all other blocks before it. Independient blocks can always be extracted alone, no matter what. The cons is that the compression ratio is considerably worse.
 +
 +
So each block stores a fixed BLOCK_SIZE of uncompressed data, except maybe the last block. The actual block size will be always dynamic and change depending of how much the data can be compressed, so in order to keep track were each block is a list of offset is needed, this list of offsets contains the seek position of each block.
 +
Example: Block #1 is 21000 bytes long, Block #2 is 23000 bytes long, Block #3 is 25000 bytes long, the offset list will be like this: 21004, 44004, 69004. (the 4 is because of the starting file header that is 4 bytes long)
 +
 +
So in order to decompress the random data we need, we only need to do a POSITION % BLOCK_SIZE, this gives the block number were that data is. With the block number we go to the offset table and seek to that position. In this way we can read the block and pass it to the decompression function.
 +
 +
Block Number * BLOCK_SIZE gives the starting position of the data on the block, so if we are looking for position 5000 and the starting data in the block starts at 4000 we know that we have to ignore the fist 1000 bytes of the extracted data, this also tells if the block contains all the data we need or if we should move to the next block.
  
 
== LZ41 Data Format ==
 
== LZ41 Data Format ==

Latest revision as of 00:42, 12 May 2023

Compression Support

The major problem in supporting compressed files is random access, FSO needs to be able to do all operations on a compressed file as it would normally do with any file; fopen(), fseek(), fread(), fclose(). Whiout having to decompress the entire file and store it in ram to do those operations. .pof parsing specially uses a lot of fseeks() and small freads(). Generally, file compression algos are intended to decompress the entire file from start to finish in one go and thus random access its non standart for the compression format.

In FSO 21.2.0 the first compression support was merged into the code, the first implementation uses LZ4/LZ4-HC and it is called internally as "LZ41". The LZ41 implementation is mainly intended for loose files but adding those inside VPs is also supported.


LZ41

The LZ41 design is similar to the lz4 random access example. It uses the lz4 stream compression, with an int table which stores the offsets that indicate at what compressed block the original file position is, an int to indicate the number of these offsets, and the original filesize and blocksize. The major difference being that instead of using a dictionary, the stream is reset at each block, making them independent from each other, and thus you can pick any block you want and decompress that block at will. Resetting the stream rather than use a dictionary may be less efficient (this is untested as the time of writing this article), but it was the only way to do it for individual files without having to embed the dictionary too. The LZ41 implementation is non-LZ4 format standard, just like the official example for random access is, as the LZ4 data format does not seems to consider what is needed for random access. FSO checks for the file header on fopen() and if the file is compressed then the logic for compressed files is used.

LZ4 Random Access Concept

LZ4 (and any other compression format) is composed out of a chain of blocks. There are two type of blocks, dependant blocks and independient blocks, dependant blocks are far more efficient than indenpendient blocks, but they are intended to be read in order from start to finish, you cant just seek the stream and decompress a block in the middle whiout decompressing all other blocks before it. Independient blocks can always be extracted alone, no matter what. The cons is that the compression ratio is considerably worse.

So each block stores a fixed BLOCK_SIZE of uncompressed data, except maybe the last block. The actual block size will be always dynamic and change depending of how much the data can be compressed, so in order to keep track were each block is a list of offset is needed, this list of offsets contains the seek position of each block. Example: Block #1 is 21000 bytes long, Block #2 is 23000 bytes long, Block #3 is 25000 bytes long, the offset list will be like this: 21004, 44004, 69004. (the 4 is because of the starting file header that is 4 bytes long)

So in order to decompress the random data we need, we only need to do a POSITION % BLOCK_SIZE, this gives the block number were that data is. With the block number we go to the offset table and seek to that position. In this way we can read the block and pass it to the decompression function.

Block Number * BLOCK_SIZE gives the starting position of the data on the block, so if we are looking for position 5000 and the starting data in the block starts at 4000 we know that we have to ignore the fist 1000 bytes of the extracted data, this also tells if the block contains all the data we need or if we should move to the next block.

LZ41 Data Format

4 byte header ("LZ41")
N Blocks
N offsets
int num_offsets
int original_filesize
int blocksize

Lz41 structure.png

LZ41 Tests

Uncompressed MVP 4.4.1
12.9GB - Space used on Disk

Compressed MVP 4.4.1 Block Size: 65536 LZ4-HC L6
7.01GB - Space used on disk
Compression time: 5 minutes (1 thread)

MVP 4.4.1 - All FS2 Models mission load time
0:50 - NVME Compressed LZ4-HC L6, 7.01GB, BS: 65536
0:50 - NVME Uncompressed
1:01 - SSD Compressed LZ4-HC L6, 7.01GB, BS: 65536
1:02 - SSD Uncompressed
1:49 - HDD Compressed LZ4-HC L6, 7.01GB, BS: 65536
2:30 - HDD Uncompressed

LZ41 Compressor Guidelines

Block Size
Block size is set by the compressor 8192, 16384, 65536 block sizes were all tested, higher values are untested and may work or not. A higher block size will make the large compressed files smaller, but will also make FSO use slightly more RAM.

Ignore text based file formats
.fc2, .fs2, .tbl, .tbm, .eff: Text files should be ignored. Compressing them works fine, but there is little to be gained and the only thing it does is to make the file unreadeable by text editors.

Movies are already compressed
.mp4,.ogg,.mve,.webm: the ratio will be negative most of the time, and when is not, you are just saving a few kb, maybe a mb or two.

Audio files
.ogg, .wav: .wav may give you some marginal gains, but it is just not worth trying. With ogg its the same problem as with movie files, these files are already compressed.

pngs
.png and apng: These are already compressed and 99% of the time end up with a negative ratio. pngs that aren't compressed are going to go down in size, considerably, but this is uncommon. It might be possible to check the png header for the compression value they already have before trying to compress them.

Do not keep files that end up with negative ratios
No matter if those files are inside a vp or they are loose files, if a compressed file end up being bigger than the original one, it must be discarded.

Compression Function Example

Lz41 example.png
Source: VPC Compressor tool

LZ41 Design Choices

A non standart format was used since the official random access example was already breaking the lz4 frame format.
Additional data were added trailing to each file: the offset table, the num offsets, the block size, and the original filesize. Normally, those would be on a container and cached on container load, but it is not possible for loose files.

Due to the 31 chars filename limit, the idea of adding an extra extension to the files was dropped, as changing the filename length was just not an option, and since loose files were not getting an extra extension, an uncompressed .pof would be named the same as a compressed .pof. As a result, no need to store compressed files on a different container rather than plain .VP was seen as needed as there were already no difference for loose files. This was later proven to be a mistake because it could cause user confusion.

Future Improvements

At some point in 23.X.X versions a new guideline for compressing files will be requiered.
- The extra extension ".lz41" will be added to loose files. Since recent improvements in FSO code allows for loose files to use more than 31 characters. FSO will ignore this extra extension after load and thus "ship.pof.lz41" will be considered as "ship.pof".
- New VP container: .VPC
The .VPC container format is identical to a .VP container in structure and limits, with one key difference: It may contain uncompressed files just like any normal VP, but it has to contain at least 1 compressed file. Compressed files stored inside a .vpc container must not have the .lz41 extension or its name changed in any way.
While .VP can still tecnically be used to store compressed files, it is recommended not to.
None of these changes break compatibility with files compressed prior to these changes, but just as with any new feature or major changes, older versions of FSO can not use compressed files in this new way.

Versioning System

In order to avoid compatibility issues with future changes or additions, the file header is used in all fopen(), fseek(), fread() operations, the header tells FSO what to do in those operations, so in the future a new and better way to support compression is found, or a better algo, a different header must be used, lets assume "LZ42", this means FSO can handle "LZ41" files in a way and "LZ42" files in a totally different way whiout breaking any compatibility.