Train tearing (hotel class at lucky trying does click write this: 0 0 0 SSE1 ICM SSE3 SSE4 SSE5 SSE6 Encoder SSE2.5 The ICM takes a bytewise order 0 context, i.e. just the previous bits of the current byte. Recall that ICM maps a context to a bit history which is mapped to a slow adapting probability table. Each of the SSE maps a context and a probability to a new probability interpolated between the two nearest quantized outputs. SSE1 and SSE2 are both order 0, but SSE1 is fast adapting and SSE2 is slow adapting The two predictions are averaged the linear domain. The SSE BBB update both quantized table entries above and below the input probability, unlike ZPAQ which updates only the nearest. SSE3 takes a bytewise order 1 context. SSE4 takes the previous but not the current byte as context, plus the run length quantized to 4 levels SSE5 takes a sparse order 1 context of just the low 5 bits and a gap of 1 byte, i.e. 5 of the last 16 bits, plus the current byte: ...xxxxx The output is averaged linearly with the input with weight 3 to the output. SSE6 takes a 14 bit hash of the order 3 context. It is averaged linearly with its input with weight 1 each. The table below compares the models for bzip2 and BBB on the Calgary corpus with each file compressed separately. both cases the block size is 900 KB, which is large enough to hold each file a single block. BBB is run both fast and slow modes. About half of the compression time both cases is due to PIC, which has runs of 0 bytes. Unlike bzip2, BBB has no protection against string comparisons while sorting. Note also that compressing all of the data together as a tar file makes compression worse. As mentioned, BWT is poorly suited for mixed data types. Program Calgary Compr Decomp calgary.tar bzip2 828 0 0 860 bbb cfk900,672.33.46,762 bbb ck900,672.74.54 MSufSort v2. Most modern BWT based compressors use some type of suffix array sorting algorithm for the forward BWT transform. MSufSort version 2 is a fast and memory efficient algorithm developed by Maniscalco Later improved version, MSufSort v3, used a different algorithm. Recall that a suffix array is a list of pointers to the suffixes of a string lexicographical order. Thus, the SA for BANANA$ is where precedes all other characters. The BWT is calculated by outputting the characters to the left of each of the pointers, e.g. ANNB$AA. MSufSort v2 actually calculates the inverse suffix array The lists the lexicographical rank of each character the string. For example, the of BANANA$ is There is a simple algorithm for converting to SA: for each i 0..n-1 do SA] i However, because SA and each require 4n memory, the inversion is typically done place. This can be achieved by using one bit of each element to those that have already been moved for i from 0 to n-1 do if 0 start i suff i rank 1 repeat tmp -suff suff rank rank tmp until rank start -suff SA sorting based on comparison sorts have worst case run times of O or worse because of the O time required to compare strings when the average LCP is large, as the case of highly redundant data. A common feature of efficient algorithms, such as the 15 described by is that they exploit the fact that if two suffixes have a common prefix of m characters and their order is known, then we also know the order of their first m suffixes. For example, once we know that comes before ANANA$, then we also know the order of their next m 3 suffixes, specifically that comes before A$ comes before and comes before Suffix trees implicitly exploit this feature by merging the m common characters into a single edge, thus achieving O sort times, although at a high cost memory. MSufSort v2 is faster practice than a suffix tree, and uses less memory, which allows for better compression because blocks can be larger. However its run time is difficult to analyze because of its complexity. MSufSort v2 begins by partially sorting the suffixes by their first two characters using a counting sort and storing the result as a stack of 64K linked lists reading right to left the input string. To save space, the list pointers are overlayed on the and the head of each list is stored a separate stack which is popped lexicographical order. If a list contains only one element, then it is replaced by its rank the The sign bit is used to distinguish ranks from pointers. If a list has more than one element, then that list is split into 64K new lists by a counting sort keyed on the next two characters, and the head of each list is pushed on the stack maintaining lexicographical order. Each stack entry has a head pointer and a suffix length. denotes the last