Lossless data compression may greatly increase communications speed and memory capacity without compromising data integrity. Some compression techniques even offer built-in encryption capabilities.
Yet, according to classical Shannon information theory, lossless data compression is theoretically impossible. That theory no longer explains but rather obscures practice. Autosophy information theory provides a much better understanding of lossless data compression techniques. Even the older Huffman codes are shown to be examples of Autosophy communication.
This tutorial deals with the three most basic compression techniques: the Huffman codes, the Ziv Lempel 77 code, and Autosophy tree networks. The V.42bis modem compression standard, based on dynamic Autosophy tree networks, is dealt with in a separate tutorial.
There are some general restrictions which should be considered. Data compression will not work for random bit sequences such as previously compressed or encrypted data. Inappropriate application can even lead to data expansion such that a "compressed" file contains more bits than the original. Some compression methods are also subject to error propagation in which even a single transmission error can scramble all the following data. Another practical concern is that many data compression algorithms are covered by patents. Patent licenses are required even for international standards such as V.42bis and choosing to ignore that is very unwise.
The basic method for lossless data compression

Autosophy data communication is conceptually quite simple. Consider an English dictionary containing up to 64k words in which each word is numbered in order. Each word-number or ADDRESS TOKEN would require 16 bits for identification. Things start to get interesting when an identical dictionary is held by both a transmitter and a receiver. To send a text file the transmitter would search the dictionary and substitute each word in the text file with its corresponding word-number or ADDRESS TOKEN. The receiver uses the ADDRESS TOKENS to find and retrieve the same words from its identical dictionary. A whole text word containing any number of characters can thus be transmitted with 16 bits, i.e., equivalent to only 2 ASCII characters. The compression is realized without any changes to the data. The compression ratio depends on the average number of characters in the words of the text file.

According to the Autosophy information theory, communication depends on the data content and the amount of “knowledge” shared by the transmitter and receiver. The more “knowledge” they share (and the closer it is to “true mathematical “knowledge”) the more efficient the communication.
An Autosophy communication system may be implemented with either a fixed or a learning library.
A fixed library is grown in advance and distributed on floppy disc or via an Internet download. A library can also be kept secret with distribution limited to authorized receivers. The transmitted address tokens would then represent a thoroughly encrypted code for everyone else, enabling secure communications on the Internet or via satellite. Foreign languages require separate libraries which can easily be identified during the communication setup. A fixed library system is most efficient where the vocabulary is known and relatively predictable. It then provides the highest compression ratios with the simplest implementation. A library made with a Content Addressable Memory (CAM) can attain extremely fast encoding speeds.
In a learning library system each communication begins with an empty library. The libraries in the transmitter and receiver are assembled in lockstep during the transmission using the transmitted address tokens. This is more flexible in that each transmission creates its own library specific to language and data type. However, the method does not work well for short messages of less than a few thousand characters, which may even be subject to data expansion. Any error in a learning library transmission may also cause error propagation. A data encryption option is not generally available.
The conventional Huffman codes
A first lossless data compression scheme was invented by David Huffman and described in "A Method for the Construction of Minimum Redundancy Codes" (1952). A patent (#3,694,813) has expired. Despite being a favorite of academics, the Huffman codes have resulted in few commercial applications. Its chief importance lies in the generally overlooked fact that the data input and transmission codes are treated as “addresses” rather than binary digits. That already takes it beyond the Shannon theory.

At the core of the Huffman codes is a library of statistics regarding the relative frequency of different characters. Variable length codes are assembled with the most often used characters being assigned the codes with the fewest bits. This is accomplished by adding the relative frequencies of the least used characters together into a new node, which is then combined again in pairs until each character has an assigned code. The frequency statistics tables are usually assembled by a software package using old text files as input.

Each ASCII input character is used as an “address” in the library to fetch the variable length bit code for transmission. The receiver has a much larger library containing the ASCII output codes. Most of its library locations are empty. The first bit arriving at the receiver is used as the most significant address bit, with the remaining bits of the address at all zeroes. If an output character is stored in that library location then the character is used as output data and the routine is restarted with the next transmitted bit. If that library location is empty then the next transmitted bit is used as the next lower significant address bit and the library is examined again. The library is thus searched in a tree pattern until an output character is found in the library.
The library must contain precise frequency statistics for each data character. If the statistics turn out to be wrong because of changes in the input data then data expansion may occur. Dynamic libraries where the frequency statistics are constantly and simultaneously recalculated in both the transmitter and the receiver can solve that problem but only for large file transmissions. Any error in such a transmission will cause error propagation.
For many decades the Huffman code remained the only lossless data compression method. That changed only in 1974 when the new Autosophy information theory led to more efficient algorithms. Huffman codes are now quite outdated and not recommended for commercial applications.
Ziv Lempel (LZ-77) sliding library data compression
The data compression method published in May 1977 by Jacob Ziv and Abraham Lempel (“A Universal Algorithm for Sequential Data Compression”) is now known as the LZ-1 or LZ-77 code. Ziv and Lempel did not apply for patents on this invention, but it should be noted that Douglas Whiting later received important improvement patents (5,003,307 and 5,016,009). Although generally inferior to the tree algorithms, LZ-77 remains the most widely used data compression method. It can be found everywhere from slow software-only applications to very fast integrated chipsets.

Tree network data compression and encryption
Self-learning tree networks were discovered in June 1974 by Klaus Holtz and first disclosed in 1975 in the application for patent 4,366,551. In September 1978, J. Ziv and A. Lempel published their second major compression algorithm ("Compression of Individual Sequences via Variable-Rate Coding") now known as the LZ-2 or LZ-78 code. Together with W. Eastman, Ziv and Lempel applied for patent 4,464,650 in August 1981. In June 1984 Terry Welch improved the method ("A Technique for High Performance Data Compression," Patent 4,558,303) and the resulting algorithm is now known as the Ziv-Lempel-Welch (LZW) code. LZW is a dynamic version of the Holtz trees invented ten years earlier. The serial tree network, illustrated below, is one of eight self-learning networks described by Holtz.
According to the Autosophy information theory a tree network accumulates true mathematical “knowledge.” It consists of identical nodes, each of which contains an “engram” of knowledge. An engram is created by a new data input (a GATE) related to already established knowledge (via a POINTER) which then creates a new engram of “knowledge” (an ADDRESS in omni dimensional hyperspace).
TREE NETWORK GENERATION ALGORITHM
TREE NETWORK RETRIEVAL ALGORITHM
Fixed library data compression and encryption
Fixed library text compression can triple communications speed or storage space. Encoding speed using a Content Addressable Memory can reach 50 Million characters per second and thus match any network speed. The library can be included in standard communications software packages with different libraries for different languages. Alternatively, the users may create their own private libraries for encrypted communications. Copies of the libraries must then be distributed to all authorized users, either on floppy disc or via Internet download.
FIXED LIBRARY GENERATION ALGORITHM
A hyperspace library may be grown from old text files using a software package. The input files must be at least 150 kbytes long and in the proper language. A hyperspace library with 8k nodes, each 21 bits wide, has proven near-optimal. The library will only contain alphabetic text words. Non-text characters can not be compressed and will later be represented by a 9 bit code. For more efficient storage all characters are converted into lower case ASCII code. The first 256 library ADDRESSES are pre-loaded with the ASCII character set so that each character will grow a separate tree network from a different SEED node. The character set may be scrambled for encryption.
In a first pass the input text files are used to grow a serial tree network in which each word is stored only once. The TIP code, which represents each text word, is sent to a special bubble sorting list. If a TIP code is not found in the bubble list then it is added to the bottom of the list. If an identical TIP code is already stored in the list then it is swapped with the TIP code one location higher up in the list. In this way each word code is stored only once and the most often used words in a language tend to migrate (bubble) up towards the top of the list. Eventually, starting from the top, all TIP codes in the list are converted back into clear text words using the Tree Network Retrieval Algorithm explained earlier. The text words are stored in a temporary file where each word is separated from the next by an ASCII space.
The tree library is then cleared and the text word file used in a second pass to create the actual output tree library. That fixed tree library is then used in the communication software or hardware.
DATA ENCODING AND TRANSMISSION ROUTINE
RECEIVER DATA RETRIEVAL
Prior test results show that an 8k by 21 bit node library is equivalent to about 130,000 characters of random text (as used in the LZ-77 code) and contains about 6000 of the most common English words. About 97% of the words used in a normal Internet communication are contained in the list. Words not found in the library are automatically chopped into known fragments for transmission. Several transmission codes may be required for an unknown word, but the fragmentation will not significantly reduce compression ratios because of the relative rarity of such words. For normal text communications the average compression ratio is about 3:1. The system can be modified to learn new words during a file transmission but the increase in compression ratios is not very significant.
The above example is designed to compress text files. Each text word is normally terminated by an ASCII space character as the “end of sequence” indicator. Other data types, including sounds and images, can be compressed using similar algorithms.
Compression of still images or live television involves spiral scanning sequences in which each pixel brightness value is treated like a text character in the above example. The “end of sequence” is a fixed string length limit. Lossless image and video compression is further explained in a separate tutorial.

Sound or voice can be compressed using a sampling clock. Each digital sample of the analog wave is treated like a text character in the above example. A sampling sequence is terminated when the analog wave crosses the zero volt level and signals the “end of sequence.” The last sample may be a time sample to prevent any drifting of the sampling clock. Analog wave forms, which are later not found in the library, are cut into wave fragments for transmission (with the S bit = 0).
A data compression and encryption system can be implemented on a single chip to speed up modem communications and maximize storage in PC disc files. A universal chip set can be used for open communications, and a secret library distributed for encrypted communications. A Content Addressable Memory (CAM) can increase encoding and retrieval speed to more than 50 Million characters per second, fast enough even for real time data applications. A fixed library system does not suffer from error propagation. Transmission errors will only cause local data errors. Data compression is active immediately, providing maximum data compression for file transmissions of any length.
Dynamic library data compression
The alternative to a fixed library is one which builds itself dynamically from transmitted address tokens. Basically, each tree node in the library contains a pointer to a previously learned string and also the first character of the next transmitted string code. An example of such a system is found in the V.42bis modem compression standard, described in a separate tutorial.
A dynamic library is a seductive idea but has many drawbacks along with the obvious benefit that it adapts to any language or data type. It will not compress short files very well because a relatively empty library contains few matching strings. Several thousand characters are needed even to avoid “data expansion.” The library can be “poisoned” by different data types, such as graphics embedded in a text file. It will then require a few thousand characters just to recover normal compression efficiency. Also, because each string is cut off at the first “not found node” a very disorganized library is formed. Instead of the text words found in a fixed library, a dynamic library contains random string fragments. Any transmission error will lead to differences between the transmitter and receiver libraries and lead to “error propagation.” The library can also grow to unmanageable dimensions during very large file transmissions. To avoid this requires a recycling library in which nodes are constantly cleared and recycled in further transmissions. The algorithms are very complex and require an embedded microprocessor for real time operations. Encoding and retrieval speed using even very fast processors can only just keep up with telephone or ISDN line speeds. Very fast data streams in very high speed channels can not be compressed using this method.
Performance profiles for text data
Data
compression ratios are determined by data content, i.e., the mix of data
types. Normal text data can be compressed significantly. Totally random
data can not be compressed using any algorithm.

The diagram, above, shows typically achievable compression ratios using the most advanced methods available.
The Huffman codes' compression ratios are quite low even for very predictable data. Significant compression ratios can only be achieved for highly specialized data types.
Though LZ-77 and V.42bis have similar characteristics, V.42bis compression ratios are almost always higher. With a sufficiently large library, compression ratios may even approach those achievable using fixed tree libraries. Searching such large libraries, though, requires more time and encoding speed slows down.
Inherent in a dynamic library system is a risk of data expansion for short files. The V.42bis standard avoids data expansion by switching the compression engine on or off depending on achieved compression ratios. An embedded microprocessor constantly monitors the compression ratios. If the compression ratios become too low then it switches to uncompressed data. If the data becomes compressible again then the communication switches back into compressed mode. With its small library V.42bis normally achieves only about 2:1 compression. And, though the algorithm is fast enough for telephone modems, it is too slow for newer high speed channels.
Fixed tree library systems achieve the highest compression on transmissions of all sizes. Ratios average about 3:1. Adding special learning features can further improve the compression, though only marginally. Fixed library systems are ideal for compressing Internet communications including e-mail and provide by far the most effective encryption option.