Kumar, Harsh (2016-12). A Plain-text Compression Technique with Fast Lookup Ability. Master's Thesis.
Data compression has always been an essential aspect of computing. In recent times, with the increasing popularity of remote and cloud-based computation, compression is becoming more important. Reducing the size of a data object in this context would not only reduce the transfer time, but also the amount of data transferred. The key figures of merit of a data compression scheme are its compression ratio and its compression, decompression and lookup speeds. Traditional compression techniques achieve high compression ratios, but require decompression before a lookup can be performed. This increases the lookup time. In this thesis, we propose a compression technique for plain-text data objects, that uses variable length encoding to compress data. The dictionary of possible words is sorted based on the statistical frequency of the use of words, which are encoded using the variable length code-words. Words that are not in the dictionary are handled as well. The driving motivation of our technique is to perform significantly faster lookups without the need to decompress the compressed data object. Our approach also facilitates string operations (such as concatenation, insertion and deletion and search-and-replacement) on compressed text without the need of decompression. We implement our technique in C++, and compare our approach with industry standard tools like gzip and bzip2 in terms of compression ratio, lookup speed, search-and-replace time and peak memory uses. Our compression scheme is about 81x faster as compared to gzip and about 165x times faster as compared to bzip2, when the data is searched, and restored into a compressed format. In conclusion, our approach facilitates string operations like concatenation, insertion, deletion and search-and-replace on the compressed file itself without the need for decompression.