A Plain-text Incremental Compression (PIC) Technique with Fast Lookup Ability
Additional Document Info
2018 IEEE. Data compression is a key aspect of computing applications such as online search engines, cloud computing and big data computing. In recent times, with the increasing popularity of remote and cloud-based computation, compression is becoming more important. Reducing the size of a data object in this context would not only reduce the transfer time, but also the amount of data transferred. The key figures of merit of a data compression scheme are its compression ratio and its compression, decompression and search (lookup) speeds. Traditional compression techniques achieve high compression ratios, but require decompression before a lookup can be performed, thus increasing the lookup time. In this paper, we propose an incremental compression technique for plain-Text data objects (PIC), that uses variable length encoding to compress data. The dictionary of possible words is sorted based on the statistical frequency of the words, and the words are encoded using the variable length code-words. Words that are not in the dictionary are handled as well. The driving motivation of our technique is to perform significantly faster lookups without the need to decompress the compressed data object. PIC also facilitates string operations (such as concatenation, insertion, deletion and lookup-And-replacement) on compressed text without the need of decompression. In this manner, these string operations are incrementally compressed, resulting in greatly improved efficiency. We implement our technique in C++, and compare PIC with industry standard tools like lz4, gzip and bzip2 in terms of compression ratio, lookup speed, and lookup-And-replace time. PIC is about 8.84, 100.17 and 174.25 faster as compared to lz4, gzip and bzip2, respectively, when the data is looked-up, and restored into a compressed format. A lookup-And-replace operation on a PIC-compressed file is shown to be 2.83 faster than a plain-Text file. We formally prove that our method does not produce any false positive or false negative results during a lookup, and also prove that PIC is amenable to file concatenation. We have also implemented a parallel version of PIC. Compression, decompression and search times are 2.24, 1.51 and 5 faster when eight threads are used instead of one. We also demonstrate that the parallel version of PIC scales better compared to commonly available tools for parallel compression. PIC opens the possibility of performing all string manipulations on a PIC-compressed file, yielding about 2 compression over plain text, and 1-2 orders of magnitude faster operations than traditional compression schemes.
name of conference
2018 IEEE 36th International Conference on Computer Design (ICCD)