![]() ![]() The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks. However, creating a split for each block won’t work, because it is impossible to start reading at an arbitrary point in the gzip stream and therefore impossible for a map task to read its split independently of the others. As before, HDFS will store the file as eight blocks. Imagine now that the file is a gzip-compressed file whose compressed size is 1 GB. With an HDFS block size of 128 MB, the file will be stored as eight blocks, and a MapReduce job using this file as input will create eight input splits, each processed independently as input to a separate map task. Consider an uncompressed file stored in HDFS whose size is 1 GB. When considering how to compress data that will be processed by MapReduce, it is important to understand whether the compression format supports splitting. ![]() ![]() Note : If you are using a native library and you are doing a lot of compression or decompression in your application, consider using CodecPool, which allows you to reuse compressors and decompressors, thereby amortizing the cost of creating these objects by using the code compressor = CodecPool.getCompressor(codec) and the same can be returned to the pool using CodecPool.returnCompressor(compressor)Įffect of splittable produce on map-reduce Native libraries are available for all of the compression formats like deflate,gzip,bzip2,lzo,lz4 and snappy where as java implementation is available only for deflate,gzip and bzip2.The native libraries are picked up using the Java system property. For example, in one test, using the native gzip libraries reduced decompression times by up to 50% and compression times by around 10% (compared to the built-in Java implementation). Note : For performance, it is preferable to use a native library for compression and decompression. So, for example, GzipCodec encapsulates the compression and decompression algorithm for gzip. In Hadoop, a codec is represented by an implementation of the CompressionCodec interface. Note :A codec is the implementation of a compression-decompression algorithm. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |