5

I took backup of the folder /var/www with the tar command (including -z (gzip)):

tar -cvzf file.gz /var/www/*

I checked the size of www. It is around 100kb, but the size of the file produced by tar is around 185Mb. What could cause this?

TRiG
  • 1,910

3 Answers3

18

ls -sh does not take subdirectories into account.

I would use du -csh -- * to calculate this (the -- prevents problems with files starting with a "dash"). Where ...

   -s, --summarize
      display only a total for each argument
  -h, --human-readable
      print sizes in human readable format (e.g., 1K 234M 2G)
   -c, --total
      produce a grand total

man du

If you have hardlinks though it will mess up the totals.

Rinzwind
  • 299,756
5

You are probably mistaken about how big the content of your original directory is. In the case of directories, ls -l lists the size of the directory itself, not of the files contained in that directory. So for example

drwxr-xr-x 8 www-data www-data 4096 Sep  2 03:12 some-dir

shows you, that the directory itself takes 4096 bytes. But that's only the size of some-dirs entry in your filesystem structure. To sum up the sizes of the directory's contents, you can use du ("disk usage"), for example

du -s some-dir

As with ls and a bunch of other commands, you can use the switch h for "human readable" units:

du -s some-dir
1804    some-dir

du -sh some-dir
1,8M    some-dir
1

It's not the case this time (see the accepted answer), but sometimes the extra overhead of archiving and compression can result in a larger archive than the original content.

This is true when there is extremely high entropy, such as a directory filled with files of random text and/or media.

Example 1: Random data

$ dd if=/dev/urandom of=test bs=1M count=100
$ tar -zcf test.tgz test
$ tar -cf test.tar test
$ gzip -ck --best test.tar > test-best.tar.gz
$ gzip -ck --fast test.tar > test-fast.tar.gz
$ xz -ck --fast test.tar >test.tar.xz
$ xz --fast -ck test >test.xz
$ gzip --best -ck test >test.gz
$ bzip2 --best -ck test >test.bz2
$ ls -lS test*
-rw-r--r-- 1 adamhotep adamhotep 105326395 Oct  7 16:52 test.bz2
-rw-r--r-- 1 adamhotep adamhotep 104875661 Oct  7 16:49 test-fast.tar.gz
-rw-r--r-- 1 adamhotep adamhotep 104875661 Oct  7 16:48 test.tar.gz
-rw-r--r-- 1 adamhotep adamhotep 104874474 Oct  7 16:49 test-best.tar.gz
-rw-r--r-- 1 adamhotep adamhotep 104874206 Oct  7 16:51 test.gz
-rw-r--r-- 1 adamhotep adamhotep 104867840 Oct  7 16:48 test.tar
-rw-r--r-- 1 adamhotep adamhotep 104864052 Oct  7 16:50 test.tar.xz
-rw-r--r-- 1 adamhotep adamhotep 104862868 Oct  7 16:50 test.xz
-rw-r--r-- 1 adamhotep adamhotep 104857600 Oct  7 16:47 test

This created a random 100M file and then archived and compressed it in several different ways. The results are sorted by size (biggest first). As you can see, the overhead from the tarball containers and compression headers is large and there's a distinct lack of patterns to compress.

The original random file is unsurprisingly the smallest here.

(I used -ck and piped the output of the compression commands so you can more clearly see what output file it created. This was superfluous.)

Example 2: Video+Audio data

$ youtube-dl -o test.mp4 'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
[youtube] dQw4w9WgXcQ: Downloading webpage
[youtube] dQw4w9WgXcQ: Downloading video info webpage
[youtube] dQw4w9WgXcQ: Extracting video information
[youtube] dQw4w9WgXcQ: Downloading js player en_US-vflOj6Vz8
[download] Destination: test.mp4
[download] 100% of 56.64MiB in 00:07
$ gzip --best -ck test.mp4 >test.mp4.gz
$ xz --fast -ck test.mp4 >test.mp4.xz
$ ls -lS test.mp4*
-rw-r--r-- 1 adamhotep adamhotep  59388616 Oct  7 16:52 test.mp4
-rw-r--r-- 1 adamhotep adamhotep  59332683 Oct  7 16:52 test.mp4.gz
-rw-r--r-- 1 adamhotep adamhotep  59320572 Oct  7 16:52 test.mp4.xz

I repeated the gzip and xz tests for this test video. There was enough metadata to just barely shrink it with compression (xz can save 68k, a whopping 0.1%!). I suspect this has to do with the cues .mp4 leaves to ensure proper streaming and audio-visual sync. This particular video lacks subtitles.

 

In short, don't compress random or compressed data.

Adam Katz
  • 784
  • 8
  • 18