28

I have a 2958616 byte text file. When I run sort < file.txt | uniq > sorted-file.txt, I get a 3213965 byte text file. Why is my sorted text file bigger?

You can download the text files here.

wb9688
  • 1,437
  • 1
    Interesting. These are old and tried tools. They should not make files bigger, only smaller. – Jos Jul 10 '16 at 10:37
  • Are you operating on any kind of compressed file system or with spare files that do not occupy space for large empty areas inside them? – Byte Commander Jul 10 '16 at 12:04
  • @ByteCommander No. – wb9688 Jul 10 '16 at 12:05
  • When I apply the command you gave above to your downloaded file.txt, the output is only 2942389 bytes large, about 16 kB smaller than the original. Are you sure you ran exactly the command you gave us? Please try running it again. – Byte Commander Jul 10 '16 at 12:09
  • I'm exactly running that command. I still get a bigger output. When I run it on another computer, I get a smaller output. – wb9688 Jul 10 '16 at 12:10
  • Interesting is that examining your and my output file with wc, I learn that they both have the same line and word count, but yours is bigger in bytes: 271576 271576 2942389 my-sorted.txt and 271576 271576 3213965 sorted-file.txt. – Byte Commander Jul 10 '16 at 12:12
  • 5
    Your output file has \r\n line endings, whereas the input file has \n line endings. Perhaps you should set your locale differently. Try LC_ALL=C in front of each command. – meuh Jul 10 '16 at 12:19
  • 2
    @meuh That was it! Could you add that as an answer? – wb9688 Jul 10 '16 at 12:20
  • 5
    Hang on, the locale affects this? What locale are you using? What's the output of locale? Are you sure you didn't create the file on some other system? – terdon Jul 10 '16 at 12:22
  • Looks like the original file has been mangled for the character sequence o-o as in co-ordinator which becomes coM-CM-6rdinator (in cat -vet output). So it gets recognised as non-unix by sort?? – meuh Jul 10 '16 at 12:32
  • @meuh that wouldn't make it add \r. Also, it doesn't do so on my Arch. It might be a bug in Ubuntu's sort or uniq I guess. – terdon Jul 10 '16 at 12:35
  • The utf8 char sequence in the original file is for coördinator and similar words. – meuh Jul 10 '16 at 12:39
  • But it only contains the following characters: abcdefghijklmnopqrstuvwxyz – wb9688 Jul 10 '16 at 12:43
  • 6
    sed '/^[a-z]*$/d' < file.txt | wc -l gave me 305 lines. – meuh Jul 10 '16 at 12:47
  • 5
    Your file also contains â ê î ñ ô ö öö û those aren't in the ASCII set. – terdon Jul 10 '16 at 16:42
  • 1
    You might be interested in the -u parameter of most sort versions. – PlasmaHH Jul 10 '16 at 20:48

2 Answers2

42

While your original file has lines that end with \n, your sorted file has \r\n. The addition of the \r is what changes the size.

To illustrate, here's what happens when I run your command on my Linux system:

$ sort < file.txt | uniq > sorted-file.linux.txt
$ ls -l file.txt sorted-file.linux.txt 
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
$ wc -l file.txt sorted-file.linux.txt 
273882 file.txt
271576 sorted-file.linux.txt

As you can see, the sorted de-duped file is a few lines shorter and, consequently, a few bytes smaller. Your file, however, is different:

$ wc -l sorted-file.linux.txt sorted-file.txt 
271576 sorted-file.linux.txt
271576 sorted-file.txt

The two files have exactly the same number of lines, but:

$ ls -l file.txt sorted-file.linux.txt sorted-file.txt 
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
-rw-r--r-- 1 terdon terdon 3213965 Jul 10 12:11 sorted-file.txt

The sorted-file.txt, the one I downloaded from your link, is larger. If we now examine the first line, we can see the extra \r:

$ head -n1 sorted-file.txt | od -c
0000000   a  \r  \n
0000003

Which aren't present in the one I created on Linux:

$ head -n1 sorted-file.linux.txt | od -c
0000000   a  \n
0000002

If we now remove the \r from your file:

$ tr -d '\r' < sorted-file.txt > new-sorted-file.txt

We get the expected result, a file that is smaller than the original, just like the one I created on my system:

$ ls -l sorted-file.linux.txt new-sorted-file.txt file.txt
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:19 new-sorted-file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
terdon
  • 100,812
  • 3
    How come sort command added \r to the resulting file? Isn't the combination of \r plus \n a Windows thing? – Tulains Córdova Jul 11 '16 at 13:06
  • 3
    @TulainsCórdova that's a very good question. I have no idea. I am assuming the OP did this in a non-native environment but I don't know. And yes, \r\n line endings are a Windows thing. – terdon Jul 11 '16 at 13:39
25

hexdump reveals it!

$ hexdump -cn 32 file.txt 
0000000   a   d   h   d  \n   a   d   s   l  \n   a   m   v   b  \n   a
0000010   o   v  \n   a   o   w  \n   a   r   o   b  \n   a   s   f   a
0000020

$ hexdump -cn 32 my-sorted.txt 
0000000   a  \n   a   a  \n   a   a   a  \n   a   a   d  \n   a   a   d
0000010   s  \n   a   a   f   j   e  \n   a   a   f   j   e   s  \n   a
0000020 

$ hexdump -cn 32 sorted-file.txt 
0000000   a  \r  \n   a   a  \r  \n   a   a   a  \r  \n   a   a   d  \r
0000010  \n   a   a   d   s  \r  \n   a   a   f   j   e  \r  \n   a   a
0000020   

Your sorted file is bigger because it uses Windows line endings \r\n (two bytes) instead of Linux line endings \n (one byte).

Could it be that you were running that command above under Windows using either tools like cygwin or this new Linux subsystem for Windows 10? Or did you maybe run something in Wine?

Byte Commander
  • 107,489
  • this new Windows Subsystem for Linux? bash is only one Linux program that runs in it; sort is not bash. – user253751 Jul 11 '16 at 02:36
  • @immibis You mean Linux subsystem for Windows? I meant that, but haven't been too interested in it myself yet, so did not try or research it further so far. – Byte Commander Jul 11 '16 at 05:32
  • It is actually called the Windows Subsystem for Linux, but either one makes sense. (See how this would look with another subsystem: either "Windows Subsystem for Console [Applications]" or "Console [Application] Subsystem for Windows" makes sense) – user253751 Jul 11 '16 at 05:46
  • @immibis Aha, okay. You see I wasn't too interested in that specific topic yet. Forgive me, please :) – Byte Commander Jul 11 '16 at 05:49