I have a 2958616 byte text file. When I run sort < file.txt | uniq > sorted-file.txt
, I get a 3213965 byte text file. Why is my sorted text file bigger?
You can download the text files here.
While your original file has lines that end with \n
, your sorted file has \r\n
. The addition of the \r
is what changes the size.
To illustrate, here's what happens when I run your command on my Linux system:
$ sort < file.txt | uniq > sorted-file.linux.txt
$ ls -l file.txt sorted-file.linux.txt
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
$ wc -l file.txt sorted-file.linux.txt
273882 file.txt
271576 sorted-file.linux.txt
As you can see, the sorted de-duped file is a few lines shorter and, consequently, a few bytes smaller. Your file, however, is different:
$ wc -l sorted-file.linux.txt sorted-file.txt
271576 sorted-file.linux.txt
271576 sorted-file.txt
The two files have exactly the same number of lines, but:
$ ls -l file.txt sorted-file.linux.txt sorted-file.txt
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
-rw-r--r-- 1 terdon terdon 3213965 Jul 10 12:11 sorted-file.txt
The sorted-file.txt
, the one I downloaded from your link, is larger. If we now examine the first line, we can see the extra \r
:
$ head -n1 sorted-file.txt | od -c
0000000 a \r \n
0000003
Which aren't present in the one I created on Linux:
$ head -n1 sorted-file.linux.txt | od -c
0000000 a \n
0000002
If we now remove the \r
from your file:
$ tr -d '\r' < sorted-file.txt > new-sorted-file.txt
We get the expected result, a file that is smaller than the original, just like the one I created on my system:
$ ls -l sorted-file.linux.txt new-sorted-file.txt file.txt
-rw-r--r-- 1 terdon terdon 2958616 Jul 10 12:11 file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:19 new-sorted-file.txt
-rw-r--r-- 1 terdon terdon 2942389 Jul 10 15:15 sorted-file.linux.txt
\r\n
line endings are a Windows thing.
– terdon
Jul 11 '16 at 13:39
hexdump
reveals it!
$ hexdump -cn 32 file.txt
0000000 a d h d \n a d s l \n a m v b \n a
0000010 o v \n a o w \n a r o b \n a s f a
0000020
$ hexdump -cn 32 my-sorted.txt
0000000 a \n a a \n a a a \n a a d \n a a d
0000010 s \n a a f j e \n a a f j e s \n a
0000020
$ hexdump -cn 32 sorted-file.txt
0000000 a \r \n a a \r \n a a a \r \n a a d \r
0000010 \n a a d s \r \n a a f j e \r \n a a
0000020
Your sorted file is bigger because it uses Windows line endings \r\n
(two bytes) instead of Linux line endings \n
(one byte).
Could it be that you were running that command above under Windows using either tools like cygwin
or this new Linux subsystem for Windows 10? Or did you maybe run something in Wine?
file.txt
, the output is only 2942389 bytes large, about 16 kB smaller than the original. Are you sure you ran exactly the command you gave us? Please try running it again. – Byte Commander Jul 10 '16 at 12:09wc
, I learn that they both have the same line and word count, but yours is bigger in bytes:271576 271576 2942389 my-sorted.txt
and271576 271576 3213965 sorted-file.txt
. – Byte Commander Jul 10 '16 at 12:12\r\n
line endings, whereas the input file has\n
line endings. Perhaps you should set your locale differently. TryLC_ALL=C
in front of each command. – meuh Jul 10 '16 at 12:19locale
? Are you sure you didn't create the file on some other system? – terdon Jul 10 '16 at 12:22o-o
as inco-ordinator
which becomescoM-CM-6rdinator
(incat -vet
output). So it gets recognised as non-unix by sort?? – meuh Jul 10 '16 at 12:32\r
. Also, it doesn't do so on my Arch. It might be a bug in Ubuntu'ssort
oruniq
I guess. – terdon Jul 10 '16 at 12:35coördinator
and similar words. – meuh Jul 10 '16 at 12:39abcdefghijklmnopqrstuvwxyz
– wb9688 Jul 10 '16 at 12:43sed '/^[a-z]*$/d' < file.txt | wc -l
gave me 305 lines. – meuh Jul 10 '16 at 12:47â ê î ñ ô ö öö û
those aren't in the ASCII set. – terdon Jul 10 '16 at 16:42