Chinese encoding in names of compressed files in zip

Question

Sorry for asking a question similar to my previous one. The difference from the last question is that now it is in a zip archive where Chinese encoding in names of compressed files are not recognized, both after extraction and after listing the content of the zip archive:

$ unzip -l "严蔚敏数据结构(c语言版)教材及答案.zip"
Archive:  严蔚敏数据结构(c语言版)教材及答案.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
    25600  2000-01-04 23:27   ?+?+i- ??-?.doc
    80896  2000-01-04 23:27   ?+??i- -+.doc
    41984  2000-01-04 23:27   ?++?i- i+????-?.doc
    52224  2000-01-04 23:27   ?+?+i- ??i?.doc
    50688  2000-01-04 23:27   ?+??i- ??????.doc
    54272  2000-01-04 23:27   ?++?i- -????-??????.doc
    26112  2000-01-04 23:27   ?+?-i- ?????????_+?.doc
    76288  2000-01-04 23:27   ?+-?i- -??-????-?.doc
    53760  2000-01-04 23:27   ?+-?i- -+?+++?=.doc
    53760  2000-01-04 23:27   ?+--i- ??.doc
  7929077  2009-02-26 22:49   -???????+C????+??+?+?+pdf.pdf
---------                     -------
  8444661                     11 files

I was wondering how to deal with this problem?

Thanks and regards!

update:

I have uploaded this zip archive to and it can be downloaded from http://www.mediafire.com/?dw87ee72m56evy9

I tried to use chardet to determine the encoding of the names of the compressed files by:

$ unzip -l "严蔚敏数据结构(c语言版)教材及答案.zip" | chardet
<stdin>: utf-8 (confidence: 0.99)

But are the file names indeed encoded in utf-8? Aren't they supposed to be in a foreign encoding? I guess the output by unzip -l are too much, and how shall I only single out the filenames in its output as input to chardet?

I don't know. It can be GB 2312, GBK, GB 18030. I will guess it is GB 2312, because it is the most popular? — Tim, Jun 10 '11 at 22:06

score 8 · Answer 1 · edited Dec 18 '15 at 04:51

8

Try:

unzip -O cp936 "严蔚敏数据结构(c语言版)教材及答案.zip"

edited Dec 18 '15 at 04:51

snoop

4,040
9
40
58

answered Dec 17 '15 at 21:30

ChandlerQ

207
2
8

Worked for me with Cyrillic characters in an archive from Mac OS X this way:
```
unzip -O utf-16 "archive.zip"
```
– Andrew Sklyarevsky Jul 14 '16 at 13:51
Works with SHIFT-JIS (basically all ZIP files made on Windows in Japan) too: unzip -O shift-jis the.zip – Nicolas Raoul Jan 23 '17 at 08:43
Why my unzip gets no option -O, UnZip 6.00 of 20 April 2009, unavailable on either Ubuntu 16.04 and Manjaro 18.04 here. – CodyChan Apr 29 '19 at 02:57

enzotib · Answer 2 · 2011-06-11T09:09:16.330

I would extract the files, then do a

ls | chardet

to see what it says.

Also, you could try different encodings with

ls | iconv -f GB2312

for example. You could see the encoding known to iconv with iconv -l.

Once determined the encoding, let's suppose is GB2312, you should modify the filenames, to change the encoding to UTF8

for f in *; do
  g="$(iconv -f GB2312 <<<"$f")"
  mv "$f" "$g"
done

EDIT

Tried a brute force attack to your zip file, converting to every known encoding, but none of them seems to me to be plausible

#!/bin/bash

iconv -l | 
  sed  's|//$||' | 
  while read enc; do 
    printf "\n --- $enc ---\n\n"
    ls | iconv -cf "$enc" 2>/dev/null
  done

Did not work for me, chardet was wrongly thinking that the extracted files were UTF-8. — Nicolas Raoul, Jan 23 '17 at 08:52

Daniel Kenzelmann · Answer 3 · 2015-01-15T16:07:28.090

Usually the filenames get interpreted as western charset. Thus you have to first convert the filenames from UTF-8 back to ISO then interpret that "byte-stream" as GB2312 back to UTF-8. i.e.:

ls | iconv -f UTF-8 -t ISO8859-1 | iconv -f GB2312 -t UTF-8

This does not work for your specific file, so you might want to find out how the file was created (what system, what program, what language etc.)

See also http://en.wikipedia.org/wiki/Mojibake

jemin · Answer 4 · 2015-07-23T09:10:45.747

You will need iconv, but convmv and cconv are optional.

Step 1, find the correct char-encode converting chain.
Step 2, rename files by a shell script.

Sometimes, there is a wrong character encoding in the converting chain. You have to find the gap out, by the way in enzotib's post.

For example, a file named "冼极.otf" in an utf8 file system.

touch 冼极.otf

I have to do the followings to get its correct name "宋体.otf".

convmv --notest -f utf8 -t cp950 *.otf
convmv --notest -f cp936 -t utf8 *.otf

After one has its correct name, one may like to do the simplified-traditional Chinese converting by use of cconv as in the shell-script below. In my case "宋體.otf" finally.

#!/bin/sh
# bash shell script
mkdir TW
for filename in *; do [ -d "$filename" ] || echo "$filename" ; done |
    while read filename; do 
    filename_TW=`echo "$filename" | cconv -f UTF8-CN -t UTF8-TW` 
    printf "\n --- $filename $filename_TW ---\n\n"
    #uncomment lines below if you've confirmed the names
    #mv "$filename" "TW/$filename_TW"
    #touch "$filename"
    done

Here is another example relative to Daniel's post. A file named "ý¹úÖ¾.txt" in an utf8 file system.

touch ý¹úÖ¾.txt

After some trails, I find its correct simplified Chinese name is "三国志.txt" by

ls | iconv -f utf-8 -t iso-8859-1 | iconv -f cp936 -t utf-8

Then I rename it to traditional Chinese name "三國志.txt" by

#!/bin/sh
mkdir BACKUP
for filename in *; do [ -d "$filename" ] || echo "$filename" ; done |
    while read filename; do 
    filename_TW=`echo "$filename" | iconv -f utf-8 -t iso-8859-1 | iconv -f cp936 -t utf-8 | cconv -f UTF8-CN -t UTF8-TW` 
    mv "$filename" "$filename_TW"
    touch "BACKUP/$filename"
    done

Fin

Chinese encoding in names of compressed files in zip

4 Answers4

Linked