8

Sorry for asking a question similar to my previous one. The difference from the last question is that now it is in a zip archive where Chinese encoding in names of compressed files are not recognized, both after extraction and after listing the content of the zip archive:

$ unzip -l "严蔚敏数据结构(c语言版)教材及答案.zip"
Archive:  严蔚敏数据结构(c语言版)教材及答案.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
    25600  2000-01-04 23:27   ?+?+i- ??-?.doc
    80896  2000-01-04 23:27   ?+??i- -+.doc
    41984  2000-01-04 23:27   ?++?i- i+????-?.doc
    52224  2000-01-04 23:27   ?+?+i- ??i?.doc
    50688  2000-01-04 23:27   ?+??i- ??????.doc
    54272  2000-01-04 23:27   ?++?i- -????-??????.doc
    26112  2000-01-04 23:27   ?+?-i- ?????????_+?.doc
    76288  2000-01-04 23:27   ?+-?i- -??-????-?.doc
    53760  2000-01-04 23:27   ?+-?i- -+?+++?=.doc
    53760  2000-01-04 23:27   ?+--i- ??.doc
  7929077  2009-02-26 22:49   -???????+C????+??+?+?+pdf.pdf
---------                     -------
  8444661                     11 files

I was wondering how to deal with this problem?

Thanks and regards!


update:

I have uploaded this zip archive to and it can be downloaded from http://www.mediafire.com/?dw87ee72m56evy9


I tried to use chardet to determine the encoding of the names of the compressed files by:

$ unzip -l "严蔚敏数据结构(c语言版)教材及答案.zip" | chardet
<stdin>: utf-8 (confidence: 0.99)

But are the file names indeed encoded in utf-8? Aren't they supposed to be in a foreign encoding? I guess the output by unzip -l are too much, and how shall I only single out the filenames in its output as input to chardet?

Tim
  • 25,177

4 Answers4

8

Try:

unzip -O cp936 "严蔚敏数据结构(c语言版)教材及答案.zip"
snoop
  • 4,040
  • 9
  • 40
  • 58
ChandlerQ
  • 207
  • 2
  • 8
2

I would extract the files, then do a

ls | chardet

to see what it says.

Also, you could try different encodings with

ls | iconv -f GB2312

for example. You could see the encoding known to iconv with iconv -l.

Once determined the encoding, let's suppose is GB2312, you should modify the filenames, to change the encoding to UTF8

for f in *; do
  g="$(iconv -f GB2312 <<<"$f")"
  mv "$f" "$g"
done

EDIT

Tried a brute force attack to your zip file, converting to every known encoding, but none of them seems to me to be plausible

#!/bin/bash

iconv -l | 
  sed  's|//$||' | 
  while read enc; do 
    printf "\n --- $enc ---\n\n"
    ls | iconv -cf "$enc" 2>/dev/null
  done
enzotib
  • 93,831
0

Usually the filenames get interpreted as western charset. Thus you have to first convert the filenames from UTF-8 back to ISO then interpret that "byte-stream" as GB2312 back to UTF-8. i.e.:

ls | iconv -f UTF-8 -t ISO8859-1 | iconv -f GB2312 -t UTF-8

This does not work for your specific file, so you might want to find out how the file was created (what system, what program, what language etc.)

See also http://en.wikipedia.org/wiki/Mojibake

0

You will need iconv, but convmv and cconv are optional.

Step 1, find the correct char-encode converting chain.
Step 2, rename files by a shell script.

Sometimes, there is a wrong character encoding in the converting chain. You have to find the gap out, by the way in enzotib's post.

For example, a file named "冼极.otf" in an utf8 file system.

touch 冼极.otf

I have to do the followings to get its correct name "宋体.otf".

convmv --notest -f utf8 -t cp950 *.otf
convmv --notest -f cp936 -t utf8 *.otf

After one has its correct name, one may like to do the simplified-traditional Chinese converting by use of cconv as in the shell-script below. In my case "宋體.otf" finally.

#!/bin/sh
# bash shell script
mkdir TW
for filename in *; do [ -d "$filename" ] || echo "$filename" ; done |
    while read filename; do 
    filename_TW=`echo "$filename" | cconv -f UTF8-CN -t UTF8-TW` 
    printf "\n --- $filename $filename_TW ---\n\n"
    #uncomment lines below if you've confirmed the names
    #mv "$filename" "TW/$filename_TW"
    #touch "$filename"
    done

Here is another example relative to Daniel's post. A file named "ý¹úÖ¾.txt" in an utf8 file system.

touch ý¹úÖ¾.txt

After some trails, I find its correct simplified Chinese name is "三国志.txt" by

ls | iconv -f utf-8 -t iso-8859-1 | iconv -f cp936 -t utf-8

Then I rename it to traditional Chinese name "三國志.txt" by

#!/bin/sh
mkdir BACKUP
for filename in *; do [ -d "$filename" ] || echo "$filename" ; done |
    while read filename; do 
    filename_TW=`echo "$filename" | iconv -f utf-8 -t iso-8859-1 | iconv -f cp936 -t utf-8 | cconv -f UTF8-CN -t UTF8-TW` 
    mv "$filename" "$filename_TW"
    touch "BACKUP/$filename"
    done

Fin

jemin
  • 53
  • 4