Monday, January 14, 2013

Changing the File Encoding in Unix

From time to time, Unix users have to deal with files sent by colleagues that are not properly encoded and this may cause a bit of a trouble.

You may find your self trying to write a shell script around a file, but a cat only spits gibberish. Or, if you try to work with something a bit more elaborated  like grep or sed, it will not find any match. Huh? Chances are that the file encoding is not ASCII and the standard Unix tools will not understand its content.


$ less myfile.txt
"myfile.txt" may be a binary file.  See it anyway?
$ file myfile.txt
myfile.txt: Little-endian UTF-16 Unicode text, with CR, LF line terminators

At this point, there are two options. You can write a small script with your favourite language or use the standard tools in your system, instead.  The final result will be the same, because all use the same set of functions for the conversions.

If you want to program a bit, the following languages use iconv for character set conversion.

C

http://www.kernel.org/doc/man-pages/online/pages/man3/iconv.3.html

PHP
http://php.net/manual/en/book.iconv.php

Perl
http://search.cpan.org/dist/Text-Iconv/Iconv.pm

Python
http://pypi.python.org/pypi/iconv/1.0

Ruby
http://ruby-doc.org/stdlib-1.9.2/libdoc/iconv/rdoc/Iconv.html


If you are writing a shell script, you can always use the iconv program.  It accepts the input/output encodings and the source file. It will send the result to the standard output.

$ iconv -f utf16 -t ascii oldfile > newfile