Convert any char-set to UTF-8

The forum does not accept the word “charset” in the title, that why I wrote “char-set”…

I wrote my first shell (.sh) :smile: and it was working awesomely (not really wrote but more about copy and paste), and it was about HTML pages in Chinese (charset gb18030) that I needed to put in UTF-8.
Changing the meta to UTF-8 did nothing! so I needed a more effective solution.

After few search on the internet, I found exactly what I needed: How to Convert Files to UTF-8 Encoding in Linux

No software needed!
You need to create a .sh file and use the Terminal (no worries like I said it’s mostly a story of copy and paste).

Open the directory where you need to convert massively your files (HTML or txt or…),
Then right click / Create Document / Empty File, then name it converter.sh (converter dot sh), open it with pluma, then copy and past in it the code below:

#!/bin/bash
#enter input encoding below (GB18030 is the chinese charset i have converted, instead enter the charset YOU need to convert)
FROM_ENCODING="GB18030"
#below output encoding to UTF-8
TO_ENCODING="UTF-8"
#convert
CONVERT=" iconv  -f   $FROM_ENCODING  -t   $TO_ENCODING"
#loop to convert multiple files 
for  file  in  *.html; do
$CONVERT   "$file"   -o  "${file%.html}.utf8.converted"
#above the new file extension is .utf8.converted to allow me to check if the job was done correctly before to rename them as HTML
done
exit 0

then save the conveter.sh after your own modifications (for example if your files are .txt change all the .html in .txt in the code above)
Now open terminal (CTRL+ALT+T) and go in the folder where you need to execute the converter.sh with

cd path/to/directory/to/convert

(This is where this new extention will be VERY useful :laughing: )

Once you are in the good directory, you need to make the converter.sh executable, in the terminal enter:

chmod +x converter.sh

Then execute it (still in the terminal copy and CTRL+SHIFT+V in the terminal (to past the code below):

./converter.sh

It’s instantaneous (very fast) and now you have your files with the same name but the extention is “.utf8.converted” open the files to check that everything was well converted and OK then take all your old files and put them in a back-up directory.

Now we have to rename all the extention at once with the terminal (still very easy):
Basically it’s:

rename “s/oldExtension/newExtension/” *.oldExtension

Which is in my case:

rename “s/utf8.converted/html/” *.utf8.converted

To get them all renamed as html files.
WARNING:
If a part of the name match the extension better to take a look >> here (a solution with more slashes and back-slashes) as you do not want to rename the name itself (in my case no chance that names and extensions can match so it was ok, and i am lazy).

I hope my explanation were not too messy.

2 Likes

Hallo

Some may find this video explaining the history/development of utf-8 to be interesting (it is done in easy to understand terms - in a cafe!):

https://youtu.be/MijmeoH9LT4?t=16

Very well explained. :slight_smile:

1 Like

@alpinejohn Now I got a better understanding on how UTF-8 works, nice video!
Thanks a lot for sharing. :smiley:

For the rename part you can use caja rename extension.

Hi @brokoli

Can you tell more about that?
where is it?
Do we need to install? If so, is it compatible with U-M 16.04?

I very curious about it :grinning:

I’m not sure but afraid it appeared in 17.10, sorry, but i would advice you to update or reinstall(that’s what i did) to 18.04.
You just need to choose multiple files, go to context menu(right mouse button) and click on Rename.
UPD: I just googled and found it in this repository https://launchpad.net/~robert-tari/+archive/ubuntu/main

Thank you @brokoli , that’s a nice UI addition ! and it works on U-M 16.04
For those who are interested below the CL in Terminal

sudo add-apt-repository ppa:robert-tari/main 
sudo apt-get update 
sudo apt-get install cajarename 

Then
To reload caja with a second > Rename… < in the context menu type in Terminal:

caja -q

Then have fun to rename multiple files with interresting options