The Encoding saga for English and non-English languages
There are several approaches in understanding the encoding mechanism. The approach we will take today is to see data conversion and storage using bytes. Since most of our communication happens in text (or string), we will look into string-> byte[] conversion. Later this byte array can be transferred over network or stored in a data store.
Converting a string to a byte (binary) array has always been challenging. There is no easy solution to this problem if you are working on an application that uses a locale other than English. And it becomes even more complex when you are saving the byte array to a database with a different character set.
So this article will deal with handling these encodings
Do you need encoding for conversion of English text?
Let’s answer this question with the help of an example
private static void Main(string[] args) { string sample = @"this is a \t string in unicode format"; byte[] bytes = GetBytes(sample); string convertedBack = GetString(bytes); Console.WriteLine(bytes.Length + " >> " + convertedBack); } static byte[] GetBytes(string str) { var bytes = new byte[str.Length * sizeof(char)]; Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length); return bytes; } static string GetString(byte[] bytes) { var chars = new char[bytes.Length / sizeof(char)]; System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length); return new string(chars); }
When you execute this code, it gives you an output
74 >> this is a \t string in unicode format
So this code converts 37 character string into a 74 character byte array. For small size hobby applications, this method works fine. But this is not the optimal way of doing it. This is where the right encoding comes into picture
Now let’s write another code to use Encoding to convert a string
public static byte[] GetBytesWithEncoding(Encoding encoding, string str) { return encoding.GetBytes(str); } public static string GetStringWithEncoding(Encoding encoding, byte[] bytes) { return encoding.GetString(bytes); }
When we pass different encoding objects to this method, we get following result
Unicode (UTF-7) >> 41 >> this is a \t string in unicode format
Unicode (UTF-8) >> 37 >> this is a \t string in unicode formatUS-ASCII >> 37 >> this is a \t string in unicode format
Unicode >> 74 >> this is a \t string in unicode format
Unicode (UTF-32) >> 148 >> this is a \t string in unicode format
So your byte array size varies based on the encoding selected. When building enterprise applications it is important to ensure that your memory footprint is the least and such optimizations definitely help.
Encoding for non-English text
If you are using a non-English locale on your application machine/server using the default encodings will not be helpful. You may face several issues converting surrogate characters or language specific characters while applying default encoding.
The best solution requires you to find out the encoding that your data store supports. If your data store is
- Database – find out the character set supported.
- In-memory - you do not need to worry.
- Flat-file, – find out your system locale.
So once you have found the locale, you need to map it with the encoding. .NET supports 140 locale and you get can the list of locale by a small piece of code below or at MSDN
var encodings = Encoding.GetEncodings(); foreach (var encodingInfo in encodings) { Debug.WriteLine(encodingInfo.DisplayName + " > " + encodingInfo.Name
+ "(" + encodingInfo.CodePage + ")"); }
The next step would be to create an encoding object with the right encoding code. Below is the code that uses Devnagri (Hindi) and Japanese encodings. I have used Google translation to convert the text ‘Welcome to Encoding saga’ so please pardon me if the translations are not correct.
// Encoding - Hindi encoding = Encoding.GetEncoding(57002); bytes = GetBytesWithEncoding(encoding, @"एनकोडिंग सागा में आपका स्वागत है"); convertedBack = GetStringWithEncoding(encoding, bytes); Console.WriteLine(encoding.EncodingName + " >> " + bytes.Length + " >> " + convertedBack); // Encoding - Japanese encoding = Encoding.GetEncoding(932); bytes = GetBytesWithEncoding(encoding, @"エンコード佐賀へようこそ"); convertedBack = GetStringWithEncoding(encoding, bytes); Console.WriteLine(encoding.EncodingName + " >> " + bytes.Length + " >> " + convertedBack);
If you are running this on a system with English locale, the convertedBack values visible on the screen would will be ‘???’
Hope this helps you to understand the importance of right encoding in data conversion and storage.