Extracting the base character from a character marked with a diacritic

Ever wondered how you can translate European national characters with diacritics such as å, ä, ö, ó etc into their base characters (a and o)? This can be useful for example when constructing filenames from user-given strings.

This is one way to do it:

string s = "ö";
string normalizedString = s.Normalize(NormalizationForm.FormD);

Console.WriteLine("Composite character: " + s[0]);
if (normalizedString.Length > 1)
{
   Console.WriteLine("Base character: " + normalizedString[0]);
   Console.WriteLine("Diacritic character: " + normalizedString[1]);
}

The result will be:

Composite character: ö
Base character: o
Diacritic character: ¨

Obviously, the key is the String.Normalize function. When we pass it NormalizationForm.FormD as the requested Unicode normalization form, it will separate all composite characters into their constituents. If the char is not a composite, then nothing will happen to it.

Note that the resulting string will be longer that the original if characters were separated. If needed it’s easy to iterate over the characters to filter our non-letters using conditions such as

if (CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.LowercaseLetter ||
    CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.UppercaseLetter)
{
   ...
}

Happy character-mangling!

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.