Why Is The Euro Symbol Visible In My application But Not In Oracle SQLdeveloper When Using An UTF8/AL32UTF8 NLS_CHARACTERSET Database ?

APPLIES TO:

Oracle Database - Standard Edition - Version 11.1.0.6 and later

Information in this document applies to any platform.

GOAL

You are using an user application and there are problems with some symbols like € symbol when selecting the data in other tools like sqlplus or Oracle Sqldeveloper but the symbol is seen in the application.

When selecting € symbol inserted trough sqlplus or Oracle Sqldeveloper this is not visible in the application.

The database is an AL32UTF8 NLS_CHARACTERSET Database so this fully supports the € symbol.

When checking the actual codes inserted trough the application you notice the code stored is not correct.

(Using query: select dump(<column>,16) from table where......)

Output: Typ=1 Len=2: c2,80

Inserting the € Symbol with SQLDeveloper shows the following:

Output:Typ=1 Len=3: e2,82,ac

SOLUTION

The € is in UTF8 /AL32UTF8 ( = Unicode UTF-8 -> the - in between is the "official" Unicode name) code e2,82,ac ( U+20AC )

The c2,80 code is basically a "save guard" this is a mapping that is for example is defined in almost any vendor ISO-8859-1 encoding ( WE8ISO8859P1 in oracle)

It's not actually defined as the € , all codes in ISO-8859-1 between 80 and 9F are mapped to the - formally not defined as a printable character - C2,80 ( U+0080) to c2,9F ( U+009F) in Unicode.

If you see c2,80 coming from an application and it's displayed in that application as € then this means there is somewhere WE8MSWIN1252 input (= € code 80) that is "defined" as ISO-8859-1 (= for example HTML without characterset definition - default is ISO-8859-1 but the user enters € in a tekst box) and then converted to Unicode.

You then end up with U+0080 ( c2,80 in UTF-8) in the database.

Simply said, somewhere on the application side some input is not properly defined.

This is most likely seen with web applications but can also be seen in other setups.

Normally the expected end result of this ISO-8859-1 to Unicode conversion for code 80 to 9F is the Unicode replacement character � U+FFFD seen the code 80 to 9F are not defined in ISO-8859-1.

Howoever seen most clients are actually windows clients - who are MSWIN1252 clients (Note 341676.1 Difference between WE8MSWIN1252 and WE8ISO8859P1 characterset ) - and the default encoding of html (= encoding not defined) is ISO-8859-1 this means data loss for all the characters not defined in ISO-8859-1 but in 1252 ( €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ -> Note 341676.1 Difference between WE8MSWIN1252 and WE8ISO8859P1 characterset )

and no way to "reverse" this after the conversion seen they all map to the same Unicode U+FFFD character.

With these "extra" mapping you actually can "reverse" the incorrect ISO-8859-1 to Unicode conversion, seen each has a unique Unicode codepoint.

Note that the actual Unicode codepoint for this "rescue mapping" is NOT defined as that character.

The solution here is to correct the application side to use proper characterset defintions during input.

Existing data in the database can also be corrected, for example all "not correctly stored € symbols" can then be "rescued" by doing an update using REPLACE(column,UNISTR('\0080'),UNISTR('\20AC')) .

Full translate example of "correcting" all W8MSWIN1252 characters who where *declared* as WE8ISO8859P1 on a client side and then stored in an UTF8 or AL32UTF8 database:

translate(<column>,

unistr('\0080')||unistr('\0082')||unistr('\0083')||unistr('\0084')||unistr('\0085')||unistr('\0086')||unistr('\0087')||unistr('\0088')||unistr('\0089')||unistr('\008A')||unistr('\008B')||unistr('\008C')||unistr('\008E')||unistr('\0091')||unistr('\0092')||unistr('\0093')||unistr('\0094')||unistr('\0095')||unistr('\0096')||unistr('\0097')||unistr('\0098')||unistr('\0099')||unistr('\009A')||unistr('\009B')||unistr('\009C')||unistr('\009E')||unistr('\009F'),

unistr('\20AC')||unistr('\201A')||unistr('\0192')||unistr('\201E')||unistr('\2026')||unistr('\2020')||unistr('\2021')||unistr('\02C6')||unistr('\2030')||unistr('\0160')||unistr('\2039')||unistr('\0152')||unistr('\017D')||unistr('\2018')||unistr('\2019')||unistr('\201C')||unistr('\201D')||unistr('\2022')||unistr('\2013')||unistr('\2014')||unistr('\02DC')||unistr('\2122')||unistr('\0161')||unistr('\203A')||unistr('\0153')||unistr('\017E')||unistr('\0178'))

This document has been taken from Oracle Doc ID 1928949.1