Oracle Database - Standard
Edition - Version 11.1.0.6 and later
Information in this document applies to any
platform.
You are using an
user application and there are problems with some symbols like € symbol when
selecting the data in other tools like sqlplus or Oracle Sqldeveloper but the
symbol is seen in the application.
When selecting € symbol inserted trough sqlplus or Oracle Sqldeveloper
this is not visible in the application.
The database is an AL32UTF8 NLS_CHARACTERSET Database so this fully supports
the € symbol.
When checking the actual codes inserted trough the application you notice the
code stored is not correct.
(Using query: select dump(<column>,16) from table where......)
Output: Typ=1 Len=2: c2,80
Inserting the € Symbol with SQLDeveloper shows the following:
Output:Typ=1 Len=3: e2,82,ac
The € is in UTF8 /AL32UTF8 ( = Unicode UTF-8 -> the - in between is the
"official" Unicode name) code e2,82,ac ( U+20AC )
The c2,80 code is basically a "save guard" this is a mapping that is
for example is defined in almost any vendor ISO-8859-1 encoding ( WE8ISO8859P1
in oracle)
It's not actually defined as the € , all codes in ISO-8859-1 between 80 and 9F
are mapped to the - formally not defined as a printable character - C2,80 (
U+0080) to c2,9F ( U+009F) in Unicode.
If you see c2,80 coming from an application and it's displayed in that
application as € then this means there is somewhere WE8MSWIN1252 input (= €
code 80) that is "defined" as ISO-8859-1 (= for example HTML without
characterset definition - default is ISO-8859-1 but the user enters € in a
tekst box) and then converted to Unicode.
You then end up with U+0080 ( c2,80 in UTF-8) in the database.
Simply said, somewhere on the application side some input is not properly
defined.
This is most likely seen with web applications but can also be
seen in other setups.
Normally the expected end result of this ISO-8859-1 to Unicode conversion for
code 80 to 9F is the Unicode replacement character � U+FFFD seen the code 80 to 9F are not defined in ISO-8859-1.
Howoever seen most clients are actually windows clients - who are MSWIN1252
clients (Note 341676.1 Difference between WE8MSWIN1252 and WE8ISO8859P1 characterset ) - and
the default encoding of html (= encoding not defined) is ISO-8859-1 this means
data loss for all the characters not defined in ISO-8859-1 but in 1252 (
€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ -> Note 341676.1 Difference between WE8MSWIN1252 and WE8ISO8859P1 characterset )
and no way to "reverse" this after the conversion
seen they all map to the same Unicode U+FFFD character.
With these "extra" mapping you actually can "reverse" the
incorrect ISO-8859-1 to Unicode conversion, seen each has a unique Unicode
codepoint.
Note that the
actual Unicode codepoint for this "rescue mapping" is NOT defined as
that character.
The solution here
is to correct the application side to use proper characterset
defintions during input.
Existing data in the database can also be corrected, for example all "not
correctly stored € symbols" can then be "rescued" by doing an
update using REPLACE(column,UNISTR('\0080'),UNISTR('\20AC')) .
Full translate
example of "correcting" all W8MSWIN1252 characters who where
*declared* as WE8ISO8859P1 on a client side and then stored in an UTF8 or
AL32UTF8 database:
translate(<column>,
unistr('\0080')||unistr('\0082')||unistr('\0083')||unistr('\0084')||unistr('\0085')||unistr('\0086')||unistr('\0087')||unistr('\0088')||unistr('\0089')||unistr('\008A')||unistr('\008B')||unistr('\008C')||unistr('\008E')||unistr('\0091')||unistr('\0092')||unistr('\0093')||unistr('\0094')||unistr('\0095')||unistr('\0096')||unistr('\0097')||unistr('\0098')||unistr('\0099')||unistr('\009A')||unistr('\009B')||unistr('\009C')||unistr('\009E')||unistr('\009F'),
unistr('\20AC')||unistr('\201A')||unistr('\0192')||unistr('\201E')||unistr('\2026')||unistr('\2020')||unistr('\2021')||unistr('\02C6')||unistr('\2030')||unistr('\0160')||unistr('\2039')||unistr('\0152')||unistr('\017D')||unistr('\2018')||unistr('\2019')||unistr('\201C')||unistr('\201D')||unistr('\2022')||unistr('\2013')||unistr('\2014')||unistr('\02DC')||unistr('\2122')||unistr('\0161')||unistr('\203A')||unistr('\0153')||unistr('\017E')||unistr('\0178'))
This document has been taken from Oracle Doc ID 1928949.1