Charsets are a nightmare – Console, Java and Oracle Database

This week, there was this issue about utPLSQL-cli printing russian symbols used in tests as ?, no matter what NLS_LANG or LC_ALL environment variables are set.

I’ll be honest: there are things I really enjoy about programming and open source work – charset issues are definitely not on that list.

Because I know I’ll need this some day in future, I thought I could write it down while it’s still fresh – and maybe it might be helpful for someone else, too. I’m totally not an expert on this topic and some of my conclusions might even be wrong – if so, don’t hesitate to reach out, which will give me a chance to learn something!

Let’s assume we have a database set up with AL32UTF8 charset and some russian Star Wars quotes stored in a table.
The database charset does not really matter here, but Oracle encourages to use AL32UTF8 and it will make it much more unlikely to hit potential bugs if you do so.

create table star_wars_quotes (
  id integer not null primary key,
  quote varchar2(4000 char)
);

insert into star_wars_quotes ( id, quote )
  values ( 1, 'да пребудет с вами Сила');

We also have a very simple Java-program to select this quote

public class CharsetExample {

  public static void main( String[] args ) {

    try {
      System.out.println("Current Charset: " + Charset.defaultCharset());

      OracleDataSource ds = new OracleDataSource();
      ds.setURL("jdbc:oracle:thin:user/pw@localhost:1521/ORCLPDB1");

      try (Connection con = ds.getConnection()) {
        try (PreparedStatement stmt = con.prepareStatement(
                "select * from star_wars_quotes")) {

          try (ResultSet rs = stmt.executeQuery()) {
            while ( rs.next() ) {
              System.out.println(rs.getString("QUOTE"));
            }
          }
        }
      }

    } catch ( Exception e ) {
      e.printStackTrace(System.out);
    }
  }
}

What happens now if I run my java-application from a new Powershell console on Windows? (Yes folks, I’m using Windows and I like it)

> java -jar charset-example.jar
Current Charset: windows-1252
?? ???????? ? ???? ????

We actually have a problem, so let’s see what we can do.

Controlling Client -> Database Charset

Long story short: You don’t do this with Java. Java will always use UTF8/Unicode charset to query the database and also stores strings internally as Unicode. The database charset does not matter, neither does the default charset of the environment.

You can easily check this with a debugger and your preferred IDE: Strings will get shown in a readable fashion. If not, the strings are already corrupted in the database (which is likely to happen if you are not on AL32UTF8 charset) or you hit a bug (very rare, even rarer when you are on AL32UTF8 charset).

Controlling Java -> Console Charset

This is where things are getting interesting. As we’ve seen I’m currently on Windows-1252 charset, codepage 1252. We might want to have the Console use UTF-8 instead, so we change its codepage:

> chcp 65001
Aktive Codepage: 65001.
> java -jar charset-example.jar
Current Charset: windows-1252
?? ???????? ? ???? ????

Too bad, no change at all.

Java doesn’t care what codepage your console runs at, it doesn’t even know. What it cares about is what your system’s default codepage is – which is Windows-1252 in our case.

We can control the Charset Java uses to output strings by using the -Dfile.encoding parameter:

> java "-Dfile.encoding=UTF-8" -jar charset-example.jar
Current Charset: UTF-8
да пребудет с вами Сила

There we go!

So the solution is to bring the codepage your console is using and the charset Java is using to output text in line.

Controlling Java-Charset without using a parameter

However, sometimes we’re not able to add a parameter to the program call. We might think about how we can control the Default-Charset inside of our Java-program.

The answer is: We can’t (Well, we actually can in a very hacky way, but I’d not recommend that and it also might no longer work in future Java versions).

The charset is controlled and set by the JVM, which is started long before our main-method is even hit.

Fortunately, Java allows to have a solution for these kind of problems: an environment variable called JAVA_TOOL_OPTIONS.
We can set this environment variable with the JVM parameters we want to be used and the JVM will pick them up:

> $env:JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF-8"
> java -jar charset-example.jar
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Current Charset: UTF-8
да пребудет с вами Сила

I hope this helps, it will definitely help me in the future.

One final note: Oracle’s NLS-Settings have nothing to do with the charset used. They impact a lot, but not the way Java is communicating with the database (if we use OJDBC driver at least).
Maybe this is a topic for a separate post about things I don’t really like dealing with 😁

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s