Surrogates and Combining Characters in Java

Target

Build a new API for the String, StringBuf and Charachter class of the Java SDK that ensures that surrogates and combining characters are preserved (e.g. avoid cutting a string within such a character).

Design ideas

Example

old:

String componentId = id;
int i = id.indexOf('_');
if (i >= 0) {
componentId = id.substring(0, i);
}

new:

String componentId = Utf16Str.SplitBefore( id, '_');
if( componentId == NULL ) {
componentId = id;
}

Critical classes and methods

class Character: methods dealing with character properties

Example: 
bool Character.isLetter( char c )

Requirment: 
In order to handle surrogate pairs properly, an interface for 32-bit  characters (encoding UTF-32) is required.

Solution Approach:
class UCharacter of ICU4J offers such a 32-bit interface and should be used instead of JDK class Characters.

class String/StringBuf: extract single characters from string

Example: 
char String.charAt( int index )

Requirement:
A 16-bit return value is problematic if the 16-bit value is part of a surrogate pairs or part of a combining character sequence.

Solution Approach (depending on the programming context):

This alternatives can be offered via static methods and/or via a character iterator class.

class String: searching

Example: 
int String.indexOf( char c )

Requirement:
When a matching character or string is found, the character that immediately follows the matching character has to be checked, as well. If the matching sequence is immediately followed by a combining character, than it is not a valid match, because the combining character modifies the last character of the matching sequence.

class String/StringBuffer: extracting parts of a string

Example: 
String String.substring( int beginIndex, int endIndex)

Requirement:
When extracting parts from a string, avoid splitting surrogate pairs and Graphemes.

Solution Approach (depending on the programming context):

class StringBuffer: modifying parts of a string

Example: StringBuffer StringBuffer.replace( int beginIndex, int endIndex, String s )

The indices that mark the borders of the operation may not cut surrogate pairs and Graphemes. Principially the same approaches can be applied as for extracting parts from a string.

Rules and restrictions on strings that can be processed

Possible support by check tools

A check tool can detect and warn, if one of the critical operations is done.

It may be possible to avoid warnings, if the critical operations is used in a save context. This can be: