5. String Classes and Collation

String handling is an important part of most applications. While Strings are a type of Collection, they have a number of unique features and behavior.

Characters and Unicode
Describes Characters.

CharacterCollection and String classes
Introduces the GemStone Smalltalk objects that store collections of Characters.

String Sorting and Collation
Describes collation, including traditional string collation and collation using the ICU libraries and Unicode strings.

Encrypting Strings
Explains how to encrypt strings.

5.1 Characters and Unicode

A Character is a special object—an object whose value is encoded in the OOP. Literal Characters are formed with a leading $.

Code point

Each Character has a code or codePoint, which for lower order Characters is the ASCII value. Either of these terms may be used, though ASCII is an incorrect term for the higher code points. GemStone supports Characters with values from 0 to 16r10FFFF, the full Unicode range, except for the Unicode reserved range.

The Unicode range of codePoints from 16rD800-16rDFFF is reserved for encoding leading/trailing surrogate pairs for UTF-16 encoding. These can never be legal Unicode characters, and as such, it is an error to attempt to create a Character in this range.

To get the Character for a given codePoint, use the Character class methods withValue: or codePoint:.

Attributes

Characters have “type”, and know if they are a digit, letter, separator, or other similar kind. This information is defined in the Unicode database as the Unicode general category, and a variety of testing methods are available. The Unicode database also defines the upper and lower case equivalents, and case conversion methods are available. See the image for a full list of available protocol.

For example,

$Z isUppercase

true

$u isDigit

false

Collation

Characters are ordered (collated) using internal character tables, which provide Unicode collation order for Characters up to code point 255. Characters above that are collated by code point. Character collation is used in collating instances of basic String classes. For more on collation, see String Sorting and Collation.

Character collation can be modified by installing character data tables, although this use is deprecated. This may be used to provide Unicode collation for Characters with codePoints above 255 or to provide legacy GemStone collate order (the collation order that was used in versions before 2.4). Character-based string collation has limitations outside the ASCII range; the ICU-library based string collation should be used if the default collation is not sufficient.

Unicode and the Unicode Database

The Unicode Consortium is an international standards organization that produces the Unicode Database. Unicode is a commonly used standard which provides unique codes for all Characters in all Character sets, in the range 0 to 0x10FFFF. It also describes the category of each Character and relationship between it and other Characters, and provides a default collation order with the Default Unicode Collation Element Table (DUCET).

For more information on this database, see http://www.unicode.org/Public/UNIDATA/UCD.html

The Unicode Consortium provides code charts by script as well as a single master list of all characters, presented in an ASCII-only, comma-delimited version. The current version of this database can be found at http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.

Character Data Tables

Customized Character data tables are deprecated. This information is provided for convenience when performing tests as part of transitioning to alternate character and string collation and other handling.

Character data tables are an internal structure that supports Character collation and Character-based string collation. For performance, the base installed tables include only Characters with codes 0..255. Installing character data tables allows Unicode collation of Characters over 255, or collation in GemStone legacy collate order, which avoids the need for existing applications to rebuild indexes.

The character data tables are used repository-wide, and changing this table may have consequences; if the character tables change collation, then GemStone indexes and collections that depend on ordering may not return the correct results, and must be re-indexed or rebuilt.

Installing Character Data Tables

The following methods can only be executed by SystemUser. After commit, the installed tables apply to all logins for the repository. Any indexes or collections that depend on string collation must be rebuilt after installing new tables. Installed tables are not affected by upgrade or conversion. The tables distributed with GemStone are based on Unicode version 5.1; note that this is an older version of the Unicode standard.

To install a 0...255 character table in GemStone legacy collate order:

Character installCharTables: (PassiveObject fromServerTextFile:

		'$GEMSONE/examples/CharTableDefault.tab') activate.

To install the full Unicode character table in GemStone legacy collate order:

Character activateCharTablesFromFile:

	'$GEMSTONE/examples/CharTableUnicode510.dat'.

To install the full Unicode character data table in Unicode DUCET collate order:

Character activateCharTablesFromFile:

	'$GEMSTONE/examples/CharTableUnicode410.dat'

To reset to the default internal character table:

Globals removeKey: #CharacterDataTables.

To disable loading of installed character tables on session login, set the following environment variable:

export GS_DISABLE_CHARACTER_TABLE_LOAD=TRUE

5.2 CharacterCollection and String classes

CharacterCollection and String classes

CharacterCollection is a subclass of SequenceableCollection that is specialized to hold Characters, and expands the protocol inherited from SequenceableCollection to include messages specialized for comparing, searching, concatenating, and changing the case of character sequences. CharacterCollection is the abstract superclass for strings, including String class and other specialized strings. In this discussion, we will generally use the term String to include all the subclasses of CharacterCollection, not just the String class itself.

Each element of a CharacterCollection is a Character. A Character has an associated value, which may require more than one byte of physical storage. This is handled for you by GemStone; if more storage is required, the String is transparently converted to the appropriate type. For String, this is DoubleByteString or QuadByteString; for Unicode7, this is Unicode16 or Unicode32. The specific class does not change the interaction with the object; access by index will return the Character at the given index, regardless of how many bytes the Character actually requires. However, if you need to write the String to a file or other non-GemStone sequential format, this may require converting to an appropriate single-byte format, generally UTF-8.

The CharacterCollection hierarchy includes the following concrete classes:

Strings

These classes are traditional strings.

String
Strings hold Characters with codepoints in the range 0..255.

DoubleByteString
DoubleByteStrings are required when one or more Characters in a string needs more than one byte of storage. DoubleByteStrings hold Characters with codepoints in the range 0...16rFFFF (64K).

QuadByteString
QuadByteStrings are required when one or more Characters in a string needs more than two bytes of storage. QuadByteStrings hold Characters with codepoints in the range 0...16r10FFFF.

While traditional Strings normally hold human-readable text characters, this is not a requirement. Generally, raw byte data would be held in an instance of ByteArray, but it may be more convenient to use a String. In particular, there are cases when an instance of String will hold raw UTF-8 encoded bytes.

Unicode Strings

For strings that require locale-specific collation, specialized subclasses of the String classes are provided. These classes rely on the open-source ICU libraries to provide comparison and sorting behavior.

Unicode7
A subclass of String, limited to holding Characters with codepoints in the range 0..127 that are represented in 7 bits.

Unicode16
A subclass of DoubleByteString, holding Characters with codepoints in the range 0...16rFFFF (64K), excluding the range 16rD800-16rDFFF. This range is reserved for surrogates that allow encoding into UTF-16.

Unicode32
A subclass of QuadByteString, holding Characters with codepoints in the range 0..16r10FFFF. Again, this excludes the range 16rD800-16rDFFF.

Unicode strings should not hold raw byte data.

Symbol

Class Symbol is a subclass of String. Each Symbol with a unique set of Characters is guaranteed to have only one canonical instance in GemStone. Symbols are created by a special process, the SymbolGem, to ensure this uniqueness.

Like Strings, symbols may also contain Characters with values that require more than a byte of storage, and will convert into DoubleByteSymbols or QuadByteSymbols as needed. GemStone Smalltalk uses symbols internally to represent variable names and selectors. All symbols may be viewed by all users. Private information should be maintained in Strings, not in Symbols.

Symbols, DoubleByteSymbols, and QuadByteSymbols are restricted to 1024 or fewer characters.

You can “create” symbols using asSymbol or withAll: method. If the Symbol was created previously as part of the GemStone kernel, by another user, or by yourself, you will get the existing Symbol. A new symbol is only created if it has not been previously defined. Existing Symbols cannot be modified.

Since Symbols are canonical, the class of a Symbol always depends on the contents. While you can create a DoubleByteString with only characters in the range of String, you cannot create a DoubleByteSymbol that does not contain at least one character in the DoubleByte range, and the same is true for QuadByteString.

Symbols that have no references from anywhere in the system will eventually be garbage collected, if the system is configured to do so. See the System Administration Guide for more information on symbol garbage collection.

ByteArray

ByteArray is a specialized collection that is restricted to holding Integers between 0 and 255 (inclusive). While ByteArray is not a kind of String, the contents may be interpreted as a String.

Instances of ByteArray can be creating using literal syntax #[]. For example:

#[ 1 2 3 4 ]

Utf8

Utf8 is a subclass of ByteArray. It is not a kind of String, but may easily be converted back and forth from a traditional or Unicode String. A Utf8 holds the UTF-8 encoded bytes created by sending encodeAsUTF8 to a string, or by reading encoded data from a GsFile using contentsAsUTF8. Utf8 instances should not be directly created or edited.

'šamas' encodeAsUTF8

anUtf8( 197, 161, 97, 109, 97, 115)

Instances of Utf8 can be read from and written to instance of GsFile, which cannot directly handle characters with codePoints over 256.

String equality, ordering, and interoperation

By default, traditional strings and symbols are compared for equality and ordered using legacy rules. String and symbol are ordered using a character based comparison, and equality includes non-printing characters as well as printing characters. The collation is described in more detail under Traditional String Legacy Collation.

Unicode strings always use a an IcuCollator, and comparison is based on the entire string, and is highly configurable; the character data tables are not used. Unicode string equality does not consider non-printing characters. Unicode collation is described under Unicode String Collation using ICU libraries.

Since legacy and Unicode equality and ordering rules are different, traditional strings and symbols, using legacy comparison, cannot be ordered with Unicode strings (other than using protocol that explicitly provides a collator). To avoid inconsistent results, attempting to do so results in an error.

When Unicode Comparison Mode is enabled (see Unicode Comparison Mode), traditional strings and symbols are also collated using Unicode rules, and can be ordered and compared with Unicode strings in collections.

String protocol

Creating Strings

You have already seen strings created as literals. Strings created as literals are invariant; they cannot be modified after they are created.

In addition to creating strings literally, you can use the inherited instance creation methods, such as new: and withAll:. For example:

| myString |

myString := String withAll: #($a $z $u $r $e).

myString

azure

To create a string that can be modified later, you can use withAll: with a literal String:

| myString |

myString := String withAll: 'azure'.

myString at: 1 put: $A.

myString

Azure

Concatenating Strings

A string responds to the comma operator by returning a new string in which the argument to the comma has been appended to the string’s original contents. For example:

'String ' , 'con' , 'catenation'

String concatenation

Although this technique is handy when you need to build a small string, it’s not very efficient. In the last example, GemStone Smalltalk creates a String object for the first literal, 'String'. The #, message returns a new instance of String containing 'String con', and the second #, message creates a third string.

When you need to build a longer string, you’ll find it more efficient to use addAll: or add: (they’re the same for class String), which modifies the original string. Note that you cannot start with a literal string, since a literal string is invariant.

For example:

| resultString |

resultString := String new.

resultString add: 'String ';

             add: 'con';

             add: 'catenation'.

resultString

String concatenation

Converting between String classes and encodings

To convert between UTF-8 encoded bytes and the various kinds of string classes, there are a number of methods:

Instances of Symbols and traditional Strings can be converted to the lowest-storage type of Unicode string using asUnicodeString.
Instances of Symbols and Unicode strings can be converted to the lowest-storage type of traditional strings using asString.
A traditional String that is compose of raw UTF-8 encoded bytes can be decoded to a Unicode string using decodeFromUTF8ToUnicode, or to another traditional String with decoded bytes, using decodeFromUTF8ToString.
To convert from a ByteArray or Utf8 to a Unicode string, use decodeToUnicode, or to convert to a traditional String, use decodeToString.
Instances of ByteArray and Utf8 may be converted to a traditional String without decoding by using bytesIntoString.
All kinds of Strings can be encoded to an instance of Utf8 by using encodeAsUTF8.

String Transformations

CharacterCollection and its subclasses define messages that let you perform various conversions.

Strings can be converted in case:

asUppercase creates a new instance with all uppercase letters
asLowercase creates a new instance with all lowercase letters
asTitlecase creates a new instance with the first letter of each word capitalized, the remaining letters lowercase.
asFoldcase returns a new instance in “fold case”, which is case-free for comparison, and usually is similar to the lowercase.

For example:

'abcde' asUppercase

ABCDE

You can remove leading and/or trailing whitespace separators using methods such as trimSeparators. There are a number of variants; see the image for details.

For example:

'  abcde  ' trimSeparators

'abcde'

Strings can be split using the subStrings: method, which allows you to specify one or more characters to use as markers.

For example, to split a text into lines with /:

'owa/tagu/siam' subStrings: '/'

anArray( 'owa', 'tagu', 'siam')

Strings can be converted to numbers and other types of objects as well. Note that not all Strings can be converted to all kinds of other objects —if the String does not contain the representation of a number, for example, it’s meaningless to convert it to an Integer, so this will return an error.

For example:

'15' asFloat

15.0

Equality and Identity

Traditional strings are equal to each other if they contain the exact same Characters in the same case; equality is case-sensitive.

Unicode strings compared using = follow the ICU library comparison rules for equality, which are similar, although any non-whitespace control characters (such as null) are ignored for the comparison.

Traditional strings and Unicode strings cannot be compared to each other for equality using =. Any comparison involving a Unicode string require a collator. To compare traditional and Unicode strings in any combination, use compareTo:collator:. Specifying nil for the collator uses the default collator.

Strings can be compared for case-insensitive equality using the methods isEquivalent: or equalsNoCase:.

Identity in Literal vs. nonliteral

Literal and nonliteral Strings behave differently in identity comparisons. Each nonliteral String (created, for example, with new, withAll:, or asString) has a unique identity. That is, two Strings that are equal are not necessarily identical.

| nonlitString1 nonlitString2 |

nonlitString1 := String withAll: #($a $b $c).

nonlitString2 := String withAll: #($a $b $c).

(nonlitString1 == nonlitString2)

false

However, literal strings that contain the same character sequences and are compiled at the same time are both equal and identical:

| litString1 litString2 |

litString1 := 'abc'.

litString2 := 'abc'.

(litString1 == litString2)

true

This distinction can become significant in building sets. If you add both litString1 and litString2 to the same IdentitySet, the set will contain only one instance of 'abc'; however, an IdentitySet would include both nonlitString1 and nonlitString2.

Searching and Pattern matching

CharacterCollection and its subclasses define methods that can tell you whether a string contains a particular sequence of characters and, if so, where the sequence begins. This search can be case sensitive, case insensitive, and may include wild cards.

Below are some common methods; see the image for further methods.

Table 5.1 Search and Pattern Match Protocol
Case-sensitive Search	Case-insensitive Search	Description
	includesString: subString	Return true if the receiver includes subString.
findString: subString startingAt: anIndex	findStringNoCase: subString startingAt: anIndex	Return the index of subString if it exists within the receiver at anIndex or above, otherwise zero (0).
matchPattern: patternArray		Return true if the receiver matches the specifications in patternArray
findPattern: patternArray startingAt: anIndex	findPatternNoCase: patternArray startingAt: anIndex	Return the index of a substring in the receiver that matches the specifications in patternArray at anIndex or above, otherwise zero (0).

Pattern Matching Wild Cards

Pattern matching arguments (patternArray) consist of an Array containing combinations of Strings and the wildcard characters $* and $?. The character $? matches any single character in the receiver, and $* matches any sequence of characters in the receiver.

This is an example of the use of wildcard characters in pattern matching.

'weimaraner' matchPattern: #('w' $* 'r')

true

Since $* is interpreted as “any sequence of characters”, this returns true.

Similarly, The following example returns the index at which a sequence of characters beginning and ending with $r occurs in the receiver.

'weimaraner' findPattern: #('r' $* 'r') startingAt: 1

If a wildcard character $* or $? occurs in the receiver or within a string in the argument array, it is interpreted literally.

The following expressions illustrate what happens when the * is within the string and interpreted literally:

'w*r' matchPattern: #('weimaraner')

false

'weimaraner' findPattern: #('w*r') startingAt: 1

5.3 String Sorting and Collation

While strings clearly have a natural sort order, the details of that order are complex. Different languages may sort the same set of strings differently, according to the particular rules in that language. Even within one language, different applications may want to order string data differently. To complicate matters, some languages may treat certain sequences of characters as a unit when sorting strings.

The sorting of strings into a standard order is called collation. Collation depends on the results of a comparison between two strings, which in turn depends on how the Characters within the string are collated. While this simple view breaks down with some sorting requirements and linguistic rules, basic string comparison is adequate for many uses and is faster than the more complete external collation.

Traditional String Legacy Collation

Traditional strings (String, DoubleByteString, and QuadByteString) and symbols (Symbol, DoubleByteSymbol, and QuadByteSymbol) are collated by individual character. The comparison of characters with values up to 255 are done according to the Default Unicode Collation Element Table (DUCET), and Character 256 and above are sorted by codePoint, the Unicode numeric value.

This is the default behavior for traditional strings and symbols. Installing non-default Character Data Tables (see Character Data Tables) will affect the Character collation, according to the specific table installed. Enabling Unicode Comparison Mode (see Unicode Comparison Mode) causes traditional strings and symbols to collate following the same rules as Unicode strings. This section does not apply when using Unicode Comparison Mode.

String ordering using <= (as well as <, >, and >=) is not case-sensitive. When instances of String, DoubleByteString, and QuadByteString are compared using <= or related operations, the comparison first is done case-insensitive. If they are found to be equal other than with respect to case—if the only difference is case—then they are collated according to the Character Data Table, which specifies uppercase comes before lowercase.

For example:

#( 'c' 'MM' 'Mm' 'mb' 'mM' 'mm' 'x' ) 	sortAscending

anArray( 'c' 'mb' 'MM' 'Mm' 'mM' 'mm' 'x' )

Since ordering is by character, with only case being excluded, the default ordering is sensitive to accents and other diacritical marks on characters. Characters with diacritical marks are not related to the base character.

For example, all words beginning with 'Co' and 'co' would sort before all words beginning with 'Có' and 'có':

#('Cór' 'COz' 'Coa' 'cóa') sortAscending

anArray( 'Coa', 'COz', 'cóa', 'Cór')

Unicode String Collation using ICU libraries

The classes IcuLocale and IcuCollator provide an interface to the ICU (International Components for Unicode) libraries. The ICU libraries are a widely-used, open-source implementation of language-specific sorting and collation.

For a complete explanation of the features and subtleties of language-specific collation, you should refer to documentation on the ICU website, http://icu-project.org/.

Unicode strings (instance of Unicode7, Unicode16, and Unicode32) and instances of Utf8 use IcuCollator and IcuLocale to perform sorting operations using the ICU libraries. The collation is performed by considering the entire string, not on a character-by-character basis, and requires a specific language and locale to determine the rules for the comparison. In addition to specific language rules, ICU sorting is highly configurable for other application-specific sorting requirements.

While collation will vary according to specific language and locale, in general ICU collation orders characters with diacritical marks with the base character, and sorts lowercase before uppercase.

For example, using the sorting examples in the previous section and the default collator for the US, a different sort ordering is produced from that of legacy collation:

( 'c', 'mb', 'mm', 'mM', 'Mm', 'MM', 'x')

( 'Coa', 'cóa', 'Cór', 'COz')

By configuring the IcuCollator, however, other orderings, including ordering similar to the traditional string collation, may be produced.

Only Unicode strings and Utf8 instances use ICU sorting, by default. You may explicitly order traditional strings and symbols by specifying an IcuCollator; and enabling Unicode Comparison Mode will cause all these to use ICU comparison and sorting.

IcuLocale

Instances of IcuLocale represent a specific language, country, and language variant. The available IcuLocales are in the shared library and can be listed using IcuLocale class >> availableLocales.

A default instance of IcuLocale is instantiated on first reference, and stored in session state. The default IcuLocale is based on the operating system locale setting for the gem.

To set a specific default IcuLocale, use the method IcuLocale class >> default:. This sets the default locale for the session executing this code. While the instance of IcuLocale can be made persistent, the default IcuLocale does not persist from session to session.

IcuLocale class >> getUS is an example of instantiating an IcuLocale for the language English in the country US.

To determine what IcuLocale is currently in use, use the method IcuLocale >> default.

Example 5.1 Default IcuLocale

topaz 1> printit

IcuLocale default

a IcuLocale

  name                en_US

IcuCollator

An IcuCollator encapsulates the rules involved in collation for a specific IcuLocale. A default instance of IcuCollator is instantiated on first reference, based on the default IcuLocale, and stored in session state.

When comparing instances of Unicode String classes, the comparison always uses an IcuCollator, using the method compareTo:collator:. If an IcuCollator is not specified, such as when Unicode String classes are compared using >, the IcuCollator default is used; which in turn uses IcuLocale default.

You can also create an instance of IcuCollator for a specific locale, if you need to use specific collation rules other than the default. You can do this using IcuCollator class methods forLocale: anIcuLocale or forLocaleNamed: aString. For example, to create an IcuCollator for the German language as spoken in Germany:

IcuCollator forLocaleNamed: 'de_DE'

The actual string comparison is done by the ICU libraries, and follows the ICU comparison rules for that locale. Collation rules are similar in most western languages, but there are differences in specific languages.

For example, in the Hungarian language, ’cs’ is considered a single letter, so words that start with ’cs’ are sorted together and follow other words beginning with ’c’. The following example sets up a collection that is sorted according to Hungarian rules:

Example 5.2 Sorting in Hungarian IcuLocale

| hungarianWords collator |

collator := IcuCollator forLocaleNamed: 'hu_HU'.

hungarianWords := IcuSortedCollection newUsingCollator: collator.

hungarianWords

	add: 'csak' asUnicodeString;

	add: 'cukor' asUnicodeString;

	add: 'comb' asUnicodeString.

hungarianWords

a IcuSortedCollection

  sortBlock           a ExecBlock2

  collator            a IcuCollator

  #1 comb

  #2 cukor

  #3 csak

Customizing Sort

IcuCollator includes a number of attributes that can be used to customize the sort. These attributes work within the specific language rules of the associated IcuLocale.

Keep in mind that while the default values and the descriptions listed in Table 5.2 apply to most locales, particularly with non-Western scripts, the defaults may be different in different locales, and the attribute may have different behaviors.

See the ICU site, particularly the pages under http://userguide.icu-project.org/collation, for more precise descriptions and more detailed documentation.

Table 5.2 IcuCollator Attributes
Attribute name	Allowed values	Default
alternateHandling	true \| false	false	When true, allows space and punctuation characters within the string to be ignored.
caseFirst	'off', 'upperFirst', or 'lowerFirst'	'off'	When comparing case, determines if upper or lowercase is sorted first. Most locales sort lowercase first when caseFirst is ’off’ as well as when ’lowerFirst’.
caseLevel	true \| false	false	When true, considers case in the comparison, even if the strength would normally not consider case. For some locales, adds another strength level between SECONDARY and TERTIARY strengths.
frenchCollation	true \| false	false	When true, sorts secondary differences (the same base character with different diacritical marks) in reverse order (starting from the end of the string). This is the correct collation in French.
normalization	true \| false	false	Determines whether to normalize input strings, useful if input data may be un-normalized. Has performance impact.
numericCollation	true \| false	false	When true, sorts numeric sequences within the string by numerical rather than string comparison; e.g. sort ’100’ after ’2’.
strength	PRIMARY - 0 SECONDARY - 1 TERTIARY - 2 QUARTENARY - 4, or IDENTICAL - 15	TERTIARY	Determines the level of collation factors to consider, such as diacritical marks and case. See discussion below for more details.

Strength allows degrees of sort, to consider or not consider things like accent characters and case when performing the sort. The default strength is TERTIARY for most locales (the main exception being Japanese). The following are the sort strengths:

PRIMARY sorts by primary differences, ignoring secondary and later differences. The base letter represents a primary difference, so for example 'a' and 'b'.
SECONDARY sorts by primary and secondary differences, ignoring tertiary and later differences. An example of a secondary difference is diacritical differences on the same base letter, for example 'o' and 'ó'.
TERTIARY sorts by primary, then secondary, then tertiary differences. Uppercase vs. lowercase is a tertiary differences. TERTIARY is the default sort order for most locales.
QUATERNARY is used in Japanese, where it distinguishes between Japanese Katakana and Hiragana, and can be used to break ties among separator characters when alternateHandling is true.
IDENTICAL sorts by the specific character, by codepoints in the NFD (Normalization Form Canonical Decomposition) form. There is a performance impact with this strength.

The default sort strength is TERTIARY. As an example, when two strings are compared using TERTIARY strength, characters in the strings are compared first by the base character, ignoring any case or diacritical marks. If the base characters are the same, they are compared by diacritical mark, ignoring case. If both base characters and diacritical marks are the same, then case is considered. Note that unlike GemStone’s Strings or ASCII ordering, the default sorts places lowercase before uppercase.

Keep in mind that with lower sort strengths, when a factor such as case is not used, the relative position in the results of similar strings is not deterministic; the strings compare as the same, and so their position will depend on the order of the input.

By using the IcuCollator sort attributes, you have a great deal of control over your specific sorting.

For example, using the alternative handling example, you can sort strings that include spaces, dashes and other punctuation without considering the punctuation characters when doing the comparison:

Example 5.3 Sort ignoring punctuation

| blues collator|

collator := IcuCollator forLocale: IcuLocale default.

collator alternateHandling: true.

blues := IcuSortedCollection newUsingCollator: collator.

blues add: (Unicode7 withAll: 'blue berry').

blues add: (Unicode7 withAll: 'blue moon').

blues add: (Unicode7 withAll: 'bluebird').

blues add: (Unicode7 withAll: 'blue bird').

blues add: (Unicode7 withAll: 'blue-bird').

blues add: (Unicode7 withAll: 'bluetooth').

blues

a IcuSortedCollection

  sortBlock           a ExecBlock2

  collator            a IcuCollator

  #1 blue berry

  #2 bluebird

  #3 blue bird

  #4 blue-bird

  #5 blue moon

  #6 bluetooth

IcuSortedCollection

An IcuSortedCollection is a specialized subclass of SortedCollection for which you do not set the sortBlock. An IcuSortedCollection may only hold instances of subclasses of CharacterCollection. It is associated with a IcuCollator, which in turn is associated with an IcuLocale, and the sorting behavior is specific to the configuration of these instances. IcuSortedCollections rely on the open-source ICU libraries to perform the comparisons and produce correctly collated results.

Using IcuCollator is recommended if you will have sorted collections containing Unicode strings. This avoids lookup failures if a different collator is used to lookup than was used to sort the elements in the collection.

Unicode Comparison Mode

Configuring your repository to use Unicode Comparison Mode allows traditional strings to automatically use Unicode comparison rules. This permits traditional and Unicode strings to be compared interchangeably.

Unicode Comparison Mode affects not only collation using >, >=, <, <=, but also equality using = and ~=. The legacy and Unicode rules for equality are not identical.

Since Unicode Comparison Mode affects equality comparisons of traditional strings and symbols, as well as ordering, it may break lookup in existing hashed collections in addition to SortedCollections and indexes. Use caution when determining if your existing application can use Unicode Comparison Mode; GemStone does not include tools or code to support application validation after changing comparison mode. Working with GemTalk Technical Support or Professional Services is recommended.

Unicode Comparison Mode is controlled by the Global #StringConfiguration. By default, StringConfiguration is set to String, which provides legacy string comparison mode.

To turn on Unicode mode, as SystemUser, execute

SringConfiguration enableUnicodeComparisonMode

This returns the previous setting for Unicode Comparison Mode. Note that the current session is not affected; the new mode will take effect for all subsequent logins.

To disable Unicode Comparison Mode, as SystemUser, execute

SringConfiguration disableUnicodeComparisonMode

Again, note that the current session is not affected; the new mode will take effect for all subsequent logins.

5.4 Encrypting Strings

There are times when you may which to encrypt strings in your repository or for transmittal to other systems. GemStone provides an interface to Advanced Encryption Standard (AES) encryption/decryption, provided by the OpenSSL open source libraries included with GemStone.

The AES specification is available at :http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf.

All encryptions/decryptions are in cipher block chaining (CBC) mode; see the AES specification document for further details.

Encryption and decryption API methods are provided for 128-bit/16-byte keys, 192-bit/24-byte keys, and 256-bit/32-byte keys, using the following methods.

Encryption can be done on instances of ByteArray or Uft8, or subclasses of CharacterCollection. For encryption, you must provide a key that is a ByteArray of the appropriate size (16, 24, or 32 bytes) containing key bytes, and a salt that is a 16-byte ByteArray containing salt values.

The following methods encrypt or decrypt using the specified key and salt, return the encrypted or decrypted result:

aesEncryptWith128BitKey: aKey salt: aSalt

aesDecryptWith128BitKey: aKey salt: aSalt

aesEncryptWith192BitKey: aKey salt: aSalt

aesDecryptWith192BitKey: aKey salt: aSalt

aesEncryptWith256BitKey: aKey salt: aSalt

aesDecryptWith256BitKey: aKey salt: aSalt

These methods place the encrypted or decrypted result into aByteObjOrNil, starting at offset 1, and resizing if necessary. If aByteObjOrNil is nil, a new instance of the same class as the receiver will be created containing the results.

aesEncryptWith128BitKey: aKey salt: aSalt into: aByteObjOrNil

aesDecryptWith128BitKey: aKey salt: aSalt into: aByteObjOrNil

aesEncryptWith192BitKey: aKey salt: aSalt into: aByteObjOrNil

aesDecryptWith192BitKey: aKey salt: aSalt into: aByteObjOrNil

aesEncryptWith256BitKey: aKey salt: aSalt into: aByteObjOrNil

aesDecryptWith256BitKey: aKey salt: aSalt into: aByteObjOrNil

You may use ByteArray withRandomBytes: N to produce pseudo-random key and salt values for encryption.

For example:

topaz 1> run

| key salt encrypted |

key  := ByteArray withRandomBytes: 32.

salt := ByteArray withRandomBytes: 16.

encrypted := 'My secret string' aesEncryptWith256BitKey: key

	salt: salt.

encrypted aesDecryptWith256BitKey: key salt: salt.

My secret string

5. String Classes and Collation

5.1 Characters and Unicode

Code point

Attributes

Collation

Unicode and the Unicode Database

Character Data Tables

Installing Character Data Tables

5.2 CharacterCollection and String classes

CharacterCollection and String classes

Strings

Unicode Strings

Symbol

ByteArray

Utf8

String equality, ordering, and interoperation

String protocol

Creating Strings

Concatenating Strings

Converting between String classes and encodings

String Transformations

Equality and Identity

Identity in Literal vs. nonliteral

Searching and Pattern matching

Table 5.1 Search and Pattern Match Protocol

Pattern Matching Wild Cards

5.3 String Sorting and Collation

Traditional String Legacy Collation

Unicode String Collation using ICU libraries

IcuLocale

Example 5.1 Default IcuLocale

IcuCollator

Example 5.2 Sorting in Hungarian IcuLocale

Customizing Sort

Table 5.2 IcuCollator Attributes

Example 5.3 Sort ignoring punctuation

IcuSortedCollection

Unicode Comparison Mode

5.4 Encrypting Strings