5. String Classes and Collation

Previous chapter

Next chapter

String handling is an important part of most applications. While Strings are a type of Collection, they have a number of unique features and behavior.

Characters and Unicode
Describes Characters.

String classes
Introduces the GemStone Smalltalk objects that store collections of Characters.

String Sorting and Collation
Describes collation, including Traditional string collation and collation using the ICU libraries and Unicode strings.

Encrypting Strings
Explains how to encrypt strings.

5.1 Characters and Unicode

A Character is a special object: an object whose value is encoded in the OOP. Literal Characters are formed with a leading $.

Code point

Each Character has a code or codePoint, which for lower order Characters is the ASCII value. Either of these terms may be used, though ASCII is an incorrect term for the higher code points. GemStone supports Characters with values from 0 to 16r10FFFF, the full Unicode range, except for the Unicode reserved range.

The Unicode range of codePoints from 16rD800-16rDFFF is reserved for encoding leading/trailing surrogate pairs for UTF-16 encoding. These can never be legal Unicode characters, and as such, it is an error to attempt to create a Character in this range.

To get the Character for a given codePoint, use the Character class methods withValue: or codePoint:.

Attributes

Characters have “type”, and know if they are a digit, letter, separator, or other similar kind. This information is defined in the Unicode database as the Unicode general category, and a variety of testing methods are available. The Unicode database also defines the upper and lower case equivalents, and case conversion methods are available. See the image for a full list of available protocol.

For example,

$Z isUppercase
true
 
$u isDigit
false
Collation

Characters are ordered (collated) using internal character tables, which provide a Unicode-like collation order for Characters up to code point 255. Characters above that are collated by code point. Character collation can be modified by installing character data tables, although this use is deprecated.

Character collation is used in collating instances of Traditional string classes, in Legacy String Comparison Mode. This character-based string collation has limitations outside the ASCII range; the ICU-library based string collation should be used if the default collation is not sufficient. For more on collation, see String Sorting and Collation.

Unicode and the Unicode Database

The Unicode Consortium is an international standards organization that produces the Unicode Database. Unicode is a commonly used standard which provides unique codes for all Characters in all Character sets, in the range 0 to 0x10FFFF. It also describes the category of each Character and relationship between it and other Characters, and provides a default collation order with the Default Unicode Collation Element Table (DUCET).

For more information on this database, see http://www.unicode.org/Public/UNIDATA/UCD.html

The Unicode Consortium provides code charts by script as well as a single master list of all characters, presented in an ASCII-only, comma-delimited version. The current version of this database can be found at http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.

5.2 String classes

A string is a sequence of Characters, implemented as a subclass of CharacterCollection.

Each element in a CharacterCollection is a Character. Since characters may require more than one byte of storage, the class of string may be transparently converted to an instance of the class with the appropriate capacity for that Character. The semantics of the CharacterCollection remain the same; access by index will return the Character at the given index, regardless of how many bytes the Character actually requires.

A fundamental quality of strings is collation. Since the scope of collation includes equality, the collation of strings affects a repository in many ways, such as dictionary lookups. Collation in GemStone has historically been handled using character-based tables. Unicode string-based collation using ICU open source libraries is included in recent releases and provides a much richer set of collation features. To ensure that legacy applications function correctly, GemStone supports both of these encoding/collation schemes.

Traditional Strings

In Legacy String Comparison Mode, Traditional strings collate using internal character-based collation tables. When the repository is in Unicode Comparison mode, however, Traditional strings use ICU-based Unicode collation.

Traditional strings are implemented in three classes:

String
Strings hold Characters with codepoints in the range 0..255 (8 bits).

DoubleByteString
DoubleByteStrings are required when one or more Characters in a string needs more than one byte of storage. DoubleByteStrings hold Characters with codepoints in the range 0...16rFFFF (64K).

QuadByteString
QuadByteStrings are required when one or more Characters in a string needs more than two bytes of storage. QuadByteStrings hold Characters with codepoints in the range 0...16r10FFFF.

While Traditional strings normally hold human-readable text characters, this is not a requirement. Generally, raw byte data would be held in an instance of ByteArray, but it may be more convenient to use a String. In particular, there are cases when an instance of String will be used to hold raw UTF-8 encoded bytes.

Unicode Strings

Unicode strings always use ICU string-based collation. Like Traditional strings, there are three classes based on range, but note that the codePoint range is different than Traditional strings.

Unicode7
A subclass of String, limited to holding Characters with codepoints in the range 0..127 that are represented in 7 bits.

Unicode16
A subclass of DoubleByteString, holding Characters with codepoints in the range 0...16rFFFF (64K), excluding the range 16rD800-16rDFFF. This range is reserved for surrogates that allow encoding into UTF-16.

Unicode32
A subclass of QuadByteString, holding Characters with codepoints in the range 0..16r10FFFF. Again, this excludes the range 16rD800-16rDFFF.

Unicode strings should not hold raw byte data.

String equality, ordering, and interoperation

In Legacy String Comparison Mode, Traditional strings and symbols are compared for equality and ordered using character-based comparison, and equality includes non-printing characters as well as printing characters.

Unicode strings use the ICU string-based string collation, in which equality does not consider non-printing characters.

Since Traditional and Unicode string equality rules are different, Traditional strings and symbols (when the repository is in Legacy String Comparison Mode) may produce inconsistent results. In this mode it is an error to mix Unicode strings with Traditional strings or symbols, either for comparison or equality.

Other String-like classes

Symbol

A symbol is similar to a string, but each symbol with a unique set of Characters is guaranteed to have only one canonical instance in GemStone. Symbols are created by a special process, the SymbolGem, to ensure this uniqueness. Creating a new symbol will return an existing symbol, if one exists; a new symbol is only created if it has not been previously defined. Existing symbols cannot be modified.

Like strings, symbols may also contain Characters with values that require more than a byte of storage, and will convert from class Symbol into DoubleByteSymbols or QuadByteSymbols as needed. Since symbols are canonical, the class of a symbol always depends on the contents. While you can create a DoubleByteString with only characters in the range of String, you cannot create a DoubleByteSymbol that does not contain at least one character in the DoubleByte range, and the same is true for QuadByteString.

All symbols may be viewed by all users. Private information should be maintained in strings, not in symbols.

Symbols, DoubleByteSymbols, and QuadByteSymbols are restricted to 1024 or fewer characters.

Symbols that have no references from anywhere in the system may eventually be garbage collected, if the system is configured to do so. See the System Administration Guide for more information on symbol garbage collection.

Symbols, like strings, collate using character-based tables in Legacy String Comparison Mode and using ICU string-based collation in Unicode Comparison Mode. As a result, they cannot be compared to Unicode strings in Legacy String Comparison Mode.

The literal form of a Symbol is specified using a leading #. The body of the symbol may additionally include single quotes. This is optional for symbols that are legal identifiers and keywords, but required for symbols that start with a number, include punctuation/spaces, etc. For example:

#'22 skidoo'
#fooBar

ByteArray

ByteArray is a specialized collection that is restricted to holding Integers between 0 and 255 (inclusive). While ByteArray is not a kind of String, the contents may be interpreted as a String.

Instances of ByteArray can be creating using literal syntax #[]. For example:

#[ 1 2 3 4 ]

Utf8

Utf8 is a subclass of ByteArray. It is not a kind of String, but may easily be converted back and forth from a traditional or Unicode string. A Utf8 holds the UTF-8 encoded bytes created by sending encodeAsUTF8 to a string, or by reading encoded data from a GsFile using contentsAsUTF8. Utf8 instances should not be directly created or edited.

'šamas' encodeAsUTF8
anUtf8( 197, 161, 97, 109, 97, 115)

Instances of Utf8 can be read from and written to instance of GsFile, which cannot directly handle characters with codePoints over 256.

String protocol

Creating Strings

Strings created as literals, that is, in text encased in single quotes, are invariant; they cannot be modified after they are created.

In addition to creating strings as literals, you can use the inherited instance creation methods, such as new: and withAll:. For example:

String withAll: #($a $z $u $r $e).
azure

Concatenating Strings

A string responds to the comma operator by returning a new string in which the argument to the comma has been appended to the string’s original contents. For example:

'String ' , 'con' , 'catenation'
String concatenation

Although this technique is handy, it’s not very efficient; each #, message send creates a new instance of String, so this example creates three Strings, returning the final one.

To build a string efficiently, by appending onto the original object, you can use add:, which modifies the original string. Note that you cannot start with a literal string, since a literal string is invariant.

For example:

| resultString |
resultString := String new. 
resultString add: 'String ';
             add: 'con';
             add: 'catenation'. 
resultString
%
String concatenation

Converting between String classes and encodings

To convert between UTF-8 encoded bytes and the various kinds of string classes, there are a number of methods:

  • Instances of Symbols and Traditional strings can be converted to the lowest-storage type of Unicode string using asUnicodeString.
  • Instances of Symbols and Unicode strings can be converted to the lowest-storage type of Traditional strings using asString.
  • A Traditional string that is composed of raw UTF-8 encoded bytes can be decoded to a Unicode string using decodeFromUTF8ToUnicode, or to another Traditional string with decoded bytes, using decodeFromUTF8ToString.
  • A Traditional string can be encoded into a String containing the raw UTF-8 encoded bytes, using encodeAsUTF8IntoString.
  • To convert from a ByteArray containing UTF-8 or from a Utf8 to a Unicode string, use decodeFromUTF8ToUnicode, or to convert to a Traditional string, use decodeFromUTF8ToString.
  • Instances of ByteArray and Utf8 may be converted to a Traditional string without decoding by using bytesIntoString.
  • All kinds of strings can be encoded to an instance of Utf8 by using encodeAsUTF8.

String Transformations

CharacterCollection and its subclasses define messages that let you perform various conversions.

Strings can be converted in case:

  • asUppercase creates a new instance with all uppercase letters
  • asLowercase creates a new instance with all lowercase letters
  • asTitlecase creates a new instance with the first letter of each word capitalized, the remaining letters lowercase.
  • asFoldcase returns a new instance in “fold case”, which is case-free for comparison, and usually is similar to the lowercase.

For example:

'abcde' asUppercase
ABCDE

You can remove leading and/or trailing whitespace separators using methods such as trimSeparators. There are a number of variants; see the image for details.

For example:

'  abcde  ' trimSeparators 
'abcde'

Strings can be split using the subStrings: method, which allows you to specify one or more characters to use as markers.

For example, to split a text into lines with /:

'owa/tagu/siam' subStrings: '/'
anArray( 'owa', 'tagu', 'siam')

Strings can be converted to numbers and other types of objects as well. For example:

'15' asFloat
15.0

Note that not all Strings can be converted to all kinds of other objects; if the String does not contain the representation of a number, for example, it’s meaningless to convert it to an Integer, so this will return an error.

Equality and Identity

Traditional strings are equal to each other if they contain the exact same Characters in the same case; equality is case-sensitive.

Unicode strings compared using = follow the ICU library comparison rules for equality, which are similar, although any non-whitespace control characters (such as null) are ignored for the comparison.

As mentioned above, Traditional strings and Unicode strings cannot be compared to each other for equality using =, when the repository is in Legacy String Comparison Mode. To compare traditional and Unicode strings in any combination, use compareTo:collator:, specifying nil for the collator to indicate the default collator.

Strings can be compared for case-insensitive equality using the methods isEquivalent: or equalsNoCase:.

Identity in Literal vs. nonliteral

Literal and nonliteral Strings behave differently in identity comparisons. Each nonliteral String (created, for example, with new, withAll:, or asString) has a unique identity. That is, two Strings that are equal are not necessarily identical.

| nonlitString1 nonlitString2 |
nonlitString1 := String withAll: #($a $b $c).
nonlitString2 := String withAll: #($a $b $c).
(nonlitString1 == nonlitString2)
false

However, literal strings that contain the same character sequences and are compiled at the same time are both equal and identical:

| litString1 litString2 |
litString1 := 'abc'.
litString2 := 'abc'.
(litString1 == litString2)
true

This distinction can become significant in building sets. If you add both litString1 and litString2 to the same IdentitySet, the set will contain only one instance of 'abc'; however, an IdentitySet would include both nonlitString1 and nonlitString2.

Searching and Pattern matching

CharacterCollection and its subclasses define methods that can tell you whether a string contains a particular sequence of characters and, if so, where the sequence begins. This search can be case sensitive, case insensitive, and may include wild cards.

Below are some common methods; see the image for further methods.

Table 5.1 Search and Pattern Match Protocol

Case-sensitive Search

Case-insensitive Search

Description

 

includesString: subString

Return true if the receiver includes subString.

findString: subString startingAt: anIndex

findStringNoCase:
subString startingAt: anIndex

Return the index of subString if it exists within the receiver at anIndex or above, otherwise zero (0).

matchPattern:
patternArray

 

Return true if the receiver matches the specifications in patternArray

findPattern:
patternArray
startingAt: anIndex

findPatternNoCase:
patternArray
startingAt: anIndex

Return the index of a substring in the receiver that matches the specifications in patternArray at anIndex or above, otherwise zero (0).

Pattern Matching Wild Cards

Pattern matching arguments (patternArray) consist of an Array containing combinations of Strings and the wildcard characters $* and $?. The character $? matches any single character in the receiver, and $* matches any sequence of characters in the receiver.

This is an example of the use of wildcard characters in pattern matching.

'weimaraner' matchPattern: #('w' $* 'r')
true

Since $* is interpreted as “any sequence of characters”, this returns true.

Similarly, The following example returns the index at which a sequence of characters beginning and ending with $r occurs in the receiver.

'weimaraner' findPattern: #('r' $* 'r') startingAt: 1
6

If a wildcard character $* or $? occurs in the receiver or within a string in the argument array, it is interpreted literally.

The following expressions illustrate what happens when the * is within the string and interpreted literally:

'w*r' matchPattern: #('weimaraner')
false
 
'weimaraner' findPattern: #('w*r') startingAt: 1
0

5.3 String Sorting and Collation

While strings clearly have a natural sort order (collation), the details of that order are complex. Different languages may sort the same set of strings differently, according to the particular rules in that language. Even within one language, different applications may want to order string data differently. To complicate matters, some languages may treat certain sequences of characters as a unit when sorting strings.

Collation depends on the results of a comparison between two strings, which in turn depends on how the Characters within the string are collated. While this simple view breaks down with some sorting requirements and linguistic rules, basic string comparison is adequate for many uses and is faster than the more complete external collation.

Comparison Mode

The Comparison Mode of a repository controls the way comparisons are done between instance of Traditional strings. The modes are:

  • Legacy String Comparison Mode, the default for new applications.
  • Unicode Comparison Mode, enabled in all GsDevKit-based applications.

In Legacy String Comparison Mode, Traditional strings and symbols cannot be compared to Unicode strings without using special protocol. Collation of Traditional strings and symbols is using character-based collation.

In Unicode Comparison Mode, Traditional strings and Symbols use ICU string-based collation, and can interoperate easily with Unicode strings.

A new repository can be easily switched to Unicode Comparison Mode. Since the collation rules may be subtly different, and affect system operations such as looking up class names in SymbolDictionaries, changing the mode for existing applications should be done with great care and thorough testing. To be safe, all indexes and sorted collections should be rebuilt, and all hashed collections re-hashed. The mode of a repository must be managed as part of System Administration, not by individual developers on a shared repository.

StringConfiguration

The Comparison Mode is controlled by the Global #StringConfiguration. By default, StringConfiguration is set to String, and the repository is therefore in Legacy String Comparison Mode.

To enable Unicode Comparison Mode, as SystemUser, execute:

SringConfiguration enableUnicodeComparisonMode

This returns the previous setting for Unicode Comparison Mode. Note that this comments, but the current session is not affected; the new mode will take effect for all subsequent logins.

To enable Legacy String Comparison Mode, as SystemUser, execute:

SringConfiguration disableUnicodeComparisonMode

Again, note that this operation commits, but the change does not affect the current session; the new mode will take effect for all subsequent logins.

To verify the mode in this repository, execute:

StringConfiguration isInUnicodeComparisonMode

Legacy String Comparison Mode for Traditional Strings

Traditional strings (String, DoubleByteString, and QuadByteString) and symbols (Symbol, DoubleByteSymbol, and QuadByteSymbol) are collated, in Legacy String Comparison Mode, by individual character. The comparison of characters with values up to 255 are done according to the Default Unicode Collation Element Table (DUCET), and Character 256 and above are sorted by codePoint, the Unicode numeric value.

Legacy applications may have installed non-default internal character tables, which modified the character-based collation. This is no longer recommended; if the default character-based collation is not sufficient for your application, you should integrate the ICU string-based collation.

Enabling Unicode Comparison Mode (see Comparison Mode) causes Traditional strings and symbols to collate following the same rules as Unicode strings. This section only applies when in Legacy String Comparison Mode, not in Unicode Comparison Mode.

String ordering using <= (as well as <, >, and >=) is not case-sensitive. When instances of String, DoubleByteString, and QuadByteString are compared using <= or related operations, the comparison first is done case-insensitive. If they are found to be equal other than with respect to case—if the only difference is case—then they are collated according to the Character Data Table, which specifies uppercase comes before lowercase.

For example:

#( 'MM' 'c' 'Mm' 'mb' 'mM' 'x' 'mm' ) 	
	sortAscending
anArray( 'c' 'mb' 'MM' 'Mm' 'mM' 'mm' 'x' )

Since ordering is by character, with only case being excluded, the default ordering is sensitive to accents and other diacritical marks on characters. Characters with diacritical marks are not related to the base character.

For example, all words beginning with 'Co' and 'co' would sort before all words beginning with 'Có' and 'có':

#('Cór' 'COz' 'Coa' 'cóa') 
	sortAscending
anArray( 'Coa', 'COz', 'cóa', 'Cór')

Unicode Comparison Mode and ICU Collation

Unicode strings, and all strings when in Unicode Comparison Mode, use the ICU (International Components for Unicode) libraries to provide string-based collation. The ICU libraries are a widely-used, open-source implementation of language-specific sorting and collation.

For a complete explanation of the features and subtleties of language-specific collation, you should refer to documentation on the ICU website, http://icu-project.org/.

The classes IcuLocale and IcuCollator provide an interface to the ICU libraries. Unicode strings (instance of Unicode7, Unicode16, and Unicode32) and instances of Utf8 use IcuCollator and IcuLocale to perform sorting operations using the ICU libraries. The collation is performed by considering the entire string, not on a character-by-character basis, and requires a specific language and locale to determine the rules for the comparison.

In addition to specific language rules, ICU sorting is highly configurable for other application-specific sorting requirements.

While collation will vary according to specific language and locale, in general ICU collation orders characters with diacritical marks with the base character, and sorts lowercase before uppercase.

For example, using the sorting examples in the previous section and the default collator for the US, a different sort ordering is produced from that of legacy collation:

#( 'MM' 'c' 'Mm' 'mb' 'mM' 'x' 'mm' ) 	 
	sortAscending
anArray( 'c', 'mb', 'mm', 'mM', 'Mm', 'MM', 'x')
 
#('Cór' 'COz' 'Coa' 'cóa')
	sortAscending
anArray( 'Coa', 'cóa', 'Cór', 'COz')

This is the default US collation; by configuring the IcuCollator, however, many other orderings may be produced.

IcuLocale

Instances of IcuLocale represent a specific language, country, and language variant. The available IcuLocales are in the shared library and can be listed using IcuLocale class >> availableLocales.

A default instance of IcuLocale is instantiated on first reference, and stored in session state. The default IcuLocale is based on the operating system locale setting for the gem. The default IcuLocale affects collation, so some care should be taken in configuring the operating system locale for the gem processes. In applications with distributed locales, it may be safer to set a default IcuLocale on login, using UserProfile >> loginHook: (see the System Administration Guide).

To set a specific default IcuLocale, use the method IcuLocale class >> default:. This sets the default locale for the session executing this code. While the instance of IcuLocale can be made persistent, the default IcuLocale does not persist from session to session.

To determine what IcuLocale is currently in use, use the method IcuLocale >> default.

IcuLocale default
IcuLocale en_US

IcuCollator

An IcuCollator encapsulates the rules involved in collation for a specific IcuLocale. A default instance of IcuCollator is instantiated on first reference, based on the default IcuLocale, and stored in session state.

When comparing instances of Unicode string classes, the comparison always uses an IcuCollator, using the method compareTo:collator:. If an IcuCollator is not specified, such as when Unicode string classes are compared using >, the IcuCollator default is used; which in turn uses IcuLocale default.

You can also create an instance of IcuCollator for a specific locale, if you need to use specific collation rules other than the default. You can do this using IcuCollator class methods forLocale: anIcuLocale or forLocaleNamed: aString. For example, to create an IcuCollator for the German language as used in Germany:

IcuCollator forLocaleNamed: 'de_DE' 

The actual string comparison is done by the ICU libraries, and follows the ICU comparison rules for that locale. Collation rules are similar in most western languages, but there are differences in specific languages.

For example, in the Hungarian language, ’cs’ is considered a single letter, so words that start with ’cs’ are sorted together and follow other words beginning with ’c’. The following example sets up a collection that is sorted according to Hungarian rules:

Example 5.1 Sorting in Hungarian IcuLocale

| hungarianWords collator |
collator := IcuCollator forLocaleNamed: 'hu_HU'.
hungarianWords := IcuSortedCollection newUsingCollator: collator.
hungarianWords 
	add: 'csak' asUnicodeString; 
	add: 'cukor' asUnicodeString; 
	add: 'comb' asUnicodeString.
hungarianWords 
a IcuSortedCollection
  sortBlock           a ExecBlock2
  collator            a IcuCollator
  #1 comb
  #2 cukor
  #3 csak
 

Customizing Sort

IcuCollator includes a number of attributes that can be used to customize the sort. These attributes work within the specific language rules of the associated IcuLocale.

Keep in mind that while the default values and the descriptions listed in Table 5.2 apply to most locales, particularly with non-Western scripts, the defaults may be different in different locales, and the attribute may have different behaviors.

See the ICU site, particularly the pages under http://userguide.icu-project.org/collation, for more precise descriptions and more detailed documentation.

Table 5.2 IcuCollator Attributes

Attribute name

Allowed values

Default

 

alternateHandling

true | false

false

When true, allows space and punctuation characters within the string to be ignored.

caseFirst

'off', 'upperFirst', or 'lowerFirst'

'off'

When comparing case, determines if upper or lowercase is sorted first. Most locales sort lowercase first when caseFirst is ’off’ as well as when ’lowerFirst’.

caseLevel

true | false

false

When true, considers case in the comparison, even if the strength would normally not consider case.

frenchCollation

true | false

false

When true, sorts secondary differences (e.g. differences in diacritical marks) in reverse order. This is the collation rule for French.

normalization

true | false

false

Determines whether to normalize input strings. Useful if input data may not be -normalized, but impacts performance.

numericCollation

true | false

false

When true, sorts numeric sequences within the string by numerical rather than string comparison; e.g. sort ’100’ after ’2’.

strength

PRIMARY - 0
SECONDARY - 1
TERTIARY - 2
QUARTENARY - 4, or
IDENTICAL - 15

TERTIARY

Determines the level of collation factors to consider, such as diacritical marks and case. See discussion below for more details.

Strength allows degrees of sort, to consider or not consider things like accent characters and case when performing the sort. The default strength is TERTIARY for most locales (the main exception being Japanese). The following are the sort strengths:

  • PRIMARY sorts by primary differences, ignoring secondary and later differences. The base letter represents a primary difference, so for example 'a' and 'b'.
  • SECONDARY sorts by primary and secondary differences, ignoring tertiary and later differences. An example of a secondary difference is diacritical differences on the same base letter, for example 'o' and 'ó'.
  • TERTIARY sorts by primary, then secondary, then tertiary differences. Uppercase vs. lowercase is a tertiary differences. TERTIARY is the default sort order for most locales.
  • QUATERNARY is used in Japanese, where it distinguishes between Japanese Katakana and Hiragana, and can be used to break ties among separator characters when alternateHandling is true.
  • IDENTICAL sorts by the specific character, by codepoints in the NFD (Normalization Form Canonical Decomposition) form. There is a performance impact with this strength.

The default sort strength is TERTIARY. As an example, when two strings are compared using TERTIARY strength, characters in the strings are compared first by the base character, ignoring any case or diacritical marks. If the base characters are the same, they are compared by diacritical mark, ignoring case. If both base characters and diacritical marks are the same, then case is considered. Note that unlike GemStone’s Strings or ASCII ordering, the default sorts places lowercase before uppercase.

Keep in mind that with lower sort strengths, when a factor such as case is not used, the relative position in the results of similar strings is not deterministic; the strings compare as the same, and so their position will depend on the order of the input.

By using the IcuCollator sort attributes, you have a great deal of control over your specific sorting.

For example, using the alternative handling example, you can sort strings that include spaces, dashes and other punctuation without considering the punctuation characters when doing the comparison:

Example 5.2 Sort ignoring punctuation

| blues collator|
collator := IcuCollator forLocale: IcuLocale default.
collator alternateHandling: true.
blues := IcuSortedCollection newUsingCollator: collator.
blues add: (Unicode7 withAll: 'blue berry').
blues add: (Unicode7 withAll: 'blue moon').
blues add: (Unicode7 withAll: 'bluebird').
blues add: (Unicode7 withAll: 'blue bird').
blues add: (Unicode7 withAll: 'blue-bird').
blues add: (Unicode7 withAll: 'bluetooth').
blues
%
a IcuSortedCollection
  sortBlock           a ExecBlock2
  collator            a IcuCollator
  #1 blue berry
  #2 bluebird
  #3 blue bird
  #4 blue-bird
  #5 blue moon
  #6 bluetooth

IcuSortedCollection

An IcuSortedCollection is a specialized subclass of SortedCollection for which you do not set the sortBlock. An IcuSortedCollection may only hold instances of subclasses of CharacterCollection. It is associated with a IcuCollator, which in turn is associated with an IcuLocale, and the sorting behavior is specific to the configuration of these instances. IcuSortedCollections rely on the open-source ICU libraries to perform the comparisons and produce correctly collated results.

Using IcuSortedCollection is recommended if you will have sorted collections containing Unicode strings. This avoids lookup failures if a different collator is used to lookup than was used to sort the elements in the collection.

ICU libraries and versioning

ICU and Unicode versioning

The Unicode Consortium periodically releases new versions of the Unicode Standard, with (usually minor) changes in collation and the addition of new characters. The ICU organization then periodically releases new versions of their libraries reflecting these changes in the standard. Major GemStone releases include the latest version of the ICU libraries.

The indexing structures depend on collation encodings from ICU that may change between versions, even if the collation changes would not otherwise affect the application. So even in cases where the Unicode differences are minor, the ICU library version loaded in an application must match the ICU version used to build indexes.

To accommodate the (generally) low value of upgrading to a new ICU library, and the potentially high cost of rebuilding structures in your application that depend on collation, GemStone preserves the existing ICU library version over upgrade.

IcuLibraryVersion

The version of the ICU library that is used in a repository is stored under (Globals at: #IcuLibraryVersion). This is a string, which must correspond to one of the versions of the ICU libraries in the product distribution. When a session logs in, it will select the ICU shared libraries to load based on the IcuLibraryVersion value.

As with StringConfiguration, IcuLibraryVersion is a global, repository-wide setting that can be only changed by SystemUser, to avoid the risk of lookup failures and incorrect query results. It should be managed as part of System Administration, not by individual developers on a shared repository.

Updating IcuLibraryVersion

To update the version of ICU libraries in your repository, you will need to follow this procedure:

1. Ensure no other users are on the system

2. Login as SystemUser and execute

Globals at: #IcuLibraryVersion put: newVersionString

Commit and logout.

3. Shut down and restart the Stone.

4. Login as DataCurator, or a user with the appropriate object access rights. If you are using a linked session, you may need to restart the application to allow the new version of the ICU shared library to be loaded

5. Update any persistent data structures that may be affected. This involves dropping and rebuilding indexes that involve Unicode strings, resorting SortedCollections, and resorting any application data structures that depend on Unicode string collation.

6. When this is complete and all changes have been committed, other users may be allowed to login.

5.4 Encrypting Strings

There are times when you may which to encrypt strings in your repository or for transmittal to other systems. GemStone provides an interface to Advanced Encryption Standard (AES) encryption/decryption, provided by the OpenSSL open source libraries included with GemStone.

The AES specification is available at: http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.197.pdf.

All encryptions/decryptions are in cipher block chaining (CBC) mode; see the AES specification document for further details.

Encryption and decryption API methods are provided for 128-bit/16-byte keys, 192-bit/24-byte keys, and 256-bit/32-byte keys, using the following methods.

Encryption can be done on instances of ByteArray or Uft8, or subclasses of CharacterCollection. For encryption, you must provide a key that is a ByteArray of the appropriate size (16, 24, or 32 bytes) containing key bytes, and a salt that is a 16-byte ByteArray containing salt values.

The following methods encrypt or decrypt using the specified key and salt, return the encrypted or decrypted result:

aesEncryptWith128BitKey: aKey salt: aSalt
aesDecryptWith128BitKey: aKey salt: aSalt
 
aesEncryptWith192BitKey: aKey salt: aSalt
aesDecryptWith192BitKey: aKey salt: aSalt
 
aesEncryptWith256BitKey: aKey salt: aSalt
aesDecryptWith256BitKey: aKey salt: aSalt

These methods place the encrypted or decrypted result into aByteObjOrNil, starting at offset 1, and resizing if necessary. If aByteObjOrNil is nil, a new instance of the same class as the receiver will be created containing the results.

aesEncryptWith128BitKey: aKey salt: aSalt into: aByteObjOrNil
aesDecryptWith128BitKey: aKey salt: aSalt into: aByteObjOrNil
 
aesEncryptWith192BitKey: aKey salt: aSalt into: aByteObjOrNil
aesDecryptWith192BitKey: aKey salt: aSalt into: aByteObjOrNil
 
aesEncryptWith256BitKey: aKey salt: aSalt into: aByteObjOrNil
aesDecryptWith256BitKey: aKey salt: aSalt into: aByteObjOrNil

You may use ByteArray withRandomBytes: N to produce pseudo-random key and salt values for encryption. For example:

Example 5.3 String Encryption

topaz 1> run
| key salt encrypted |
key  := ByteArray withRandomBytes: 32.
salt := ByteArray withRandomBytes: 16.
encrypted := 'My secret string' aesEncryptWith256BitKey: key
	salt: salt.
encrypted aesDecryptWith256BitKey: key salt: salt.
%
My secret string
 
 

Previous chapter

Next chapter