7. Indexes and Querying

Previous chapter

Next chapter

This chapter describes GemStone Smalltalk’s indexing and querying mechanism, a system for efficiently retrieving elements of large collections.

Overview
Reviews the concept of relations.

Defining Queries
Describes the structure of query predicates, the types of queries, and how to construct a query.

Creating Indexes
Discusses GemStone Smalltalk’s facilities for creating indexes on collections.

Results of Executing a GsQuery
How to execute a query and the options for working with the results.

Enumerated and Set-valued Indexes
Describes how to create enumerated and collection-valued indexes and queries.

Managing Indexes
How to perform index management: find out about indexes in your system, remove existing indexes, handle errors, and audit indexes.

Indexing and Performance
Additional factors that can impact the performance of your queries.

Historic Indexing API differences
The older indexing API, using UnorderedCollection methods and select blocks.

7.1 Overview

Most applications use one or more databases containing business data, which may be very large. Individual records in these databases may be added, removed, and/or updated, and need to be queried in multiple ways for different purposes. All these operations must be performed quickly and efficiently.

Business Objects

In GemStone, a database is represented as an instance of a collection that holds instances of business objects. You may have thousands or millions of objects in a collection, and these objects may be complex composite objects holding many individual strings, dates, number and other basic data types.

The following example shows simple employee data in table form:

Table 7.1 Employees

First Name

Job

Age

Address

Fred

clerk

40

22313 Main, Dexter, OR

Sophie

bus driver

24

540 E. Sixth, Renton, WA

Conan

librarian

40

999 Walnut, Hilt, CA

Moppet

intern

18

17 SW Oak #6, Portland, OR

In Smalltalk, this can be represented as an Employee class, with instance variables firstName, job, age, and address; and an Address class, with street, city, and state instance variables.

Database Collection

The collection itself may be an instance of a number of different types of Collection subclasses. For scaling, and to support indexes, a subclass of UnorderedCollection is recommended. Hashed collections such as dictionaries may become unbalanced if too many elements hash to the same value, and as a collection grows, may require the entire collection to be rebuilt. Indexed collections such as Array have limitations on adding and removing elements without affecting the entire collection. UnorderedCollections, particularly IdentityBag, IdentitySet, RcIdentityBag, and RcIdentitySet, use an optimized internal tree structure to hold the elements and are the recommended Collection classes for use for large databases. Collection classes are described in Chapter 4.

To make it easy to associate behavior with your set of Employees, it is often useful to define a class SetOfEmployees that is a subclass of IdentitySet. An instance of SetOfEmployees then can contain instances of Employee, with a reference from UserGlobals or from a class variable.

Queries

Since UnorderedCollections aren’t ordered, lookup is by value. For example, to find a particular Employee, you use select:, detect: or similar messages. For example,

MyEmployees select: [:ea | ea addess state = 'OR'] 
MyEmployees detect: [:ea | ea firstName = 'Sophie'] 

These iterative messages may not scale well. For example, for the above select: expression, for each employee in the collection, the employee object and the address object must be faulted into memory, and the messages address, state, and = are sent. While this doesn’t matter for small collections, it can become unreasonably slow for very large collections; particularly if objects in the collection are not in the shared page cache, and need to be read from disk.

GemStone Indexes and Queries

Indexes

Indexes and indexed queries provide a way to locate specific objects in a collection by value. Indexes are created on specific named instance variables, either by identity or by equality. Creating an index on a collection (e.g. on the instance variable firstName), creates parallel internal structures which provide a mapping from the indexed value (such as the firstName ’Sophie’) to the root object in the collection (the employee). Using this index, only a few message sends are needed to lookup the collection element that is the same as, or less or greater than, a particular value.

Identity indexes support queries that are looking for identical values, while equality indexes support queries that compare using equality, or greater or less than, a particular value.

Indexes are created on objects based on instance variables, not on message sends; since the instance variable relationships are known by the system, indexes can be updated automatically as elements are added and removed from the collection, and when references on the path are changed. There are some exceptions to this which require manually updating the indexes.

Indexes may only be created for instance of subclasses of UnorderedCollection.

GsQueries

To take advantage of an index you have built on your collection, you must perform the query using GsQuery syntax, rather than select: or similar iteration methods. A query performed using GsQuery will use indexes, as long as an index exists for the particular instance variable involved in the query. If an index does not exist, then the GsQuery will be performed iteratively, with performance similar to the comparable select: or detect: operation.

When the collection is properly indexed, GsQueries can return results without having to iterate the collection, fault the intermediate objects into memory, or send messages to each object.

GsQueries can be used on most kinds of Collection, not only UnorderedCollection. However, the performance benefit only appears on instances of subclasses of UnorderedCollection for which the appropriate index or indexes exist.

Deciding what to optimize

As with any kind of optimization, it’s important to consider the application’s performance profile, performance requirements, and the entire context, rather than automatically creating indexes on all possible paths.

The process of creating indexes creates overhead. The additional internal objects created use some space, and building an index may take some time. As the data in the repository changes, including objects added to and removed from the collection itself as well as changes in actual values, the mappings in the index structures need to be updated. Periodically, indexes should be audit ed to ensure integrity, and rebuilt if necessary; rebuilds are required for some system upgrades. Indexes must be specifically removed when the collection is removed, to ensure the internal infrastructure is cleaned up.

While most collections with more than a few thousand objects will see better performance using indexed queries, it is wise to consider indexes with this overhead in mind. Before going through the trouble of creating an index, you should determine that the index provides value. There are a number of factors that strongly influence queries, both iterative queries and indexed queries. These factors interact with each other and there are other factors, such as caching, that also influence performance.

  • The size of the collection. With smaller collections, iterative performance is fast enough that indexing provides little benefit. Iterative performance grows linearly with collection size, while indexed performance increases slowly.
  • The length of the path. Longer paths require more lookups and more infrastructure, and take longer to complete. For longer paths, it is more efficient to cache the value higher within the object structure.
  • The size of the result set. If you have a query that returns a very large number of results, creating the result set reduces performance; this is particularly so for indexed queries.

Overview of the steps in creating and using indexed queries

In order to take advantage of efficient indexed queries on your collection, the following steps need to be done:

a. Determine the queries that can benefit from optimization, and describe them using query syntax. Query syntax is described starting here.

For example, to query for employees under 21 who live in Oregon, the query string might be:

(each.age < 21) & (each.address.state = 'OR')

2. Create one or more indexes on the collection, that specify the particular instance variable path on which you will perform the query. Creating indexes is described starting here.

To support the above query, you may want to create two indexes, for example:

GsIndexSpec new
   equalityIndex: 'each.age' lastElementClass: SmallInteger;
   equalityIndex: 'each.address.state' lastElementClass:       String;
   createIndexesOn: myEmployees.

3. Execute the query on that indexed collection, using query protocol. How to define and execute queries is described starting here.

For example:

(GsQuery fromString: '(each.age < 21) & (each.address.state =       	''OR'')') 
   on: myEmployees;
   queryResult

Managing Indexes

In addition to creating indexes and queries, you will also need to do some management on your indexes and queries. For example, you should evaluate your indexes for performance, remove indexes that are no longer needed, and audit indexes to ensure the structures are correct. Many of these indexing tasks are handled by IndexManager.

Special Syntax for Indexing

GemStone indexing uses several syntactical elements that are either specific to, or primarily used for, index creation and indexed queries.

Path-dot syntax

Indexes are created, and queries formed, using special syntactic structure called a path, which designates variables for indexing and describes certain features of the index. Path syntax uses a period to represent the object/instance variable name relationship.

For example, given a collection of Employees, in which each employee has an address instance variable, which refers to an Address that has a state instance variable, the path is:

address.state

A longer path is

account.order.address.state

In the simplest case, a path on an instance variable on the collection elements, this is just the instance variable name. For example:

firstName

You may also specify an empty path, meaning the elements of the root collection itself.

Each instance variable name on the path is a pathTerm. In the above example, address and state are each pathTerms. Paths can contain a long string of pathTerms, if the elements of the collection represent a deeply nested tree of objects.

Path-dot syntax can be used anywhere in GemStone code; it is required in index creation and queries, for which message sends are not allowed.

Initial each

An initial 'each.', where each represents the elements of the collection, is recommended but optional for GsIndexSpec index creation, and required for GsQueries. For example:

each.address.state
Enumerated pathTerms

A vertical bar | in the path indicates the presence of two alternate instance variables that will be indexed together, as if they were a single variable.

For example, you might want to search on both name and nickname in a single operation. This might look like this:

account.name|nickname
Set-value path terms

An asterisk * in the path indicates a collection, which must be an instance of an indexable class (an instance of a subclass of UnorderedCollection). A set-valued path term may not be the first term in the path.

For example, if the instance variable children contains an IdentityBag of instances of Child, and a child has the instance variable age:

children.*.age

Historic indexing syntax

The GsIndexSpec/GsQuery classes provide the general purpose indexing interface. An older syntax using UnorderedCollection methods to create indexes, and selection blocks with curly braces to define queries, is an alternate way to use indexes. This older syntax remains fully supported in order to ensure upgraded applications do not require changes. However, new features are not available using this historic API.

See Managing Indexes for information specific to the historic API.

Last Element Class

Creating an equality index creates an internal btree that contains the ordered values of the instance variable that is indexed. For example, an index on firstName creates a btree containing ’Conan’, ’Fred’, and so on. This allows fast lookup of a position in this btree when performing the query, and values that are equal or greater or less than can be returned in order as needed.

Building this btree and providing predictable lookup requires that the values be comparable in well-known and efficient ways. When building indexes, there are choice to make in balancing the restrictivity of the indexed values vs. the impact of comparison on query performance.

Performing an identity query creates no such restrictions on the index, since the comparison is by identity (OOP), and any two objects can be compared this way.

To provide the definition of comparison, equality indexes require specifying the lastElementClass. This generally restricts the indexed values to instance of this class or of subclasses of this class, although string classes have some special handling.

Optimized classes

The following classes, and subclasses of these classes, are optimized for indexes. In most cases, the final element you will create an index on will be one of the following. For legacy indexes, the index structures encode the value; for btreePlusIndexes, they can perform optimized comparisons. These classes are subclasses of Magnitude or CharacterCollection.

Character, SmallInteger, SmallDouble, SmallFraction,
String, DoubleByteString, QuadByteString,
Unicode7, Unicode16, Unicode32,
Symbol, DoubleByteSymbol, QuadByteSymbol,
Time, Date, DateTime, DateAndTime,
LargeInteger, Float, DecimalFloat, ScaledDecimal, FixedPoint, Fraction

Boolean is a special case; it is a special, and so does not require looking in legacy indexes. However, it does not support optimizedComparison.

Using other classes

You can create indexes where the indexed values are instances of classes other than the above, including classes you have defined yourself.

Identity indexes on instances of your own classes require no extra work, since they compare on the identity of the objects.

If you wish to create an index where the values that are instance of application classes that do not subclasses of basic classes, you must ensure these classes implement comparison operators, as described here.

Comparing data types

Some cases of data type comparison have special handling in indexes.

  • It may be useful to mix strings and symbols, but there is additional cost. While a string and a symbol can be ordered using <=, a string and a symbol that contain the same characters are not equal. There are two solutions: using alternate comparison methods which reduce performance; or optimizing the comparison operators and not mixing symbols and strings.
  • NaN (not a number) are specialized kinds of Float that are not equal to themselves. As with strings, special handling is required to accommodate NaNs, at the cost of performance; or NaNs may be disallowed in Float indexes.
  • The indexed comparison mechanism considers only the first 900 characters of each string operand, so two strings that differ only beginning at the 901st character are considered equal.
  • nil is a special case of object that can be compared to any other object. They also require special handling in indexes. Since the appearance of nil signifies a value that is not there, less than and greater than comparison results will not include nil values. Since accommodating nil requires special protocol, nil may also be disallowed.

A nil along the path to an indexed slot is a different issue; such missing sections of a reference tree are allowed without special handling.

Strings in indexes

Indexing on strings has complications, due to the different collation orders it is possible to configure. For more on collation, see Chapter 5.

To summarize, strings come in two "flavors":

  • Traditional strings (String, DoubleByteString and QuadByteString, which are interchangeable based on the maximum Character codePoint size). Traditional strings, in Legacy String Comparison Mode, use character-based collation.

Symbols (Symbol, DoubleByteSymbol and QuadByteSymbol) follow the same collation rules as Traditional strings.

  • Unicode strings (Unicode7, Unicode16, and Unicode32) always use ICU string-based collation.

A repository in Legacy String Comparison Mode disallows compare between Unicode strings and Traditional strings or symbols, to avoid unpredictable results. In this mode, you cannot mix Traditional and Unicode strings; it is difficult to avoid errors when using Unicode strings in Legacy String Comparison Mode.

A repository in Unicode Comparison Mode uses Unicode collation for all flavors of strings and symbols. In this mode, you can use Traditional strings and Unicode strings interchangeably.

Constraining the indexed variables using lastElementClass is not effective for strings, since Traditional string, symbol and Unicode string classes inherit by codePoint range rather than by collation or other behavior. It is allowed, but not recommended, to specify CharacterCollection (the superclass of all kinds of Strings and Symbols), since (depending on the mode and index type) it may create an ambiguous indexes.

In both Comparison Modes, specifying a lastElementClass of any of the following will create an index that includes a cached collator:

Unicode7, Unicode16, Unicode32

In Legacy String Comparison Mode, the lastElementClass of any of the following will permit instance of any of the classes:

String, DoubleByteString, QuadByteString,
Symbol, DoubleByteSymbol, QuadByteSymbol

In Unicode Comparison Mode, the lastElementClass of any of the following will permit instance of any of the classes:

String, DoubleByteString QuadByteString,
Symbol, DoubleByteSymbol, QuadByteSymbol
Unicode7, Unicode16, Unicode32

Note that some optimized indexes disallow mixing Symbols with any kinds of Strings.

Redefining Comparison Messages

If you create an index on values that are instances of your application classes, these classes must implement the basic comparison operators, at least =, >, <, and <=. You can redefine one or more of these in terms of another.

The operators must be defined to conform to the following rules:

  • If a < b and b < c, then a < c.
  • Exactly one of these is true: a < b, or b < a, or a = b.
  • a <= b if a < b or a = b.
  • If a = b, then b = a.
  • If a < b, then b > a.
  • If a >= b, then b <= a.

While the indexing subsystem does not use hashing itself, note that redefining = does requires attention to the hash method to be consistent with the new definition of equality. Object that are equal must return the same hash value to ensure they behave in a consistent and logical manner in all use cases.

7.2 Defining Queries

Before you can define indexes on your collection, you need to determine the ways in which you will need to search your collection to retrieve elements. The queries you need determine the details of the indexes to create.

At its simplest, a query consists of the specification of an instance variable common to all the objects in the collection, a comparison operator, and a literal to which the value is compared. For example, if you wish to be able to find all employees 21 and older, your query formula could be something like this:

each.age >= 21

In this example, every object in the collection (each) has an instance variable age, which is specified using dot-path notation. The value of that instance variable is compared, greater than or equal, to the literal SmallInteger 21.

While this formula is simple, you can formulate queries based on multiple instance variable values, operators, and constants, and combine them using boolean logic. However, using this query syntax, you cannot include message sends; the indexes are based on structural relationships using instance variable names.

For performance and clarity, it is an advantage to use short and simple queries. However, it may be valuable to compose your queries based on the statement of business logic. This may mean creating a complicated query that is not in its most efficient form. The final query will be automatically optimized to a logically equivalent form that is more efficient for GemStone to execute. See Formulating queries and performance.

Query Predicate Syntax

A query contains a predicate expression, which is a Boolean expression that, when evaluated with the elements of the collection, returns true or false. In a query, the expression usually compares an instance variable on the collection objects with another instance variable or with a constant.

A predicate contains one or more predicate terms—the expressions that specify comparisons.

Predicate Terms

A term is a Boolean expression containing an operand and usually a comparison operator followed by another operand. For example, in

each.age >= 18

each.age and 18 are operands, while >= is a comparison operator. The only time you would not have a comparison operator is if the operand is itself a Boolean (true or false).

Predicate Operands

An operand can be a path (each.age, in this case), a variable name, or a literal (18, in this example). All GemStone Smalltalk literals except arrays are acceptable as operands.

Predicate Operators

Predicate operators are ==, ~~, =, ~=, <, <=, > and >=. No other operators are permitted in a GsQuery or selection block query.

Combining Predicates using Boolean Logic

If you want retrieval of an element to be contingent on the values of two or more of its instance variables, you can join several terms using a conjunction operator & (logical AND) or disjunction operator | (logical OR).

The conjunction operator, &, makes the predicate true if and only if the terms it connects are true. The disjunction operator, |, makes the predicate true if either one, or both, of the terms it connects are true.

You may also negate individual predicate terms using not.

Each predicate term must be parenthesized.

For example, the following are legal queries.

(each.name = 'Conan') & (each.job = 'librarian')
(each.age <= 40) | (each.job = 'librarian') not

Combining Range Predicates

Queries that use less than or greater than, such as each.age >= 18, define a starting (or ending) point in a range query. Specifying both a starting point and ending point creates a range query. For example,

(18 <= each.age) & (each.age <= 65)

These two terms can be combined into single range predicate.

18 <= each.age <= 65

Range specifications such this can only be defined with this syntax if the operands and comparison operators truly define a range.

Creating a GsQuery

GsQuery is a programmatic way to define a query, allowing you to easily abstract, store and reuse various aspects of the query.

To create a GsQuery, you create an instance of GsQuery using query predicate syntax. The most simple way to create a GsQuery is by passing in a string. For example:

GsQuery fromString: 'each.age >= 18' 

Since the fromString: protocol requires a string, if the query includes literal strings, you must include two single quotes within the string. For example:

GsQuery fromString: 'each.firstName = ''Fred'''.

This message will return an instance of GsQuery. Before it can be executed, it must be bound to a collection:

  • Create the GsQuery using fromString:on: creates a GsQuery that is bound to a particular collection.
  • Bind the query before executing using the on: method.

Query Variables

The strings used to define GsQuery instances may contain variables—any element of a predicate that is are not a literal or path-dot expressions. This allows your query to be stored and executed later using different values.

For example, for a query such as

GsQuery fromString: '18 <= each.age <= 65'

This can be generalized to a query with variables:

GsQuery fromString: 'min <= each.age <= max'.

The resulting formula in the GsQuery includes 'min' and 'max' as variables. These must be bound to specific values before the query can be executed. Binding is done by sending the bind:to: message to the query. For the above example, to execute the query:

aQuery := GsQuery fromString: 'min <= each.age <= max'. 
aQuery 
	bind: 'min' to: 18; 
	bind: 'max' to: 65; 
	on: myEmployees; 
	queryResult

Note that the “max” and “min” in the query formula are string elements, and are not affected by any temporary or instance variables named max or min in the scope of the code being executed. The only way to resolve max and min are by binding variables.

7.3 Creating Indexes

Queries can be executed without an associated index, but there is no performance benefit. To execute a query efficiently, you need to also create an index on the instance variables for the query. These indexes provide a mapping from the specific key values that you are interested in to the results (the objects in the collection).

The path you provide when creating an index provides the key that is needed to lookup the value during a query. These keys are the values of a specific instance variables within the elements of a collection, or the elements of the collection itself. For example, given a collection of Employees, and the path each.address.state, the objects at the state instance variable (perhaps two-character Strings) would be the keys.

The values for these keys are the objects in the collection itself, which are the results of the query using that index. For our example, the values are the instances of Employee in AllEmployees. When you make an indexed query for Employees with addresses in a given state, that state key is used to lookup the matching elements (instance of Employee).

Equality and Identity Indexes

Indexes fall into two main types: Equality Indexes and Identity Indexes. Equality indexes support equality-based queries, including >, >=, <, <=, =, and ~=. Identity indexes support queries containing identity comparisons, == and ~~.

When creating an index, you specify whether an equality or identity index is created. Since identity comparisons are done by OOP, not by the object’s contents, they are faster, and the lastElementClass does not matter; any two objects can be compared for identity.

If you only have an identity index on a variable, but form your query using an equality operator, the query will not have an index to use (and thus, will iterate the collection).

You may create both equality and identity indexes on the same path.

Btree and Legacy Indexes

GemStone supports two different internal structures; the legacy structures, which includes a btree and an index dictionary; and the btreePlus structures, which use a btree+ and does not require the dictionary. The query results are the same for each, of course, but the performance profile is different.

The decision of which to use impacts your indexing work.

  • The best query performance is with btreePlusIndexes with optimizedComparison. However, optimizedComparison places restrictions on lastElementClass data types, such that, for example, Strings and Symbols cannot be mixed, and nils and NaN floats may not be present.
  • If your data does not conform to the data type restrictions, using legacy indexes is recommended.

With a legacy identity index, the index dictionary provides a identity-based lookup for the key. In a btreePlus identity index, the keys are in a btree. This allows you to stream over the results of a identity query only when using a btreePlus index.

The index structure you use can be specified for each index, otherwise it relies on the system or configured default. Since structures are shared between indexes on a collection, all indexes on a specific collection must use the same internal structure.

Note this is entirely distinct from the historic indexing API (using UnorderedCollection methods to create indexes); creating indexes using the historic API may create either kind of internal structure, depending on the current default.

See here for details on how to configure each index type.

Creating the Index

Creating an index involves creating an instance of GsIndexSpec and sending messages to define the index and the parameters and options for that index, then use this spec to create indexes on a specific collection.

Before creating an index, you must know:

  • the paths for the instance variables that you will query on.
  • The classes of the values of these instance variables, and if these instances are homogenous.
  • If your queries will be by equality or identity

To create an index using GsIndexSpec, do the following:

Create the instance of GsIndexSpec

This is done by executing GsIndexSpec new

Define one or more indexes on the spec

To define an index, send an index creation message to the GsIndexSpec, including the path you want indexed, the class of the last element (for equality indexes), and options (if used).

The most general index creation methods include:

equalityIndex:lastElementClass: 
identityIndex:

While these methods can be used to create indexes on strings, there are additional index creation methods are specific to various kinds of string indexes. These methods have variants that allow you to specify the index options.

Create the index on a specific collection

To actually create the index, send the message createIndexesOn:, providing the specific collection on which you want to create the indexes.

To put this all together, for example:

GsIndexSpec new
	identityIndex: 'each.userId';
	equalityIndex: 'each.age' lastElementClass: SmallInteger;
	equalityIndex: 'each.address.state' lastElementClass: String;
	createIndexesOn: myEmployees.

This creates an identity index on userId, an equality index on age, and another equality index on address.state, all on the collection myEmployees.

You can view the indexes by recreating the specification from the indexed collection, using indexSpec. For example:

run
myEmployees indexSpec printString
%
GsIndexSpec new
	identityIndex: 'each.userId';
	equalityIndex: 'each.age'
		lastElementClass: SmallInteger;
	equalityIndex: 'each.address.state'
		lastElementClass: String;
	yourself.

Equality Indexes on strings

Equality indexes on strings present a variety of options and restrictions, depending on:

  • If the indexed elements will be Traditional strings, Unicode strings, Symbols, or a mix.
  • If you are using the GsIndexOptions optimizedComparison feature, which is strongly recommended with btreePlus indexes and disallowed with legacy indexes.
  • If the application is in Unicode or Legacy String Comparison Mode.

The following methods can be used to create equality indexes on strings and/or symbols. Note that each has a variants that allow you to specify the index options.

equalityIndex:lastElementClass: 
unicodeIndex:  
unicodeIndex:collator: 
stringOptimizedIndex:  
symbolOptimizedIndex:  
symbolOptimizedIndex:collator:  
unicodeStringOptimizedIndex:  
unicodeStringOptimizedIndex:collator:  

Which one you should use, and the rules allowing comparisons between different kinds of data, are different for repositories in Legacy String Comparison Mode or in Unicode Comparison Mode.

Comparison Modes are described on here.

Repositories in Legacy String Comparison mode

In Legacy String Comparison mode, it is disallowed to compare Traditional and Unicode strings, so it’s not possible for the indexed variables to contain a mix of Unicode strings and Traditional strings or Symbols.

Legacy indexes

To create a legacy index on Traditional strings, symbols, or a mix of the two,
use a equalityIndex:* method specifying a lastElementClass of String.

If you are using Unicode strings in Legacy String Comparison Mode,
use a unicodeIndex:* method.

optimizedComparison (btreePlus) index

You cannot create an optimizedComparison index on a mix of types.

If your indexed elements are all Traditional strings,
use a stringOptimizedIndex:* method.

If your indexed elements are all Unicode strings,
use a unicodeStringOptimizedIndex:* method.

If your indexed elements are all Symbols,
use a symbolOptimizedIndex:* method.

Repositories in Unicode Comparison Mode

In Unicode Comparison Mode, Traditional strings are collated exactly like Unicode strings, and indexes make no distinction between them.

Symbols are also collated like Unicode strings, but due to the definition of equality, optimizedComparison indexes do make a distinction between strings and symbols.

Legacy indexes

To create a legacy index in Unicode Comparison Mode on Traditional strings, Unicode strings, symbols, or any mix, use a unicodeIndex:* method, to ensure the collator is persisted with the index.

optimizedComparison (btreePlus) index

optimizedComparison indexes may mix Traditional and Unicode strings, but may not mix strings and symbols.

If your indexed elements are all Traditional or Unicode strings,
use the method unicodeStringOptimizedIndex:*.

If your indexed elements are all Symbols,
use the method symbolOptimizedIndex:*.

Implicit Indexes

With legacy indexes, the indexing internal structures include a dictionary. This dictionary, as a side effect, provides de facto identity indexes with some equality indexes: specifically, for non-terminal pathTerms, and where the lastElementClass is a Special (SmallInteger, SmallDouble, SmallFraction, Character, or Boolean, in which equality and identity are the same). Such indexes are referred to as implicit indexes.

Since with btreePlusIndexes there is no dictionary, there are also no implicit indexes defined.

For clarity, and to avoid dependency on side-effects of the internal structures, it is recommended to explicitly define any identity indexes that you require. There is no risk in explicitly creating an identity index that would exist as a implicit index.

GsIndexOptions

An instance of GsIndexOptions specifies features that will be used when creating a particular index on a collection. GsIndexSpec index definition methods all have variants that accept an instance of GsIndexOptions, although some override certain settings. If no GsIndexOptions is explicitly provided, the session or repository default is used.

The GsIndexOptions defines if the index is a legacy index or a btreePlus index, as well as other important indexing features. The options available for GsIndexOptions are:

GsIndexOptions class >> legacyIndex
defines a legacy index structure, and disables btreePlusIndex and optimizedComparison.

GsIndexOptions class >> btreePlusIndex
defines a btreePlus index structure, and disables legacyIndex.

GsIndexOptions class >> optimizedComparison
adding optimizedComparison is only allowed with btreePlusIndex.

GsIndexOptions class >> reducedConflict
Instructs the index to create the internal structures as reduced-conflict, recommended when indexing on a reduced-conflict collection.

GsIndexOptions class >> optionalPathTerms
Instructs the index to allow objects that do not include an indexed instance variables to be present in the indexed collection.

These options are described in more detail starting here.

Combining options

GsIndexOptions can be combined using the plus operator and removed using the minus or not operators, with the caveat that not all options are compatible with each other. For example:

GsIndexOptions legacyIndex + GsIndexOptions reducedConflict
GsIndexOptions btreePlusIndex + GsIndexOptions optimizedComparison not

If you combine two options that conflict, the later one has precedence.

Default options

Creating an instance of GsIndexOptions, using class methods such as GsIndexOptions >> legacyIndex, begins with the default, repository-wide GsIndexOptions.

The specific value requested by the class method (such as legacyIndex) overwrites the default only for that setting and its dependents.

For example, using GsIndexOptions legacyIndex will return a GsIndexOptions instance with legacyIndexes on and both btreePlusIndex and optimizedComparison disabled, regardless of the default. However, the default GsIndexOptions setting for other values, such as reducedConflict, will be retained

The initial default GsIndexOptions is:

GsIndexOptions btreePlusIndex + GsIndexOptions optimizedComparison.

In an upgraded application, the system default is set instead to:

GsIndexOptions legacyIndex

to ensure that the behavior does not change from previous releases.

You can manually set the repository-wide default, as SystemUser, by executing GsIndexOptions class >> default:. Do this with care, since it may affect all indexes that are created in the future that do not explicitly set all the GsIndexOptions values.

For example, if you have an upgraded application and want to default to btreePlusIndexes and optimizedComparison, execute

GsIndexOptions default: (GsIndexOptions legacyIndex + 	
GsIndexOptions reducedConflict)

You may also set a session-wide default that applies only to your session and only until you log out, using GsIndexOptions class >> sessionDefault:.

The Options in GsIndexOptions

The options btreePlusIndex, optimizedComparison, and legacyIndex are used to specify the index type.

  • GsIndexOptions legacyIndex enables the classic legacy btree and disables btreePlusIndex. legacyIndex is not compatible with optimizedComparison.
  • GsIndexOptions btreePlusIndex enables the btreePlus structures and disables legacyIndex. For performance, this is normally used with the optimizedComparison option. btreePlusIndexes without optimizedComparison are somewhat less performant than legacy indexes in most cases.

The following table describes the three combinations:

GsIndexOptions
btreePlusIndex +

GsIndexOptions optimized comparison

Provides the best query performance, with somewhat slower update performance. There are restrictions on the contents of indexed instance variables; nil is not allowed, they cannot mix strings and symbols, and cannot mix floats and NaNs.

GsIndexOptions legacyIndex

Provides good performance. Data type restrictions are less strict.

GsIndexOptions
btreePlusIndex

Data type restrictions are less strict, but the performance is not as good as legacyIndex.

Using optimizedComparison, it is disallowed to use a mix of certain kinds of objects in the collection. The following rules when using optimizedComparison:

  • values must be a kind of the last element class.
  • nil is not allowed as a value.
  • For Float last element class, NaN floats are not allowed as a value.
  • For String last element class, Symbols are not allowed as a value.
  • For Symbol last element class, Strings are not allowed as a value.

When using the "Optimized" index specification methods to define an index, it overrides the settings for these three options in the default or argument GsIndexOptions.

Reduced-Conflict

In a multi-user system, reduced-conflict collection classes may help avoid transaction conflicts if multiple users simultaneously add or remove objects from the collection; for more on this problem, see Classes That Reduce the Chance of Conflict. For example, using an RcIdentityBag rather than an IdentityBag allows concurrent updates to the collection itself.

If there are concurrent updates of the same indexed instance variable for different objects in the collection (for example, the addresses associated with two different customer objects are both changed), there is not an application object conflict, since the objects are independent. However, there may be a transaction conflict due to the indexes, since both addresses are keys in the same indexing structure.

This doesn’t apply to legacy identity indexes, which are always reduced-conflict.

To avoid transaction conflicts from the indexing internal structures, specify that the indexes are reducedConflict, using GsIndexOptions reducedConflict.

For example:

GsIndexSpec new
	equalityIndex: 'each.address' 
	options: (GsIndexOptions reducedConflict)

Optional pathTerms

A homogenous collection is one in which each element in the indexed collection defines the instance variable described by the index, for each pathTerm in the indexed path. By default, indexes require that the collection be homogeneous. If any element does not have the given instance variable, it will raise an error when the element is added to the collection.

If you want to create an index on a non-homogenous collection, you can define the indexes with optional pathTerms. For example:

GsIndexSpec new
	equalityIndex: 'each.nickName' 
	options: (GsIndexOptions optionalPathTerms)

When creating an optional pathTerm index, it is not an error when the objects in the collection do not implement an instance variable specified by the index. For a multi-pathTerm index, that includes each pathTerm; objects with missing instance variable definitions for any of the pathTerms in the indexed path are not considered when creating query results.

Note that this option bypasses some error detection. If you create an index using an instance variable that does not exist at all (perhaps due to a typing error), then the index is created correctly and does not report an error, even if it does not create the index you might have intended to create.

7.4 Results of Executing a GsQuery

Once you have defined your query, created the GsQuery, and bound it to a collection, there are further options in how to access the results of the query.

To simply get the results, you can send queryResult to the instance of GsQuery.

GsQuery >> queryResult will, like selection block queries, return a new instance of collection of the same class as the base collection, unless protocol such as asArray are used to specify the class of the results.

Also similarly to selection block queries, queries on instances of reduced-conflict (Rc) collections, return the equivalent non-Rc collection.

The collection returned from a query has no index structures. Indexes belong to specific instances of collections, rather than the classes. If you want to perform indexed selections on the new collection, you must build the necessary indexes on the new collection.

GsQuery’s Collection protocol

GsQuery accepts other Collection protocol, and, provided the query has bound to a collection and to query variables, the GsQuery instance responds to as if the GsQuery was a collection of the results of the query. This means that rather than having to put the results of a query into a temporary variable for further processing, GsQuery can respond directly to the kinds of message you are likely to send to the query results.

You can convert the type of collection, for example, using asArray or asIdentityBag:

(GsQuery fromString: 'each.address.state = ''OR''' 
	on: Employees) asArray

Or fetch a single instance from the results:

(GsQuery fromString: 'each.firstName = ''Sophie''' 
	on: Employees) any

Performing one of the collection operations that are provided for GsQuery simplifies your code, since you may not have to put results in temporary variables. It may or may not allow you to avoid creating query result objects.

Enumeration methods also allows you to perform code while the query is executing, rather than waiting for the results.

Caching Query Results

While GsQuery responds to messages as if it was a collection, the results of a query are not a static collection. By default, each time you execute any GsQuery collection protocol, the query is performed again. So, for example, sending isEmpty to a GsQuery before sending asArray will execute the underlying query twice.

You can cache the results of your GsQuery using GsQueryOptions cacheQueryResult. By default, it is false. Using this option allows the resultSet of the GsQuery to be cached. Note that this cache will not reflect changes in the root collection that occurred after the query was executed; you are responsible for re-running the query if current results are required.

To create an instance of GsQueryOptions with cacheQueryResult true, use this expression:

GsQueryOptions cacheQueryResult

And use this instance with GsQuery methods that includes the options: keyword.

For example:

query := (GsQuery fromString: 'each.address.state = ''OR''' 
	options: (GsQueryOptions cacheQueryResult)
	on: Employees).
query isEmpty ifTrue: [^'no results'].
report := self createReportingStructure. 
query do: [:ea | report updateDataWith: ea].
...

GsQuery enumeration methods accepting blocks

Among the collection protocol that GsQuery understands are the methods do:, select:, reject:, collect:, detect: and detect:ifNone:. These may look similar to iterative queries on the root collection, but since the actual query is already provided by the GsQuery, the action is quite different.

With GsQuery, these will operate on the result set of the initial query. In essence, you are adding an additional, non-indexed search criteria to the indexed query. This additional code will be executed for each element in the collection for which the indexed query matches, at the time that the index query is examining that result element.

For example, if you have an index on Employee age, and a query such as:

(GsQuery fromString: 'each.age <= 18' on: Employees)

Using this query, you can add an additional search criteria using select:, so that only Employees who live in Oregon are returned.

(GsQuery fromString: 'each.age <= 18' on: Employees) select: 
	[:each | each address state = 'OR']

This will return a result set that includes Employees under 18 who live in Oregon.

The address message is only sent to the elements (Employees) who are under 18, it is not executed for every element in the collection. Also note that the state comparison does not use an index; these are message sends.

Order of results

Provided there is an index on the query path, the enumeration block operates on each object in the result set in the order specified by the index. However, if you wish to use the result of the select: or other enumeration method, the result will necessarily be a kind of UnorderedCollection, and the objects in the returned collection will be not be ordered.

You can still use the enumeration protocol to produce results that are ordered according to the index, by adding each element to a temporary Array. However, for ordered results, you may want to stream over the results instead.

Efficiency of query vs. enumeration

It is more efficient to perform an indexed query with multiple predicates using GsQuery, than to add additional criteria using enumeration methods.

For example, the following code returns a collection of all employees who are 26 or younger, and who respond false to hasOtherHealthInsurance.

GsQuery fromString: 'each.age <= 26' on: myEmployees)
	reject: [:each | each hasOtherHealthInsurance]

This may be useful if you have predicates that require message sends. However, if you can formulate the second statement as an indexable predicate, it would be more efficient as a query. If hasOtherHealthInsurance was actually an instance variable, you could write this as:

(GsQuery fromString: '(each.age <= 26) &
	(each.hasOtherHealthInsurance) not' on: myEmployees)
	queryResults
Early exit from execution

Since the code in the block provided to select: (and similar methods) is executed for each element that the indexed query itself would return, this provides a way to exit the indexed query early. In this block, you can execute any code (as long as it does not modify the collection or the objects in the collection, in ways that would change the result set). If it’s no longer useful to continue the search, you can exit the block and potentially save a lot of time.

For example, say you have a collection of purchase orders, and you are generating a report of all open purchase orders. If a new order arrives during the period you are executing this operation, you might want not want to bother producing the already-obsolete report.

(GsQuery fromString: 'each.isOpen' on: MyOrders) do: 
	[:anOrder |
	report add: anOrder description.
	self checkForNewOrders ifTrue: [^'report canceled']
	]

Query results as Streams

It may be more useful to return the result of an equality query as a stream, instead of a collection, especially if the result set is large. Returning the result as a stream not only is faster, is also avoids the need to have all the result objects in memory simultaneously.

You can stream on an identity query only when using a btreePlusIndex. You cannot stream on the results of an identity legacyIndex.

Streaming on index results return the results in order that is defined by the index, so you can iterate over the elements that are returned in the order defined by the index, with no extra effort.

To get the results as a stream, use the message GsQuery >> readStream or GsQuery >> reversedReadStream.

These methods return an instance of a specialized subclass of Stream that understand a limited number of ReadStream protocol. Legal messages to an index stream are:

atEnd
do:
next
reversed
size
skip:

Streams do not automatically save the resulting objects. If you do not save them as you read them, the results of the query are lost. You should not modify the objects in the base collection while streaming, nor add or remove objects; doing so can cause an error or corrupt the stream.

For example, suppose your company wishes to send a congratulatory letter to anyone who has worked there for thirty years or more. Once you have sent the letter, you have no further use for the data. Assuming that each employee has an instance variable called lengthOfService, and there is an index on this, you can use a stream to formulate the query as follows:

oldTimers := (GsQuery fromString: 'each.lengthOfService >= 30'
	on: myEmployees) readStream. 
[ oldTimers atEnd ] whileFalse: [  
	| anEmployee | 
	anEmployee := oldTimers next. 
	anEmployee sendCongratuations. ].

Limitations on streamable queries

Streams on query results have certain limitations; for example, the predicate in the query must be logically streamable. The following restrictions apply:

  • It takes a single predicate only; no conjunction of predicate terms is allowed. The exception is range predicates, which can be combined into a single predicate. For example (each.age > 18) & (each.age <= 65) is legal, since it can be reformulated as a single range predicate, (18 < each.age <= 65).
  • The predicate can contain only one path.
  • The collection you are streaming over must have an equality index on the path specified in an equality predicate; or have an identity btreePlusIndex on the path specified by an identity predicate.

7.5 Enumerated and Set-valued Indexes

Enumerated path terms in indexes and queries

Enumerated path terms allow you query over more than one instance variable value in a single query. This is specified using the vertical bar | in the path term, between the instance variable names.

The instance variables are treated as alternate choices; if any one of the specified instance variables matches the search criteria, the predicate evaluates to true.

For example, you might want to search on both first name and nickname in a single operation. The query might look like this:

(GsQuery fromString: 'each.firstName|nickName = ''Freddie''' 
	on: MyEmployees) queryResult

When this is executed, the results will include all instances that have either the firstName equal to ‘Freddie’, or the nickName ‘Freddie’, or both.

In order to optimize this query with an index, you need to create an index on the specific enumeration, e.g. 'each.firstName|nickName'. An enumerated path term query will not use an index on the individual instance variables that are enumerated.

Restrictions on predicates with enumerated pathTerms

The semantics of enumerated pathTerms do not allow multiple conjoined predicates using the same enumerated pathTerm, since each predicate is evaluated separately. (conjoined predicates are those connected using &).

Indexes and Queries with collections on the path

Your business objects may themselves contain collections; for example, an employee may contain a collection of children; and you may want to search based on some criteria of the objects in that collection. As long as this collection is itself indexable, indexes and queries can include all elements within these contained collections.

Index paths that include collections, and the queries that use these indexes, are generally referred to as Set-valued indexes and queries for historical reasons, although any kind of indexable collection, not just Sets, may be used.

When you wish to specify a path containing an instance of a subclass of UnorderedCollection, the collection is represented by an asterisk *. This syntax may be used to create indexes and perform queries. Only GsQuery may be used to perform set-valued queries.

For example, suppose you want to know which of your employees has children of age 18 or younger. To facilitate such queries, each of your employees has an instance variable named children, which is implemented as a set. This set contains instances of a class that has an instance variable named age.

To create the index:

GsIndexSpec new
	equalityIndex: 'each.children.*.age' 
		lastElementClass: SmallInteger;
	createIndexesOn: myEmployees. 

Set-valued query results

When you execute a set-valued query, the results you get will follow the particular semantics of Set-valued queries. Since there are potentially multiple “true” query results for a given element in the base collection, the result of a set-valued query such as this can be larger than the original collection.

For example, consider the following query, using the index created above:

(GsQuery fromString: 'each.children.*.age <= 18'
	on: myEmployees) queryResult

In this example, if the root collection myEmployees is a Bag or IdentityBag (rather than a Set or IdentitySet), and an employee has two children that are under 18, then that employee will appear in the results (a Bag or IdentityBag) twice. Employees with three minor children appear in the results three times, and so on. The resulting collection may be several times as large as the original collection, depending on the details of the query and data.

If the root collection myEmployees is a Set, which does not allow multiple instances of the same object, this potential source of confusion does not occur.

Restrictions on predicates in set-valued queries

The semantics of set-valued indexes do not allow multiple conjoined predicates that use the same set-valued pathTerm, since each predicate is evaluated separately. (conjoined predicates are those connected using &).

In general, it is recommended to avoid using multiple- set-valued predicate queries, although some multiple-predicate set-valued queries can be optimized, or avoid the problem cases, and are safe and therefor allowed.

7.6 Managing Indexes

You may need to find out about all the indexes in your system, and to remove selected indexes or clean up indexes that were not successfully created. This functionality is provided by the class IndexManager.

IndexManager has a single instance which provides much of the functionality, accessible via IndexManager current.

This instance is lazy initialized, and stored in the IndexManager class instance variable after it is created. Any configuration you do on IndexManager current, therefore, will be used by all affected operations, if you commit after making the change.

While Indexes are Being Created

Indexing a large collection will take some amount of time to create the infrastructure and tracking for each indexed object.

The message progressOfIndexCreation returns a description of the current status for an index as it is created.

Queries during index creation

While the index is being created, the index is write-locked. Any query that would normally use the index is performed directly on the collection, by brute force. If a concurrent user modifies an object that is actively participating in the index at the same time, index creation is terminated with an error.

Auto-commit

Creating or removing an index creates and/or modifies many objects related to the internal structures that support indexes. These modifications are uncommitted changes that must be kept in the session’s memory until these changes are committed. Many uncommitted changes place a large demand on memory and creates a risk of out of memory conditions. Chapter 8, “Transactions and Concurrency Control”, explains uncommitted objects and transactions in more detail, while Chapter 15, “Performance and Optimization” includes information on object memory use.

To avoid problems during index creation, it is often necessary to set the IndexManager to autoCommit. When IndexManager is set to autoCommit, it will commit the partially created index, rather than risk running out of resources and failing the index operation.

By default, autoCommit is false. When you send the following message:

IndexManager autoCommit: true

it configures your IndexManager such that the current transaction is committed during an indexing operation, whenever any of the following occur:

  • The current session receives a signal indicating temporary object memory is almost full.
  • The percentage of temporary object memory in use reaches the IndexManager’s setting for percentTempObjSpaceCommitThreshold.

The default is 60. This threshold can be changed using IndexManager >> percentTempObjSpaceCommitThreshold: anInt

  • The current session receives a signal to FinishTransaction. This occurs when the commit record backlog is larger than STN_SIGNAL_ABORT_CR_BACKLOG, and this session is holding the commit record.
  • The number of modified objects in the current transaction reaches the IndexManager’s setting for dirtyObjectCommitThreshold.

The default is SmallInteger maximum value, which means this limit is effectively disabled.This limit can be changed using IndexManager >> dirtyObjectCommitThreshold: anInt

When autoCommit is true, a transaction will be started (if necessary) before the indexing operation begins, and the IndexManager will commit at the completion of the indexing operation. Note that this means that, even if you are in manual transaction mode and not in a transaction, index operations will cause changes to be committed to the repository without you explicitly beginning a transaction.

If you want to enable autoCommit only for the current session, not for all index creation, you can use

IndexManager sessionAutoCommit: true

Indexes on temporary collections

You may create indexes on temporary collections containing temporary and persistent objects. However, on abort, any indexes on temporary collections are removed.

Inquiring About Indexes

For a full description of the indexes on a particular collection, send indexSpec to the collection. This produces a string containing the GsIndexSpec code that would recreate the same indexes, and provides useful documentation on those indexes.

For example,

myEmployees indexSpec printString
%
GsIndexSpec new
	equalityIndex: 'each.age'
		lastElementClass: SmallInteger;
	equalityIndex: 'each.address.state'
		lastElementClass: String;
		options: GsIndexOptions reducedConflict;
	identityIndex: 'each.userId';
	yourself.

The following IndexManager messages allow you to inquire about all indexes in the repository.

  • getAllNSCRoots

Returns a collection of all UnorderedCollections in the repository that have indexes.

  • usageReport

Returns a report on all indexes on all UnorderedCollections in the repository.

Removing Indexes

There are a number of ways to remove indexes.

Since indexing internal structures create references to the indexed collection and to objects in the collection, before dereferencing a collection, you should be sure to remove all indexes on the collection. This allows the collection to be garbage collected.

To remove indexes based on a GsIndexSpec

As you can create indexes based on an instance of GsIndexSpec, you can also use that specification to remove these indexes.

GsIndexSpec >> removeIndexesFrom: aCollection

This method removes the indexes described by the GsIndexSpec from the collection aCollection. If any of the indexes do not exist, they are not removed and no error is returned.

This is most useful in combination with the method that creates the spec from the existing collection. For example:

(MyEmployees indexSpec) 
	removeIndexesFrom: MyEmployees.

To remove a single index, you may edit the specification code printed by indexSpec, or create a simple GsIndexSpec with information to remove a single index:

(GsIndexSpec new 
	equalityIndex: 'each.age' lastElementClass: Object)
		removeIndexesFrom: MyEmployees.

To remove indexes using IndexManager

IndexManager, which provides a system-wide view of all the indexes in the repository, provides a number of methods to remove indexes both individually, by collection, and globally.

IndexManager >> removeEqualityIndexFor: aCollection on: aPathString 

Removes an equality index from the collection aCollection with the indexed path described by aPathString. If the path specified does not exist, this method returns an error. Implicit indexes are not removed.

IndexManager >> removeIdentityIndexFor: aCollection on: aPathString 

Removes the identity index from the collection aCollection with the indexed path described by aPathString. If the path specified does not exist, this method returns an error. Implicit indexes are not removed.

IndexManager >> removeAllIndexesOn: aCollection

Removes all explicitly created indexes from the collection aCollection. Implicit indexes that were created by these elements participating in other indexed collections are not removed.

IndexManager >> removeAllIndexes

Removes all indexes on all UnorderedCollections, including all implicit and partial indexes.

IndexManager >> removeAllTracking

Removes all indexes on all UnorderedCollections, and all object tracking. While this is the fastest way and most complete way to remove indexing infrastructure, if you are using modification tracking for any other purpose, that tracking will be removed as well.

Rebuilding Indexes

When objects that participate in an index are modified, the related indexing infrastructure must be updated. This causes some overhead. If you are performing an operation that will modify a large number of objects that participate in multiple indexes, such as a large migration, it may be more efficient to remove some or all of the indexes on the collection before performing the migrate, and rebuild those indexes after the migration is complete.

It is also sometimes required to remove and rebuild indexes as part of a GemStone upgrade; certain changes in GemStone kernel classes require you to either rebuild specific kinds of, or all, indexes. Any requirement to do this will be included in upgrade instructions in the Installation Guide for the version of GemStone to which you are upgrading.

To remove and rebuild indexes, you can extract and save the GsIndexSpec, and reuse that after the operation is complete.

For example:

| mySpec |
mySpec := myCollection indexSpec.
mySpec removeAllIndexesFrom: myCollection.
<perform migration or other operation>
mySpec createIndexesOn:myCollection

Using IndexManager >> getAllNSCRoots, you may extend this example to retrieve the GsIndexSpec for each collection in the repository, which will allow you to remove and rebuild the indexes.

Indexing Errors

To ensure that indexing structures are consistent, some kinds of errors that may occur during index creation will disable commits. Before creating an index, it is advisable to commit any work in progress, to avoid losing any work if an indexing error does occur.

For example, if you create an index on a collection and one or more of the objects that participate in the index do not implement the instance variable on the path, it will raise an error (unless using optionalPathTerms, as described here).

If an error occurs partly through index creation, and the autoCommit status (see Auto-commit) means that some portion of the index creation was committed, a collection may have unusable partial indexes. These indexes must be manually removed.

The following IndexManager instance methods allow you to remove incomplete indexes, while not affecting any complete, usable indexes:

IndexManager current removeAllIncompleteIndexes

Removes all incomplete indexes on all UnorderedCollections.

IndexManager current removeAllIncompleteIndexesOn: anNSC

Removes all incomplete indexes on the specified UnorderedCollection.

If you modify objects that participate in an index, try to commit your transaction, and your commit operation fails, query results can become inconsistent. If this occurs, abort the transaction and try again.

Auditing Indexes

Indexes should be audited regularly, as part of your regular application maintenance, to ensure there are no problems.

You can audit the internal indexing structures for a particular collection by executing:

aCollection auditIndexes

This audits all the indexes, explicit and implicit, on the given collection. If indexes are correct, this method returns 'Indexes are OK' or 'Indexes are OK and the receiver participates in one or more indexes.'. If there are no indexes on the collection, a message such as 'No indexes are present.' is returned.

In the case of failure, a list of specific problems is returned.

You can audit all indexes in the entire repository at once using:

IndexManager current nscsWithBadIndexes

which will return an IdentitySet containing all collections that fail auditIndexes. Depending on the number of indexed collections in your system, this may take a considerable time to run.

In the rare case of a problem reported, the usual way to resolve the problem is to remove and rebuild the affected indexes. In some cases, removing all indexes on the collection may succeed even if the internal problems prevent a single index being removed.

7.7 Indexing and Performance

The value of Indexes is to improve performance, of course. It is always recommended to perform tests to verify performance improvements.

Indexing improves query performance dramatically (in most cases), but does have a negative impact on updating the indexed data, since the indexes must be kept up to date.

Type of index

The performance characteristics of btreePlus and legacy indexes are quite different.

btreePlus indexes without optimized comparison are usually slower than other kinds of indexes. If your desired index cannot support optimizedComparison, you should use a legacyIndex.

btreePlus optimizedComparison indexes are usually considerably faster than a legacy index, but they create a somewhat larger negative impact on data updates.

Data updates

As your application is in use and the data in the indexed collection changes, the index must be updated. While normally indexing a large collection speeds up queries performed on that collection and has little effect on other operations, there are cases in which maintaining the index can cause a performance bottleneck.

For example, you may notice slower than acceptable performance if you are making a great many modifications to the instance variables of objects that participate in an index, and more than one of the following is true:

  • the path of the index is long;
  • the object occurs many times within the indexed IdentityBag or Bag
  • the object participates in many indexes

Even so, indexing a large collection is still likely to improve performance unless more than one of these circumstances holds true. If you do experience a performance problem, you can work around it in one of two ways:

If you have created relatively few indexes but are modifying many indexed objects, it may be worthwhile to remove the indexes, modify the objects, and then re-create the indexes.

If you are making many modifications to only a few objects, or if you have created a great many indexes, it is more efficient to commit frequently during the course of your work. That is, modify a few objects, commit the transaction, modify a few more objects, and commit again.

Formulating queries and performance

The most efficient queries are the ones in which the first predicate will return the smallest result set. This is sometimes easy for a human to determine, but the query cannot predict this without actually running the query. Queries should be manually reviewed for these kinds of domain-specific optimizations.

For example, you might want to query for current orders for a particular customer.

(each.status = #current) & (each.customer.name = 'Smith')

If your application is likely to have only a few current orders, then this is more efficient. However, if you are likely to have many current orders, but only a few customers named Smith, it would be more efficient for you to write the formula in reverse order.

Auto-optimize

Queries, by default, are optimized before execution; for example, the not operator is transformed into the logical equivalent by changing the comparison operator.

In addition, the predicates are reordered as follows, from left to right:

1. predicates involving indexed paths.

2. predicates with identity comparisons on paths without indexes.

3. predicates with equality comparisons on paths without indexes.

Auto-optimize can be disable using the instance of GsQueryOptions that is associated with each query. The GsQueryOptions instance controls optimization and other query features. In addition to the various specific optimizations performed, GsQueryOptions controls if automatic query optimization is done; the default is to do auto-optimization.

7.8 Historic Indexing API differences

In older versions of GemStone/S and GemStone/S 64 Bit, indexes and queries used a more limited API based on UnorderedCollection methods and a block-like query syntax. This API remains fully supported and interoperates with the GsIndexSpec/GsQuery API, with some limitations. A number of features are not supported by the older API.

Index creation using UnorderedCollection protocol

UnorderedCollection provides protocol to create indexes. This creates the same index structures as GsIndexSpec, but does not provide access to some index features.

The following index creation methods are defined on UnorderedCollection:

createIdentityIndexOn:
createEqualityIndexOn:withLastElementClass:

The path argument is the same as the path used to create a GsIndexSpec index, however you may not include the initial "each".

For example, the following three statements create the same indexes that were created here.

myEmployees createIdentityIndexOn: 'userId'.
myEmployees 
	createEqualityIndexOn: 'age' 
	withLastElementClass: SmallInteger.
myEmployees 
	createEqualityIndexOn: 'address.state' 
	withLastElementClass: String.

Enumerated and set-value indexes and queries are not supported using historic API.

Internal legacy vs. btreePlus indexing structures

The used of legacyIndex or btreePlusIndex/optimizedComparison is based on the default GsIndexOptions. Whatever the session or system default is will determine the type of index being created

String and Unicode Equality Indexes

Indexes on various kinds of strings follow the same rules as GsIndexSpec string indexes, with the exception that the optimized indexes cannot be created this way.

To create unicode indexes, specify a lastElementClass of any Unicode string class (Unicode7, Unicode 16, or Unicode32). Since no collator can be specified, the index will be created using the current default IcuCollator.

Reduced-conflict Equality Indexes

An Rc Equality Index is a type of Equality Index in which internal indexing structures are reduced-conflict. This avoids some transaction conflicts when creating an index on a reduced-conflict (RC) collection, such as RcIdentityBag. Reduced-conflict classes are described in Indexes and Concurrency Control. Rc Equality indexes are described under Reduced-Conflict.

Using UnorderedCollection index creation protocol to create an index, the message is:

createRcEqualityIndexOn:withLastElementClass: 

Queries using Selection Blocks

Selection blocks are a kind of block specialized for queries, using curly braces instead of brackets. The compiler understands this syntax and creates the selection block instance when the code or method is compiled.

A selection block query might be written like this:

{:each | each.address.state = 'OR'}

Selection blocks are quite restrictive:

  • A selection block has exactly one argument
  • Message sends are not allowed in a selection block; you can only use the dot syntax to specify instance variables of the argument.
  • The code inside the block is limited to predicates as described under Query Predicate Syntax, with additional limitations below.
  • Set valued and enumerated syntax are not allowed in a selection bock
  • Range predicate syntax are not allowed in a selection block, although you may specify the same operation by conjoining two separate predicates.
  • Selection block queries do not allow the | (disjunction operator), nor the not operator.
  • Selection block can only be used as arguments to the methods select:, reject:, detect:, detect:ifNone:, or selectAsStream:.
  • Selection block queries are not optimized.

In selection block queries, you can reference temporary, instance or other variables within the block, and these are resolved at runtime as in ordinary blocks.

Executing Selection Block Queries

A selection block is used with select:, reject:, detect:, detect:ifNone:, or selectAsStream: to perform the query over a collection.

For example:

Employees select: {:each | each.address.state = 'OR'}

These have the same semantics as with standard blocks executed on a collection. For example, reject: will return a result set that includes all elements for which the block evaluation would return false. The results are in a collection the same class as the base collection (unless species or speciesForSelect specifies a different class, as with the RC classes).

The collection returned from a query has no index structures. If you want to perform indexed selections on the new collection, you must build the necessary indexes on the new collection.

Results as a stream

To get the results as a stream, use UnorderedCollection >> selectAsStream:. This returns an instance of RangeIndexReadStream, which understands the following messages:

next
Returns the next value on a stream of range index values.

atEnd
Returns true if there are no more elements to return through the logical iteration of the stream.

reversed
Create a ReversedRangeIndexReadStream based on the receiver, allowing you to stream over the results from last to first.

Creating a GsQuery from a selection block

If you have existing code that includes selection block queries, you can use those selection blocks to create the instances of GsQuery.

For example,

GsQuery fromSelectBlock: {:each | each.address.state = 'OR'}

This can be bound using on:, or created using fromSelectBlock:on:, similar to how you create and bind a GsQuery from a string.

Managing indexes

Information about indexes

Sending indexSpec to the collection provides a complete description of the indexes on a collection, and can be used for information without using the GsIndexSpec API; the extra details provided by indexSpec can be ignored.

You can also send messages to the collection that will return quick information on indexed paths.

equalityIndexedPaths and identityIndexedPaths
Returns, respectively, the equality indexes and the identity indexes on the receiver’s contents. Each message returns an array of strings representing the paths in question.

For example, the following expression returns the paths into myEmployees that bear equality indexes:

myEmployees equalityIndexedPaths
%
anArray( 'age', 'address.state') 

kindsOfIndexOn: aPathNameString
Returns information about the kind of index present on an instance variable within the elements of the receiver. The information is returned as one of these symbols: #none, #identity, #equality, #identityAndEquality.

equalityIndexedPathsAndConstraints
Returns an array in which the odd-numbered elements are the elements of the path, and the even-numbered elements are the constraints specified when creating an index using the keyword withLastElementClass:.

Removing Indexes

Removing indexes can be done using the GsIndexSpec

You may send methods to the indexed collection directly to remove one or all indexes.

UnorderedCollection >> removeEqualityIndexOn: aPathString
Removes an equality index from the path indicated by aPathString. If the path specified does not exist, this method returns an error. Implicit indexes are not removed.

UnorderedCollection >> removeIdentityIndexOn: aPathString
Removes the identity index on the specified path. If the path specified does not exist, this method returns an error. Implicit indexes are not removed.

UnorderedCollection >> removeAllIndexes
Removes all explicitly created indexes from the receiver. Implicit indexes that were created by these elements participating in other indexed collections are not removed.

Previous chapter

Next chapter