7. Indexes and Querying

Previous chapter

Next chapter

This chapter describes GemStone Smalltalk’s indexing and querying mechanism, a system for efficiently retrieving elements of large collections.

Overview
reviews the concept of relations.

Defining and Executing Queries
describes the structure of query predicates, the types of queries, and how to construct and execute a query.

Creating Indexes
discusses GemStone Smalltalk’s facilities for creating indexes on collections.

Special Kinds of Queries and Indexes
Describes how to create indexes and query on Unicode Strings, enumerated and collection-valued path terms, and other special cases.

Managing Indexes
How to perform index management: find out about indexes in your system, remove existing indexes, handle errors, and audit indexes.

Query Formulas and Optimization
How to use Query formulas, and how these formulas are optimized.

7.1 Overview

Most applications require one or more databases : a set of data that is critical for business function, and needs to be accessed efficiently. In GemStone, this is a represented as a instance of collection that holds instances of business objects. For large applications, collection may be very large, containing many thousands or even millions of objects; and you will need to be able to find specific objects within that collection quickly and easily.

The following example shows employee data in table form:

Table 7.1 Employees

Name

Job

Age

Address

Fred

clerk

40

22313 Main, Dexter, OR

Sophie

busdriver

24

540 E. Sixth, Renton, WA

Conan

librarian

40

999 Walnut, Hilt, CA

In GemStone Smalltalk, you would naturally define Employee and Address classes that are subclasses of Object, with name, job, age, and so on as instance variables; and create instances of these classes to store your employee information. You’d put these instances in a collection. While various kinds of collections could be used, an IdentitySet (or IdentityBag) would be the logical choice, since the Employees are not inherently ordered.

To make it easy to associate behavior with your set of Employees, you might define a class SetOfEmployees that is a subclass of IdentitySet. Then, you might make the collection of Employee globally accessible, by referencing it by the key #myEmployees.

Recall that in UnorderedCollections, lookup is by value, rather than by position in the collection or by a key. Standard Collection protocol allows you to locate an object in an UnorderedCollection by select:, or similar messages.

For example:

myEmployees select: [:anEmployee | anEmployee age = 40]

Searching this way sends one or more messages to each element of the receiver. In this example, the messages age and = are each sent once for each element of myEmployees. This is fine for small collections, but becomes unreasonably slow for collections containing many thousands of complex objects. This is particularly true if objects are not in memory and need to be read from disk in order to respond to messages.

GemStone Indexes

Indexes provide a way to locate specific objects in a collection by value. Indexing a Collection creates internal structures such as balanced trees (Btrees). Only a few message sends are needed to lookup a value or range of values in the indexing structures. When collections are indexed, they can return results without having to iterate the collection or send messages to each object.

Indexes may only be created and indexed queries performed over collection classes that are subclasses of UnorderedCollection: this includes IdentityBag, IdentitySet, Bag, Set, RcIdentityBag, and a few other specialized kinds of Set.

Indexes are created on objects based on instance variables, not on message sends; since the instance variable relationships are known by the system, indexes can be usually be updated automatically as elements are added and removed from the collection, and when object instance variables change value. There are some exceptions to this, but these require manual updating.

What you need to do

In order to take advantage of efficient indexed queries on your collection, you need to do the following:

  • Formulate the query that you wish to be indexed, using query syntax
  • Create an index on that collection, that specifies that particular instance variable path on which you will perform the query.
  • execute the query on that indexed collection, using specific query protocol

For example, if you want searching for Employee by name to be fast, you can create an index on name, and then use query syntax to execute a query on name. If you need to search based on different instance variables, such as searching by ID as well as by name, you need to create multiple indexes.

You may use query syntax to create and execute a query, even when there is no index on that variable, but it will have to iterate the collection to find the results.

The details of how to define and create indexes, and how to formulate and execute queries, are described in following sections.

APIs for creating indexes and queries

There are two distinct APIs for performing these tasks.

  • The GsIndexSpec/GsQuery interface is introduced in version 3.2, and provides ways to define and manipulate indexes and queries using objects.
  • The traditional indexing API, which creates indexes using UnorderedCollection methods and performs queries using selection blocks, is an alternate way to use indexes. Some features are not available using this traditional indexing API. Differences are noted in the following sections.

Internal index structures and behaviors are the same for both APIs, and if indexing and querying is limited to the common features, they can be used interchangeably.

Managing Indexes

In addition to creating indexes and queries, you will also need to do some management on your indexes and queries. For example, you should evaluate your indexes for performance, remove indexes that are no longer needed, and audit indexes to ensure the structures are correct. Many of these indexing tasks are handled by IndexManager.

Indexing trade-offs

Creating an index creates a number of internal objects that implement the indexing infrastructure; this includes the btree structure and the RcIndexDictionary. If your collection is small, the extra overhead of this infrastructure reduces the value of the indexed search. The size at which the trade-off makes an index worthwhile will depend on many factors, and testing is always preferred. As a rule of thumb, if your collection contains fewer than about 2000 objects, it may not be worthwhile to create an index.

Building and removing indexes requires extra management. You must be sure to remove indexes before dereferencing instances of UnorderedCollection, to avoid leaving indexing structures in place that cannot be automatically garbage collected.

Special Syntax for Indexing

GemStone indexing uses several syntactical elements that are either specific to, or primarily used for, index creation and indexed queries.

Path-dot syntax

Indexes are created, and queries formed, using special syntactic structure called a path, which designates variables for indexing and describes certain features of the index. Path syntax uses a period . to represent the object/instance variable name relationship.

For example, for a collection of Employees, in which each employee has an address instance variable, which refers to an Address that has a “state” instance variable, an example of a path is:

address.state

In the simplest case, a path on an instance variable on the collection elements, this is just the instance variable name. For example:

name

Each instance variable name on the path is a pathTerm. In the above example, address and state are each pathTerms. Paths can contain many pathTerms, if the elements of the collection represent a deeply nested tree of objects. If each object has the appropriate instance variable, this is an example of a longer path:

account.order.address.state

You may specify the values of variables nested up to 16 levels deep within the elements of a collection.

Path-dot syntax can be used anywhere in GemStone code; it is required in index creation and queries, for which message sends are not allowed.

An initial 'each.', where each represents the elements of the collection, is recommended but optional for GsIndexSpec index creation. For example:

each.address.state

This 'each.' is not permitted in paths supplied to UnorderedCollection index creation methods.

Enumerated pathTerms

A vertical bar | in the path indicates the presence of two alternate instance variables that will be indexed together, as if they were a single variable.

For example, you might want to search on both name and nickname in a single operation. This might look like this:

account.name|nickname

This syntax can be used to create indexes and queries using the GsQuery/GsIndexSpec API, but not with the traditional API.

Set-value path terms

An asterisk * in the path indicates a collection, which must be an instance of an indexable class (an instance of a subclass of UnorderedCollection).

For example, if the instance variable children contains an IdentityBag of instances of Child, and a child has the instance variable age:

children.*.age

This syntax can be used to create indexes and queries using the GsQuery/GsIndexSpec API, but not with the traditional API.

7.2 Defining and Executing Queries

Before you can define indexes on your collection, you need to determine the ways in which you will need to search your collection to retrieve elements. Once you have defined the queries you need, this will determine the details of the indexes you need to create.

At its simplest, a query consists of an instance variable of the objects in the collection, a comparison operator, and a literal to which the value is compared. For example, if you wish to be able to find all employees 18 and older, your query formula would be something like this:

each.age >= 18

In this example, every object in the collection (each) has an instance variable age, which is specified using dot-path notation. The value of that instance variable is compared, greater than or equal, to the literal 18.

This formula is simple; you can formulate queries based on multiple instance variable values, operators, and constants, and combine them using boolean logic. However, note that in the query syntax, you cannot include message sends. The details of predicate syntax is described in the next section, Query Predicate Syntax.

Equality vs. Identity queries

Queries, and the indexes that support them, can be based on equality comparisons or on identity comparison. Equality comparisons include greater than and less than comparisons, as well as equal and not equal; equality comparison-based indexes depend on being able to order the indexed elements respective to one another, so an element or range of elements can be found using search on the internal btree.

Identity indexes and queries compare elements by identity and have more limited flexibility, but fewer restrictions—since any objects can be compared using identity. You cannot query for a range of results using Identity indexes.

Query Predicate Syntax

A query contains a predicate expression, which is a Boolean expression that, when evaluated with the elements of the collection, returns true or false. In a query, the expression usually compares an instance variable on the collection objects with another instance variable or with a constant.

A predicate contains one or more predicate terms—the expressions that specify comparisons.

Predicate Terms

A term is a Boolean expression containing an operand and usually a comparison operator followed by another operand. For example, in

each.age >= 18

each.age and 18 are operands, while >= is a comparison operator. The only time you would not have a comparison operator is if the operand is itself a Boolean (true or false).

Predicate Operands

An operand can be a path (each.age, in this case), a variable name, or a literal (18, in this example). All GemStone Smalltalk literals except arrays are acceptable as operands.

Predicate Operators

The following tables list the comparison operators used in a query predicates:

==

Identity comparison operator

~~

Non-identity comparison operator

=

Equality comparison operator

~=

Not-equal comparison operator

<

Less than equality operator

<=

Less than or equal to equality operator

>

Greater than equality operator,

>=

Greater than or equal to equality operator

No other operators are permitted in a GsQuery or selection block query.

These operators behave according to the rules governing the objects being compared. Equality comparisons that use an index, however, are more restricted in an indexed query; you cannot compare classes that are not in the same hierarchy as the class specified during index creation. The exception to this are strings; all kinds of traditional strings can be compared to each other and to Symbols, and all kinds of Unicode strings can be compared to each other, but not to traditional strings or Symbols. (But see the note on Unicode Comparison Mode here).

Nil is a special case for equality index comparisons. Because of its special significance as a placeholder for unknown or inapplicable values, nil is comparable to every kind of object, and every kind of object is comparable to nil. Since the appearance of nil signifies a value that is not there, less than and greater than comparison results will not include nil values.

Basic Classes optimized for indexes

The most efficient indexes are on certain GemStone Smalltalk kernel classes, in which all or a large part of the object’s value can be encoded within the index internal structures, avoiding the need to read the object itself. These classes are referred to as “Basic classes” for indexing and querying.

This includes the following classes, and subclasses of these classes:

Character, String, DoubleByteString, QuadByteString,
Unicode7, Unicode16, Unicode32,
Symbol, DoubleByteSymbol, QuadByteSymbol,
Boolean, Time, Date, DateTime, DateAndTime
SmallInteger, LargeInteger,
ScaledDecimal, Fraction,
SmallDouble, Float, DecimalFloat

Instances of basic classes, and subclasses of basic classes, use specialized protocol to perform the comparisons.

In comparisons involving instances of String or its subclasses, the indexed comparison mechanism considers only the first 900 characters of each operand. Two strings that differ only beginning at the 901st character are considered equal.

Creating indexes on Unicode strings (Unicode7, Unicode16, and Unicode32) require using a Unicode index.

Using non-basic Classes

You can create indexes on instances of classes that are not basic classes, including classes you defined yourself.

Identity indexes on instances of your own classes, since they compare on the identity of the objects, require no addition consideration, nor do indexes on instances of classes that inherit a kernel implementation of the equality operators.

If you need to, you can redefine the equality operators =, ~=, <=, <, >=, and > in classes that you have created that are not subclasses of Basic classes. There are some caveats; this is discussed under Redefined Comparison Messages.

Combining Predicates using Boolean Logic

If you want retrieval of an element to be contingent on the values of two or more of its instance variables, you can join several terms using a conjunction (logical AND) or disjunction (logical OR) operator.

The conjunction operator, &, makes the predicate true if and only if the terms it connects are true. The disjunction operator, |, makes the predicate true if either one, or both, of the terms it connects are true.

Selection block queries only permit the conjunction operator; GsQuery queries allow both conjunction and disjunction operators.

You may also negate individual predicate terms using not, only in GsQuery queries.

Each predicate term must be parenthesized.

For example, the following are legal queries. This example returns a collection of employees who are named Conan and are librarians:

(each.name = 'Conan') & (each.job = 'librarian')

While this returns a collection of employees who are 40 or younger in age, or who are not librarians.

(each.age <= 40) | (each.job = 'librarian') not

Combining Range Predicates

Queries that use less than or greater than, such as each.age >= 18, define a starting (or ending) point in a range query. Specifying both a starting point and ending point creates a range query. For example,

(18 <= each.age) & (each.age <= 65)

Using the GsQuery API, but not in the traditional API, these two terms can be combined into single range predicate. For example:

18 <= each.age <= 65

Range specifications such this can only be defined with this syntax if the operands and comparison operators truly define a range.

Selection Block Queries

To create a query based on your search criteria, you can use either GsQuery or traditional selection block syntax. Both types of query syntax produce the same results. Your existing GemStone indexes will use the selection block syntax; GsQuery has the advantage of allowing programmatic management of the query, and there are a number of specialized query features that are only available when using GsQuery.

Selection Blocks

Selection blocks are a kind of block specialized for queries, using curly braces instead of brackets. The compiler understands this syntax and creates the selection block instance when the code or method is compiled. Selection blocks require exactly one argument and do not allow message sends within them; there are other restrictions on allowed operations within a selection block, as described under Query Predicate Syntax.

A selection block query might be written like this:

{:each | each.address.state = 'OR'}

As with the equivalent iteration methods that use ordinary blocks with square brackets, each represents the elements of the collection, and in this case, the dot-path syntax resolves the object at the instance variable #address, which object has an instance variable #state.

In selection block queries, you can reference temporary, instance or other variables within the block, and these are resolved at runtime as in ordinary blocks.

Executing Selection Block Queries

A selection block is used with reject:, collect:, detect: and detect:ifNone:, to perform the query over a collection.

For example:

Employees select: {:each | each.address.state = 'OR'}

Selection blocks can be also used with reject:, collect:, detect: and detect:ifNone:. These have the same semantics as with standard blocks executed on a collection. For example, reject: will return a result set that includes all elements for which the block evaluation would return false.

Return values

A selection block query returns a new instance of collection of the same class as the base collection. So if you query on a collection that is an instance of SetOfEmployees, for example, the results will be returned in an instance of SetOfEmployee.

The exception is if you root collection is an instances of a reduced-conflict (Rc) collection, such as RcIdentityBag. The result of a query in this case is a non-Rc collection. The results of a query on a RcIdentityBag are returned as an instance of IdentityBag. This avoids the overhead that supports the reduced-conflict classes.

The collection returned from a query has no index structures. If you want to perform indexed selections on the new collection, you must build the necessary indexes on the new collection.

Queries using GsQuery

GsQuery is a programmatic way to define a query, allowing you to easily abstract, store and reuse various aspects of the query. While GsQuery provides more query features, the query is internally similar to the processes used by selection block queries, and the query results will be the same.

Creating and Executing a GsQuery

To create a query using GsQuery, you create an instance of GsQuery, specifying the details of the search. There are a number of ways to specify the search; the most simple is by passing in a string. For example:

GsQuery fromString: 'each.age >= 18' 

This message will return an instance of GsQuery. Before it can be executed, it must be bound to a collection. You can create the GsQuery using fromString:on: in order to create a GsQuery that is bound to a particular collection, or you can bind the collection later using the on: method. Sending queryResult to the GsQuery will return the results of the query.

The following two examples illustrate creating and executing a bound query, and creating and executing a query that is not associated with a collection, and binding it before execution.

(GsQuery fromString: 'each.age >= 18' on: Employees)
	queryResult.
 
(GsQuery fromString: 'each.age >= 18')
	on: Employees;
	queryResult.

Since the fromString: protocol requires a string, if the query includes literal strings, they must be double quoted. For example:

GsQuery fromString: 'each.firstName = ''Fred'''.
Creating a GsQuery from a selection block

If you have existing code that includes selection block queries, you can use those selection blocks to create the instances of GsQuery.

For example,

GsQuery fromSelectBlock: {:each | each.address.state = 'OR'}

This can be bound using on:, or created using fromSelectBlock:on:, similarly to how you create and bind a GsQuery from a string.

You may also create the GsQuery from a saved query formula, previously extracted from another GsQuery; this is described here.

Query Variables

The strings used to define GsQuery instances may contain variables—any element of a predicate that is are not a literal or path-dot expressions. This allows your query to be stored and executed later using different values.

For example, for a query such as

GsQuery fromString: '18 <= each.age <= 65'

This can be generalized to a query with variables:

GsQuery fromString: 'min <= each.age <= max'.

The resulting formula in the GsQuery includes 'min' and 'max' as variables. These must be bound to specific values before the query can be executed. Binding is done by sending the bind:to: message to the query. For the above example, to execute the query:

aQuery := GsQuery fromString: 'min <= each.age < max'. 
aQuery 
	bind: 'min' to: 18; 
	bind: 'max' to: 65; 
	on: myEmployees; 
	queryResult

Note that the “max” and “min” in the query formula are string elements, and are not affected by any temporary or instance variables named max or min in the scope of the code being executed. The only way to resolve max and min are by binding variables.

GsQuery’s Collection protocol

To get the query results, you can send queryResult to the instance of GsQuery. GsQuery accepts other Collection protocol, which it responds to as if the GsQuery were the query result Collection.

You can send asArray or asIdentityBag to the GsQuery directly, for example:

(GsQuery fromString: 'each.address.state = ''OR''' 
	on: Employees) asArray

Performing one of the collection operations that are provided for GsQuery simplifies your code, since you may not have to put results in temporary variables. It may or may not allow you to avoid creating query result objects. Enumeration methods also allows you to perform code while the query is executing, rather than waiting for the results.

GsQuery and cacheQueryResults

By default, each time you execute any GsQuery collection protocol, the query is performed again. So sending includes: or isEmpty to a GsQuery, by default, does not avoid executing the query during a subsequent queryResult.

You can cache the results of your GsQuery by specifying an instance of GsQueryOptions with cacheQueryResult: set to true. This will cache the result set of the GsQuery. Note that this cache will not reflect changes in the root collection that occurred after the query was executed; you are responsible for re-running the query if current results are required.

You may pass in an instance of GsQueryOptions using the fromString:options: and related protocol for GsQuery.

For example:

query := (GsQuery fromString: 'each.address.state = ''OR''' 
	options: (GsQueryOptions cacheQueryResult)
	on: Employees).
query isEmpty ifTrue: [^'no results'].
report := self createReportingStructure. 
query do: [:ea | report updateDataWith: ea].
...
GsQuery enumeration methods accepting blocks

Among the collection protocol that GsQuery understands are the methods do:, select:, reject:, collect:, detect: and detect:ifNone:. While these look similar to fetching query results using selection blocks, since the actual query is already provided by the GsQuery, there are key differences.

With GsQuery, it’s important to remember that these will operate on the result set of the initial query. In essence, you are adding an additional, non-indexed search criteria to the indexed query. This additional code will be executed for each element in the collection for which the indexed query matches, at the time that the index query is examining that result element.

For example, if you have an index on Employee age, and a query such as:

(GsQuery fromString: 'each.age <= 18' on: Employees)

Using this query, you can add an additional search criteria using select:, so that only Employees who live in Oregon are returned.

(GsQuery fromString: 'each.age <= 18' on: Employees) select: 
	[:each | each state = 'OR']

This will return a result set that includes Employees under 18 who live in Oregon. The state message is only sent to the elements (Employees) who are under 18, it is not executed for every element in the collection.

Order of results

Provided there is an index on the query path, the enumeration block operates on each object in the result set in the order specified by the index. However, since the select: or other method will necessarily return a kind of UnorderedCollection (see Return values), the objects in the collection returned by the enumeration method will be not be ordered.

You can use the enumeration protocol to produce results that are ordered according to the index. For example:

resultArray := Array new.
(GsQuery fromString: 'each.age <= 18' on: Employees) do: 
	[:each | resultArray add: each].

However, for ordered results, you may want to stream over the results instead.

Efficiency of query vs. enumeration

It is more efficient to perform an indexed query using GsQuery than to add additional criteria using enumeration methods.

For example, the following code returns a collection of all employees who are 26 or younger, and who respond false to hasOtherHealthInsurance.

GsQuery fromString: 'each.age <= 26' on: myEmployees)
	reject: [:each | each hasOtherHealthInsurance]

This is useful if you have predicates that require message sends. However, if you can formulate the second statement as an indexable predicate, it would be more efficient. If hasOtherHealthInsurance was actually an instance variable, you could write this as:

(GsQuery fromString: '(each.age <= 26) &
	(each.hasOtherHealthInsurance) not' on: myEmployees)
	queryResults
Early exit from execution

Since the code in the block provided to select: (and similar methods) is executed for each element that the indexed query itself would return, this provides a way to exit the indexed query early. In this block, you can execute any code (as long as it does not modify the collection or the objects in the collection in ways that would change the result set). Based on this code, if it’s no longer useful to continue the search, you can exit the block.

For example, say you have a collection of purchase orders, and you are generating a report of all open purchase orders. If a new order arrives during the period you are executing this operation, you might want not want to bother producing the already-obsolete report.

(GsQuery fromString: 'each.isOpen' on: MyOrders) do: 
	[:anOrder |
	report add: anOrder description.
	self checkForNewOrders ifTrue: [^'report cenceled']
	]

Return values

GsQuery >> queryResult will, like selection block queries, return a new instance of collection of the same class as the base collection, unless protocol such as asArray are used to specify the class of the results.

Also similarly to selection block queries, queries on instances of reduced-conflict (Rc) collections, return the equivalent non-Rc collection.

The collection returned from a query has no index structures. Indexes belong to specific instances of collections, rather than the classes. If you want to perform indexed selections on the new collection, you must build the necessary indexes on the new collection.

Query results as Streams

It may be more useful to return the result of an equality query as a stream, instead of a collection, especially if the result set is large. Returning the result as a stream not only is faster, is also avoids the need to have all the result objects in memory simultaneously.

Streaming on index results return the results in order that is defined by the index, so you can iterate over the elements that are returned in the order defined by the index, with no extra effort.

To get the results as a stream, use the message GsQuery >> readStream. or UnorderedCollection >> selectAsStream:. These methods return an instance of RangeIndexReadStream, which is similar to a ReadStream but specialized for index results.

You can then iterate the results using standard stream protocol. Instances of RangeIndexReadStream understand the messages next, atEnd, and similar ReadStream protocol.

Streams do not automatically save the resulting objects. If you do not save them as you read them, the results of the query are lost. You should not modify the objects in the base collection while streaming, nor add or remove objects; doing so can cause an error or corrupt the stream.

For example, suppose your company wishes to send a congratulatory letter to anyone who has worked there for thirty years or more. Once you have sent the letter, you have no further use for the data. Assuming that each employee has an instance variable called lengthOfService, and there is an index on this, you can use a stream to formulate the query as follows:

oldTimers := (GsQuery fromString: 'each.lengthOfService >= 30'
	on: myEmployees) readStream. 
[ oldTimers atEnd ] whileFalse: [  
	| anEmployee | 
	anEmployee := oldTimers next. 
	anEmployee sendCongratuations. ].

The selection block query interface uses the message selectAsStream: to create a stream on the query results. This is handled the same as a GsQuery readStream.

oldTimers := myEmployees selectAsStream: 
	{:anEmp | anEmp.lengthOfService >= 30}.
[ oldTimers atEnd ] whileFalse: [
	| anEmployee | 
	anEmployee := oldTimers next. 
	anEmployee sendCongratuations. ].

Limitations on streamable queries

Streams on query results have certain limitations; for example, the predicate in the query must be logically streamable. The following restrictions apply:

  • It takes a single predicate only; no conjunction of predicate terms is allowed. The exception is with if two predicates can be automatically combined to form a single range predicate. So, for example, (each.age > 18) & (each.age <= 65) is legal, since it can be reformulated as a single range predicate,
    (18 < each.age <= 65).
  • The predicate can contain only one path.
  • The collection you are searching must have an equality index on the path specified in the predicate.

 

7.3 Creating Indexes

To execute a query efficiently, you need to create an index on the instance variables that you want to query on. These indexes provide a mapping from the specific key values that you are interested in to the results (the objects in the collection).

The path you provide when creating an index provides the key. These keys are objects in the collection, or the values of a specific instance variable within the elements of a collection.

For example, given a collection of Employees, and the path each.address.state, the objects at the state instance variable (perhaps a two-character String) would be the key. Then when you make an indexed query for Employees with addresses in a given state, that state key is used to lookup the matching elements (instance of Employee).

Equality and Identity Indexes

Indexes fall into two main types: Equality Indexes and Identity Indexes. Equality indexes keep keys in a btree structure, which provides tree-based lookup. This allows greater-than and less-than, and sorted range results, to be produced. IdentityIndexes store keys in a hashed identity dictionary, which only allows lookup by identity.

When creating an index, you specify whether an equality or identity index is created.

Then, in the indexed query, the comparison operator controls the type of index that is used. Queries containing >, >=, <, <=, =, and ~= use an equality index. Queries containing == and ~~ will look for an identity index.

If you only have an identity index on a variable, but form your query using an equality operator, the query will not have an index to use (and thus, will iterate the collection).

The exception to this is if your equality index is on a “special” object, such as a SmallInteger, SmallDouble, Character, or Boolean, in which equality and identity are the same. This results in an implicit index (see Implicit Indexes), which can be used to make identity based queries.

You may create both equality and identity indexes on the same path.

Specialized subtypes of Indexes

Within the general types of indexes, there are some variations with special features.

Unicode Indexes

The Unicode Index is a type of Equality Index that allows you to index instances of Unicode strings— Unicode7, Unicode16, and Unicode32—which require a IcuCollator to compare. See Unicode String Indexes and Queries for details.

Reduced-conflict Equality Indexes

An Rc Equality Index is a type of Equality Index in which internal indexing structures are reduced-conflict. This avoids some transaction conflicts when creating an index on a reduced-conflict (RC) collection, such as RcIdentityBag. Reduced-conflict classes are described in Indexes and Concurrency Control. Rc Equality indexes are described under Reduced-Conflict Indexes.

Implicit Indexes

Most of the indexes you will use for your queries are created explicitly, by executing code specifying a particular indexed path.

However, under some cases, implicit indexes are also created as a side effect. These indexes can be used to perform indexed queries. You cannot manage or remove them, however; they can only be removed by removing the primarily explicit index for which the implicit index is a side-effect.

Implicit indexes include:

  • In the process of creating an index on a nested instance variable, GemStone Smalltalk also creates identity indexes on the values that lie on the path to that variable.

For example, if you create an equality index on 'name.lastName', it also creates an identity index on name. By creating this index, you can make indexed identity queries on the objects specified by name, without explicitly creating an index on name.

  • An implicit identity index is also present if you create an equality index on a Special, such as a SmallInteger, SmallDouble, Character, or Boolean, in which equality and identity are the same.

Creating indexes using GsIndexSpec

To create an index using GsIndexSpec, do the following:

Create an instance of GsIndexSpec

This is done by executing GsIndexSpec new

Define one or more indexes on the Spec

To define an index, send an index creation message to the GsIndexSpec, including the path you want indexed, the class of the last element (for equality indexes), and options (if used).

Index creation messages include, for example, equalityIndex:lastElementClass: and identityIndex: (see the list below).

Create the index on a specific collection

To actually create the index, send the message createIndexesOn:, providing the specific collection on which you want to create the indexes.

To put this all together, for example:

GsIndexSpec new
	identityIndex: 'each.userId';
	equalityIndex: 'each.age' lastElementClass: SmallInteger;
	equalityIndex: 'each.address.state' lastElementClass: String;
	createIndexesOn: myEmployees.

This creates an identity index on userId, an equality index on age, and another equality index on address.state, all on the collection myEmployees.

You can view the indexes by recreating the specification code, using indexSpec. For example:

myEmployees indexSpec
GsIndexSpec new
	identityIndex: 'each.userId';
	equalityIndex: 'each.age'
		lastElementClass: SmallInteger;
	equalityIndex: 'each.address.state'
		lastElementClass: String;
	yourself.

The expressions that create a GsIndexSpec can be stored as instances or as code, and can be used along or in conjunction with other GsIndexSpec instances, to create the same set of indexes or a customized set of indexes on any collections that contain objects that implement the given paths.

The following index creation methods are defined on GsIndexSpec :

equalityIndex:lastElementClass: 
equalityIndex:lastElementClass:options: 
identityIndex: 
identityIndex:options:unicodeIndex:
unicodeIndex:collator: 
unicodeIndex:collator:options:

GsIndexOptions

A instance of GsIndexOptions specifies optional additional refinements that will be used when creating a particular index on a collection. A GsIndexOptions instance is provided by using the variants of the index creation methods that include the options: keyword. GsIndexOptions are not used by traditional index creation.

The options available for GsIndexOptions are:

GsIndexOptions class >> reducedConflict
Returns an instance of GsIndexOptions that specifies that the index is reduced-conflict. This applies for equality indexes, making these into Rc Equality Indexes.

GsIndexOptions class >> optionalPathTerms
Returns an instance of GsIndexOptions that specifies that the collection is allowed to be non-homogenous, that each element of the collection is not required to include all indexed instance variables on the path. May applies to any index.

GsIndexOptions can be combined using the plus operator and removed using the minus operator. For example:

GsIndexSpec new
	equalityIndex: 'each.name' 
	lastElementClass: String
	options: (GsIndexOptions optionalPathTerms +
		GsIndexOptions reducedConflict)
 

Creating indexes using UnorderedCollection protocol

UnorderedCollection provides protocol to create indexes. This creates the same index structures as GsIndexSpec, but does not provide access to some index features.

The following index creation methods are defined on UnorderedCollection:

createIdentityIndexOn:
createEqualityIndexOn:withLastElementClass:

For example, this message requests that myEmployees creates an identity index on the instance variable userId within each of its elements:

myEmployees createIdentityIndexOn: 'userId'.

And these messages request to create equality indexes on the instance variables age and name:

myEmployees 
	createEqualityIndexOn: 'age' 
	withLastElementClass: SmallInteger.
myEmployees 
	createEqualityIndexOn: 'address.state' 
	withLastElementClass: String.

Together, these three statements create the same indexes that were provided in the example in the previous section.

Reduced-Conflict Indexes

When creating an equality index on a collection that is reduced-conflict, such as RcIdentityBag, some multi-user commit conflicts may be avoided by creating the indexing structures themselves as reduced-conflict. This option is not particularly useful if your collection is not reduced conflict (such as IdentitySet, etc.), since this collection will encounter any commit conflicts as well.

This doesn’t apply to identity indexes, which are always reduced-conflict.

If you are creating an index on an RcIdentityBag, to make the index reduced-conflict, use an index creation method with the options: keyword, and pass in GsIndexOptions reducedConflict. For example:

GsIndexSpec new
	equalityIndex: 'each.name' 
	options: (GsIndexOptions reducedConflict)

Using UnorderedCollection index creation protocol to create an index, the message is:

UnorderedCollection >> createRcEqualityIndexOn:withLastElementClass:

Optional pathTerms

A homogenous collection is one in which each element in the indexed collection defines the instance variable described by the index, for each pathTerm in the indexed path. By default, indexes require that the collection be homogeneous. If any element does not have the given instance variable, it will raise an error when the element is added to the collection.

If you want to create an index on a non-homogenous collection, you can define the indexes with optional pathTerms using GsIndexSpec protocol. Use an index creation method with the options: keyword, and pass in GsIndexOptions optionalPathTerms. For example:

GsIndexSpec new
	equalityIndex: 'each.nickName' 
	options: (GsIndexOptions optionalPathTerms)

When creating an optional pathTerm index, it is not an error when the objects in the collection do not implement an instance variable specified by the index. For a multi-pathTerm index, that includes each pathTerm; objects with missing instance variable definitions for any of the pathTerms in the indexed path are not considered when creating query results.

If you create an index with a pathTerm for an instance variable that does not exist at all (perhaps due to a typing error), then the index is created correctly and does not report an error, even if it does not create the index you might have intended to create.

While Indexes are Being Created

Indexing a large collection will take some amount of time to create the infrastructure and tracking for each indexed object.

The message progressOfIndexCreation returns a description of the current status for an index as it is created.

Queries during index creation

While the index is being created, the index is write-locked. Any query that would normally use the index is performed directly on the collection, by brute force. If a concurrent user modifies an object that is actively participating in the index at the same time, index creation is terminated with an error.

Auto-commit

Creating or removing an index creates and/or modifies many objects related to the internal structures that support indexes. These modifications are uncommitted changes that must be kept in the session’s memory until these changes are committed. Many uncommitted changes place a large demand on memory and creates a risk of out of memory conditions. Chapter 8, “Transactions and Concurrency Control”, explains uncommitted objects and transactions in more detail, while Chapter 14, “Performance and Optimization” includes information on object memory use.

To avoid problems during index creation, it is often necessary to set the IndexManager to autoCommit. The IndexManager controls overall index behavior, and is described in more details under Managing Indexes. When IndexManager is set to autoCommit, it will commit the partially created index, rather than risk running out of resources and failing the index operation.

By default, autoCommit is false. When you send the following message:

IndexManager autoCommit: true

it configures your IndexManager such that the current transaction is committed during an indexing operation, whenever any of the following occur:

  • The current session receives a signal indicating temporary object memory is almost full.
  • The percentage of temporary object memory in use reaches the IndexManager’s setting for percentTempObjSpaceCommitThreshold.

The default is 60. This threshold can be changed using IndexManager >> percentTempObjSpaceCommitThreshold: anInt

  • The current session receives a signal to FinishTransaction. This occurs when the commit record backlog is larger than STN_SIGNAL_ABORT_CR_BACKLOG, and this session is holding the commit record.
  • The number of modified objects in the current transaction reaches the IndexManager’s setting for dirtyObjectCommitThreshold.

The default is SmallInteger maximum value, which means this limit is effectively disabled.This limit can be changed using IndexManager >> dirtyObjectCommitThreshold: anInt

When autoCommit is true, a transaction will be started (if necessary) before the indexing operation begins, and the IndexManager will commit at the completion of the indexing operation. Note that this means that, even if you are in manual transaction mode and not in a transaction, index operations will cause changes to be committed to the repository without you explicitly beginning a transaction.

If you want to enable autoCommit only for the current session, not for all index creation, you can use

IndexManager sessionAutoCommit: true

7.4 Special Kinds of Queries and Indexes

Previously the basic kinds of indexes, on the instance variables of your objects, have been described. The following are some special kinds of indexes, providing some specialized behavior.

Unicode String Indexes and Queries

Equality indexes are inherently ordered. So, while you may search for an object using equality, the ability to find that object using the internal Btree requires that the keys be ordered (collated) in a predicable way with respect to all other keys in that index. The ordering of keys in an equality index is more clear for queries that involve less than and greater than comparisons.

While objects such as numbers and dates have an obvious and fixed ordering, with strings this is more complicated, since different languages may order strings differently. This is described in more detail in Chapter 4.

You may safely create indexes on traditional strings, which have a fixed collation order (you must not change collation using customized Character Data Tables without removing all indexes in the repository).

Unicode strings, which are instance of Unicode7, Unicode16, and Unicode32, allow locale, language, and usage specific collation by relying on instance of IcuCollator. Unicode strings provide a more powerful way to order strings according to the specific requirements for your application. Any ordering that involves a Unicode string needs an IcuCollator, and since the default IcuCollator can change, Unicode strings cannot be used in an ordinary index.

To create an index on a final String element that will permit Unicode strings to be used, you must create a unicode index. A unicode index persists an instance of IcuCollator, which will be used for all comparisons0 within that index. This assures that you can locate elements correctly for a given key, whether it be a unicode string or a traditional string.

NOTE
With Unicode Comparison Mode, these rules change; with this setting, traditional strings behave like Unicode strings in comparison operations. See Unicode Comparison Mode for more information.

Creating Unicode Indexes

All unicode indexes require an instance of IcuCollator. An immutable copy of this IcuCollator is persisted as part of the index, and is used for all queries on that index, regardless of the current locale.

If you don’t explicitly specify an instance of IcuCollator, than the current default IcuCollator is used, and will be used for all comparisons on this index.

You cannot change the collator of an index after it has been created.

GsIndexSpec

GsIndexSpec provides special protocol to create unicode indexes. The following methods are available:

unicodeIndex:
unicodeIndex:collator: 
unicodeIndex:collator:options:

If you do not specify a collator, a copy of the current default IcuCollator will be made invariant and persisted with the index. Otherwise, you may specify a collator using standard IcuCollator methods, such as IcuCollator class >> forLocaleNamed:. See Chapter 4 for more information on IcuCollator.

UnorderedCollection protocol

You may also create unicode indexes using the traditional UnorderedCollection protocol, by specifying a lastElementClass of any Unicode string class (Unicode7, Unicode 16, or Unicode32).

Since no collator can be specified, the index will be created using the current default IcuCollator.

Due to the way Unicode strings fit into the CharacterCollection hierarchy, the semantics of specifying the lastElementClass are different for CharacterCollection classes, and do not follow the usual hierarchy rules.

  • When Unicode7, Unicode 16, or Unicode32 is specified as a lastElementClass, a Unicode index is created, using the current default collator. In addition to allowing instances of Unicode7, Unicode 16, and Unicode32 (regardless of which of these classes is specified), instances of any subclasses of CharacterCollection are allowed.
  • When a lastElementClass of String or CharacterCollection is specified, this specifically disallows instances of Unicode7, Unicode16, and Unicode32, although otherwise the hierarchical meaning of the lastElementClass applies.

Example

The following example demonstrates creating a unicode string index, which will collate according to the rules for the German language as used in Germany:

GsIndexSpec new 
		unicodeIndex: 'each.lastName'  
		collator: (IcuCollator forLocaleNamed: 'de_DE'); 
	createIndexesOn: myEmployees.

Queries are created and executed as for equality indexes on ordinary strings. When performing a query, the results are located and ordered according to the collation rules of the IcuCollator that was used to create the index.

For example, since the index was created above with German-language collation, the following query will return results that are correct for German collation:

GsQuery  
	fromString: '''Weiß''<= each.lastName <= ''Weiz''' 
	on: myEmployees

Enumerated path terms in indexes and queries

Enumerated path terms allow you query over more than one instance variable value in a single query. This is specified using the vertical bar | in the path term, between the instance variable names.

The instance variables are treated as alternate choices; if any one of the specified instance variables matches the search criteria, the predicate evaluates to true.

For example, you might want to search on both first name and nickname in a single operation. The query might look like this:

(GsQuery fromString: 'each.firstName|nickName = ''Freddie''' 
	on: MyEmployees) queryResult

When this is executed, the results will include all instance that have either the firstName equal to ‘Freddie’, or the nickName ‘Freddie’, or both.

In order to optimize this query with an index, you need to create an index on the specific enumeration, e.g. 'each.firstName|nickName'. An enumerated path term query will not use an index on the individual instance variables that are enumerated.

Restrictions on predicates with enumerated pathTerms

The semantics of enumerated pathTerms do not allow multiple conjoined predicates using the same enumerated pathTerm, since each predicate is evaluated separately. (conjoined predicates are those connected using &).

Collection path Indexes and Queries

Your business objects may themselves contain collections; for example, an employee may contain a collection of children; and you may want to search based on some criteria of the objects in that collection. As long as this collection is itself indexable, indexes and queries can include all elements within these contained collections.

Index paths that include collections, and the queries that use these indexes, are sometimes called Set-valued indexes and queries, although any kind of indexable collection, not just Sets, may be used.

When you wish to specify a path containing an instance of a subclass of UnorderedCollection, the collection is represented by an asterisk *. This syntax may be used to create indexes and perform queries. However, only GsQuery may be used to perform set-valued queries.

For example, suppose you want to know which of your employees has children of age 18 or younger. To facilitate such queries, each of your employees has an instance variable named children, which is implemented as a set. This set contains instances of a class that has an instance variable named age.

To create the index:

GsIndexSpec new
	equalityIndex: 'each.children.*.age' 
		lastElementClass: SmallInteger;
	createIndexesOn: myEmployees. 

Set-valued query results

When you execute a set-valued query, the results you get will follow the particular semantics of Set-valued queries. Since there are potentially multiple “true” query results for a given element in the base collection, the result of a set-valued query such as this can be larger than the original collection.

For example, consider the following query, using the index created above:

(GsQuery fromString: 'each.children.*.age <= 18'
	on: myEmployees) queryResults

In this example, if the root collection myEmployees is a Bag or IdentityBag (rather than a Set or IdentitySet), and an employee has two children that are under 18, then that employee will appear in the results (a Bag or IdentityBag) twice. Employees with three minor children appear in the results three times, and so on. The resulting collection may be several times as large as the original collection, depending on the details of the query and data.

If the root collection myEmployees is a Set, which does not allow multiple instances of the same object, this potential source of confusion does not occur.

Restrictions on predicates in set-valued queries

The semantics of set-valued indexes do not allow multiple conjoined predicates that use the same set-valued pathTerm, since each predicate is evaluated separately. (conjoined predicates are those connected using &).

In general, it is recommended to avoid using multiple- set-valued predicate queries, although some multiple-predicate set-valued queries can be optimized, or avoid the problem cases, and are safe and therefor allowed.

Redefined Comparison Messages

When indexed queries are executed for instances of basic classes or subclasses of basic classes (see Basic Classes optimized for indexes), the comparison operators are not performed as message sends, and you cannot change the operation of a query by redefining the comparison messages in a GemStone kernel class. In other words, for predefined GemStone classes, the comparison operators really are operators in the traditional programming language sense; they are not messages.

For example, if you recompiled or subclassed the class Time, redefining < to count backwards from the end of the century, GemStone Smalltalk would ignore that redefinition when < appeared next to an instance of Time inside a selection block. GemStone Smalltalk would simply apply an operator that behaved like Time’s standard definition of <.

For subclasses that you have created that are not subclasses of basic classes, however, equality operators can be redefined. If you do so, the selection block in which they are used performs the comparison on the basis of your redefined operators—as long as one of the operands is the class you created and in which you redefined the equality operator.

If you redefine any, you must redefine at least the operators =, >, <, and <=. You can redefine one or more of these in terms of another.

The operators must be defined to conform to the following rules:

  • If a < b and b < c, then a < c.
  • Exactly one of these is true: a < b, or b < a, or a = b.
  • a <= b if a < b or a = b.
  • If a = b, then b = a.
  • If a < b, then b > a.
  • If a >= b, then b <= a.

You must obey one other rule as well: objects that are equal to each other must have equal hash values. Therefore, if you redefine =, you must also redefine the method hash so that dictionaries will behave in a consistent and logical manner. This rule is not limited to indexes, but applies to any object that will be stored in any dictionary.

WARNING: Modifying an existing object in such a way that its hash value changes will corrupt hashed collections containing that object. Use care if you need to redefine the hash method; do not refer to instance variables that are likely to change.

 

7.5 Managing Indexes

You may need to find out about all the indexes in your system, and to remove selected indexes or clean up indexes that were not successfully created. This functionality is provided by the class IndexManager.

IndexManager has a single instance which provides much of the functionality, accessible via:

IndexManager current

This instance is lazy initialized, and stored in the IndexManager class instance variable after it is created. Any configuration you do on IndexManager current, therefore, will be used by all affected operations, if you commit after making the change.

Indexes on temporary collections

You may create indexes on temporary collections containing temporary and persistent objects. However, on abort, any indexes on temporary collections are removed.

Inquiring About Indexes

For a full description of the indexes on a particular collection, send indexSpec to the collection. This produces a string containing the GsIndexSpec code that would recreate the same indexes, and provide useful documentation on those indexes.

For example,

myEmployees indexSpec 
%
GsIndexSpec new
	equalityIndex: 'each.age'
		lastElementClass: SmallInteger;
	equalityIndex: 'each.address.state'
		lastElementClass: String;
		options: GsIndexOptions reducedConflict;
	identityIndex: 'each.userId';
	yourself.

You can also send messages to the collection that will return quick information on indexed paths.

  • equalityIndexedPaths and identityIndexedPaths

Returns, respectively, the equality indexes and the identity indexes on the receiver’s contents. Each message returns an array of strings representing the paths in question.

For example, the following expression returns the paths into myEmployees that bear equality indexes:

myEmployees equalityIndexedPaths
%
anArray( 'age', 'address.state') 
  • kindsOfIndexOn: aPathNameString

Returns information about the kind of index present on an instance variable within the elements of the receiver. The information is returned as one of these symbols: #none, #identity, #equality, #identityAndEquality.

  • equalityIndexedPathsAndConstraints

Returns an array in which the odd-numbered elements are the elements of the path, and the even-numbered elements are the constraints specified when creating an index using the keyword withLastElementClass:.

The following IndexManager messages allow you to inquire about all indexes in the repository.

  • getAllNSCRoots

Returns a collection of all UnorderedCollections in the repository that have indexes.

  • usageReport

Returns a report on all indexes on all UnorderedCollections in the repository.

Removing Indexes

There are a number of ways to remove indexes, using GsIndexSpec, IndexManager, and UnorderedCollection protocol.

Since indexing internal structures create references to the indexed collection and to objects in the collection, before dereferencing a collection, you should be sure to remove all indexes on the collection. This allows the collection to be garbage collected.

To remove indexes based on a GsIndexSpec

As you can create indexes based on an instance of GsIndexSpec, you can also use that specification to remove these indexes.

GsIndexSpec >> removeIndexesFrom: aCollection

Removes the indexes described by the receiver from the collection indicated by aCollection. If any specified indexes do not exist, they are not removed and no error is returned.

This is most useful in combination with the method that creates the spec from the existing collection. For example:

(MyEmploees indexSpec) 
	removeIndexesFrom: MyEmployees.

To remove a single index, you may edit the specification code printed by indexSpec, or create a simple GsIndexSpec with information to remove a single index:

(GsIndexSpec new 
	equalityIndex: 'each.age' lastElementClass: Object)
		removeIndexesFrom: MyEmployees.

To remove indexes using IndexManager

IndexManager, which provides a system-wide view of all the indexes in the repository, provides a number of methods to remove indexes both individually, by collection, and globally.

IndexManager >> removeEqualityIndexFor: aCollection on: aPathString 

Removes an equality index from the collection aCollection with the indexed path described by aPathString. If the path specified does not exist, this method returns an error. Implicit indexes are not removed.

IndexManager >> removeIdentityIndexFor: aCollection on: aPathString 

Removes the identity index from the collection aCollection with the indexed path described by aPathString. If the path specified does not exist, this method returns an error. Implicit indexes are not removed.

IndexManager >> removeAllIndexesOn: aCollection

Removes all explicitly created indexes from the collection aCollection. Implicit indexes that were created by these elements participating in other indexed collections are not removed.

IndexManager >> removeAllIndexes

Removes all indexes on all UnorderedCollections, including all implicit and partial indexes.

IndexManager >> removeAllTracking

Removes all indexes on all UnorderedCollections, and all object tracking. While this is the fastest way and most complete way to remove indexing infrastructure, if you are using modification tracking for any other purpose, that tracking will be removed as well.

To remove indexes using UnorderedCollection protocol

You may also send methods to the indexed collection directly to remove one or all indexes.

UnorderedCollection >> removeEqualityIndexOn: aPathString 

Removes an equality index from the path indicated by aPathString. If the path specified does not exist, this method returns an error. Implicit indexes are not removed.

UnorderedCollection >> removeIdentityIndexOn: aPathString 

Removes the identity index on the specified path. If the path specified does not exist, this method returns an error. Implicit indexes are not removed.

UnorderedCollection >> removeAllIndexes 

Removes all explicitly created indexes from the receiver. Implicit indexes that were created by these elements participating in other indexed collections are not removed.

Rebuilding Indexes

When objects that participate in an index are modified, the related indexing infrastructure must be updated. This causes some overhead. If you are performing an operation that will modify a large number of objects that participate in multiple indexes, such as a large migration, it may be more efficient to remove some or all of the indexes on the collection before performing the migrate, and rebuild those indexes after the migration is complete.

It is also sometimes required to remove and rebuild indexes as part of a GemStone upgrade; certain changes in GemStone kernel classes require you to either rebuild specific kinds of, or all, indexes. Any requirement to do this will be included in upgrade instructions in the Installation Guide for the version of GemStone to which you are upgrading.

To remove and rebuild indexes, you can extract and save the GsIndexSpec, and reuse that after the operation is complete.

For example:

| mySpec |
mySpec := myCollection indexSpec.
mySpec removeAllIndexesFrom: myCollection.
<perform migration or other operation>
mySpec createIndexesOn:myCollection

Using IndexManager >> getAllNSCRoots, you may extend this example to retrieve the GsIndexSpec for each collection in the repository, which will allow you to remove and rebuild the indexes.

Indexing and Performance

Under ordinary circumstances, indexing a large collection speeds up queries performed on that collection and has little effect on other operations. Under certain uncommon circumstances, however, indexing can cause a performance bottleneck.

For example, you may notice slower than acceptable performance if you are making a great many modifications to the instance variables of objects that participate in an index, and:

  • the path of the index is long; or
  • the object occurs many times within the indexed IdentityBag or Bag (recall that neither IdentitySet nor Set may have multiple occurrences of the same object); or
  • the object participates in many indexes.

Even so, indexing a large collection is still likely to improve performance unless more than one of these circumstances holds true. If you do experience a performance problem, you can work around it in one of two ways:

  • If you have created relatively few indexes but are modifying many indexed objects, it may be worthwhile to remove the indexes, modify the objects, and then re-create the indexes.
  • If you are making many modifications to only a few objects, or if you have created a great many indexes, it is more efficient to commit frequently during the course of your work. That is, modify a few objects, commit the transaction, modify a few more objects, and commit again. Frequent commits improve performance noticeably.

Formulating queries and performance

The most efficient queries are the ones in which the first predicate will return the smallest result set. This is sometimes easy for a human to determine, but the query cannot predict this without actually running the query. Queries should be manually reviewed for these kinds of domain-specific optimizations.

For example, you might want to query for current orders for a particular customer.

(each.status = #current) & (each.customer.name = 'Smith')

If your application is likely to have only a few current orders, then this is more efficient. However, if you are likely to have many current orders, but only a few customers named Smith, it would be more efficient for you to write the formula in reverse order.

This assumes that both predicates have an associated index. The optimization step will reorders predicates so the indexed predicates will be evaluated before any non-indexed predicates. See Query Formula Optimizations for the automatic optimizations that are done.

Indexing Errors

To ensure that indexing structures are consistent, some kinds of errors that may occur during index creation will disable commits. Before creating an index, it is advisable to commit any work in progress, to avoid losing any work if an indexing error does occur.

For example, if you create an index on a collection and one or more of the objects that participate in the index do not implement the instance variable on the path, it will raise an error (unless using optionalPathTerms, as described here).

If an error occurs partly through index creation, and the autoCommit status (see Auto-commit) means that some portion of the index creation was committed, a collection may have unusable partial indexes. These indexes must be manually removed.

The following IndexManager instance methods allow you to remove incomplete indexes, while not affecting any complete, usable indexes:

IndexManager current removeAllIncompleteIndexes

Removes all incomplete indexes on all UnorderedCollections.

IndexManager current removeAllIncompleteIndexesOn: anNSC

Removes all incomplete indexes on the specified UnorderedCollection.

If you modify objects that participate in an index, try to commit your transaction, and your commit operation fails, query results can become inconsistent. If this occurs, abort the transaction and try again.

Auditing Indexes

Indexes should be audited regularly, as part of your regular application maintenance, to ensure there are no problems.

You can audit the internal indexing structures for a particular collection by executing:

aCollection auditIndexes

This audits all the indexes, explicit and implicit, on the given collection. If indexes are correct, this method returns 'Indexes are OK' or 'Indexes are OK and the receiver participates in one or more indexes.'. If there are no indexes on the collection, a message such as 'No indexes are present.' is returned.

In the case of failure, a list of specific problems is returned.

You can audit all indexes in the entirely repository at once using:

IndexManager current nscsWithBadIndexes

which will return an IdentitySet containing all collections that fail auditIndexes. Depending on the number of indexed collections in your system, this may take a considerable time to run.

In the rare case of a problem reported, the usual way to resolve the problem is to remove and rebuild the affected indexes. In some cases, removing all indexes on the collection may succeed even if the internal problems prevent a single index being removed. If removing indexes is impractical, contact Gemstone Technical Support for further assistance.

7.6 Query Formulas and Optimization

When you define a query, you may be able to most easily write this as a statement of business logic. For more complicated queries, this may not be in its most efficient form.

Automatic query optimization performs some optimizations that can change a query into the logically equivalent form that is more efficient for GemStone to execute.

You can also perform these optimizations manually, by writing your query in a the most efficient form, rather than in the human/business logic form.

Query Formulas

An instance of GsQuery is created on a string, and is internally represented as an instance of GsQueryFormula. The formula provided by the string may not be in its most efficient form for execution. While you can hand-optimize the formula when you create the string, it may be desirable to write the query so it makes sense from a business logic point of view, and is more human-readable.

While a query formula such as

each.numberOfChildren < 3

does not change from optimization, a more complicated query such as this:

((1 <= each.numberOfChildren) not & 
	(each.numberOfChildren <= 3)) not

benefits from optimization; the optimized version of this is

each.numberOfChildren < 1

When a query is being executed or displayed, by default it is auto-optimized. It uses the current formula to create an optimized version according to the optimization rules, and executes or displays the optimized formula. The optimized version is not saved.

You can access the current formula using the formula message. By accessing the instance of GsQueryFormula directly, you bypass the auto-optimization.

You can update the query’s formula by sending the optimize message to the query. This saves the new, optimized formula in place of the current formula in the query.

Queries also retain the original formula with which they are created, so you can still view the human-readable form. These are accessible via the originalFormula message.

For example, if you create a query based on the above formula:

query := GsQuery fromString: '((1 <= each.numberOfChildren) not
	& (each.numberOfChildren <= 3)) not'.
 

You can view the formulas, before and after optimization:

query printString
each.numberOfChildren < 1
 
query formula printString
((1 <= each.numberOfChildren) not & 
	(each.numberOfChildren <= 3)) not
 
query optimize.
query formula printString
each.numberOfChildren >= 1
 
query originalFormula printString
((1 <= each.numberOfChildren) not & 
	(each.numberOfChildren <= 3)) not

Invariance and Formula reuse

Instances of GsQuery are not invariant. However, instances of GsQueryFormula and its subclasses are invariant when created using the public API. This allows formulas to be safely persisted and shared, since side effects of message sends will not change the semantics of the query.

When a GsQuery’s formula changes, such as when variables are bound, or when optimize is sent, a new formula instance replaces the previous one. The particular formula in a GsQuery, therefore, will depend on the stage at which the formula is accessed.

For example, to save and reuse a formula from a GsQuery:

aQuery := GsQuery fromString: 'each.age <= min'.
UserGlobals at: #savedFomula put: aQuery formula.

The formula can be later reused:

(GsQuery fromFormula: savedFomula on: aCollection)
	bind: 'min' to: 18;
	queryResults.

To make a GsQuery invariant so it can safely be reused, send immediateInvariant. The query can later be copied, to bind, optimize and execute. In this example, note that the query is saved with the reference to the collection:

query := GsQuery fromString: 'each.age <= min' on: aCollection.
UserGlobals at: #savedQuery put: query immediateInvariant. 

In this case, the reused query does not need to specify the collection when reused:

savedQuery copy
	bind: 'min' to: 18;
	queryResults.
 

Disabling auto-optimize

Queries, by default, are optimized before execution. Each query has an associated instance of GsQueryOptions. This controls optimization and other query features. In addition to the various specific optimizations performed, GsQueryOptions controls if automatic query optimization is done. The default is to do auto-optimization.

Queries are created with a default GsQueryOptions, or the options can be set on creation using the query creation methods with the options: keyword.

To disable auto-optimization, you can create a query that specifies an instance of GsQueryOptions that has autoOptimize removed. For example:

query := GsQuery 
	fromString: '((1 <= each.numberOfChildren) not &
	(each.numberOfChildren <= 3)) not' 
	on: myEmployees
	options: (GsQueryOptions default - GsQueryOptions
			autoOptimize).

Query Formula Optimizations

The following are the specific optimizations that are performed when a query is optimized.

Remove "not" using boolean logic

Clauses that include a not are transformed using De Morgan's Laws into the logical equivalent form without the not. For example:

(each.firstName = 'Dale') not

becomes:

(each.firstName ~= 'Dale')

Convert predicates with equal operands into boolean constants

Predicates with common operands are transformed to the equivalent constant predicate. For example, either of the following becomes true:

(each.firstName = each.firstName) 
(4 = 4)

Convert constant-path reversed to path-constant

Constant-path predicates are replaced by equivalent path-constant predicates. For example:

(19 > each.age) 

becomes:

(each.age < 19)

Eliminate redundant predicates

Predicates that fall within range of other predicates are removed. For example:

(each.age < 19) & (each.age < 4)

becomes:

(each.age < 4)

This optimization requires that the variables in the query be bound to values.

Combine path-constants into range predicate

Two path-constant predicates on the same path will be converted into a single range predicate, if the predicates represent a range. For example:

(each.age > 4) & (each.age < 19)

becomes:

(4 < each.age < 19)

This optimization requires that the variables in the query are bound to values.

Combine path-constants to enumerated predicate

If an index exists that has an enumerated path term, and there are path-constant predicates using that enumerated path term, the path-constant predicates can be combined into a single enumerated predicate. For example,

(each.firstName = 'Martin') | (each.lastName = 'Martin')

becomes:

(each.firstName|lastName = 'Martin')

This optimization requires that the collection and any variables be bound to the query.

Simplify (true) and (false) predicates

Other optimizations may result in predicates that are unary constants true or false. These are removed, or the entire expression is simplified, depending on the logic.

(true) & <other predicates> becomes: <other predicates>
(true) | <other predicates> becomes: (true)
(false) & <other predicates> becomes: (false)
(false) | <other predicates> becomes: <other predicates>

Reorder predicates

The predicates are reordered as follows, from left to right.

1. predicates involving indexed paths.

2. predicates with identity comparisons on paths without indexes.

3. predicates with equality comparisons on paths without indexes.

 

Previous chapter

Next chapter