4. Collection and Stream Classes

Collections of objects are key features in an application. GemStone provides a variety of Collection classes, including both subclasses of Collection, and other implementations of structures with collection semantics. This chapter describes the main types of collections that are available.

Strings and ByteArrays are kinds of collections that are specialized to hold characters or bytes; these are described separately in Chapter 5.

Introduction to Collections
introduces the GemStone Smalltalk objects that store groups of other objects, and describes the different kinds of collections that are available.

Reduced-Conflict Collection Classes
describes specialized kinds of classes that avoid conflicts in a multi-user system.

GsBitmap
describes GsBitmap, a specialized kind of collection.

Sorting the objects in a collection
describes the ways to sort elements in collections.

4.1 Introduction to Collections

Instances of the Collection subclasses are specialized to manage an indeterminate number of objects as a group using unnamed instance variables.

Collections can be classified by whether or not they maintain a specified order for their elements, whether or not key-based lookup is supported, and the kinds of objects they can reference.

Collections can be broadly classified into basic categories:

Access by Key — the Dictionary Classes

Instances of AbstractDictionary subclasses do not support a specific order for their elements; elements are stored and retrieved via the at:put: and at: messages, using arbitrary objects for an element's key. Subclasses of AbstractDictionary are specialized based on whether key-based lookup uses equality comparison or identity comparison, the type of key, and the type of value.

Dictionaries can also have named instance variables, if you choose to define them.

Access by Position — the SequenceableCollection Classes

Instances of SequenceableCollection classes maintain a specific order for their elements and support storage and retrieval via the at:put: and at: messages using an integer key (the one-based offset into the elements).

Byte-format classes such as ByteArray and String cannot have named instance variables. You may define named instance variables for pointer-format subclasses, such as Array and OrderedCollection.

Access by Value — the UnorderedCollection Classes

Instances of UnorderedCollection classes—also referred to as Non-Sequenceable Collections or NSCs—do not have a specific order for their elements, and do not support storage or retrieval via the at:put: and at: messages. Objects in these collections are accessed by iterating the collection. UnorderedCollections support indexes, which allow ordered iteration and fast key-based lookup.

You may define named instance variables for subclasses of one of the UnorderedCollections subclasses.

Other kinds of Collection

GemStone includes some collection-like classes that do not inherit from Collection, such as GsBitmap. The generalizations about collections made in this chapter may or not apply to such classes.

Efficient Implementations of Large Collections

When you create a collection of more than about 2K objects, or a byte collection larger than 16K, GemStone internally uses a sparse tree implementation to make more efficient use of resources. These are referred to as "Large Objects", and use internal classes such as LargeObjectNode. This behavior occurs in a manner that is transparent to you, and you interact with Large Collections the same as smaller collections; however, these internal objects may be visible when performing object-based audit and analysis.

Modifying objects in collections

Different kinds of collections use different criteria in which to store and locate objects. Most dictionaries use a hashed value, the result of sending the #hash message to each key object. If the result of sending #hash changes, the key may not be found in the collection using methods such as at:.

The ordering of a SortedCollection depends on the results of sending comparison methods to the objects in the collection.

If a hash method or comparison method is defined that depends on values in the instance variables of the objects, and these instance variables are modified, the object may not be found using lookup methods in the dictionary or sorted collection, though iteration will still find them. If you change the value of an instance variable of an object in such a collections, you should remove and re-insert the object in the collection so the lookup methods will work.

Protocol Common to All Collections

Collection classes understand common protocol, inherited from the abstract superclass Collection. Collection defines methods that enable you to perform the general collection operations described in the following sections.

Creating Instances

Collections can be creating using new, new:, with:, and similar protocol. The most basic way to create a new collection is using the message new. When sent to a Collection class, this message causes a new instance of the class with no elements (size zero) to be created. Most kinds of collections can expand as you add additional objects.

new: anInteger, causes many Collection subclasses to create an instance that is pre-sized to hold anInteger elements. This avoids the need to expand the collection when elements are added. Pre-defining the size during creation is particularly important when creating a hashed collection that will hold a large number of objects. Hashed collections store elements in buckets, and the number of buckets must be increased when the number of objects in the collection reaches a threshold for the number of buckets. These expansions are expensive, since it requires that each element be re-added to the expanded collection at the recomputed hashed location.

Several kinds of Collections can be created as literals, using Smalltalk syntax. Arrays, ByteArrays, Strings and Symbols have literal syntax, and Arrays can also be created at runtime using Array constructors. ByteArrays, Strings and Symbols are discussed in Chapter 5.

Enumerating

Collection defines several methods that enable you to loop through a collection’s elements, evaluating a block for each element in the collection.

The message do: aBlock is the most general; it evaluates aBlock for each element.
Methods that iterate through the elements and return collections are collect:, select:, and reject:.

The class of the result collection is often, but not always, the same kind of collection as the receiver. The class methods species, speciesForSelect, and speciesForCollect determine the class of the result.

When sent to SequenceableCollections, these messages preserve the ordering of the receiver in the result. That is, if element a comes before element b in the receiver, then element a will come before b in the result.

The messages detect:, detect:ifNone:, and any iterate to return a single element, based on the order the collection is enumerated. Enumeration stops after the an element is found.
The message anySatisfy: enumerates each element, stopping if any is found that meet the block criteria; allSatisfy: enumerates, stopping if any is found that does not meet the block criteria.

To avoid unpredictable consequences, do not add elements to or remove them from a collection while you are enumerating it.

Collections in multi session environment

In many cases, you will have collections that need to be accessed by multiple sessions in an application. Different sessions may need to read the contents or to add, remove, or modify elements.

Conflicting updates

Due to the transactional nature of GemStone (see Chapter 8, “Transactions and Concurrency Control”), overlapping updates by two sessions may conflict, in which case the second update has failed and the work needs to be repeated.

There are some kinds of conflicts that, while they modify the same object, are not really conflicts in a logical sense. GemStone provides a number of different collection classes that avoid specific kinds of conflicts; these are described under Reduced-Conflict Collection Classes.

Visibility and ordering

When you add an element to a collection, this change does not become visible to other users until you have successfully committed the transaction. While it is important in a multiuser system to avoid long periods in a transaction prior to commit, the requirements are application specific and there may be minutes or hours between the time an object is created and when it is finally committed and visible to other users. Other users, in turn, must abort or commit before they see the changes.

For ordinary (non-reduced conflict) collections, this means that object changes may become visible to other users some time after the change is actually made.

For reduced-conflict sequenceable and queue classes, the order of objects in the collection may depend on the order of the commit, rather than the order of the time in which the object was created.

Collection classes

Collection classes can be grouped by the kinds of access methods they provide and the kinds of objects their instances can store.

Dictionary classes, including Dictionary, KeyValueDictionaries, and KeySoftValueDictionary
SequenceableCollection classes, including Array, OrderedCollection, and SortedCollection,
UnorderedCollection classes, including Bag, IdentityBag, Set, and IdentitySet
Stream Classes, including ReadStream, WriteStream, and ReadWriteStream.
Reduced-Conflict Collection Classes, including RcArray, RcIdentityBag, RcIdentitySet, RcKeyValueDictionary, RcPipe, and RcQueue.
Classes that aren’t subclasses of Collection, including GsBitmap.

String classes, including Traditional and Unicode string classes and Symbols, are also kinds of Collections and understand many kinds of Collection messages. Concerns that are specific to Strings, including String collation, are described in Chapter 5.

This chapter does not attempt to describe all collection classes or all methods that are available; it highlights the most commonly used protocol and describes special features. Review the methods in the image for more details.

Dictionary classes

Dictionaries provide their special facilities by storing key-value pairs instead of simple, linear lists of objects. The elements in a Dictionary collection are stored and accessed via a key; each key must be unique within that Dictionary.

While some types of dictionaries are implemented as “a collection of Associations”, the interface methods return results based on the logical contents, which are the values. Other, specialized protocol allows you to refer to the key or the value portions of the logical associations.

Internal Dictionary Structure

For performance reasons, the internal implementation of Dictionary classes varies. Instances of Dictionary itself consist of a collection of Association objects. KeyValueDictionary subclasses are implemented differently, as a sequence of keys and values, which may use CollisionBuckets to hold the actual values. IdentityDictionary is a sequence of keys and Associations. All these dictionaries understand common protocol, regardless of implementation.

Dictionary and KeyValueDictionary

Dictionary class uses Associations to store the key/value pair, while subclasses of KeyValueDictionary are slot-based. KeyValueDictionary has several subclasses, divided according to the type of key used to access the information:

IdentityKeyValueDictionary
IntegerKeyValueDictionary
StringKeyValueDictionary
SymbolKeyValueDictionary
IdentityDictionary
SymbolDictionary

KeySoftValueDictionary

A KeySoftValueDictionary is a subclass of KeyValueDictionary that allows the virtual machine to remove entries as needed to free up memory.

Typically, you might use a KeySoftValueDictionary to manage non-persistent objects that are large and take time to create, but that can be recreated whenever needed from small, readily available objects (tokens). For example, you might create a KeySoftValueDictionary to serve as a cache to hold large, expensive objects that are needed repeatedly. Within that dictionary, the values would be the large calculated objects, and the keys would be the corresponding tokens. If your application needs a large, expensive object but does not find it in the KeySoftValueDictionary, you can create the object and add it to the cache so that it might be available the next time it is needed.

As memory fills up, the virtual machine might remove some objects from the cache. (Remember, the contents of the cache are non-persistent and can be recreated.) The virtual machine may remove keys and values from the KeySoftValueDictionary until adequate memory is available. For details about how to manage the number of KeySoftValueDictionary entries, see Getting Rid of Non-Persistent Objects.

Keep in mind the following:

Entries are removed from a KeySoftValueDictionary only if there are no strong references to the entry’s value.
If an entry in a KeySoftValueDictionary is cleared, all other entries that reference this value directly or indirectly will also have been cleared.
Before generating an OutOfMemory error, the virtual machine removes all KeySoftValueDictionary entries that are eligible for removal.
KeySoftValueDictionary entries are cleared during a mark/sweep operation, but are not cleared during a scavenge. For more about mark/sweep and scavenge operations, see the “Managing Growth” chapter of the System Administration Guide.
A corresponding subclass, IdentityKeySoftValueDictionary, uses identity (rather than equality) comparison on keys. For details, see the image.
A KeySoftValueDictionary frequently contains instances of SoftReference. Do not be tempted to confuse this with the notion of WeakReference found in many Smalltalk dialects; the two mechanisms are quite different.

SequenceableCollection classes

SequenceableCollections, such as Array and OrderedCollection, let you refer to their elements with integer indexes, and they understand messages such as first and last that refer to the order of those indexed elements. Adding by default adds to the end of the collection.

Copying

When copying a very large instance of a subclass of SequenceableCollection, it can be more efficient to use the method replaceFrom:to:with:startingAt:, which does not fault the contents into memory. This can improve performance significantly for very large collections.

This example copies two elements of an array into a different array, overwriting the target array’s original contents:

| numericArray |

numericArray := Array with: 55 with: 66 with: 77 with: 88.

numericArray replaceFrom: 2 to: 3

	with: #( 1 2 3 4 5 ) startingAt: 4.

numericArray

an Array

 #1 55

 #2 4

 #3 5

 #4 88

Note that, while the replace method does not itself fault the contents into memory, displaying the results as in the example also faults the objects into memory.

Array

One of the most important differences between client Smalltalk arrays and a GemStone Smalltalk array is that GemStone arrays are extensible; you can increase the size of an array at any time. Sending at:put: will increase the size of the array, as long as the index is only one greater than the current array size. Other protocol such as addAll: also increase the size while adding elements.

It’s also possible to change the size without explicitly storing or removing elements, using the message size: inherited from class Object. When you lengthen an array with size:, the new elements are set to nil.

Literal Array and Array Constructors

Arrays can also be created in code without sending instance creation messages, by using literal array or array constructor syntax.

Since Array constructors perform code at runtime, it is more efficient to use Array literals if the contents are literals.

Array Literals are created at compile time, and hold other literal objects. These start with the pound sign, are enclosed in parenthesis and separated by white space. Array literals are defined by ANSI; syntax is described here. They are invariant.

#( 'carrot' 'tomato' 'celery')

Array constructors are created at runtime. These are enclosed in curly braces and separated by a period. Array constructors are GemStone-specific, not defined by ANSI; syntax is described here.

{ Date today . Time now }

SortedCollection

SortedCollection is a type of SequenceableCollection in which the elements are ordered by a specific sort order, not by the order in which they were added or by the method used to add the element. You may not send at:put:, addLast:, or similar methods to a SortedCollection.

Each instance of SortedCollection is associated with a sortBlock. The default block will sort elements that can be compared using <=, which includes strings and numbers. You can also define your own sortBlock, if you want elements ordered by some other criteria, such as the value of an instance variable.

For more on comparison, sorting, and sort blocks, see Sorting the objects in a collection.

Example 4.1

| scrabbleWords |

scrabbleWords := SortedCollection sortBlock:

	[:a :b | a size < b size].

scrabbleWords add: 'able'; add: 'zebra'; add: 'jumper';

	add: 'yet'.

scrabbleWords

aSortedCollection( 'yet', 'able', 'zebra', 'jumper')

There is overhead in always keeping the collection sorted, so it usually more efficient to sort the elements only when you need them to be sorted for presentation. Especially for large collections or collections in which objects are frequently added and removed, consider using another kind of class to store the elements, then using methods such as sortWithBlock: to create a new Array with the elements in sorted order.

SortedCollection sortBlocks are compiled code, and as such, may need to be recompiled on GemStone upgrade. Provided the sortBlock is simple—that it, it does not contain references to variables outside the scope of the block, nor iterative methods—the recompile can be done automatically. Since the sortBlock executes for many element pairs during sort, keeping the sortBlock simple and fast is important for performance in any case.

Stream Classes

A Stream acts like a SequenceableCollection that keeps track of the index most recently accessed. Streams are often used for reading characters from strings or files, but any kind of collection can be used with a Stream, and any type of object can be in that collection.

Commonly used Stream classes are ReadStream, WriteStream, and ReadWriteStream, which come in two variants; the traditional Smalltalk 1-based positional offset, and the ANSI-compliant portable streams with an 0-based offset.

PositionableStream and Position

PositionableStream, with its subclasses ReadStream and WriteStream, was traditionally implemented in GemStone with the position indicating an offset from 1; that is, the first position in the stream was 1.

ANSI specifies, and other Smalltalk dialects use, an offset of 0, so the first position in the stream is 0.

To allow both sets of classes to be available for use, while either one or the other uses the actual class name, GemStone includes the multiple sets of classes, implementing both interfaces. There are four sets of classes, which all exist in the image (and therefore, may have instances), with only three sets being visible at any time. The following two sets are always visible:

Legacy-style PositionableStream classes, compatible with previous GemStone version’s PositionableStream classes:

PositionableStreamLegacy

	ReadStreamLegacy

	WriteStreamLegacy

ANSI-compliant and portable PositionableStream classes:

PositionableStreamPortable

	ReadStreamPortable

	WriteStreamPortable

	ReadWriteStreamPortable

In addition, only one of the following sets is visible, depending on how your system is configured. These are two distinct sets of instances of Class, with the same name, but different implementations.

PositionableStream (with legacy definition and methods)

	ReadStream

	WriteStream

PositionableStream (with portable definition and methods)

	ReadStream

	WriteStream

The legacy versions are stored in Globals at: #GemStone_Legacy_Streams. The portable, ANSI-compatible versions are stored in Globals at: #GemStone_Portable_Streams.

To check what is currently installed, use the following methods:

PositionableStream class >> isLegacyStreamImplementation

PositionableStream class >> isPortableStreamImplementation

To install the portable version, use the method:

Stream class >> installPortableStreamImplementation

To install the legacy version, use the method:

Stream class >> installLegacyStreamImplementation

AppendStream

AppendStream is a kind of Stream that does not maintain a position. It is designed to optimize a common use-case for streams: composing long, complex blocks of text and returning the resulting string.

Like WriteStream, you can add strings and characters to an AppendStream, and like any stream, you can get the entire contents. Many other methods commonly associated with Stream classes are not available, however.

UnorderedCollection classes

Instances of UnorderedCollection store their elements as an internal, tree-based structure referred to as an Non-Sequenceable Collection (NSC). The elements have no defined order within the collection, so methods such as at: and at:put: are disallowed.

UnorderedCollection implements protocol for indexing, which allows for large collections to be queried and sorted efficiently. Chapter 7, “Indexes and Querying”, describes the querying/sorting functions in detail. The most efficient way to handle very large collections is using UnorderedCollection, using GemStone indexes to access the contents.

UnorderedCollections cannot contain nil as an element; adding nil has no effect.

Commonly used UnorderedCollection concrete classes are Bag, Set, IdentityBag and IdentitySet. Since Bag and Set use equality for comparisons, for large collections it is much more efficient to use IdentityBag or IdentitySet, which perform comparisons based on identity (OOP).

Union, Intersection, and Difference

Subclasses of UnorderedCollection provide messages that perform set arithmetic: union, set intersection, and set difference.

+ union, returning elements that are in either one, the other, or both.

- difference, returning elements that are in the receiver but not the argument.

* intersection, returning elements that are in both

Example 4.2

| pets rodents |

pets := IdentityBag with: 'dog' with: 'cat' with: 'gerbil'.

rodents := IdentityBag with: 'rat' with: 'gerbil' with: 'beaver'.

pets * rodents

 anIdentityBag( 'gerbil')

pets + rodents

 anIdentityBag( 'beaver', 'rat', 'gerbil', 'gerbil', 'cat', 'dog')

pets - rodents

 anIdentityBag( 'cat', 'dog')

Avoiding faulting contents into memory

If the argument to addAll: is an Array or OrderedCollection, the elements in the collection are not faulted into memory. For very large collections, or if the objects in the collection are not in the shared page cache and must be read from disk, this can be a significant advantage. Using an IdentityBag as an argument to replaceFrom:to:with:startingAt: allows you to get a copy of the elements without faulting the contents into memory.

Example 4.3

| bagOfRodents |

bagOfRodents := IdentityBag withAll: #('beaver' 'rat' 'agouti'

'chipmunk' 'guinea pig').

(Array new: 5) replaceFrom: 3 to: 5

with: bagOfRodents startingAt: 1.

 anArray( nil, nil, 'guinea pig', 'chipmunk', 'agouti')

4.2 Reduced-Conflict Collection Classes

GemStone provides a variety of reduced-conflict collection classes. These classes are similar to the standard collection classes already described, but include additional processing to avoid transaction conflicts in a multi-user environment.

Each reduced-conflict class has specific types of conflicts they are designed to avoid, and the amount of internal infrastructure or the cost of resolving a conflict varies. Selection of an RC class should consider the demands of the application, and also the costs of the automatic conflict resolution.

For more on transactions and transaction conflicts, see Chapter 8. Further information on the transactional behavior of these RC classes is under the section Classes That Reduce the Chance of Conflict.

RcArray

The class RcArray is similar to Array, but no conflict occurs when multiple users add objects to an RcArray. If a conflict with another update operation on the RcArray occurs, the add is replayed so that the commit can succeed.

Only the following methods support concurrent updates:

add:

addAll:

at:put: (where no other session affects the element at the at: index)

size: (when size is increased)

NSC/UnorderedCollection classes

RcIdentityBag

The class RcIdentityBag provides much of the same functionality as IdentityBag, but with no conflict for multiple sessions that add objects to the bag, and a single session that removes objects.

Internal implementation

RcIdentityBag is internally implemented using an Array of IdentityBags. Each session number corresponds to two IdentityBags, one for additions to the RcIdentityBag, and one for removed elements. Each logged-in session only modifies the IdentityBags corresponding to its own session number. Computing the current contents of an RcIdentityBag means combining the add bags, and removing all the remove bags.

Maintenance

The implementation of RcIdentityBag means that reclaiming the storage of objects that have been removed from the bag actually occurs when a session performs later adds or removes, or after that session logs out, another session logs in as that session number and performs adds or removes.

If a session adds a great many objects to the RcIdentityBag, and then does not do any further adds or removes; or if it logs out and the following sessions to use that session number do not perform adds or removes on this bag, then performance can become degraded and otherwise dereferenced objects in the RcIdentityBag cannot be garbage collected.

The message cleanupBag may be sent to the RcIdentityBag to process removals for inactive sessions. This may cause conflicts if a session logs in and adds or removes an object.

RcLowMaintenanceIdentityBag

RcLowMaintenanceIdentityBag is similar to RcIdentityBag in behavior, but does not require regular cleanup. Rather than using a per-session subcollection of add and remove elements, RcLowMaintenanceIdentityBag relies on replay to resolve conflicts. Like RcIdentityBag, it has no conflict for multiple sessions that add objects to the bag, and a single session that removes objects.

The cumbersome name is intended to be temporary, with this implementation replacing RcIdentityBag’s subcollection-based implementation in some future release.

RcIdentitySet

The class RcIdentitySet is similar to IdentitySet, but no conflict occurs when multiple users add objects to an RcIdentitySet. If a conflict with other update operations on the RcIdentitySet occur, the add is replayed so that the commit can succeed.

RcKeyValueDictionary

The class RcKeyValueDictionary provides the same functionality as KeyValueDictionary, but with no conflict for operations that involve different keys in the dictionary. As long as the keys are different, multiple sessions can add new keys to the dictionary, remove keys, or update values.

RcKeyValueDictionary avoids conflict by performing a selective abort and replay of the modifications to the dictionary.

Queue classes

The Queue classes implement a first-in-first-out (FIFO) queue. These are a kind of collection that is ordered by the sequence in which objects are added to the collection. The add: message puts an element at the logical end of the queue, and the remove method returns the element at the logical head of the queue.

The following example has the same semantics for GsPipe, RcPipe, and RcQueue; the choice of classes depends on the transactional requirements of your application.

Example 4.4 FIFO Queue

| pipe |

pipe := RcPipe new.

pipe add: 'orange'.

pipe add: 'apple'.

pipe add: 'banana'.

pipe remove.

pipe.

 aRcPipe( 'apple', 'banana')

GsPipe

The class GsPipe implements a first-in-first-out queue, with no conflict when a single session adds objects to the RcPipe, and only one session removes objects.

Internally, the GsPipe is implemented as a linked list of GsPipeElements. Since adds and removes only affect the respective ends of the linked list, there is no conflict between add and remove.

RcPipe

The class RcPipe implements a first-in-first-out queue, with no conflict when multiple sessions add objects to the RcPipe, and only one session removes objects.

Internally, the RcPipe is implemented as a linked list of GsPipeElements. Unlike with GsPipe, if a conflict with an add by another session occurs, the add operation is replayed so that the commit can succeed. Only add: and operations that invoke add: are reduced conflict.

RcQueue

The class RcQueue implements a first-in-first-out queue, with no conflict when multiple sessions add objects to the RcQueue, and only one session removes objects.

RcQueue has a more complex internal implementation, which allows it to handle high rates of concurrent updates without affecting performance. However, some usage conditions make is necessary to perform manual cleanup

Internal implementation

Internally, RcQueues are implemented using an Array of RcQueueSessionComponents, each corresponding to a session number. The RcQueueSessionComponents contain RcQueueEntry instances, one for each object that the session with the corresponding session number has added to the queue. The RcQueueEntry includes timestamp and sequence number; the timestamp is used to determine the next object within the entire queue is next to be returned, and the sequence number is used to track the next element within the queue for a specific session.

When a next message causes an object to be removed, the removing session updates the RcQueue’s removal sequence number array corresponding to the RcQueueSessionComponents in which the removed object was found.

Maintenance

Reclaiming the storage of objects that have been removed from the queue is deferred until new objects are added by a session with the same session number; this is the way the risk of conflict is avoided.

If a session adds a great many objects to the queue all at once and then does not add any more, while another session consumes the objects, performance can become degraded, particularly from the consumer’s point of view. In order to avoid this, the producer can send the message cleanupMySession occasionally to the instance of the queue from which the objects are being removed. This causes storage to be reclaimed from obsolete objects.

To remove obsolete entries belonging to all inactive sessions, the producer can send the message cleanupQueue.

4.3 GsBitmap

A GsBitmap is quite different than the other collections that have been described. Instances of GsBitmap are objects that encapsulate an in-memory bitmap, with the presence of an object in the collection only indicated by the way a bit is set at the index for the oopNumber of the object.

GsBitmaps cannot be committed, and are designed to optimize performing tasks on very large numbers of persistent objects. In particular, repository analysis using allInstances and similar methods can be more easily done using GsBitmaps. The objects in a GsBitmap are not in temporary object memory, allowing arbitrary large collections. A number of repository analysis methods return GsBitmap instances, and instances of GsBitmap can be created from hidden sets (see Hidden Sets).

While GsBitmap can be considered as a collection and implements some Collection protocol, it does not inherit from Collection. Methods such as add:, remove:, includes: and do: are implemented specifically for GsBitmap; see the image for specific methods. You may send asGsBitmap to create a GsBitmap from a collection, provided the collection only contains objects that are allowed in a GsBitmap; use asArray to collect the objects corresponding to the OOPs in the GsBitmap.

Since GsBitmap is intended to work with very large collections of objects, it implements set arithmetic methods, +/union:, -/difference: and */intersect:.

While there are restrictions and caveats to using GsBitmap, there are significant benefits in memory use. Instances of GsBitmap use C Heap memory, not temporary object memory, to store the bit array.

The following restrictions apply to GsBitmap:.

Only committed objects can be added to a GsBitmap.
Specials, such as Characters, Integers, and SmallDoubles, cannot be added, since they do not have POM OOPs.
Objects can appear only once in the bitmap; duplicates are ignored.
GsBitmaps are ordered in OOP order, regardless of the order they are added.
GsBitmaps cannot be committed, since the underlying structure is not an object.
Being in a GsBitmap does not count as a reference to an object, so there is a risk that objects in a GsBitmap could be garbage collected.

The following example finds all instances of Customer that are not in the AllCustomers collection:

Example 4.5 GsBitmap

| bmAllInstances bmCustomerColl|

bmAllInstances := SystemRepository allInstances: Customer.

bmCustomerColl := AllCustomers asGsBitmap.

(bmAllInstances difference: bmCustomerColl) asArray.

GsBitmaps and C Heap memory

While GsBitmaps do not used temporary object memory, they do still use some memory, and it is possible to run out of C Heap memory if there is extensive use of large GsBitmaps.

An instance of GsBitmap requires a minimum of 16KB (one page) of C Heap memory, which can hold up to 2K objects. A GsBitmap’s memory use always grows in 16KB increments. For GsBitmap instances that contain more than 2K elements, the amount of memory used will vary, depending on how dense the OOP values are within the leaves of the internal tree structure. The best case, for very large, dense bitmaps, is about 1 bit per object. A GsBitmap that contains all the OOP in the repository (GsBitmap allValidOops) will take about (System _oopHighWaterMark // 8) bytes of C heap memory.

GsBitmaps and their objects

There is an important point to note about GsBitmaps; an object in a GsBitmap is not "referenced" by the GsBitmap in the usual way.

An object in GemStone that is not referenced by other persistent objects or by references from a session, is subject to garbage collection. In busy systems, the OOP of that object may be recycled and no longer be in use; and the OOP may be reused by this or another session for an entirely new object of any class. GemStone collections (other than GsBitmap) have references to the objects contained within them, which keeps the objects in temporary collections safe for the life of a session.

Since the references in a GsBitmap are just to the OOPS, not to the objects, objects in a GsBitmap are not safe; the reference from the bitmap is not sufficient to preserve content objects from garbage collection, if they are not referenced somewhere else in the application or session.

If your session performs commits or aborts (including automatic commits or aborts), and the objects that you are working with may become dereferenced (for example, removed from the root collection by another session), then your code should be prepared for objects to no longer exist, or to be a different object than expected.

If an object was garbage collected, and the OOP reused, it may have been used for a critical internal object, or an important object in your application. Use caution when modifying the objects returned from a GsBitmap.

GsBitmaps methods for repository analysis

GsBitmap includes methods that enable repository-wide analysis of objects in the repository. The following methods are available; see the image for other methods.

GsBitmap >> referencedObjects
Returns a new GsBitmap containing the objects directly referenced by the objects in the receiver.

GsBitmap class >> allValidOops
Returns a GsBitmap containing the oops of all valid committed objects in the repository.

GsBitmap class >> transitiveReferences: aCollection
Returns a GsBitmap that contains all the objects which are transitively referenced from the objects in aCollection.

GsBitmap class >> allObjectsExcept: aCollection
Returns a GsBitmap that contains all objects that exist in the repository, which are not contained in the objects transitively referenced from aCollection.

Bitmap files

In addition to standard collection protocol, GsBitmaps can be written to and read from disk, using the following methods:

GsBitmap >> writeToFile:

GsBitmap >> writeToFileInPageOrder:

GsBitmap >> readFromFile:

GsBitmap >> readFromFile:withLimit:startingAt:

You may also query for information on a given bitmap file, using GsBitmap >> fileInfo:. This method returns an array containing:

number of oops in the file
whether the file was written in page order
number of valid oops
number of oops that are not allocated, or in the process of being garbage collected

A GsBitmap file contains references to numeric OOPs. The caution about the risk of unreferenced OOPs being garbage collected and possibly reused, applies even more strongly when using GsBitmap files. And of course, the likelihood of incorrect results relates to the amount of garbage collection that has been done during the period between the time the file was written and when it is read.

4.4 Sorting the objects in a collection

You are likely at some point to want to present the contents of your Collection in a sorted order. You will have to determine how the objects in your collection should be compared to each other for the ordering you need.

Default Sort

Many objects, such as strings, numbers, and dates, have an inherent sort ordering; they respond to <= in a common way, although they cannot always be compared with each other. If your collection contains only homogenous objects that share an understanding of <=, you can use messages such as sortAscending, sortDescending, and asSortedCollection to the collection.

Example 4.6 SmallInteger and String sorting

(Array with: 123 with: 3 with: 99 with: 10) sortDescending

 anArray( 123, 99, 10, 3)

(Array with: '123' with: '3' with: '99' with: '10') sortAscending

anArray( '10', '123', '3', '99')

The default sort of Strings is case-insensitive, unless the only difference is case in which uppercase is first. However, in many cases you may need a different ordering, particularly when languages other than English and character outside the ASCII range are involved. GemStone provides specialized tools for this, which are described in Chapter 5.

The options depend on the type of data in your collection.

Sort based on predefined order of the objects. Some objects, such as Strings, Integers, and DateTimes, have an inherent sort ordering, and GemStone provides default sorts for Collections that contain only objects that can be compared using <=.

While strings have intuitive sort order, string sorting can be complex. Traditional and Unicode strings handle some cases differently. String sorting is described in String Sorting and Collation.

Sort based on one or more of the predefined order of objects’s instance variable values. The sort you intend is based on the values in application objects instance variables, and these values have inherent sort order, such as sorting customers by zip code.
Arbitrary Sort. sortBlocks allow you to specify expressions that can order any type of object according to your specific requirements.

These issues are the same when using a SortedCollection, which always maintains sort order as elements are added and removed, or when sorting another kind of collection for presentation.

Sorting Application objects

Most likely, you will need to sort complex objects in your collection, such as Customers by name or Addresses by zip code. If the instance variables in your complex objects are objects that have a defined sort order, you can take advantage of sortAscending:, sortDescending:, and sortWith:, to provide a specification for the desired sort order.

You may wish to implement <= on your application objects, in which case you can just use as sortAscending, sortDescending, and asSortedCollection. However, this provides a single definition of the sort order of your objects that will always be applied.

For example, say we have a class for Employee, and a Globals AllEmployees is a collection that contains instances of Employee:

Example 4.7

Object subclass: 'Employee'

	instVarNames: #( 'firstName' 'lastName' 'job' 'age')

	classVars: #()

	classInstVars: #()

	poolDictionaries: #()

	inDictionary: UserGlobals

Employee compileMissingAccessingMethods

UserGlobals at: #AllEmployees put: (	IdentityBag

	with: (Employee new firstName: 'Lee'; lastName: 'Smith';

				job: #librarian; age: 40)

	with: (Employee	 new firstName: 'Kay'; lastName: 'Adams';

			job: #clerk; age: 24)

	with:		(Employee new firstName: 'Al'; lastName: 'Jones';

			job: #busdriver; age: 40)

To sort Employees by age and lastName, we can use the sortAscending: method, passing in the instance variables against which the ascending sort should be done:

Example 4.8

| sorted str |

str:= String new.

sorted := AllEmployees sortAscending: #('age' 'lastName').

sorted do: [:anEmp |

	str add: (anEmp age asString); space; add: anEmp lastName; lf].

str

24 Adams

40 Jones

40 Smith

Sorting in multiple orders

For finer control, you can use the sortWith: method, which allows you to define direction for each instance variable.

Example 4.9

| sorted str |

str := String new.

sorted := AllEmployees sortWith: #('age' 'Ascending'

                                 'lastName' 'Descending').

sorted do: [:anEmp | str add: (i age asString);

				add: ' '; add: anEmp lastName; lf].

str

24 Adams

40 Smith

40 Jones

SortBlocks

You can also specify sort ordering by defining a sortBlock. A sortBlock is a two-argument block that should return true if the first argument should precede the second argument, and false if not. The expressions within the block are expected to by symmetrical - i.e., for two specific arguments for which the block returns true, then the block should return false when the arguments are reversed. If values compare equal, and the block returns the same results for both argument orders, then the final ordering of the equal elements is arbitrary.

SortedCollection is a type of Collection that includes a sortBlock; SortedCollection class is discussed under SortedCollection.

You can sort the elements of a collection by creating a SortedCollection using asSortedCollection: aBlock, or by using methods such as sortWithBlock:. which return an Array with the sorted contents .

For example, to sort customers by last name:

AllEmployees sortWithBlock: [:a :b |

	a lastName <= b lastName]

You can create sort blocks that are as elaborate as you need; however, you should observe the symmetry of the expression.

For example, this block sorts by lastName, with further sorting by firstName if the lastNames are the same:

AllEmployees sortWithBlock: [:a :b |

	a lastName = b lastName

		ifTrue: [a firstName <= b firstName]

		ifFalse: [a lastName <= b lastName]

].

Sorting Large Collections

When sorting using the above methods, the entire collection must fit into memory. This may not be practical for very large collections.

To avoid out of memory errors when sorting large collections, you can allow the sort to issue periodic commits, which will make the sort results persistent. Persistent objects don’t need to stay in memory the way temporary objects do, which reduces the demand on memory.

These intermediate commits are enabled by specifying a persistentRoot for the sort, and by taking advantage of the IndexManager’s ability to set up autoCommit. IndexManager is a class that manages Indexes, which you’ll read more about in Chapter 7. You do not need to have an index on the collection in order to use this feature. However, you do need to set IndexManager’s autoCommit setting to true. For more information on autoCommit, see Auto-commit.

For example, the following code sorts AllEmployees collection using sortWithBlock:persistentRoot:

Example 4.10 Sorting large collections, committed incremental results

UserGlobals at: #SortedEmployees put: Array new.

System commitTransaction.

AllEmployees

	sortWithBlock: [:a :b | a lastName <= b lastName]

	persistentRoot: SortedEmployees