The Collection classes are a key group of classes in GemStone Smalltalk. This chapter describes the main types of Collections that are available and their common functionality.
Introduction to Collections
introduces the GemStone Smalltalk objects that store groups of other objects.
Collection Subclasses
describes several kinds of ready-made data structures that are central to GemStone Smalltalk data description and manipulation.
Stream Classes
describes classes that add functionality to access or modify data stored as a Collection.
Sorting
describes the ways to sort elements in collections.
Instances of the Collection classes are specialized to manage an indeterminate number of objects as a group using unnamed instance variables. All instances of Collection subclasses support protocols for adding and removing elements (as long as the collection is not invariant), for iterating over the elements, and for testing the presence of an object. Collections can be classified by whether or not they maintain a specified order for their elements, whether or not key-based lookup is supported, and the kinds of objects they can reference.
Collections can be broadly classified into three categories:
Instances of AbstractDictionary subclasses do not support a specific order for their elements but do support storage and retrieval via the at:put: and at: messages, using arbitrary objects for an element's key. Subclasses of AbstractDictionary are specialized based on whether key-based lookup uses equality comparison or identity comparison, the type of key, and the type of value.
Dictionaries can also have named instance variables, if you choose to define them.
Instances of SequenceableCollection classes maintain a specific order for their elements and support storage and retrieval via the at:put: and at: messages using an integer key (the one-based offset into the elements), analogous to an array with a numeric subscript in other programming languages.
Byte-format classes such as ByteArray and String cannot have named instance variables. The other sequenceable collections can have named instance variables if you choose to define them.
Instances of UnorderedCollection classes—also referred to as Non-Sequenceable Collections or NSCs—do not have a specific order for their elements, and do not support storage or retrieval via the at:put: and at: messages. Objects in these collections are accessed by iterating the collection. UnorderedCollections support indexes, which allow ordered iteration and fast key-based lookup.
UnorderedCollections may have named instance variables, if you choose to define them.
When you create a collection of more than about 2K pointer object or more than 16K byte objects, GemStone internally uses a sparse tree implementation to make more efficient use of resources. These are referred to as "Large Objects", and use internal classes such as LargeObjectNode. This behavior occurs in a manner that is transparent to you, and you interact with Large Collections the same as smaller collections; however, these internal objects may be visible when performing object-based audit and analysis.
Different kinds of collections use different criteria in which to store and locate objects. Most dictionaries use a hashed value, the result of sending the #hash message to each key object. If the result of sending #hash changes, the key may not be found in the collection using methods such as at:.
The ordering of a SortedCollection depends on the results of sending comparison methods to the objects in the collection.
If a hash method or comparison method is defined that depends on values in the instance variables of the objects, and these instance variables are modified, the object may not be found using lookup methods in the dictionary or sorted collection, though iteration will still find them. If you change the value of an instance variable of an object in such a collections, you should remove and re-insert the object in the collection so the lookup methods will work.
Collection classes understand common protocol, inherited from the abstract superclass Collection. Collection defines methods that enable you to:
The examples that follow provide a starting point for using Collections; review the methods and method comments in the image for more details.
Collection classes respond to the instance creation message new. When sent to a Collection class, this message causes a new instance of the class with no elements (size zero) to be created. Most kinds of collections can expand as you add additional objects.
Another instance creation message, new: anInteger, causes many Collection subclasses to create an instance that is pre-sized to hold anInteger elements. It’s often more efficient to use new: than new, because a Collection created with new: need not expand repeatedly as you add new elements. This is particularly significant for large key-based Collections where the hash must be computed for each element in the collection when the Collection base size changes. For very large collections, growing may require a large amount of temporary object memory to complete, with the risk of running out of memory.
Collections also define the instance creation message, withAll: aCollection, that creates a new instance of the receiver containing all of the objects stored in aCollection, and with:, with:with:, with:with:with:, and with:with:with:with:, which create a new instance of the receiver with 1, 2, 3 or 4 (respectively) specific elements.
Several kinds of Collections can be created as literals, using Smalltalk syntax. Arrays, ByteArrays, Strings and Symbols have literal syntax, and Arrays can also be created at runtime using Array constructors. ByteArrays, Strings and Symbols are discussed in Chapter 5.
Literal Arrays are created at compile time, and hold other literal objects. These start with the pound sign, are enclosed in parenthesis and separated by white space. Array literals are defined by ANSI; syntax is described here.
#( 'carrot' 'tomato' 'celery')
Arrays constructor are created at runtime. These are enclosed in curly braces and separated by a period. Array constructors are GemStone-specific, not defined by ANSI; syntax is described here.
{ Date today . Time now }
Collection defines for its subclasses two basic methods for adding elements: add:, which adds one element to the Collection, and addAll:, which adds several elements to the Collection at once.
Collection subclasses override these methods in order to control access to elements or to enforce an ordering scheme. Certain subclasses of Collection provide additional methods that add an element at a numbered positions or based on an arbitrary key.
Collection defines for its subclasses several methods for removing elements: remove:, which removes one element to the Collection, and removeAll:, which removes a collection of elements from the Collection at once.
While it is an error to remove objects that are not in the collection, the methods removeIfPresent: and removeAllPresent: perform the same removals, but do not error if the specified objects or objects are not in the collection.
Collection defines several methods that enable you to loop through a collection’s elements. The most general enumeration message is do: aBlock. When you send a Collection this message, the receiver evaluates the block repeatedly, using each of its elements in turn as the block’s argument.
Suppose that you made an instance of Array in this way:
UserGlobals
at: #Virtues
put: { 'humility' . 'generosity' . 'patience' }.
To create a single String to which each virtue has been appended, you could use the message do: aBlock like this:
| aString |
aString := String new. "Make a new, empty String."
"Append a virtue, followed by a space, to the new String"
(Virtues sortAscending) do: [:aVirtue |
aString := aString , ' ' , aVirtue].
^ aString
%
' generosity humility patience'
In addition to do:, Collection provides several specialized enumeration methods; the most common ones are collect:, select:, detect:, and reject:.
When sent to SequenceableCollections, those messages that return collections (such as select:) always preserve the ordering of the receiver in the result. That is, if element a comes before element b in the receiver, then element a is guaranteed to come before b in the result.
NOTE
To avoid unpredictable consequences, do not add elements to or remove them from a collection while you are enumerating it.
This section describes the properties of Collection’s concrete subclasses, and gives some guidance about choosing places for new classes that you might want to add to the Collection hierarchy.
Subclasses of Collection can be grouped by the kinds of access methods they provide and the kinds of objects their instances can store. Let’s first consider those collection classes that don’t provide access to elements through external numeric indexes.
Dictionary classes are subclasses of AbstractDictionary. The elements in a Dictionary collection are stored and accessed via a key; each key must be unique within that Dictionary. Depending on the specific subclass, the keys may be compared using equality or identity.
Dictionaries provide their special facilities by storing key-value pairs instead of simple, linear lists of objects. While some types of dictionaries are implemented as “a collection of Associations”, the interface methods return results based on the logical contents, which are the values. Other, specialized protocol allows you to refer to the key or the value portions of the logical associations.
For performance reasons, the internal implementation of Dictionary classes varies. Instances of Dictionary itself consist of a collection of Association objects. KeyValueDictionary subclasses are implemented differently, as a sequence of keys and values, which may use CollisionBuckets to hold the actual values. IdentityDictionary is a sequence of keys and Associations. All these dictionaries understand common protocol, regardless of implementation.
Dictionary class uses Associations to store the key/value pair. Dictionaries compare keys by equality, not by identity. If you need a dictionary that compares keys by identity, use IdentityDictionary, which is a subclass of KeyValueDictionary.
KeyValueDictionary has several subclasses, divided according to the type of key used to access the information:
A KeySoftValueDictionary is a subclass of KeyValueDictionary that allows the virtual machine to remove entries as needed to free up memory.
Typically, you might use a KeySoftValueDictionary to manage non-persistent objects that are large and take time to create, but that can be recreated whenever needed from small, readily available objects (tokens). For example, you might create a KeySoftValueDictionary to serve as a cache to hold large, expensive objects that are needed repeatedly. Within that dictionary, the values would be the large calculated objects, and the keys would be the corresponding tokens. If your application needs a large, expensive object but does not find it in the KeySoftValueDictionary, you can create the object and add it to the cache so that it might be available the next time it is needed.
As memory fills up, the virtual machine might remove some objects from the cache. (Remember, the contents of the cache are non-persistent and can be recreated.) The virtual machine may remove keys and values from the KeySoftValueDictionary until adequate memory is available. For details about how to manage the number of KeySoftValueDictionary entries, see Getting Rid of Non-Persistent Objects.
SequenceableCollections let you refer to their elements with integer indexes, and they understand messages such as first and last that refer to the order of those indexed elements. SequenceableCollection is an abstract superclass. The methods it establishes for its concrete subclasses let you read, write, copy, and enumerate collections in ways that depend on ordering.
SequenceableCollection redefines add: so it puts the added object at the end of the receiver.There are additional methods for adding one or more objects to its instances at particular locations, and for removing one or more objects according to position, equality, or identity.
SequenceableCollection redefines the comparison methods inherited from Object so that those methods take into account the classes of the collections to be compared and the number and order of their elements. In order for two SequenceableCollections to be considered equal, the following conditions must be met:
SequenceableCollection understands several copying message, including those that:
The following example copies the first two elements of an literal Array to a new Array:
#('bear' 'tiger' 'turtle') copyFrom: 1 to: 2
%
an Array
#1 bear
#2 tiger
This example copies two elements of an array into a different array, overwriting the target array’s original contents:
| numericArray |
numericArray := Array with: 55; with: 66 with: 77 with: 88.
numericArray replaceFrom: 2 to: 3 with: #( 1 2 3 4 5 )startingAt: 4.
numericArray
%
an Array
#1 55
#2 4
#3 5
#4 88
The advantage of using the method replaceFrom:to:with: startingAt: is that it does not fault the contents into memory, which can improve performance when working with very large collections. Of course, displaying the results as in the example also faults the objects into memory.
Also keep in mind that copies of SequenceableCollection, like most GemStone Smalltalk copies, are “shallow.” In other words, the elements of the copy are not simply equal to the elements of the receiver—they are the same objects.
Class SequenceableCollection redefines the enumeration and searching messages inherited from Collection in order to guarantee that they process elements in order, starting with the element at index 1 and finishing with the element at the last index.
SequenceableCollection also defines a new enumeration message, reverseDo:, which acts like do: except that it processes the receiver’s elements in the opposite order.
SequenceableCollections understand findFirst: aBlock and findLast: aBlock. The message findFirst: returns the index of the first element that makes aBlock true, while findLast: returns the index of the last.
As you have seen in previous examples, instances of Array and of its subclasses contain elements that you can address with integer keys that describe the positions of Array elements.
One of the most important differences between client Smalltalk arrays and a GemStone Smalltalk array is that GemStone arrays are extensible; you can increase the size of an array at any time. Sending at:put: will increase the size of the array, as long as the index is only one more than the current array size. Other protocol such as addAll: also increase the size while adding elements.
It’s also possible for you to change the size without explicitly storing or removing elements, using the message size: inherited from class Object. When you lengthen an array with size:, the new elements are set to nil.
You are free to create an array with the inherited message new and let the array lengthen automatically as you add elements.
Arrays created with the new message initially allocate no space for elements. As you add objects to such an array, it must lengthen itself to accommodate the new objects. It’s usually more efficient to create your arrays with the message new: aSize (inherited from class Behavior), which makes a new instance of the specified size:
| tenElementArray |
tenElementArray := Array new: 100.
The selector new: stores nil in the indexed instance variables of the empty array. Having created an array with enough storage for the elements you intend to add, you can proceed to fill it quickly.
Arrays can also be created in code without sending instance creation messages, by using literal array or array constructor syntax.
An Array literal is created with the following syntax:
#( element1 element2 element3 )
When your code includes a statement like this, at compile time an invariant instance of Array is created.
An Array constructor is created at runtime, rather than at compile time, and it not invariant - it can be modified by later code. The syntax for Array constructors is:
{ element1 . element2 . element3 }
SortedCollection is a type of SequenceableCollection in which the elements are ordered by a specific sort order, not by the order in which they were added or by the method used to add the element. You are not permitted to send at:put:, addLast:, or similar methods to a SortedCollection.
Each instance of SortedCollection is associated with its own sortBlock. The default block will sort elements that have a known sort order, such as alphabetic or numeric; the elements must be able to be compared using <=.
You can also define your own sortBlock, if you want elements ordered by some other criteria, such as the value of an instance variable.
For more on comparison, sorting, and sort blocks, see Sorting.
| scrabbleWords |
scrabbleWords := SortedCollection sortBlock: [:a :b | a size < b size].
scrabbleWords add: 'able'; add: 'zebra'; add: 'jumper'; add: 'yet'.
scrabbleWords
%
aSortedCollection( 'yet', 'able', 'zebra', 'jumper')
There is overhead in always keeping the collection sorted, so it usually more efficient to sort the elements only when you need them to be sorted for presentation. There are advantages to using a type of collection that is more suitable for managing the elements, then using methods such as sortWithBlock: to create a new Array with the original collection’s elements in sorted order.
Instances of UnorderedCollection store their elements in a private, implementation-defined order and explicit key-based access such as at: and at:put: are disallowed.
In all subclasses of UnorderedCollection, nil elements are disallowed. An UnorderedCollection will silently ignore attempts to add a nil element.
UnorderedCollection implements protocol for indexing, which allows for large collections to be queried and sorted efficiently. Chapter 7, “Indexes and Querying”, describes the querying/sorting functions in detail. The following section describes the protocol implemented in UnorderedCollection’s subclasses.
The most efficient way to handle very large collections is using UnorderedCollections and access the contents using GemStone indexes.
UnorderedCollections may also be referred to as Non-Sequenceable Collections (NSCs).
The classes Bag and Set are some of the simplest UnorderedCollections. In these classes duplicate checking is done based on equality (rather than identity), and a Set will ignore attempts to add a equal element. A Bag will accept an equal item but will do so by increasing the count of the existing element. Thus, adding two equal but not identical objects will be treated as if the first object is present twice.
If the Bag or Set contains elements that are themselves complex objects, determining the equality is complex and therefore slower than you might have hoped. GemStone recommends using IdentityBag or IdentitySet for anything but the most simple unordered collections.
IdentityBag has faster duplicate checking than Bag. Like a Bag, an IdentityBag is elastic and can hold objects of any kind.
To compare an object that is in an IdentityBag, you rely on the identity (OOP) of the object. This is a much less time-consuming task than an equality comparison, and in most cases it should be sufficient for your design.
The inherited messages add: and addAll: work much as they do with other kinds of collections, except, of course, that they are not guaranteed to insert objects at any particular position. There’s one other significant difference: if the argument to addAll: is an Array or OrderedCollection, the elements in the collection are not faulted into memory.
IdentityBag also defines a method that allows you to copy elements into a Collection (which must be a kind of Array or OrderedCollection) without faulting the contents into memory, using the message:
replaceFrom: startIndex to: stopIndex with: aCollection startingAt: repIndex
| bagOfRodents |
bagOfRodents := IdentityBag withAll: #('beaver' 'rat' 'agouti'
'chipmunk' 'guinea pig').
(Array new: 5) replaceFrom: 3 to: 5 with: bagOfRodents startingAt: 1.
anArray( nil, nil, 'beaver', 'rat', 'agouti')
You’ll generally use Collection’s enumeration protocol to get at a particular element of a IdentityBag. The following example uses detect: to find a IdentityBag element equal to 'agouti':
Class IdentityBag provides several messages for removing objects from an identity collection. The message remove:ifAbsent: lets you execute some code of your choice if the specified object cannot be found. In this example, the message returns false if it cannot find “3” in the IdentityBag:
| myBag |
myBag := IdentityBag withAll: #(2 3 4 5).
myBag remove: 3 ifAbsent: [^false].
myBag sortAscending.
%
anArray( 2, 4, 5)
Similarly, removeAllPresent: aCollection is safer than removeAll: aCollection, because the former method does not trigger an exception if some members of aCollection are absent from the receiver.
All the removal messages act to delete specific objects from an IdentityBag by identity; they do not delete objects that are merely equal to the objects given as arguments. Example 4.5 works correctly because the SmallInteger 3 has a unique identity throughout the system. By way of contrast, consider Example 4.6.
| myBag array1 array2 |
"Create two objects that are equal-but-not-identical."
array1 := Array new add: 'stuff'; add:'nonsense' ; yourself.
array2 := Array new add: 'stuff'; add:'nonsense' ; yourself.
"Create an IdentityBag containing array1."
myBag := IdentityBag new add: array1.
UserGlobals at: #MyBag put: myBag.
"Try to remove one of the objects from the IdentityBag by
referring to its equal-but-not-identical twin."
myBag remove: array2 ifAbsent: ['Sorry, can''t find it'].
%
Sorry, can’t find it
Class IdentityBag redefines the selector = in such a way that it returns true only if the receiver and the argument:
Class IdentityBag provides three messages that perform set union, set intersection, and set difference operators.
+ union, returning elements that are in either one, the other, or both.
- difference, returning elements that are in the receiver but not the argument.
* intersection, returning elements that are in both
| pets rodents |
pets := IdentityBag with: 'dog' with: 'cat' with: 'gerbil'.
rodents := IdentityBag with: 'rat' with: 'gerbil' with: 'beaver'.
pets * rodents
%
anIdentityBag( 'gerbil')
pets + rodents
%
anIdentityBag( 'beaver', 'rat', 'gerbil', 'gerbil', 'cat', 'dog')
pets - rodents
%
anIdentityBag( 'cat', 'dog')
There is one significant difference between these messages and set operators — IdentityBag’s messages consider that either the receiver or the argument can contain duplicate elements. The method comments in the image provide more information about how these messages behave when the receiver’s class is not the same as the class of the argument.
Reading or writing a SequenceableCollection’s elements in sequence entails some extra effort — you need to maintain an index variable so that you can keep track of which element you last processed. A Stream acts like a SequenceableCollection that keeps track of the index most recently accessed.
There are several concrete Stream classes. Class ReadStream is specialized for reading SequenceableCollections and class WriteStream for writing them; ReadWriteStream is also provided, for ANSI compatibility.
A stream provides its special kind of access to a collection by keeping two instance variables, one of which refers to the collection you wish to read or write, and the other to a position (an index) that determines which element is to be read or written next. A stream automatically updates its position variable each time you use one of Stream’s accessing messages to read or write an element.
Streams are often used for reading characters from strings or files, but any kind of collection can be used with a Stream, and any type of object can be in that collection.
PositionableStream, with its subclasses ReadStream and WriteStream, was traditionally implemented in GemStone with the position indicating an offset from 1; that is, the first position in the stream was 1.
ANSI specifies, and other Smalltalk dialects use, an offset of 0, so the first position in the stream is 0.
To allow legacy code and new code to coexist, GemStone includes sets of classes implementing both interfaces. There are four sets of classes, which all exist in the image (and therefore, may have instances), with only three sets being visible at any time. The following two sets are always visible:
PositionableStreamLegacy
ReadStreamLegacy
WriteStreamLegacy
PositionableStreamPortable
ReadStreamPortable
WriteStreamPortable
ReadWriteStreamPortable
In addition, only one of the following sets is visible, depending on how your system is configured. These are two distinct sets of instances of Class, with the same name, but different implementations.
PositionableStream (with legacy definition and methods)
ReadStream
WriteStream
PositionableStream (with portable definition and methods)
ReadStream
WriteStream
The legacy versions are stored in Globals at: #GemStone_Legacy_Streams. The portable, ANSI-compatible versions are stored in Globals at: #GemStone_Portable_Streams.
To check what is currently installed, use the following methods:
PositionableStream class >> isLegacyImplementation
PositionableStream class >> isPortableImplementation
To install the portable version, use the method:
Stream class >> installPortableStreamImplementation
To install the legacy version, use the method:
Stream class >> installLegacyStreamImplementation
You are likely at some point to want to present the contents of your Collection in a sorted order. There are a number of options, depending on the type of data you need to sort and the desired ordering.
Objects representing numbers, dates and times, and strings have a natural sort order. If the collection contains objects that can be compared using <=, you can easily order the collection with the default sort.
Messages such as sortAscending and sortDescending can be sent to any collection that contain only these types of objects. For example:
(Array with: 123 with: 3 with: 99 with: 10) sortDescending
%
anArray( 123, 99, 10, 3)
(Array with: '123' with: '3' with: '99' with: '10') sortAscending
%
anArray( '10', '123', '3', '99')
It’s more likely that you will want to sort more complex objects in your collection, such as Customers by name or Addresses by zip code. If the instance variables in your complex objects are objects that have a defined sort order, you can take advantage of sortAscending:, sortDescending:, and sortWith:, to provide a specification for the desired sort order.
For example, say we have a class for Employee, and define AllEmployees as a collection that contains instances of Employee:
Object subclass: 'Employee'
instVarNames: #( 'firstName' 'lastName' 'job' 'age')
classVars: #()
classInstVars: #()
poolDictionaries: #()
inDictionary: UserGlobals
%
Employee compileAccessingMethodsFor:
#('firstName' 'lastName' 'job' 'age').
%
"Make some Employees and store them in a AllEmployees."
| Lee Kay Al myEmployees |
Lee := (Employee new) firstName: 'Lee'; lastName: 'Smith';
job: #librarian; age: 40.
Kay := (Employee new) firstName: 'Kay'; lastName: 'Adams';
job: #clerk; age: 24.
Al := (Employee new) firstName: 'Al'; lastName: 'Jones';
job: #busdriver; age: 40.
myEmployees := IdentityBag new.
myEmployees add: Lee; add: Kay; add: Al.
UserGlobals at: #AllEmployees put: myEmployees.
%
To sort Employees by age and lastName, we can use the sortAscending: method, passing in the instance variables against which the ascending sort should be done:
| returnArray tempString |
tempString := String new.
returnArray := AllEmployees sortAscending: #('age' 'lastName').
"Build a printable list of the sorted ages and lastNames"
returnArray do: [:i |
tempString add: (i age asString); add: ' '; add: i lastName; lf].
tempString
%
24 Adams
40 Jones
40 Smith
For finer control, you can use the sortWith: method, which allows you to define direction for each instance variable
| returnArray tempString |
tempString := String new.
returnArray := AllEmployees sortWith: #('age' 'Ascending'
'lastName' 'Descending').
returnArray do: [:i | tempString add: (i age asString);
add: ' '; add: i lastName; lf].
tempString
%
24 Adams
40 Smith
40 Jones
You can also specify sort ordering by defining a sortBlock. A sortBlock is a two-argument block that should return true if the first argument should precede the second argument, and false if not. The expressions within the block are expected to by symmetrical - i.e., for two specific arguments for which the block returns true, then the block should return false when the arguments are reversed. If values compare equal, and the block returns the same results for both argument orders, then the final ordering of the equal elements is arbitrary.
SortedCollection is a type of Collection that includes a sortBlock; the SortedCollection class is discussed under SortedCollection. You can use sortBlocks to sort the elements of any collection, using methods such as sortWithBlock:.
For example, to sort customers by last name:
AllEmployees sortWithBlock: [:a :b |
a lastName <= b lastName]
You can create sort blocks that are as elaborate as you need; however, you should observe the symmetry of the expression.
For example, this block sorts by lastName, with further sorting by firstName if the lastNames are the same:
AllEmployees sortWithBlock: [:a :b |
a lastName = b lastName
ifTrue: [a firstName <= b firstName]
ifFalse: [a lastName <= b lastName]
].
When sorting using the above methods, the entire collection must fit into memory. This may not be practical for very large collections. To avoid out of memory errors when sorting large collections, you can allow the sort to issue periodic commits, which will make the sort results persistent. Persistent objects don’t need to stay in memory the way temporary objects do, which reduces the demand on memory.
These intermediate commits are enabled by specifying a persistentRoot for the sort, and by taking advantage of the IndexManager’s ability to set up autoCommit. IndexManager is a class that manages Indexes, which you’ll read more about in Chapter 7. You do not need to have an index on the collection in order to use this feature. However, you do need to set IndexManager’s autoCommit setting to true. For more information on autoCommit, see Auto-commit.
For example, the following code sorts AllEmployees collection using sortWithBlock:persistentRoot:
UserGlobals at: #SortedEmployees put: Array new.
System commitTransaction.
AllEmployees
sortWithBlock: [:a :b | a lastName <= b lastName]
persistentRoot: SortedEmployees