5. Bug Fixes

Previous chapter

The following bugs in v3.3.6 are fixed in v3.4.

Idle Gems not terminated by STN_GEM_TIMEOUT

With the configuration parameter STN_GEM_TIMEOUT configured to a non-zero value, Idle Gems should be terminated once the timeout has elapsed; this was not occurring. (#46723 )

Idle gems in transactionless mode could be terminated by sigAbort

A gem in transactionless mode automatically responds to sigAbort by aborting. However, if this gem was entirely idle, it did not have a chance to perform this processing, and it could be killed by lostOT handling. (#46290).

AIO page server errors on fsync may cause Stone to hang

When the AIO page servers encounter an error during fsync of the extents, such as running out of disk space, the stone could hang. fsync is a critical operation; now, the stone will shut down under this circumstance.

Additional details on I/O errors are also now printed to the stone log. (#46734, #46735)

Out of tranlog disk space during checkpoint may cause Stone to hang and restart to fail

When the out of disk space error occurs during a checkpoint, it was possible for the stone to hang. In this case, the final tranlog was not completely written and could not be read on stone restart. (#46727)

Tranlog record containing only one selectiveAbort skipped by restore

When a tranlog record contains only one record, a single selective abort, the size of the record is small enough that it is skipped by restore. (#46695)

Extra ShrPcMon processes on heavily loaded system

On a heavily loaded system, in which swapping is taking place, there a race condition in acquiring an exclusive lock on the .LCK file. This can allow multiple ShrPcMonitor processes to be started; the Stone assumes that the earlier one(s) have timed out, and starts another one. Once the Stone connects to one of these ShrPcMon processes, it runs correctly, but the extra ShrPcMon processes remain and use resources. (#46928)

Race condition when committing many symbols from multiple gems

When a very large number of new symbols are being created by multiple gem sessions, it was possible for the commands related to SymbolGem communications to intersect in such a way that the Gem is waiting for a response while the SymbolGem is not aware there is work to do. This window is narrow and the problem is rare; but can result in up to a 10-15 second delay in committing. (#46825)

Changing CanonStringBucket’s objectSecurityPolicy results in SymbolGem death

AllSymbols and instances of Symbol are protected against changing the objectSecurityPolicy; however the internal buckets that implement the AllSymbols collection were not protected. Changing the security policy for these objects caused the SymbolGem to die with error 2115. (#46410)

Suspended checkpoints resumed after an Epoch GC

When checkpoints have been suspended, if an Epoch GC happened to run during this time window, the suspension was cancelled and checkpoints resumed, without any warning messages in the stone log. If extent copy backups were taking place, which is the usual reason for suspending checkpoints, this could result in the backups being corrupt. (#47133)

Symbol Garbage Collection did not collect Double and Quad ByteSymbols

Symbol garbage collection can be run manually to remove unreferenced Symbols, per the instructions in the System Administration Guide. This code did not identify and mark for removal any symbols that contain characters over 255 (whose class was therefore DoubleByteString or QuadByteString). (#46614)

waitstone may have returned error while stone in startup

When the stone is in startup, it was possible for the waitstone to return an error rather than waiting. (#46518)

Cache warming error when termination due to cache full

When cache warming does not complete because the shared page cache became full, it was logged as an error. This is an expected scenario, so this is now a warning. (#46493)

Keyfile CPU limit not correctly applied for remote gems on large hosts

If the keyfile permits a limited number of CPUs (CPU affinity), but the Stone’s machine has fewer CPUs than this limit, remote gems were restricted to this number; remote gems did not use available CPUs on the remote host within the license limit. (#46207)

Keyfile CPU limit does not work correctly with more than 32 CPUs

Keyfile limitations on the number of CPUs did not allow for machines with many CPUs, and would fail if the number of CPUs was 32 or more. The code has been adjusted and can handle up to 1024 CPUs. (#46204)

Upgrade could error attempting to remove documentation category

GemStone upgrade creates temporary categories to install the class comment. It was possible for the removal of this category at the end of the upgrade process to error.

pstack on linux required kernel.yama.ptrace_scope=0

On linux, for pstack to work correctly you may be required to update the kernel parameter kernel.yama.ptrace_scope=0; by default, it may be set to 1. This introduced a security hole.

In v3.4, GemStone’s pstack will work with a kernel configuration of kernel.yama.ptrace_scope=1, on distributions using Linux kernel 3.4 or later: Ubuntu 14.04 or later, Redhat 7.x, and SUSE 12. (#46539)

Remote Cache Issues

Remote cache connection behavior

The first gem on a remote node triggers the creation of a remote shared cache, and the creation of a page server on the stone's node. The page server on the stone's node is multithreaded and shared between all gems on that remote node.

If a subsequent gem on that remote node fails to connect to the shared page server on the Stone's node within the timeout of 20 seconds, previously it would create a private pageserver. This could result in excessive page servers on the stone's node. (#46946)

Two new configuration parameters have been added to address this behavior:

STN_GEM_PGSVR_CONNECT_TIMEOUT provides control over the timeout specifically for the connection between the remote gem and the page server on the stone's node.

STN_GEM_PRIVATE_PGSVR_ENABLED, if false, prevents the remote gem from starting a private page server if the connection to the shared page server fails or times out. In this case, the remote gem's login would fail.

Mid-level change used random port to connect to page server on stone’s node

The connection between the gem's pgsvr on the mid cache and the gem's pgsvr on the stone cache uses a random port number; it should connect to the well known port number for the pgsvr on the stone cache. (#46382)

System stopZombieSession slow on overloaded remote node

There are code paths in which page manager thread processing can be delayed based on the cache timeout. This can result in operations such as stopZombieSession: to take an unreasonably long time for a session on an overloaded remote host. (#46956)

Multithreaded pageserver not shut down after remote cache death

When a remote cache died, the multithreaded page server for that host on the stone’s node was not entirely cleaned up. Entries for the page servers continues to use a slot in the shared page cache monitor client table, although they did not have a process table entry. (#47117)

Log file name for remote cache page server used customization

When a remote gem’s log file name is defined using NRS directives, and a remote cache was started, the composition of the log file name for the remote cache pageserver incorrectly prepended the remote gem log file name. (#47031)

Risk of SEGV when accessing hidden classes

Sending a message to the results of the private primitive method Object >> _primitiveAt: has a risk of SEGV with instances of internal, hidden classes LargeObjectNode or NscNode under some specific circumstances. (#47107)

The new class PrivateObject is now the superclass for internal hidden classes, and sending messages to PrivateObjects other than those implemented in PrivateObject will signal a MessageNotUnderstood rather than SEGV or other undesireable behavior. See PrivateObject.

GsExternalSession>>lastResult may be incorrect with multiple sessions

GsExternalSession >> lastResult is used to fetch the results of execution. lastResult previously fetched the result from the session that had the most recent previous access, which in an environment with multiple instance of GsExternalSession performing work, could be a different session than the receiver. (#47021)

startnetldi -D did not tolerate GEMSTONE_NRS_ALL with #dir:%D

When the GEMSTONE_NRS_ALL was set to a value that included a #dir:%D, and the NetLDI was started in that environment using the -D argument, it errored and did not start the NetLDI. (#47126)

allSelectors result included duplicates for inherited methods

The results of allSelectors could return duplicate symbols, if the superclass implemented the same method. (#46621)

Error on missing UserGlobals dictionary

If UserGlobals dictionary was not present, errors occurred on several methods, including Behavior>>methodCategories invoked by GBS browsers. The correction is in the underlying invocation of GsPackagePolicy currentOrNil. (#46478)

listReferences failed to find object in large IdentityBags/Sets

If an object is in an IdentityBag or IdentitySet with more than about 1015 or 2030 elements, respectively, a listReferences: or fastListReferences: operation did not detect the reference. (#46645)

findReferences found references from a large NSC that did not contain the object

If a large UnorderedCollection (NSC) contained an object that referenced a search object, but did not contain the search object directly, the results of a findReferences: or fastFindReferences: could still have included the NSC, in addition to the correct referencing object. (#47187)

Indexing methods failed to reset ProgressCount

Some indexing processes incremented the ProgressCount statistic but initialized IndexProgressCount. (#45609)

ExecBlock >> selfValue

It was possible for the method ExecBlock >> selfValue to return an out of range error, rather than an object or nil. This method is invoked to get process frame contexts by debugger methods, and for GsDevKit continuations contexts. (#46661)

Cannot change objectSecurityPolicy of a DbTransient object

Instances of Classes that are defined as DbTransient can be persisted, but their instance variable data is not written to disk. When setting the objectSecurityPolicy of a committed DbTransient object, the change to the security policy was not visible outside of the session that made the change. (#46655)

transactionConflicts commitResult key was used as details key

The handling of commit transactionConflict keys did not correctly handle synchronized commit failures; the commitResult key was incorrectly used as the conflict details key in recent versions. (#46768)

Now, on synchronized commit failure:

GsSecureSocket could prompt for passphrase

Invoking GsSecureSocket >> useCertificateFile:withPrivateKeyFile: privateKeyPassphrase:, with a keyfile that did not require a passphrase and a nil privateKeyPassphrase argument, resulted in a prompt to stdin for the passphrase from within OpenSSL code. (#46913)

Configuration file and parameter Issues

Last line of configuration file without linefeed was ignored

If the last line of a configuration file did not include an end of line indicator (CR or LF), that line was ignored when reading the configuration file. (#46716)

Dynamically set AdminGem config parameters could be lost after reclaim

Before a reclaim operation, the configuration settings are saved, and restored after the reclaim is complete. The ReclaimGem was saving and restoring the complete set of GsUser parameters, not just the ones for Reclaim, which had a risk of overwriting any updates to AdminGem settings. (#46273)

Read-only stone configuration files are not handled correctly

The stone configuration file that is used for extent names must be writable, to allow extents to be added programmatically without creating inconsistency. To avoid risk, the stone should not startup if the configuration file is read-only. This situation is not handled correctly for all cases where configuration files are passed in using the -e and/or -z argument. The problems include unclear error messages and starting up but erroring if an extent is added. (#47054)

String, Character and UTF issues

String hash incorrect in Unicode mode with terminal nulls

If a String ends with characters with codePoint zero, and the repository is in Unicode Comparison Mode, hash was computed incorrectly. (#46932)

Unicode string at:put: handling of invalid index argument

The primitive failure handling code for the at: anIndex argument was incorrect, resulting in a meaningless error message. (#46537)

Character codePoints could have been truncated by withAll:

When sending String >> withAll: with an argument of some particularly structured DoubleByteString argument, codePoints in the result were truncated to less than 256, and the result was an instance of String. (#46879)

Utf8 decodeToString could produce DoubleByteString in String range

If an instance of Utf8 includes encoded Characters with codePoints in the range 128..255, Utf8 >> decodeToString produced a DoubleByteString instead of a String. (#46877)

findPatternNoCase:startingAt: error with Unicode string argument

Invoking the method String>>findPatternNoCase:startingAt:, with one of the pattern arguments an instance of Unicode16 or other Unicode string class, resulted in an error if the repository was not in Unicode Comparison Mode. (#46975)

Multi-character binary selectors involving a hyphen character required quoting

Symbols containing non-alphanumeric characters (other than underscore) normally require quoting, but this rule does not apply to legal binary selectors (which may contain only non-alphanumeric characters). Binary selectors with more than one character that included the $- character incorrectly required quoting to evaluate. (#46603)

contentsAndTypesOfDirectory:onClient: incorrect in unicode comparison mode

When the repository is in Unicode Comparison Mode (StringConfiguration is Unicode16), GsFile methods that return file names outside the ASCII range should decode the file names from UFT8 into Unicode strings. The method GsFile >> contentsAndTypesOfDirectory:onClient: did not do this correctly when onClient: was false. (#46894)

searchlogs script did not respect sessionid, required client in IPv6

The searchlogs script returned all entries when the sessionid was used for a filter.

It also did not accept IPv4 addresses for the client filter. (#44458)

GsSocket read: 0 returned nil

The argument for GsSocket >> read: is now required to be greater than zero. An argument of zero, which previously returned nil (although no error string was set), will now signal an ArgumentError. (#42322)

GsExternalSession resolveResult:toLevel: broken

This method incorrectly used a 1-based offset for a 0-based C array. (#46919)

Reclaim may be blocked if STN_FREE_SPACE_THRESHOLD lower than #reclaimMinFreeSpaceMb

The system manages reclaim activity vs. free space, by relying on these two settings, to avoid using up all free space when performing reclaim. However, if these two settings are set inappropriately, the system can get stuck where it cannot acquire free space by performing reclaim. In v3.4, the system will generate an error if you attempt to set such a configuration, and in cases where these checks are bypassed, will print a warning in the reclaim gem log.

Hotstandby issues with manually gzipped tranlogs

If a transaction log was manually gzipped before being transmitted (e.g. while the logsender and logreceiver were not connected), on reconnect the transmit would error. Now, manually gzipped tranlogs are read by the logsender. (#46284)

If transaction logs written with record-level compression (using copydbf -c or after being transmitted to the slave by the logreceiver) were manually gzipped, these .gz files were not usable by restore or copydbf. (#46213)

Float passivate-activate resulted in SmallDouble

When a Float in the range of SmallDouble was passivated then reactivated, the result was a SmallDouble rather than the original Float. (#44082)

FileStream errored on read-only files

FileStream could not be used to read files for which the user did not have write permission. (#47155)

FileStream peekTwice failed

The peekTwice method failed when sent to a FileStream. (#47156)

Previous chapter