Recently, we found a memory corruption bug in our JB product. GC crashed while doing root scanning. During the execution of visitThreadStack, a virtual register (perhaps an integer or a boolean of value 0x1) was misinterpreted as an object address and was pushed to the mark stack. This is a conservative scanning so there is indeed a check (dvmIsValidObject) for each virtual register encountered.
The real issue is that the VR in question had a valid object address when dvmIsValidObject was called. However, its changed to 0x1 later on.
We believe the thread under scanning wasnt properly suspended during this GC. This has been proved by logs and historical information captured during further tests. The concerned user thread was still in THREAD_RUNNING state after when GC should have had all user threads suspended.
I believe there is a flaw in dvmLockThreadList:
void dvmLockThreadList(Thread* self)
{
ThreadStatus oldStatus;
if (self == NULL) /* try to get it from TLS */
self = dvmThreadSelf();
if (self != NULL) {
oldStatus = self->status;
self->status = THREAD_VMWAIT;
} else {
/* happens during VM shutdown */
oldStatus = THREAD_UNDEFINED; // shut up gcc
}
dvmLockMutex(
if (self != NULL)
self->status = oldStatus;
}
It may change a threads state from THREAD_VMWAIT to THREAD_RUNNING without checking if there is any suspension request. We created an example sequence to show how this may go wrong:
1 user thread calls dvmLockThreadList
2 user thread sets status to THREAD_VMWAIT
3 GC kicks in and calls dvmSuspendAllThreads
4 GC calls dvmLockThreadList and successfully locks gDvm.threadListLock
5 user thread failed to lock gDvm.threadListLock and therefore it’s in THREAD_VMWAIT status
6 GC increases suspendCount for all user threads
7 GC checks if any user thread is in THREAD_RUNNING state. If so it lets the thread execute till it reaches a checkpoint for suspendCount. Because the user thread in question is in THREAD_VMWAIT state, GC simply skips it.
8 GC releases gDvm.threadListLock
9 user thread locks gDvm.threadListLock successfully
10 user threads status is changed back to THREAD_RUNNING.
Further execution of dvmChangeStatus(,THREAD_RUNNING) wont suspend this user thread since its already in THREAD_RUNNING state. In our case, dvmCheckSuspendPending eventually suspended the user thread during interpretation but thats too late and damage has been done.
You may notice that this function has a comment "We dont have to check to see if we should be suspended once we have the lock. Nobody can suspend all threads without holding the thread list lock while they do it, so by definition there isnt a GC in progress.". BUT what if GC gets the thread list lock first? Thats what we just described.
Calling dvmChangeStatus in dvmLockThreadList is not gonna work because of potential deadlock. I wonder if you guys have any good idea how this should be fixed.