I recently migrated approximately 100 rhel 6 machines from NFS3 to NFS4 (server is rhel 6.6, clients are a mix from 6.3 to 6.6). Things went pretty smooth until several hours into the new configuration, then things started
running very slowly. Restarting the nfs server process clears the issue which seems to indicate the server is the problem.
The file system itself can do some very high io throughput, on the order of 1GB/sec sustained and the "th" values in /proc/net/rpc/nfsd never increase which indicates i/o's are completing on time with no thread starvation. The server itself is set for 384 threads. During the previous NFS3 config the thread count was much higher and had no problems.
I suspect file locking as the primary application in use ( an in house app) uses a lot of little startup scripts which call other scripts to set up the environment etc. Under normal circumstances this startup takes about 6 seconds. Over time that duration increases up to 30 and even 70 seconds in some cases.
I've scoured every reference to nfs4 performance degradation I could find but nothing seems to call out what we are experiencing. A few retrans exist in nfsstat but nothing that stands out. Generally, everything "look" OK but
clearly is not.
Oh, and this is all being run over 10G Ethernet. If memory serves, I believe the kernel is 2.6.2-504.8.1 on the server.
Any ideas about what else to check would be greatly appreciated.