Shared File Performance Improvements LDLM Lock Ahead Patrick

  • Slides: 15
Download presentation
Shared File Performance Improvements LDLM Lock Ahead Patrick Farrell (paf@cray. com)

Shared File Performance Improvements LDLM Lock Ahead Patrick Farrell (paf@cray. com)

Lock Ahead: Quick Refresher • Let user space request LDLM extent locks with an

Lock Ahead: Quick Refresher • Let user space request LDLM extent locks with an ioctl • Allows optimizing for various IO patterns by avoiding unnecessary LDLM lock contention • Focused on improving shared file IO performance • You were at my LUG talk, right?

High Level Design • Uses the same machinery as the existing asynchronous glimpse lock

High Level Design • Uses the same machinery as the existing asynchronous glimpse lock (AGL) implementation • Glimpse locks are a special lock type which allows information to be extracted without taking a full lock • In particular, glimpse locks on OSTs are for file size • AGLs have a lot we need…

High Level Design • AGLs: Used by statahead to speculatively gather size information Statahead

High Level Design • AGLs: Used by statahead to speculatively gather size information Statahead thread requests AGL locks • Notable features: LDLM lock request without a corresponding IO operation Asynchronous: Requesting thread does not wait for reply from server

High Level Design • Lock ahead request has no IO to do, so AGL

High Level Design • Lock ahead request has no IO to do, so AGL model is a good fit • Asynchronous requests are critical to requesting a large number of locks ahead of IO • If we had to wait for each lock request, performance gains would be lost • Server must not expand lock ahead requests, so a new LDLM flag is added for that

High Level Design: Wrinkles • Problems came in three forms: OFD glimpse callback/size checking

High Level Design: Wrinkles • Problems came in three forms: OFD glimpse callback/size checking problems Async lock request handling Race conditions • Servers need to be able to get current file size from clients (ofd_intent_{policy, cb}) • Exploit the assumption that every write lock is being used for actual IO • So the most distant write lock on any object will know the current size, only need to ask that lock about size

OFD Changes • Lock ahead violates that assumption: A write extent lock (PW) can

OFD Changes • Lock ahead violates that assumption: A write extent lock (PW) can exist without a corresponding IO request, so the ‘most distant’ lock may have incorrect size • Solution: Starting from most distant lock, glimpse each lock until you find one which has size *inside* the extent of the lock (Thanks, Andreas) • Not ideal, but except for lock ahead, there will almost never be a large number of write locks on one object

OFD Changes • Err. Oleg felt there was a race condition: In the normal

OFD Changes • Err. Oleg felt there was a race condition: In the normal case, the “most distant lock” will not change midstream, because resource is locked & no new locks can be granted • In this case, multiple clients can be writing, so while the glimpse callbacks are being sent, a different lock becomes “most distant” active lock • Thoughts?

OFD Changes • Possible performance problems for lock ahead when writing a large file

OFD Changes • Possible performance problems for lock ahead when writing a large file • For example, 100 GB per OST, 1 MB blocks: 100, 000 locks per OST • That’s a lot of callbacks, also lots of contention in there (Have to allocate lock lists atomicly to avoid deadlocks) • Impact TBD – Race conditions have impeded larger tests that would show this problem…

Race Conditions • “NEVER sleep in PTLRPCD! NEVER!” – Oleg Drokin • Async lock

Race Conditions • “NEVER sleep in PTLRPCD! NEVER!” – Oleg Drokin • Async lock requests are made by ptlrpcd threads (instead of requesting thread sleeping on reply) • Ldlm_completion_ast: Can result in sleep. • Ldlm_completion_ast_async: Alternate implementation, doesn’t sleep • Long story, but the issue was the sleeping • Required some other tweaks, will ask about on the mailing list(? ), but looks good

Race Conditions • LU-1669: Replace write mutex with range lock • Now, multiple threads

Race Conditions • LU-1669: Replace write mutex with range lock • Now, multiple threads can race LDLM requests on the same object • Lock ahead is an easy way to expose these, but most of them apply to normal IO as well • IO completes, but unnecessary lock requests are generated

Race Conditions • LU-6398: Two processes, P 1 and P 2 P 1 starts

Race Conditions • LU-6398: Two processes, P 1 and P 2 P 1 starts a write, generates LDLM lock request P 1 waits for reply from server P 2 starts a read to same region of the file P 2 cannot match lock requested by P 1 since it’s still waiting for a reply P 2 waits for a reply from server P 1 Receives reply, lock is granted on whole file P 2 Receives reply, lock is blocked by lock granted to P 1 Lock for P 1 is called back

Race Conditions • Likely fix for LU-6398 is an enqueueing list to go with

Race Conditions • Likely fix for LU-6398 is an enqueueing list to go with waiting & granted lists • Lock resource(? ) for duration of ldlm_lock_match & add to enqueueing after that (if necessary) • Not essential to fix, but would be nice. • LU-6397 is a special case related to new objects (Fixed – Thanks Jinshan)

Questions • Do you have any? • Lock ahead work in LU-6179 • If

Questions • Do you have any? • Lock ahead work in LU-6179 • If you want to help, some test cases would be especially welcome • Cray will provide these, but community assistance would speed things up • Happy to answer questions later or by email (paf@cray. com)

Other Information • Thanks to everyone for comments & input

Other Information • Thanks to everyone for comments & input