On that note, you might take a glance at the lightweight concurrency paper; it proposes a new concurrency runtime for GHC, basically moving most of the implementation into Haskell, with just a thin layer of primitives in the RTS. Interestingly, those primitives don't include locks, but instead a very minimal transactional memory.
http://research.microsoft.com/~simonpj/papers/lw-conc/index....
(The paper is also notable for giving a complete operational semantics of the system it discusses.)
(But this stuff isn't in GHC mainline, and won't be, for now, because it needs serious performance work first.)