The C startup code (for statically-linked binaries) and run-time linker (for dynamically-linked binaries) carve up initial capabilities provided by the kernel into capabilities that cover the various global variables and function pointers needed by the program and libraries, similar to how pointers are initialised for position-independent code (more complex, but same principle, just scan through all the relocations and apply them). When you mmap(2) memory from the OS, you get back a capability with bounds covering that memory. When you malloc(3) memory from your libc, it finds space in an existing mapping, takes that capability and restricts its bounds to the allocation size. When you take a pointer to a stack-allocated variable, the compiler inserts an instruction to set the bounds of that capability to just the memory it allocated for that variable. Every pointer, whether "language-level" (what is exposed in the language) or "sub-language-level" (the pointers in the implementation, like return addresses on the stack or the stack pointer itself), is a capability, and all you need to do is insert a bounds-setting instruction at the point of allocation to restrict its bounds. So your libc's malloc needs modifying, as does your kernel, but your C program that calls them just needs to be recompiled for the pure-capability ABI.
Edit: To answer the first question, yes, that is the primitive which enables CHERI to be used for in-address-space compartmentalisation rather than relying on an MMU for process-based separation and all the overheads that come from context switching address spaces.