So, basically, it's only interpreting a very small number of interpreter bytecodes either way, so the small number it has to interpret to use the builtins is comparable to the small number it has to interpret to run the recursive definition, and so the interpretive overhead is comparable (and it swamps the overhead of things like allocating a new string).
This machine is kind of slow. This took 54 CPU seconds in LuaJIT:
> function dl(n) if n < 10 then return 1 else return 1 + dl(n/10) end end
> for i = 1, 1000*1000*100 do dl(15322) end
That means that approach took 540 ns per invocation rather than Python's 4640 ns --- only about 9× slower instead of the usual 40×. Or maybe this is a case where LuaJIT isn't really coming through the way it usually does.