The dataset is more challenging, but here msft can help - since they have bing and github as well. So they might be able to make few shortcuts here.
The most time consuming part is compute, but here again msft has the compute.
Will they beat chat-gpt 4 in a year? Guess no. But they will come very close to it and maybe it would not matter that much if you focus on the product.
What I meant is, most likely assuming that you are using pytorch / jax you could code down the model pretty fast. Just compare it to llama, sure it is far behind, but the llama model is under 1000 lines of code and pretty good.
There is tons of work, for the training, infra, preparing the data and so on. That would result guess in millions lines of code. But the core ideas and the model are likely thin I would argue. So that is my point.