We know that LLMs require a ton of data. Entities such as OpenAI scrubbed the Internet grabbing whatever they could. This has led to a lot of data being fed into these NNs.
In this video I discuss how this is one piece of the equation. Since all the data is similar, what is going to distringuish one from the other? Here we get the paradox that proprietary data is going to be needed also.
▶️ 3Speak