Large Language Models
Refer to the amazing 3blue1brown video explaining transformer neural networks.
The transformer network
The softmax equation
In the same way that you might call the components of the output (
Adding randomness with the temperature parameter
In some situations, like when ChatGPT is using this distribution to create a next word, there’s room for a little bit of extra fun by adding a little extra spice into this function, with a constant
We call it the temperature, since it vaguely resembles the role of temperature in certain thermodynamics equations, and the effect is that when
For example, I’ll have GPT-3 generate a story with the seed text, once upon a time there was A, but I’ll use different temperatures in each case. Temperature zero means that it always goes with the most predictable word, and what you get ends up being a trite derivative of Goldilocks. A higher temperature gives it a chance to choose less likely words, but it comes with a risk.