Schooling large language products like GPT-four or diffusion designs for picture era calls for processing billions of data factors by networks with many hundreds of billions of parameters. These workloads run on clusters of GPUs or TPUs optimized for matrix multiplication and gradient descent.