Since models are growing in the number of parameters, it increases the computational requirements at the same time. However, during inference, you want to allocate as less compute resources as possible for a service to run 24/7, simply to keep the operational costs low. Therefore, an active field of research…