Many high-performance computing applications rely on the application of basic linear algebra operations to large groups of very small matrices. To address this need, in recent years the computational linear algebra community has developed batched BLAS (Basic Linear Algebra Subroutine) routines designed to concurrently perform basic linear algebra operations on problems that are individually too small to benefit from parallelism. While batched BLAS operations provide meaningful performance improvements for these applications, there exists potential for significant speedup by considering non-canonical data layouts that allow for cross-matrix vectorization in the inner BLAS kernels. In this talk we begin with an introduction to the BLAS and LAPACK libraries within the Intel Math Kernel Library (Intel MKL), including an overview of the Intel MKL batched BLAS routines and their applications. We then introduce the basic elements of optimizing scientific codes for Intel Architectures. Finally, we present a new batched BLAS API, called Compact Batched BLAS. The main idea behind the Compact Batched API is to perform true SIMD (Single Instruction, Multiple Data) computations in which subgroups of matrices are operated on with kernels that abstractly appear as scalar kernels, while vector registers are filled by cross-matrix vectorization. We conclude with performance results on the motivating applications.