Fast and high quality video + audio generation with first and last frame conditioning and optional audio input [model] [code]