Abstract: This article proposes a unified framework dubbed Multi-view and Temporal Fusing Transformer (MTF-Transformer) to adaptively handle varying view numbers and video length without camera ...