SwinMTL

Paper: SwinMTL Paper
Code: SwinMTL GitHub Repository
Animated GIF

swinMTL is a multi-task learning framework for depth estimation and semantic segmentation. This project presents an approach to simultaneously estimate depth and semantic segmentation from a single camera image, offering a powerful tool for understanding 3D scenes. This method leverage a shared encoder-decoder architecture, and achieves superior performance on both tasks compared to individual networks, while improving computational efficiency. This makes it particularly suitable for embedded systems and real-time applications. Key to its success is the integration of various techniques: Pre-training with Masked Image Modeling: Enhances feature representation for both tasks. Critics (discriminators): Refine predictions, ensuring alignment with ground truth labels. Logarithmic Depth Scaling: Emphasizes crucial near-range information in depth estimation. MixUp Augmentation: Increases data diversity and robustness.

Extensive evaluations on Cityscapes and NYU2 datasets demonstrate state-of-the-art performance in both depth estimation and semantic segmentation. This paves the way to explore exciting applications in areas like: Scene reconstruction for autonomous vehicles and robots 3D mapping of indoor environments Augmented reality with seamless object interaction By offering an efficient and accurate multi-task learning framework.