ARTICLE AD BOX
I am integrating Swin Transformer blocks into the YOLOv8 backbone (Ultralytics) and injecting pretrained Swin weights using timm.
The model trains and runs without any runtime errors, but the performance (mAP) is significantly worse than standard YOLOv8n on the same dataset.
What I am doing
I created a custom YOLOv8 YAML file that added SwinTransformer blocks:
# Ultralytics YOLO 🚀, AGPL-3.0 license # YOLOv8 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect # Parameters nc: 6 # number of classes scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n' # [depth, width, max_channels] n: [0.33, 0.25, 1024] # YOLOv8n summary: 225 layers, 3157200 parameters, 3157184 gradients, 8.9 GFLOPs s: [0.33, 0.50, 1024] # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients, 28.8 GFLOPs m: [0.67, 0.75, 768] # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients, 79.3 GFLOPs l: [1.00, 1.00, 512] # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPs x: [1.00, 1.25, 512] # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs # YOLOv8.0n backbone backbone: # [from, repeats, module, args] - [-1, 1, Conv, [64, 3, 2]] # 0-P1/2 - [-1, 1, Conv, [128, 3, 2]] # 1-P2/4 - [-1, 3, C2f, [128, True]] - [-1, 1, Conv, [256, 3, 2]] # 3-P3/8 - [-1, 6, SwinTransformer, [256, True]] - [-1, 1, Conv, [512, 3, 2]] # 5-P4/16 - [-1, 6, SwinTransformer, [512, True]] - [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32 - [-1, 3, SwinTransformer, [1024, True]] - [-1, 1, SPPF, [1024, 5]] # 9 # YOLOv8.0n head head: - [-1, 1, nn.Upsample, [None, 2, 'nearest']] - [[-1, 6], 1, Concat, [1]] # cat backbone P4 - [-1, 3, C2f, [512]] # 12 - [-1, 1, nn.Upsample, [None, 2, 'nearest']] - [[-1, 4], 1, Concat, [1]] # cat backbone P3 - [-1, 3, C2f, [256]] # 15 (P3/8-small) - [-1, 1, Conv, [256, 3, 2]] - [[-1, 12], 1, Concat, [1]] # cat head P4 - [-1, 3, C2f, [512]] # 18 (P4/16-medium) - [-1, 1, Conv, [512, 3, 2]] - [[-1, 9], 1, Concat, [1]] # cat head P5 - [-1, 3, C2f, [1024]] # 21 (P5/32-large) - [[15, 18, 21], 1, Detect, [nc]] # Detect(P3, P4, P5)I wrote this code to chatgpt to load pretrained swin transform pth file
import torch from ultralytics import YOLO import timm # --- YOL VE KONFİGÜRASYON AYARLARI --- # Swin Transformer eklediğiniz YAML dosyasının yolu YOLO_CONFIG_PATH = '../yolov8_three_swinTrans.yaml' # Standart YOLOv8n ağırlıklarının yolu (Diğer katmanlar için) YOLOV8_WEIGHTS_PATH = '../yolov8n.pt' # RTTS Veri Kümesi YAML Yolu DATASET_YAML_PATH = "../RTTS/data.yaml" # timm kütüphanesinden kullanılacak önceden eğitilmiş Swin-T modeli SWIN_MODEL_NAME = 'swinv2_large_window12_192_22k' # YAML'daki SwinTransformer katmanlarının indeksleri: # 4: - [-1, 6, SwinTransformer, [256, True]] # 6: - [-1, 6, SwinTransformer, [512, True]] # 8: - [-1, 3, SwinTransformer, [1024, True]] SWIN_LAYER_INDICES = [4, 6, 8] # --- MANUEL YÜKLEME FONKSİYONU --- def inject_swin_weights_multi(yolo_model, swin_name, layer_indices): """Önceden eğitilmiş Swin-Transformer ağırlıklarını birden fazla YOLOv8 katmanına enjekte eder.""" print(f"🔄 {swin_name} ağırlıkları indiriliyor ve birden fazla katmana enjekte ediliyor...") try: # 1. Swin-Transformer modelini timm'den (önceden eğitilmiş) yükle swin_timm_model = timm.create_model(swin_name, pretrained=True) swin_timm_state_dict = swin_timm_model.state_dict() yolo_state_dict = yolo_model.state_dict() new_swin_weights = {} for index in layer_indices: yolo_swin_prefix = f'model.{index}' print(f" -> Ağırlıklar {yolo_swin_prefix} için eşleştiriliyor...") # Eşleştirme işlemi (Önceki örnekte olduğu gibi) for k_timm, v_timm in swin_timm_state_dict.items(): k_yolo = None # En Olası Eşleştirmeler if k_timm.startswith('patch_embed'): k_yolo = f"{yolo_swin_prefix}.{k_timm}" elif k_timm.startswith('layers'): k_yolo = f"{yolo_swin_prefix}.{k_timm}" elif k_timm.startswith('norm'): k_yolo = f"{yolo_swin_prefix}.{k_timm}" if k_yolo and k_yolo in yolo_state_dict: if yolo_state_dict[k_yolo].shape == v_timm.shape: new_swin_weights[k_yolo] = v_timm else: print(f" ⚠️ Hata: Boyut uyuşmuyor: {k_yolo} ({yolo_state_dict[k_yolo].shape}) vs {v_timm.shape}") # 2. Yeni ağırlıkları mevcut YOLO ağırlıklarıyla birleştirme yolo_state_dict.update(new_swin_weights) # 3. Modeli yükle (strict=False ile) yolo_model.load_state_dict(yolo_state_dict, strict=False) print("✅ Swin-Transformer ağırlıkları başarıyla enjekte edildi.") except Exception as e: print(f"❌ Swin Ağırlık Enjeksiyonunda Hata Oluştu: {e}") print("Model, SwinTransformer katmanlarını sıfırdan eğitecektir.") # --- ANA EĞİTİM KODU --- if __name__ == '__main__': # 1. Modeli Swin-T konfigürasyonu ile oluştur model = YOLO(YOLO_CONFIG_PATH) # 2. Standart YOLO ağırlıklarını yükle (Diğer katmanlar için) model.load(YOLOV8_WEIGHTS_PATH) print("📢 Standart YOLOv8n ağırlıkları yüklendi.") # 3. Swin-T Ağırlıklarını Enjekte Et inject_swin_weights_multi(model, SWIN_MODEL_NAME, SWIN_LAYER_INDICES) # --- İYİLEŞTİRİLMİŞ EĞİTİM STRATEJİSİ --- # Swin Katmanlarını Dondurma: Modeli ilk 10 epoch dondurarak önceden eğitilmiş bilgiyi koruyun. # 4. Eğitimi Başlat (Düşük Öğrenme Oranı ile İnce Ayar) model.train( data=DATASET_YAML_PATH, epochs=100, # Yüksek epoch sayısı (daha iyi ince ayar için) imgsz=640, lr0=1e-4, # Başlangıç öğrenme oranını çok düşük tutun warmup_epochs=5, name='yolov8_three_swin_fine_tuned' )Training runs normally and I get logs like this:
Transferred 225/419 items from pretrained weights Standard YOLOv8n weights loaded. swinv2_large_window12_192_22k weights downloaded... -> matching weights for model.4 ... -> matching weights for model.6 ... -> matching weights for model.8 ... Swin weights successfully injected.The Problem
With standard YOLOv8n, I get:
mAP ~ 0.74 (RTTS dataset)
With my YOLOv8 + Swin Transformer hybrid, I get:
mAP ~ 0.69 - 0.72
So adding Swin actually reduces accuracy instead of improving it.
The Question
Why does adding Swin Transformer blocks into YOLOv8 lead to lower mAP, even when pretrained Swin weights are injected?
Possible things I’m not sure about:
Are timm Swin weights incompatible with Ultralytics Swin implementation?
Do window sizes / patch sizes mismatch (192 vs 640)?
Is my prefix mapping logic incorrect?
Does YOLOv8 expect different normalization layers than timm uses?
Should the Swin blocks be placed differently in the backbone?
What I want to know
What is the correct way to load pretrained Swin weights into a custom YOLOv8 model?
Is mixing YOLOv8 and Swin Transformer conceptually incompatible without rewriting more of the architecture?
Why would the hybrid model perform worse even with pretrained weights?
How should Swin blocks be configured (dimensions, stages, window size) to work well with YOLOv8?
Any insights, references, or examples would be extremely appreciated.
I added codes conv.py and tasks.py I followed and used this github documentation: https://github.com/Marfbin/NEU-DET-with-yolov8
But I learned that I didnt use swin transform pretrained weight (.pth file) so mAP value is more low than now for example 0.55 then I tried to implement this .pth file but I guess failed. I' ve been trying to increase my mAP value on my dataset for a days but I couldnt figure it out, I searched everything on the internet (github, stackoverflow, chatgpt or sth ) but I couldnt do it. I am new on this topic. please can you help me for this problem. I have to figure it out anymore. Please dont hesitate to ask if you need to see any code implementation or something for clarity, I can share
