Merge remote-tracking branch 'origin/master' into bits_and_tpu

3 years ago · 87bfb055c0
parent f07dfe010a c45c6ee8e4
commit 87bfb055c0
12 changed files with 183 additions and 64 deletions
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@ -2,4 +2,4 @@ blank_issues_enabled: false
 contact_links:
  - name: Community Discussions
    url: https://github.com/rwightman/pytorch-image-models/discussions
-    about: Issues are for features and bugs. Questions can be asked in Discussions.
+    about: Hparam request in issues will be ignored! Issues are for features and bugs. Questions can be asked in Discussions.
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@ -1,8 +1,7 @@
 ---
 name: Feature request
-about: Suggest an idea for this project. Issues are for reporting bugs or requesting
+about: Suggest an idea for this project. Hparam requests, training help are not feature requests.
-  features, the discussion forum is available for asking questions or seeking help
+  The discussion forum is available for asking questions or seeking help from the community.
  from the community.
 title: "[FEATURE] Feature title..."
 labels: enhancement
 assignees: ''
--- a/README.md
+++ b/README.md
@ -21,6 +21,30 @@ And a big thanks to all GitHub sponsors who helped with some of my costs before
 ## What's New
 ### Aug 29, 2022
 * MaxVit window size scales with img_size by default. Add new RelPosMlp MaxViT weight that leverages this:
  * `maxvit_rmlp_nano_rw_256` - 83.0 @ 256, 83.6 @ 320  (T)
 ### Aug 26, 2022
 * CoAtNet (https://arxiv.org/abs/2106.04803) and MaxVit (https://arxiv.org/abs/2204.01697) `timm` original models
  * both found in [`maxxvit.py`](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/maxxvit.py) model def, contains numerous experiments outside scope of original papers
  * an unfinished Tensorflow version from MaxVit authors can be found https://github.com/google-research/maxvit
 * Initial CoAtNet and MaxVit timm pretrained weights (working on more):
  * `coatnet_nano_rw_224` - 81.7 @ 224  (T)
  * `coatnet_rmlp_nano_rw_224` - 82.0 @ 224, 82.8 @ 320 (T)
  * `coatnet_0_rw_224` - 82.4  (T)  -- NOTE timm '0' coatnets have 2 more 3rd stage blocks
  * `coatnet_bn_0_rw_224` - 82.4  (T)
  * `maxvit_nano_rw_256` - 82.9 @ 256  (T)
  * `coatnet_rmlp_1_rw_224` - 83.4 @ 224, 84 @ 320  (T)
  * `coatnet_1_rw_224` - 83.6 @ 224 (G) 
  * (T) = TPU trained with `bits_and_tpu` branch training code, (G) = GPU trained
 * GCVit (weights adapted from https://github.com/NVlabs/GCVit, code 100% `timm` re-write for license purposes)
 * MViT-V2 (multi-scale vit, adapted from https://github.com/facebookresearch/mvit)
 * EfficientFormer (adapted from https://github.com/snap-research/EfficientFormer)
 * PyramidVisionTransformer-V2 (adapted from https://github.com/whai362/PVT)
 * 'Fast Norm' support for LayerNorm and GroupNorm that avoids float32 upcast w/ AMP (uses APEX LN if available for further boost)
 ### Aug 15, 2022
 * ConvNeXt atto weights added
  * `convnext_atto` - 75.7 @ 224, 77.0 @ 288
@ -229,6 +253,7 @@ A full version of the list below with source links can be found in the [document
 * Bottleneck Transformers - https://arxiv.org/abs/2101.11605
 * CaiT (Class-Attention in Image Transformers) - https://arxiv.org/abs/2103.17239
 * CoaT (Co-Scale Conv-Attentional Image Transformers) - https://arxiv.org/abs/2104.06399
 * CoAtNet (Convolution and Attention) - https://arxiv.org/abs/2106.04803
 * ConvNeXt - https://arxiv.org/abs/2201.03545
 * ConViT (Soft Convolutional Inductive Biases Vision Transformers)- https://arxiv.org/abs/2103.10697
 * CspNet (Cross-Stage Partial Networks) - https://arxiv.org/abs/1911.11929
@ -238,6 +263,7 @@ A full version of the list below with source links can be found in the [document
 * DLA - https://arxiv.org/abs/1707.06484
 * DPN (Dual-Path Network) - https://arxiv.org/abs/1707.01629
 * EdgeNeXt - https://arxiv.org/abs/2206.10589
 * EfficientFormer - https://arxiv.org/abs/2206.01191
 * EfficientNet (MBConvNet Family)
    * EfficientNet NoisyStudent (B0-B7, L2) - https://arxiv.org/abs/1911.04252
    * EfficientNet AdvProp (B0-B8) - https://arxiv.org/abs/1911.09665
@ -250,6 +276,7 @@ A full version of the list below with source links can be found in the [document
    * MobileNet-V2 - https://arxiv.org/abs/1801.04381
    * Single-Path NAS - https://arxiv.org/abs/1904.02877
    * TinyNet - https://arxiv.org/abs/2010.14819
 * GCViT (Global Context Vision Transformer) - https://arxiv.org/abs/2206.09959
 * GhostNet - https://arxiv.org/abs/1911.11907
 * gMLP - https://arxiv.org/abs/2105.08050
 * GPU-Efficient Networks - https://arxiv.org/abs/2006.14090
@ -259,6 +286,7 @@ A full version of the list below with source links can be found in the [document
 * Inception-ResNet-V2 and Inception-V4 - https://arxiv.org/abs/1602.07261
 * Lambda Networks - https://arxiv.org/abs/2102.08602
 * LeViT (Vision Transformer in ConvNet's Clothing) - https://arxiv.org/abs/2104.01136
 * MaxViT (Multi-Axis Vision Transformer) - https://arxiv.org/abs/2204.01697
 * MLP-Mixer - https://arxiv.org/abs/2105.01601
 * MobileNet-V3 (MBConvNet w/ Efficient Head) - https://arxiv.org/abs/1905.02244
  * FBNet-V3 - https://arxiv.org/abs/2006.02049
@ -266,6 +294,7 @@ A full version of the list below with source links can be found in the [document
  * LCNet - https://arxiv.org/abs/2109.15099
 * MobileViT - https://arxiv.org/abs/2110.02178
 * MobileViT-V2 - https://arxiv.org/abs/2206.02680
 * MViT-V2 (Improved Multiscale Vision Transformer) - https://arxiv.org/abs/2112.01526
 * NASNet-A - https://arxiv.org/abs/1707.07012
 * NesT - https://arxiv.org/abs/2105.12723
 * NFNet-F - https://arxiv.org/abs/2102.06171
@ -273,6 +302,7 @@ A full version of the list below with source links can be found in the [document
 * PNasNet - https://arxiv.org/abs/1712.00559
 * PoolFormer (MetaFormer) - https://arxiv.org/abs/2111.11418
 * Pooling-based Vision Transformer (PiT) - https://arxiv.org/abs/2103.16302
 * PVT-V2 (Improved Pyramid Vision Transformer) - https://arxiv.org/abs/2106.13797
 * RegNet - https://arxiv.org/abs/2003.13678
 * RegNetZ - https://arxiv.org/abs/2103.06877
 * RepVGG - https://arxiv.org/abs/2101.03697
--- a/benchmark.py
+++ b/benchmark.py
@ -19,7 +19,7 @@ import torch.nn as nn
 import torch.nn.parallel
 from timm.data import resolve_data_config
-from timm.models import create_model, is_model, list_models
+from timm.models import create_model, is_model, list_models, set_fast_norm
 from timm.optim import create_optimizer_v2
 from timm.utils import setup_default_logging, set_jit_fuser, decay_batch_step, check_batch_size_retry
@ -109,7 +109,8 @@ scripting_group.add_argument('--torchscript', dest='torchscript', action='store_
                    help='convert model torchscript for inference')
 scripting_group.add_argument('--aot-autograd', default=False, action='store_true',
                    help="Enable AOT Autograd support. (It's recommended to use this option with `--fuser nvfuser` together)")
-
+scripting_group.add_argument('--fast-norm', default=False, action='store_true',
                    help='enable experimental fast-norm')
 # train optimizer parameters
 parser.add_argument('--opt', default='sgd', type=str, metavar='OPTIMIZER',
@ -598,6 +599,9 @@ def main():
    model_cfgs = []
    model_names = []
    if args.fast_norm:
        set_fast_norm()
    if args.model_list:
        args.model = ''
        with open(args.model_list) as f:
--- a/timm/models/init.py
+++ b/timm/models/init.py
@ -69,5 +69,6 @@ from .helpers import load_checkpoint, resume_checkpoint, model_parameters
 from .layers import TestTimePoolHead, apply_test_time_pool
 from .layers import convert_splitbn_model, convert_sync_batchnorm
 from .layers import is_scriptable, is_exportable, set_scriptable, set_exportable, is_no_jit, set_no_jit
 from .layers import set_fast_norm
 from .registry import register_model, model_entrypoint, list_models, is_model, list_modules, is_model_in_modules,\
    is_model_pretrained, get_pretrained_cfg, has_pretrained_cfg_key, is_pretrained_cfg_key, get_pretrained_cfg_value
--- a/timm/models/densenet.py
+++ b/timm/models/densenet.py
@ -115,7 +115,7 @@ class DenseBlock(nn.ModuleDict):
    _version = 2
    def __init__(
-            self, num_layers, num_input_features, bn_size, growth_rate, norm_layer=nn.ReLU,
+            self, num_layers, num_input_features, bn_size, growth_rate, norm_layer=BatchNormAct2d,
            drop_rate=0., memory_efficient=False):
        super(DenseBlock, self).__init__()
        for i in range(num_layers):
@ -138,7 +138,7 @@ class DenseBlock(nn.ModuleDict):
 class DenseTransition(nn.Sequential):
-    def __init__(self, num_input_features, num_output_features, norm_layer=nn.BatchNorm2d, aa_layer=None):
+    def __init__(self, num_input_features, num_output_features, norm_layer=BatchNormAct2d, aa_layer=None):
        super(DenseTransition, self).__init__()
        self.add_module('norm', norm_layer(num_input_features))
        self.add_module('conv', nn.Conv2d(
--- a/timm/models/layers/fast_norm.py
+++ b/timm/models/layers/fast_norm.py
@ -1,3 +1,11 @@
 """ 'Fast' Normalization Functions
 For GroupNorm and LayerNorm these functions bypass typical AMP upcast to float32.
 Additionally, for LayerNorm, the APEX fused LN is used if available (which also does not upcast)
 Hacked together by / Copyright 2022 Ross Wightman
 """
 from typing import List, Optional
 import torch
@ -37,6 +45,7 @@ def fast_group_norm(
    if torch.is_autocast_enabled():
        # normally native AMP casts GN inputs to float32
        # here we use the low precision autocast dtype
        # FIXME what to do re CPU autocast?
        dt = torch.get_autocast_gpu_dtype()
        x, weight, bias = x.to(dt), weight.to(dt), bias.to(dt)
@ -62,6 +71,7 @@ def fast_layer_norm(
        # normally native AMP casts LN inputs to float32
        # apex LN does not, this is behaving like Apex
        dt = torch.get_autocast_gpu_dtype()
        # FIXME what to do re CPU autocast?
        x, weight, bias = x.to(dt), weight.to(dt), bias.to(dt)
    with torch.cuda.amp.autocast(enabled=False):
--- a/timm/models/layers/norm.py
+++ b/timm/models/layers/norm.py
@ -1,4 +1,8 @@
 """ Normalization layers and wrappers
 Norm layer definitions that support fast norm and consistent channel arg order (always first arg).
 Hacked together by / Copyright 2022 Ross Wightman
 """
 import torch
--- a/timm/models/layers/norm_act.py
+++ b/timm/models/layers/norm_act.py
@ -1,4 +1,16 @@
 """ Normalization + Activation Layers
 Provides Norm+Act fns for standard PyTorch norm layers such as
 * BatchNorm
 * GroupNorm
 * LayerNorm
 This allows swapping with alternative layers that are natively both norm + act such as
 * EvoNorm (evo_norm.py)
 * FilterResponseNorm (filter_response_norm.py)
 * InplaceABN (inplace_abn.py)
 Hacked together by / Copyright 2022 Ross Wightman
 """
 from typing import Union, List, Optional, Any
--- a/timm/models/layers/weight_init.py
+++ b/timm/models/layers/weight_init.py
@ -5,7 +5,7 @@ import warnings
 from torch.nn.init import _calculate_fan_in_and_fan_out
-def _no_grad_trunc_normal_(tensor, mean, std, a, b):
+def _trunc_normal_(tensor, mean, std, a, b):
    # Cut & paste from PyTorch official master until it's in a few official releases - RW
    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
    def norm_cdf(x):
@ -17,28 +17,27 @@ def _no_grad_trunc_normal_(tensor, mean, std, a, b):
                      "The distribution of values may be incorrect.",
                      stacklevel=2)
-    with torch.no_grad():
+    # Values are generated by using a truncated uniform distribution and
-        # Values are generated by using a truncated uniform distribution and
+    # then using the inverse CDF for the normal distribution.
-        # then using the inverse CDF for the normal distribution.
+    # Get upper and lower cdf values
-        # Get upper and lower cdf values
+    l = norm_cdf((a - mean) / std)
-        l = norm_cdf((a - mean) / std)
+    u = norm_cdf((b - mean) / std)
        u = norm_cdf((b - mean) / std)
-        # Uniformly fill tensor with values from [l, u], then translate to
+    # Uniformly fill tensor with values from [l, u], then translate to
-        # [2l-1, 2u-1].
+    # [2l-1, 2u-1].
-        tensor.uniform_(2 * l - 1, 2 * u - 1)
+    tensor.uniform_(2 * l - 1, 2 * u - 1)
-        # Use inverse cdf transform for normal distribution to get truncated
+    # Use inverse cdf transform for normal distribution to get truncated
-        # standard normal
+    # standard normal
-        tensor.erfinv_()
+    tensor.erfinv_()
-        # Transform to proper mean, std
+    # Transform to proper mean, std
-        tensor.mul_(std * math.sqrt(2.))
+    tensor.mul_(std * math.sqrt(2.))
-        tensor.add_(mean)
+    tensor.add_(mean)
-        # Clamp to ensure it's in the proper range
+    # Clamp to ensure it's in the proper range
-        tensor.clamp_(min=a, max=b)
+    tensor.clamp_(min=a, max=b)
-        return tensor
+    return tensor
 def trunc_normal_(tensor, mean=0., std=1., a=-2., b=2.):
@ -64,7 +63,8 @@ def trunc_normal_(tensor, mean=0., std=1., a=-2., b=2.):
        >>> w = torch.empty(3, 5)
        >>> nn.init.trunc_normal_(w)
    """
-    return _no_grad_trunc_normal_(tensor, mean, std, a, b)
+    with torch.no_grad():
        return _trunc_normal_(tensor, mean, std, a, b)
 def trunc_normal_tf_(tensor, mean=0., std=1., a=-2., b=2.):
@ -90,8 +90,8 @@ def trunc_normal_tf_(tensor, mean=0., std=1., a=-2., b=2.):
        >>> w = torch.empty(3, 5)
        >>> nn.init.trunc_normal_(w)
    """
    _no_grad_trunc_normal_(tensor, 0, 1.0, a, b)
    with torch.no_grad():
        _trunc_normal_(tensor, 0, 1.0, a, b)
        tensor.mul_(std).add_(mean)
    return tensor
@ -111,10 +111,12 @@ def variance_scaling_(tensor, scale=1.0, mode='fan_in', distribution='normal'):
        # constant is stddev of standard normal truncated to (-2, 2)
        trunc_normal_tf_(tensor, std=math.sqrt(variance) / .87962566103423978)
    elif distribution == "normal":
-        tensor.normal_(std=math.sqrt(variance))
+        with torch.no_grad():
            tensor.normal_(std=math.sqrt(variance))
    elif distribution == "uniform":
        bound = math.sqrt(3 * variance)
-        tensor.uniform_(-bound, bound)
+        with torch.no_grad():
            tensor.uniform_(-bound, bound)
    else:
        raise ValueError(f"invalid distribution {distribution}")
--- a/timm/models/maxxvit.py
+++ b/timm/models/maxxvit.py
@ -39,7 +39,7 @@ Hacked together by / Copyright 2022, Ross Wightman
 import math
 from collections import OrderedDict
-from dataclasses import dataclass
+from dataclasses import dataclass, replace
 from functools import partial
 from typing import Callable, Optional, Union, Tuple, List
@ -108,10 +108,13 @@ default_cfgs = {
    # Experimental configs
    'maxvit_pico_rw_256': _cfg(url='', input_size=(3, 256, 256), pool_size=(8, 8)),
    'maxvit_nano_rw_256': _cfg(
-        url='https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights-maxx/maxvit_nano_rw_256_sw-3e790ce3.pth',
+        url='https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights-maxx/maxvit_nano_rw_256_sw-fb127241.pth',
        input_size=(3, 256, 256), pool_size=(8, 8)),
    'maxvit_tiny_rw_224': _cfg(url=''),
    'maxvit_tiny_rw_256': _cfg(url='', input_size=(3, 256, 256), pool_size=(8, 8)),
    'maxvit_rmlp_nano_rw_256': _cfg(
        url='https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights-maxx/maxvit_rmlp_nano_rw_256_sw-c17bb0d6.pth',
        input_size=(3, 256, 256), pool_size=(8, 8)),
    'maxvit_tiny_pm_256': _cfg(url='', input_size=(3, 256, 256), pool_size=(8, 8)),
    'maxxvit_nano_rw_256': _cfg(url='', input_size=(3, 256, 256), pool_size=(8, 8)),
@ -129,21 +132,30 @@ class MaxxVitTransformerCfg:
    dim_head: int = 32
    expand_ratio: float = 4.0
    expand_first: bool = True
-    shortcut_bias: bool = True,
+    shortcut_bias: bool = True
    attn_bias: bool = True
    attn_drop: float = 0.
    proj_drop: float = 0.
    pool_type: str = 'avg2'
    rel_pos_type: str = 'bias'
    rel_pos_dim: int = 512  # for relative position types w/ MLP
-    window_size: Tuple[int, int] = (7, 7)
+    partition_stride: int = 32
-    grid_size: Tuple[int, int] = (7, 7)
+    window_size: Optional[Tuple[int, int]] = None
    grid_size: Optional[Tuple[int, int]] = None
    init_values: Optional[float] = None
    act_layer: str = 'gelu'
    norm_layer: str = 'layernorm2d'
    norm_layer_cl: str = 'layernorm'
    norm_eps: float = 1e-6
    def __post_init__(self):
        if self.grid_size is not None:
            self.grid_size = to_2tuple(self.grid_size)
        if self.window_size is not None:
            self.window_size = to_2tuple(self.window_size)
            if self.grid_size is None:
                self.grid_size = self.window_size
@dataclass
 class MaxxVitConvCfg:
@ -249,7 +261,7 @@ def _rw_max_cfg(
        conv_norm_layer='',
        transformer_norm_layer='layernorm2d',
        transformer_norm_layer_cl='layernorm',
-        window_size=7,
+        window_size=None,
        dim_head=32,
        rel_pos_type='bias',
        rel_pos_dim=512,
@ -259,8 +271,6 @@ def _rw_max_cfg(
    # - mbconv expansion calculated from input instead of output chs
    # - mbconv shortcut and final 1x1 conv did not have a bias
    # - mbconv uses silu in timm, not gelu
    # - avg pool with kernel_size=2 favoured downsampling (instead of maxpool for coat)
    # - default to avg pool for mbconv downsample instead of 1x1 or dw conv
    # - expansion in attention block done via output proj, not input proj
    return dict(
        conv_cfg=MaxxVitConvCfg(
@ -276,8 +286,7 @@ def _rw_max_cfg(
            expand_first=False,
            pool_type=pool_type,
            dim_head=dim_head,
-            window_size=to_2tuple(window_size),
+            window_size=window_size,
            grid_size=to_2tuple(window_size),
            norm_layer=transformer_norm_layer,
            norm_layer_cl=transformer_norm_layer_cl,
            rel_pos_type=rel_pos_type,
@ -293,7 +302,7 @@ def _next_cfg(
        conv_norm_layer_cl='layernorm',
        transformer_norm_layer='layernorm2d',
        transformer_norm_layer_cl='layernorm',
-        window_size=7,
+        window_size=None,
        rel_pos_type='bias',
        rel_pos_dim=512,
 ):
@ -310,8 +319,7 @@ def _next_cfg(
        transformer_cfg=MaxxVitTransformerCfg(
            expand_first=False,
            pool_type=pool_type,
-            window_size=to_2tuple(window_size),
+            window_size=window_size,
            grid_size=to_2tuple(window_size),
            norm_layer=transformer_norm_layer,
            norm_layer_cl=transformer_norm_layer_cl,
            rel_pos_type=rel_pos_type,
@ -411,18 +419,19 @@ model_cfgs = dict(
            rel_pos_dim=384,  # was supposed to be 512, woops
        ),
    ),
-    coatnext_nano_rw_224=MaxxVitCfg(
+    coatnet_nano_cc_224=MaxxVitCfg(
        embed_dim=(64, 128, 256, 512),
        depths=(3, 4, 6, 3),
        stem_width=(32, 64),
-        **_next_cfg(),
+        block_type=('C', 'C', ('C', 'T'), ('C', 'T')),
        **_rw_coat_cfg(),
    ),
-    coatnet_nano_cc_224=MaxxVitCfg(
+    coatnext_nano_rw_224=MaxxVitCfg(
        embed_dim=(64, 128, 256, 512),
        depths=(3, 4, 6, 3),
        stem_width=(32, 64),
-        block_type=('C', 'C', ('C', 'T'), ('C', 'T')),
+        weight_init='normal',
-        **_rw_coat_cfg(),
+        **_next_cfg(),
    ),
    # Trying to be like the CoAtNet paper configs
@ -463,14 +472,14 @@ model_cfgs = dict(
        depths=(2, 2, 5, 2),
        block_type=('M',) * 4,
        stem_width=(24, 32),
-        **_rw_max_cfg(window_size=8),
+        **_rw_max_cfg(),
    ),
    maxvit_nano_rw_256=MaxxVitCfg(
        embed_dim=(64, 128, 256, 512),
        depths=(1, 2, 3, 1),
        block_type=('M',) * 4,
        stem_width=(32, 64),
-        **_rw_max_cfg(window_size=8),
+        **_rw_max_cfg(),
    ),
    maxvit_tiny_rw_224=MaxxVitCfg(
        embed_dim=(64, 128, 256, 512),
@ -484,21 +493,29 @@ model_cfgs = dict(
        depths=(2, 2, 5, 2),
        block_type=('M',) * 4,
        stem_width=(32, 64),
-        **_rw_max_cfg(window_size=8),
+        **_rw_max_cfg(),
    ),
    maxvit_rmlp_nano_rw_256=MaxxVitCfg(
        embed_dim=(64, 128, 256, 512),
        depths=(1, 2, 3, 1),
        block_type=('M',) * 4,
        stem_width=(32, 64),
        **_rw_max_cfg(rel_pos_type='mlp'),
    ),
    maxvit_tiny_pm_256=MaxxVitCfg(
        embed_dim=(64, 128, 256, 512),
        depths=(2, 2, 5, 2),
        block_type=('PM',) * 4,
        stem_width=(32, 64),
-        **_rw_max_cfg(window_size=8),
+        **_rw_max_cfg(),
    ),
    maxxvit_nano_rw_256=MaxxVitCfg(
        embed_dim=(64, 128, 256, 512),
        depths=(1, 2, 3, 1),
        block_type=('M',) * 4,
        stem_width=(32, 64),
-        **_next_cfg(window_size=8),
+        weight_init='normal',
        **_next_cfg(),
    ),
    # Trying to be like the MaxViT paper configs
@ -651,7 +668,11 @@ class LayerScale2d(nn.Module):
 class Downsample2d(nn.Module):
-    """ A downsample pooling module for Coat that handles 2d <-> 1d conversion
+    """ A downsample pooling module supporting several maxpool and avgpool modes
    * 'max' - MaxPool2d w/ kernel_size 3, stride 2, padding 1
    * 'max2' - MaxPool2d w/ kernel_size = stride = 2
    * 'avg' - AvgPool2d w/ kernel_size 3, stride 2, padding 1
    * 'avg2' - AvgPool2d w/ kernel_size = stride = 2
    """
    def __init__(
@ -710,6 +731,11 @@ def _init_transformer(module, name, scheme=''):
 class TransformerBlock2d(nn.Module):
    """ Transformer block with 2D downsampling
    '2D' NCHW tensor layout
    Some gains can be seen on GPU using a 1D / CL block, BUT w/ the need to switch back/forth to NCHW
    for spatial pooling, the benefit is minimal so ended up using just this variant for CoAt configs.
    This impl was faster on TPU w/ PT XLA than the 1D experiment.
    """
    def __init__(
@ -1011,9 +1037,9 @@ def get_rel_pos_cls(cfg: MaxxVitTransformerCfg, window_size):
    return rel_pos_cls
-class PartitionAttention(nn.Module):
+class PartitionAttentionCl(nn.Module):
    """ Grid or Block partition + Attn + FFN.
-    NxC tensor layout.
+    NxC 'channels last' tensor layout.
    """
    def __init__(
@ -1183,6 +1209,7 @@ def grid_reverse_nchw(windows, grid_size: List[int], img_size: List[int]):
 class PartitionAttention2d(nn.Module):
    """ Grid or Block partition + Attn + FFN
    '2D' NCHW tensor layout.
    """
@ -1245,7 +1272,7 @@ class PartitionAttention2d(nn.Module):
 class MaxxVitBlock(nn.Module):
-    """
+    """ MaxVit conv, window partition + FFN , grid partition + FFN
    """
    def __init__(
@ -1264,7 +1291,7 @@ class MaxxVitBlock(nn.Module):
        self.conv = conv_cls(dim, dim_out, stride=stride, cfg=conv_cfg, drop_path=drop_path)
        attn_kwargs = dict(dim=dim_out, cfg=transformer_cfg, drop_path=drop_path)
-        partition_layer = PartitionAttention2d if use_nchw_attn else PartitionAttention
+        partition_layer = PartitionAttention2d if use_nchw_attn else PartitionAttentionCl
        self.nchw_attn = use_nchw_attn
        self.attn_block = partition_layer(**attn_kwargs)
        self.attn_grid = partition_layer(partition_type='grid', **attn_kwargs)
@ -1288,7 +1315,8 @@ class MaxxVitBlock(nn.Module):
 class ParallelMaxxVitBlock(nn.Module):
-    """
+    """ MaxVit block with parallel cat(window + grid), one FF
    Experimental timm block.
    """
    def __init__(
@ -1426,8 +1454,19 @@ class Stem(nn.Module):
        return x
 def cfg_window_size(cfg: MaxxVitTransformerCfg, img_size: Tuple[int, int]):
    if cfg.window_size is not None:
        assert cfg.grid_size
        return cfg
    partition_size = img_size[0] // cfg.partition_stride, img_size[1] // cfg.partition_stride
    cfg = replace(cfg, window_size=partition_size, grid_size=partition_size)
    return cfg
 class MaxxVit(nn.Module):
-    """
+    """ CoaTNet + MaxVit base model.
    Highly configurable for different block compositions, tensor layouts, pooling types.
    """
    def __init__(
@ -1442,6 +1481,7 @@ class MaxxVit(nn.Module):
    ):
        super().__init__()
        img_size = to_2tuple(img_size)
        transformer_cfg = cfg_window_size(cfg.transformer_cfg, img_size)
        self.num_classes = num_classes
        self.global_pool = global_pool
        self.num_features = cfg.embed_dim[-1]
@ -1475,7 +1515,7 @@ class MaxxVit(nn.Module):
                depth=cfg.depths[i],
                block_types=cfg.block_type[i],
                conv_cfg=cfg.conv_cfg,
-                transformer_cfg=cfg.transformer_cfg,
+                transformer_cfg=transformer_cfg,
                feat_size=feat_size,
                drop_path=dpr[i],
            )]
@ -1658,6 +1698,11 @@ def maxvit_tiny_rw_256(pretrained=False, **kwargs):
    return _create_maxxvit('maxvit_tiny_rw_256', pretrained=pretrained, **kwargs)
@register_model
 def maxvit_rmlp_nano_rw_256(pretrained=False, **kwargs):
    return _create_maxxvit('maxvit_rmlp_nano_rw_256', pretrained=pretrained, **kwargs)
@register_model
 def maxvit_tiny_pm_256(pretrained=False, **kwargs):
    return _create_maxxvit('maxvit_tiny_pm_256', pretrained=pretrained, **kwargs)
--- a/timm/models/mvitv2.py
+++ b/timm/models/mvitv2.py
@ -57,6 +57,8 @@ default_cfgs = dict(
    mvitv2_huge_in21k=_cfg(
        url='https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_H_in21k.pyth',
        num_classes=19168),
    mvitv2_small_cls=_cfg(url=''),
 )
@ -135,6 +137,11 @@ model_cfgs = dict(
        num_heads=2,
        expand_attn=False,
    ),
    mvitv2_small_cls=MultiScaleVitCfg(
        depths=(1, 2, 11, 2),
        use_cls_token=True,
    ),
 )
@ -641,7 +648,7 @@ class MultiScaleBlock(nn.Module):
        if self.shortcut_pool_attn is None:
            return x
        if self.has_cls_token:
-            cls_tok, x = x[:, :, :1, :], x[:, :, 1:, :]
+            cls_tok, x = x[:, :1, :], x[:, 1:, :]
        else:
            cls_tok = None
        B, L, C = x.shape
@ -650,7 +657,7 @@ class MultiScaleBlock(nn.Module):
        x = self.shortcut_pool_attn(x)
        x = x.reshape(B, C, -1).transpose(1, 2)
        if cls_tok is not None:
-            x = torch.cat((cls_tok, x), dim=2)
+            x = torch.cat((cls_tok, x), dim=1)
        return x
    def forward(self, x, feat_size: List[int]):
@ -996,3 +1003,8 @@ def mvitv2_large(pretrained=False, **kwargs):
 # @register_model
 # def mvitv2_huge_in21k(pretrained=False, **kwargs):
 #     return _create_mvitv2('mvitv2_huge_in21k', pretrained=pretrained, **kwargs)
@register_model
 def mvitv2_small_cls(pretrained=False, **kwargs):
    return _create_mvitv2('mvitv2_small_cls', pretrained=pretrained, **kwargs)