diff --git a/docs/models/adversarial-inception-v3.md b/docs/models/adversarial-inception-v3.md
index a8ba8616..1d66ea99 100644
--- a/docs/models/adversarial-inception-v3.md
+++ b/docs/models/adversarial-inception-v3.md
@@ -1,9 +1,11 @@
-# Summary
+# Adversarial Inception v3
 
 **Inception v3** is a convolutional neural network architecture from the Inception family that makes several improvements including using [Label Smoothing](https://paperswithcode.com/method/label-smoothing), Factorized 7 x 7 convolutions, and the use of an [auxiliary classifer](https://paperswithcode.com/method/auxiliary-classifier) to propagate label information lower down the network (along with the use of batch normalization for layers in the sidehead). The key building block is an [Inception Module](https://paperswithcode.com/method/inception-v3-module).
 
 This particular model was trained for study of adversarial examples (adversarial training).
 
+The weights from this model were ported from [Tensorflow/Models](https://github.com/tensorflow/models).
+
 ## How do I use this model on an image?
 To load a pretrained model:
 
diff --git a/docs/models/advprop.md b/docs/models/advprop.md
index e3597b18..8abac950 100644
--- a/docs/models/advprop.md
+++ b/docs/models/advprop.md
@@ -1,4 +1,4 @@
-# Summary
+# AdvProp
 
 **AdvProp** is an adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to the method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples.
 
diff --git a/docs/models/big-transfer.md b/docs/models/big-transfer.md
index 2903c926..2c3e2b65 100644
--- a/docs/models/big-transfer.md
+++ b/docs/models/big-transfer.md
@@ -1,4 +1,4 @@
-# Summary
+# Big Transfer (BiT)
 
 **Big Transfer (BiT)** is a type of pretraining recipe that pre-trains  on a large supervised source dataset, and fine-tunes the weights on the target task. Models are trained on the JFT-300M dataset. The finetuned models contained in this collection are finetuned on ImageNet.
 
diff --git a/docs/models/csp-darknet.md b/docs/models/csp-darknet.md
index 0c919149..009c8556 100644
--- a/docs/models/csp-darknet.md
+++ b/docs/models/csp-darknet.md
@@ -1,4 +1,4 @@
-# Summary
+# CSP DarkNet
 
 **CSPDarknet53** is a convolutional neural network and backbone for object detection that uses [DarkNet-53](https://paperswithcode.com/method/darknet-53). It employs a CSPNet strategy to partition the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network. 
 
diff --git a/docs/models/csp-resnet.md b/docs/models/csp-resnet.md
index bb31200f..c5eb78ee 100644
--- a/docs/models/csp-resnet.md
+++ b/docs/models/csp-resnet.md
@@ -1,4 +1,4 @@
-# Summary
+# CSP ResNet
 
 **CSPResNet** is a convolutional neural network where we apply the Cross Stage Partial Network (CSPNet) approach to [ResNet](https://paperswithcode.com/method/resnet). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.
 
diff --git a/docs/models/csp-resnext.md b/docs/models/csp-resnext.md
index 5ece7f02..c22efc53 100644
--- a/docs/models/csp-resnext.md
+++ b/docs/models/csp-resnext.md
@@ -1,4 +1,4 @@
-# Summary
+# CSP ResNeXt
 
 **CSPResNeXt** is a convolutional neural network where we apply the Cross Stage Partial Network (CSPNet) approach to [ResNeXt](https://paperswithcode.com/method/resnext). The CSPNet partitions the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network.
 
diff --git a/docs/models/densenet.md b/docs/models/densenet.md
index 30596e35..bdd39fd2 100644
--- a/docs/models/densenet.md
+++ b/docs/models/densenet.md
@@ -1,4 +1,4 @@
-# Summary
+# DenseNet
 
 **DenseNet** is a type of convolutional neural network that utilises dense connections between layers, through [Dense Blocks](http://www.paperswithcode.com/method/dense-block), where we connect *all layers* (with matching feature-map sizes) directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers.
 
diff --git a/docs/models/dla.md b/docs/models/dla.md
index 0f02f858..26a02515 100644
--- a/docs/models/dla.md
+++ b/docs/models/dla.md
@@ -1,4 +1,4 @@
-# Summary
+# Deep Layer Aggregation
 
 Extending  “shallow” skip connections, **Dense Layer Aggregation (DLA)** incorporates more depth and sharing. The authors introduce two structures for deep layer aggregation (DLA): iterative deep aggregation (IDA) and hierarchical deep aggregation (HDA). These structures are expressed through an architectural framework, independent of the choice of backbone, for compatibility with current and future networks. 
 
diff --git a/docs/models/dpn.md b/docs/models/dpn.md
index 09902807..809d0c2a 100644
--- a/docs/models/dpn.md
+++ b/docs/models/dpn.md
@@ -1,4 +1,4 @@
-# Summary
+# Dual Path Network (DPN)
 
 A **Dual Path Network (DPN)** is a convolutional neural network which presents a new topology of connection paths internally. The intuition is that [ResNets](https://paperswithcode.com/method/resnet) enables feature re-usage while DenseNet enables new feature exploration, and both are important for learning good representations. To enjoy the benefits from both path topologies, Dual Path Networks share common features while maintaining the flexibility to explore new features through dual path architectures. 
 
diff --git a/docs/models/ecaresnet.md b/docs/models/ecaresnet.md
index 0e28c32a..88b8c466 100644
--- a/docs/models/ecaresnet.md
+++ b/docs/models/ecaresnet.md
@@ -1,4 +1,4 @@
-# Summary
+# ECA ResNet
 
 An **ECA ResNet** is a variant on a [ResNet](https://paperswithcode.com/method/resnet) that utilises an [Efficient Channel Attention module](https://paperswithcode.com/method/efficient-channel-attention). Efficient Channel Attention is an architectural unit based on [squeeze-and-excitation blocks](https://paperswithcode.com/method/squeeze-and-excitation-block) that reduces model complexity without dimensionality reduction. 
 
diff --git a/docs/models/efficientnet-pruned.md b/docs/models/efficientnet-pruned.md
index 49a8f8bc..94e76447 100644
--- a/docs/models/efficientnet-pruned.md
+++ b/docs/models/efficientnet-pruned.md
@@ -1,4 +1,4 @@
-# Summary
+# EfficientNet (Knapsack Pruned)
 
 **EfficientNet** is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a *compound coefficient*. Unlike conventional practice that arbitrary scales  these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if we want to use $2^N$ times more computational resources, then we can simply increase the network depth by $\alpha ^ N$,  width by $\beta ^ N$, and image size by $\gamma ^ N$, where $\alpha, \beta, \gamma$ are constant coefficients determined by a small grid search on the original small model. EfficientNet uses a compound coefficient $\phi$ to uniformly scales network width, depth, and resolution in a  principled way.
 
@@ -89,14 +89,13 @@ You can follow the [timm recipe scripts](https://rwightman.github.io/pytorch-ima
 ```
 
 ```
-@misc{rw2019timm,
-  author = {Ross Wightman},
-  title = {PyTorch Image Models},
-  year = {2019},
-  publisher = {GitHub},
-  journal = {GitHub repository},
-  doi = {10.5281/zenodo.4414861},
-  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
+@misc{aflalo2020knapsack,
+      title={Knapsack Pruning with Inner Distillation},
+      author={Yonathan Aflalo and Asaf Noy and Ming Lin and Itamar Friedman and Lihi Zelnik},
+      year={2020},
+      eprint={2002.08258},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
 }
 ```
 
diff --git a/docs/models/efficientnet.md b/docs/models/efficientnet.md
index 15c084d0..50cc1db6 100644
--- a/docs/models/efficientnet.md
+++ b/docs/models/efficientnet.md
@@ -1,4 +1,4 @@
-# Summary
+# EfficientNet
 
 **EfficientNet** is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a *compound coefficient*. Unlike conventional practice that arbitrary scales  these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if we want to use $2^N$ times more computational resources, then we can simply increase the network depth by $\alpha ^ N$,  width by $\beta ^ N$, and image size by $\gamma ^ N$, where $\alpha, \beta, \gamma$ are constant coefficients determined by a small grid search on the original small model. EfficientNet uses a compound coefficient $\phi$ to uniformly scales network width, depth, and resolution in a  principled way.
 
diff --git a/docs/models/ensemble-adversarial.md b/docs/models/ensemble-adversarial.md
index ce59cd0d..f43d76f1 100644
--- a/docs/models/ensemble-adversarial.md
+++ b/docs/models/ensemble-adversarial.md
@@ -1,9 +1,11 @@
-# Summary
+# # Ensemble Adversarial Inception ResNet v2
 
 **Inception-ResNet-v2** is a convolutional neural architecture that builds on the Inception family of architectures but incorporates [residual connections](https://paperswithcode.com/method/residual-connection) (replacing the filter concatenation stage of the Inception architecture).
 
 This particular model was trained for study of adversarial examples (adversarial training).
 
+The weights from this model were ported from [Tensorflow/Models](https://github.com/tensorflow/models).
+
 ## How do I use this model on an image?
 To load a pretrained model:
 
diff --git a/docs/models/ese-vovnet.md b/docs/models/ese-vovnet.md
index 7680c194..51313445 100644
--- a/docs/models/ese-vovnet.md
+++ b/docs/models/ese-vovnet.md
@@ -1,4 +1,4 @@
-# Summary
+# ESE VoVNet
 
 **VoVNet** is a convolutional neural network that seeks to make [DenseNet](https://paperswithcode.com/method/densenet) more efficient by concatenating all features only once in the last feature map, which makes input size constant and enables enlarging new output channel. 
 
diff --git a/docs/models/fbnet.md b/docs/models/fbnet.md
index f1860ad0..01d8bebd 100644
--- a/docs/models/fbnet.md
+++ b/docs/models/fbnet.md
@@ -1,4 +1,4 @@
-# Summary
+# FBNet
 
 **FBNet** is a type of convolutional neural architectures discovered through [DNAS](https://paperswithcode.com/method/dnas) neural architecture search. It utilises a basic type of image model block inspired by [MobileNetv2](https://paperswithcode.com/method/mobilenetv2) that utilises depthwise convolutions and an inverted residual structure (see components).
 
diff --git a/docs/models/gloun-inception-v3.md b/docs/models/gloun-inception-v3.md
index 81c1f845..f7365ed3 100644
--- a/docs/models/gloun-inception-v3.md
+++ b/docs/models/gloun-inception-v3.md
@@ -1,8 +1,8 @@
-# Summary
+# Gluon Inception v3
 
 **Inception v3** is a convolutional neural network architecture from the Inception family that makes several improvements including using [Label Smoothing](https://paperswithcode.com/method/label-smoothing), Factorized 7 x 7 convolutions, and the use of an [auxiliary classifer](https://paperswithcode.com/method/auxiliary-classifier) to propagate label information lower down the network (along with the use of batch normalization for layers in the sidehead). The key building block is an [Inception Module](https://paperswithcode.com/method/inception-v3-module).
 
-The weights from this model were ported from Gluon.
+The weights from this model were ported from [Gluon](https://cv.gluon.ai/model_zoo/classification.html).
 
 ## How do I use this model on an image?
 To load a pretrained model:
diff --git a/docs/models/gloun-resnet.md b/docs/models/gloun-resnet.md
index 926c4dc2..c7186295 100644
--- a/docs/models/gloun-resnet.md
+++ b/docs/models/gloun-resnet.md
@@ -1,8 +1,8 @@
-# Summary
+# Glu on ResNet
 
 **Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. Instead of hoping each few stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. They stack [residual blocks](https://paperswithcode.com/method/residual-block) ontop of each other to form network: e.g. a ResNet-50 has fifty layers using these blocks. 
 
-The weights from this model were ported from Gluon.
+The weights from this model were ported from [Gluon](https://cv.gluon.ai/model_zoo/classification.html).
 
 ## How do I use this model on an image?
 To load a pretrained model:
diff --git a/docs/models/gloun-resnext.md b/docs/models/gloun-resnext.md
index a60ccd50..499ab273 100644
--- a/docs/models/gloun-resnext.md
+++ b/docs/models/gloun-resnext.md
@@ -1,8 +1,8 @@
-# Summary
+# Gluon ResNeXt
 
 A **ResNeXt** repeats a [building block](https://paperswithcode.com/method/resnext-block) that aggregates a set of transformations with the same topology. Compared to a [ResNet](https://paperswithcode.com/method/resnet), it exposes a new dimension,  *cardinality* (the size of the set of transformations) $C$, as an essential factor in addition to the dimensions of depth and width. 
 
-The weights from this model were ported from Gluon.
+The weights from this model were ported from [Gluon](https://cv.gluon.ai/model_zoo/classification.html).
 
 ## How do I use this model on an image?
 To load a pretrained model:
diff --git a/docs/models/gloun-senet.md b/docs/models/gloun-senet.md
index fb022f00..ac8f4ca8 100644
--- a/docs/models/gloun-senet.md
+++ b/docs/models/gloun-senet.md
@@ -2,7 +2,7 @@
 
 A **SENet** is a convolutional neural network architecture that employs [squeeze-and-excitation blocks](https://paperswithcode.com/method/squeeze-and-excitation-block) to enable the network to perform dynamic channel-wise feature recalibration.
 
-The weights from this model were ported from Gluon.
+The weights from this model were ported from [Gluon](https://cv.gluon.ai/model_zoo/classification.html).
 
 ## How do I use this model on an image?
 To load a pretrained model:
diff --git a/docs/models/gloun-seresnext.md b/docs/models/gloun-seresnext.md
index 4b0a07a0..72dc530d 100644
--- a/docs/models/gloun-seresnext.md
+++ b/docs/models/gloun-seresnext.md
@@ -2,7 +2,7 @@
 
 **SE ResNeXt** is a variant of a [ResNext](https://www.paperswithcode.com/method/resnext) that employs [squeeze-and-excitation blocks](https://paperswithcode.com/method/squeeze-and-excitation-block) to enable the network to perform dynamic channel-wise feature recalibration.
 
-The weights from this model were ported from Gluon.
+The weights from this model were ported from [Gluon](https://cv.gluon.ai/model_zoo/classification.html).
 
 ## How do I use this model on an image?
 To load a pretrained model:
diff --git a/docs/models/gloun-xception.md b/docs/models/gloun-xception.md
index 44687a95..2609552a 100644
--- a/docs/models/gloun-xception.md
+++ b/docs/models/gloun-xception.md
@@ -1,6 +1,8 @@
 # Summary
 
-**Xception** is a convolutional neural network architecture that relies solely on [depthwise separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution) layers. The weights from this model were ported from Gluon.
+**Xception** is a convolutional neural network architecture that relies solely on [depthwise separable convolution](https://paperswithcode.com/method/depthwise-separable-convolution) layers.
+
+The weights from this model were ported from [Gluon](https://cv.gluon.ai/model_zoo/classification.html).
 
 ## How do I use this model on an image?
 To load a pretrained model:
diff --git a/docs/models/hrnet.md b/docs/models/hrnet.md
index 3047b335..4e577ca7 100644
--- a/docs/models/hrnet.md
+++ b/docs/models/hrnet.md
@@ -1,4 +1,4 @@
-# Summary
+# HRNet
 
 **HRNet**, or **High-Resolution Net**, is a general purpose convolutional neural network for tasks like semantic segmentation, object detection and image classification. It is able to maintain high resolution representations through the whole process. We start from a high-resolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several ($4$ in the paper) stages and the $n$th stage contains $n$ streams corresponding to $n$ resolutions. The authors conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.
 
diff --git a/docs/models/ig-resnext.md b/docs/models/ig-resnext.md
index 24749233..a94d53a1 100644
--- a/docs/models/ig-resnext.md
+++ b/docs/models/ig-resnext.md
@@ -1,4 +1,4 @@
-# Summary
+# Instagram ResNeXt WSL
 
 A **ResNeXt** repeats a [building block](https://paperswithcode.com/method/resnext-block) that aggregates a set of transformations with the same topology. Compared to a [ResNet](https://paperswithcode.com/method/resnet), it exposes a new dimension,  *cardinality* (the size of the set of transformations) $C$, as an essential factor in addition to the dimensions of depth and width. 
 
diff --git a/docs/models/inception-resnet-v2.md b/docs/models/inception-resnet-v2.md
index ff741fe2..b496e31a 100644
--- a/docs/models/inception-resnet-v2.md
+++ b/docs/models/inception-resnet-v2.md
@@ -1,4 +1,4 @@
-# Summary
+# Inception Resnet v2
 
 **Inception-ResNet-v2** is a convolutional neural architecture that builds on the Inception family of architectures but incorporates [residual connections](https://paperswithcode.com/method/residual-connection) (replacing the filter concatenation stage of the Inception architecture).
 
diff --git a/docs/models/inception-v3.md b/docs/models/inception-v3.md
index 1666f1da..b5e96c58 100644
--- a/docs/models/inception-v3.md
+++ b/docs/models/inception-v3.md
@@ -1,4 +1,4 @@
-# Summary
+# Inception v3
 
 **Inception v3** is a convolutional neural network architecture from the Inception family that makes several improvements including using [Label Smoothing](https://paperswithcode.com/method/label-smoothing), Factorized 7 x 7 convolutions, and the use of an [auxiliary classifer](https://paperswithcode.com/method/auxiliary-classifier) to propagate label information lower down the network (along with the use of batch normalization for layers in the sidehead). The key building block is an [Inception Module](https://paperswithcode.com/method/inception-v3-module).
 
diff --git a/docs/models/inception-v4.md b/docs/models/inception-v4.md
index 5717a037..ca2950c2 100644
--- a/docs/models/inception-v4.md
+++ b/docs/models/inception-v4.md
@@ -1,4 +1,4 @@
-# Summary
+# Inception v4
 
 **Inception-v4** is a convolutional neural network architecture that builds on previous iterations of the Inception family by simplifying the architecture and using more inception modules than [Inception-v3](https://paperswithcode.com/method/inception-v3).
 ## How do I use this model on an image?
diff --git a/docs/models/legacy-se-resnet.md b/docs/models/legacy-se-resnet.md
index 9c84a4a4..44ba292a 100644
--- a/docs/models/legacy-se-resnet.md
+++ b/docs/models/legacy-se-resnet.md
@@ -1,4 +1,4 @@
-# Summary
+# (Legacy) SE ResNet
 
 **SE ResNet** is a variant of a [ResNet](https://www.paperswithcode.com/method/resnet) that employs [squeeze-and-excitation blocks](https://paperswithcode.com/method/squeeze-and-excitation-block) to enable the network to perform dynamic channel-wise feature recalibration.
 
diff --git a/docs/models/legacy-se-resnext.md b/docs/models/legacy-se-resnext.md
index 72bea1fc..3f4c3cf3 100644
--- a/docs/models/legacy-se-resnext.md
+++ b/docs/models/legacy-se-resnext.md
@@ -1,4 +1,4 @@
-# Summary
+# (Legacy) SE ResNeXt
 
 **SE ResNeXt** is a variant of a [ResNeXt](https://www.paperswithcode.com/method/resnext) that employs [squeeze-and-excitation blocks](https://paperswithcode.com/method/squeeze-and-excitation-block) to enable the network to perform dynamic channel-wise feature recalibration.
 
diff --git a/docs/models/legacy-senet.md b/docs/models/legacy-senet.md
index f2f9eb1a..a4c345fc 100644
--- a/docs/models/legacy-senet.md
+++ b/docs/models/legacy-senet.md
@@ -1,4 +1,4 @@
-# Summary
+# (Legacy) SENet
 
 A **SENet** is a convolutional neural network architecture that employs [squeeze-and-excitation blocks](https://paperswithcode.com/method/squeeze-and-excitation-block) to enable the network to perform dynamic channel-wise feature recalibration.
 
diff --git a/docs/models/mixnet.md b/docs/models/mixnet.md
index 69dbf2bb..ff7abb2e 100644
--- a/docs/models/mixnet.md
+++ b/docs/models/mixnet.md
@@ -1,4 +1,4 @@
-# Summary
+# MixNet
 
 **MixNet** is a type of convolutional neural network discovered via AutoML that utilises [MixConvs](https://paperswithcode.com/method/mixconv) instead of regular [depthwise convolutions](https://paperswithcode.com/method/depthwise-convolution).
 
diff --git a/docs/models/mnasnet.md b/docs/models/mnasnet.md
index faa3c5ce..91f9a204 100644
--- a/docs/models/mnasnet.md
+++ b/docs/models/mnasnet.md
@@ -1,4 +1,4 @@
-# Summary
+# MnasNet
 
 **MnasNet** is a type of convolutional neural network optimized for mobile devices that is discovered through mobile neural architecture search, which explicitly incorporates model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. The main building block is an [inverted residual block](https://paperswithcode.com/method/inverted-residual-block) (from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2)).
 
diff --git a/docs/models/mobilenet-v2.md b/docs/models/mobilenet-v2.md
index 69a59efa..d9589bde 100644
--- a/docs/models/mobilenet-v2.md
+++ b/docs/models/mobilenet-v2.md
@@ -1,4 +1,4 @@
-# Summary
+# MobileNet v2
 
 **MobileNetV2** is a convolutional neural network architecture that seeks to perform well on mobile devices. It is based on an [inverted residual structure](https://paperswithcode.com/method/inverted-residual-block) where the residual connections are between the bottleneck layers.  The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. As a whole, the architecture of MobileNetV2 contains the initial fully convolution layer with 32 filters, followed by 19 residual bottleneck layers.
 
diff --git a/docs/models/mobilenet-v3.md b/docs/models/mobilenet-v3.md
index 776d8ef7..9b2a63ea 100644
--- a/docs/models/mobilenet-v3.md
+++ b/docs/models/mobilenet-v3.md
@@ -1,4 +1,4 @@
-# Summary
+# MobileNet v3
 
 **MobileNetV3** is a convolutional neural network that is designed for mobile phone CPUs. The network design includes the use of a [hard swish activation](https://paperswithcode.com/method/hard-swish) and [squeeze-and-excitation](https://paperswithcode.com/method/squeeze-and-excitation-block) modules in the [MBConv blocks](https://paperswithcode.com/method/inverted-residual-block).
 
diff --git a/docs/models/nasnet.md b/docs/models/nasnet.md
index 80b3f5ef..9ead3e72 100644
--- a/docs/models/nasnet.md
+++ b/docs/models/nasnet.md
@@ -1,4 +1,4 @@
-# Summary
+# NASNet
 
 **NASNet** is a type of convolutional neural network discovered through neural architecture search. The building blocks consist of normal and reduction cells.
 
diff --git a/docs/models/noisy-student.md b/docs/models/noisy-student.md
index d4078679..f570b932 100644
--- a/docs/models/noisy-student.md
+++ b/docs/models/noisy-student.md
@@ -1,4 +1,4 @@
-# Summary
+# Noisy Student (EfficientNet)
 
 **Noisy Student Training** is a semi-supervised learning approach. It extends the idea of self-training
 and distillation with the use of equal-or-larger student models and noise added to the student during learning. It has three main steps: 
diff --git a/docs/models/pnasnet.md b/docs/models/pnasnet.md
index 2c752474..04c1bfa3 100644
--- a/docs/models/pnasnet.md
+++ b/docs/models/pnasnet.md
@@ -1,4 +1,4 @@
-# Summary
+# PNASNet
 
 **Progressive Neural Architecture Search**, or **PNAS**, is a method for learning the structure of convolutional neural networks (CNNs). It uses a sequential model-based optimization (SMBO) strategy, where we search the space of cell structures, starting with simple (shallow) models and progressing to complex ones, pruning out unpromising structures as we go. 
 
diff --git a/docs/models/regnetx.md b/docs/models/regnetx.md
index 326319c4..0612ebd8 100644
--- a/docs/models/regnetx.md
+++ b/docs/models/regnetx.md
@@ -1,4 +1,4 @@
-# Summary
+# RegNetX
 
 **RegNetX** is a convolutional network design space with simple, regular models with parameters: depth $d$, initial width $w\_{0} > 0$, and slope $w\_{a} > 0$, and generates a different block width $u\_{j}$ for each block $j < d$. The key restriction for the RegNet types of model is that there is a linear parameterisation of block widths (the design space only contains models with this linear structure):
 
diff --git a/docs/models/regnety.md b/docs/models/regnety.md
index 6f2d73eb..2b9100c0 100644
--- a/docs/models/regnety.md
+++ b/docs/models/regnety.md
@@ -1,4 +1,4 @@
-# Summary
+# RegNetY
 
 **RegNetY** is a convolutional network design space with simple, regular models with parameters: depth $d$, initial width $w\_{0} > 0$, and slope $w\_{a} > 0$, and generates a different block width $u\_{j}$ for each block $j < d$. The key restriction for the RegNet types of model is that there is a linear parameterisation of block widths (the design space only contains models with this linear structure):
 
diff --git a/docs/models/res2net.md b/docs/models/res2net.md
index 6801f685..a300457e 100644
--- a/docs/models/res2net.md
+++ b/docs/models/res2net.md
@@ -1,4 +1,4 @@
-# Summary
+# Res2Net
 
 **Res2Net** is an image model that employs a variation on bottleneck residual blocks, [Res2Net Blocks](https://paperswithcode.com/method/res2net-block). The motivation is to be able to represent features at multiple scales. This is achieved through a novel building block for CNNs that constructs hierarchical residual-like connections within one single residual block. This represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.
 
diff --git a/docs/models/res2next.md b/docs/models/res2next.md
index 5d6a7a23..dacb685f 100644
--- a/docs/models/res2next.md
+++ b/docs/models/res2next.md
@@ -1,6 +1,6 @@
-# Summary
+# Res2NeXt
 
-**Res2Net** is an image model that employs a variation on [ResNeXt](https://paperswithcode.com/method/resnext) bottleneck residual blocks. The motivation is to be able to represent features at multiple scales. This is achieved through a novel building block for CNNs that constructs hierarchical residual-like connections within one single residual block. This represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.
+**Res2NeXt** is an image model that employs a variation on [ResNeXt](https://paperswithcode.com/method/resnext) bottleneck residual blocks. The motivation is to be able to represent features at multiple scales. This is achieved through a novel building block for CNNs that constructs hierarchical residual-like connections within one single residual block. This represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.
 
 ## How do I use this model on an image?
 To load a pretrained model:
diff --git a/docs/models/resnest.md b/docs/models/resnest.md
index f47ea72d..b22d4573 100644
--- a/docs/models/resnest.md
+++ b/docs/models/resnest.md
@@ -1,6 +1,6 @@
-# Summary
+# ResNeSt
 
-A **ResNest** is a variant on a [ResNet](https://paperswithcode.com/method/resnet), which instead stacks [Split-Attention blocks](https://paperswithcode.com/method/split-attention). The cardinal group representations are then concatenated along the channel dimension: $V = \text{Concat}${$V^{1},V^{2},\cdots{V}^{K}$}. As in standard residual blocks, the final output $Y$ of otheur Split-Attention block is produced using a shortcut connection: $Y=V+X$, if the input and output feature-map share the same shape.  For blocks with a stride, an appropriate transformation $\mathcal{T}$ is applied to the shortcut connection to align the output shapes:  $Y=V+\mathcal{T}(X)$. For example, $\mathcal{T}$ can be strided convolution or combined convolution-with-pooling.
+A **ResNeSt** is a variant on a [ResNet](https://paperswithcode.com/method/resnet), which instead stacks [Split-Attention blocks](https://paperswithcode.com/method/split-attention). The cardinal group representations are then concatenated along the channel dimension: $V = \text{Concat}${$V^{1},V^{2},\cdots{V}^{K}$}. As in standard residual blocks, the final output $Y$ of otheur Split-Attention block is produced using a shortcut connection: $Y=V+X$, if the input and output feature-map share the same shape.  For blocks with a stride, an appropriate transformation $\mathcal{T}$ is applied to the shortcut connection to align the output shapes:  $Y=V+\mathcal{T}(X)$. For example, $\mathcal{T}$ can be strided convolution or combined convolution-with-pooling.
 
 ## How do I use this model on an image?
 To load a pretrained model:
diff --git a/docs/models/resnet-d.md b/docs/models/resnet-d.md
index ab73b3ce..4f48fe96 100644
--- a/docs/models/resnet-d.md
+++ b/docs/models/resnet-d.md
@@ -1,4 +1,4 @@
-# Summary
+# ResNet-D
 
 **ResNet-D** is a modification on the [ResNet](https://paperswithcode.com/method/resnet) architecture that utilises an [average pooling](https://paperswithcode.com/method/average-pooling) tweak for downsampling. The motivation is that in the unmodified ResNet, the [1×1 convolution](https://paperswithcode.com/method/1x1-convolution) for the downsampling block ignores 3/4 of input feature maps, so this is modified so no information will be ignored
 
diff --git a/docs/models/resnet.md b/docs/models/resnet.md
index 13768bc5..dd6e361a 100644
--- a/docs/models/resnet.md
+++ b/docs/models/resnet.md
@@ -1,4 +1,4 @@
-# Summary
+# ResNet
 
 **Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. Instead of hoping each few stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. They stack [residual blocks](https://paperswithcode.com/method/residual-block) ontop of each other to form network: e.g. a ResNet-50 has fifty layers using these blocks. 
 
diff --git a/docs/models/resnext.md b/docs/models/resnext.md
index a55dc34e..ce1a7725 100644
--- a/docs/models/resnext.md
+++ b/docs/models/resnext.md
@@ -1,4 +1,4 @@
-# Summary
+# ResNeXt
 
 A **ResNeXt** repeats a [building block](https://paperswithcode.com/method/resnext-block) that aggregates a set of transformations with the same topology. Compared to a [ResNet](https://paperswithcode.com/method/resnet), it exposes a new dimension,  *cardinality* (the size of the set of transformations) $C$, as an essential factor in addition to the dimensions of depth and width. 
 
diff --git a/docs/models/rexnet.md b/docs/models/rexnet.md
index 14fb2476..f7ddd8b5 100644
--- a/docs/models/rexnet.md
+++ b/docs/models/rexnet.md
@@ -1,4 +1,4 @@
-# Summary
+# RexNet
 
 **Rank Expansion Networks** (ReXNets) follow a set of new design principles for designing bottlenecks in image classification models. Authors refine each layer by 1) expanding the input channel size of the convolution layer and 2) replacing the [ReLU6s](https://www.paperswithcode.com/method/relu6).
 
diff --git a/docs/models/se-resnet.md b/docs/models/se-resnet.md
index 9e01760b..4b121433 100644
--- a/docs/models/se-resnet.md
+++ b/docs/models/se-resnet.md
@@ -1,4 +1,4 @@
-# Summary
+# SE ResNet
 
 **SE ResNet** is a variant of a [ResNet](https://www.paperswithcode.com/method/resnet) that employs [squeeze-and-excitation blocks](https://paperswithcode.com/method/squeeze-and-excitation-block) to enable the network to perform dynamic channel-wise feature recalibration.
 
diff --git a/docs/models/selecsls.md b/docs/models/selecsls.md
index 62514923..ba1325cc 100644
--- a/docs/models/selecsls.md
+++ b/docs/models/selecsls.md
@@ -1,4 +1,4 @@
-# Summary
+# SelecSLS
 
 **SelecSLS** uses novel selective long and short range skip connections to improve the information flow allowing for a drastically faster network without compromising accuracy.
 
diff --git a/docs/models/seresnext.md b/docs/models/seresnext.md
index 0ee24a31..6406291b 100644
--- a/docs/models/seresnext.md
+++ b/docs/models/seresnext.md
@@ -1,4 +1,4 @@
-# Summary
+# SE ResNeXt
 
 **SE ResNeXt** is a variant of a [ResNext](https://www.paperswithcode.com/method/resneXt) that employs [squeeze-and-excitation blocks](https://paperswithcode.com/method/squeeze-and-excitation-block) to enable the network to perform dynamic channel-wise feature recalibration.
 
diff --git a/docs/models/skresnet.md b/docs/models/skresnet.md
index dae6115b..077047b6 100644
--- a/docs/models/skresnet.md
+++ b/docs/models/skresnet.md
@@ -1,4 +1,4 @@
-# Summary
+# SK ResNet
 
 **SK ResNet** is a variant of a [ResNet](https://www.paperswithcode.com/method/resnet) that employs a [Selective Kernel](https://paperswithcode.com/method/selective-kernel) unit. In general, all the large kernel convolutions in the original bottleneck blocks in ResNet are replaced by the proposed [SK convolutions](https://paperswithcode.com/method/selective-kernel-convolution), enabling the network to choose appropriate receptive field sizes in an adaptive manner.
 
diff --git a/docs/models/skresnext.md b/docs/models/skresnext.md
index 8f887a53..e1c0c51b 100644
--- a/docs/models/skresnext.md
+++ b/docs/models/skresnext.md
@@ -1,4 +1,4 @@
-# Summary
+# SK ResNeXt
 
 **SK ResNeXt** is a variant of a [ResNeXt](https://www.paperswithcode.com/method/resnext) that employs a [Selective Kernel](https://paperswithcode.com/method/selective-kernel) unit. In general, all the large kernel convolutions in the original bottleneck blocks in ResNext are replaced by the proposed [SK convolutions](https://paperswithcode.com/method/selective-kernel-convolution), enabling the network to choose appropriate receptive field sizes in an adaptive manner.
 
diff --git a/docs/models/spnasnet.md b/docs/models/spnasnet.md
index a8607ff9..7bc181ed 100644
--- a/docs/models/spnasnet.md
+++ b/docs/models/spnasnet.md
@@ -1,4 +1,4 @@
-# Summary
+# SPNASNet
 
 **Single-Path NAS** is a novel differentiable NAS method for designing hardware-efficient ConvNets in less than 4 hours.
 
diff --git a/docs/models/ssl-resnet.md b/docs/models/ssl-resnet.md
index 4b9e6795..3e0ae9b6 100644
--- a/docs/models/ssl-resnet.md
+++ b/docs/models/ssl-resnet.md
@@ -1,4 +1,4 @@
-# Summary
+# SSL ResNet
 
 **Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. Instead of hoping each few stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. They stack [residual blocks](https://paperswithcode.com/method/residual-block) ontop of each other to form network: e.g. a ResNet-50 has fifty layers using these blocks. 
 
diff --git a/docs/models/ssl-resnext.md b/docs/models/ssl-resnext.md
index 8d1fc115..a6768f63 100644
--- a/docs/models/ssl-resnext.md
+++ b/docs/models/ssl-resnext.md
@@ -1,4 +1,4 @@
-# Summary
+# SSL ResNeXT
 
 A **ResNeXt** repeats a [building block](https://paperswithcode.com/method/resnext-block) that aggregates a set of transformations with the same topology. Compared to a [ResNet](https://paperswithcode.com/method/resnet), it exposes a new dimension,  *cardinality* (the size of the set of transformations) $C$, as an essential factor in addition to the dimensions of depth and width. 
 
diff --git a/docs/models/swsl-resnet.md b/docs/models/swsl-resnet.md
index 5b63f7ae..239613a9 100644
--- a/docs/models/swsl-resnet.md
+++ b/docs/models/swsl-resnet.md
@@ -1,4 +1,4 @@
-# Summary
+# SWSL ResNet
 
 **Residual Networks**, or **ResNets**, learn residual functions with reference to the layer inputs, instead of learning unreferenced functions. Instead of hoping each few stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. They stack [residual blocks](https://paperswithcode.com/method/residual-block) ontop of each other to form network: e.g. a ResNet-50 has fifty layers using these blocks. 
 
diff --git a/docs/models/swsl-resnext.md b/docs/models/swsl-resnext.md
index 84b76dfd..c9933d44 100644
--- a/docs/models/swsl-resnext.md
+++ b/docs/models/swsl-resnext.md
@@ -1,4 +1,4 @@
-# Summary
+# SWSL ResNeXt
 
 A **ResNeXt** repeats a [building block](https://paperswithcode.com/method/resnext-block) that aggregates a set of transformations with the same topology. Compared to a [ResNet](https://paperswithcode.com/method/resnet), it exposes a new dimension,  *cardinality* (the size of the set of transformations) $C$, as an essential factor in addition to the dimensions of depth and width. 
 
diff --git a/docs/models/tf-efficientnet-condconv.md b/docs/models/tf-efficientnet-condconv.md
index 8ea96c7c..579b0f6d 100644
--- a/docs/models/tf-efficientnet-condconv.md
+++ b/docs/models/tf-efficientnet-condconv.md
@@ -1,4 +1,4 @@
-# Summary
+# (Tensorflow) EfficientNet CondConv
 
 **EfficientNet** is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a *compound coefficient*. Unlike conventional practice that arbitrary scales  these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if we want to use $2^N$ times more computational resources, then we can simply increase the network depth by $\alpha ^ N$,  width by $\beta ^ N$, and image size by $\gamma ^ N$, where $\alpha, \beta, \gamma$ are constant coefficients determined by a small grid search on the original small model. EfficientNet uses a compound coefficient $\phi$ to uniformly scales network width, depth, and resolution in a  principled way.
 
@@ -8,6 +8,8 @@ The base EfficientNet-B0 network is based on the inverted bottleneck residual bl
 
 This collection of models amends EfficientNet by adding [CondConv](https://paperswithcode.com/method/condconv) convolutions.
 
+The weights from this model were ported from [Tensorflow/TPU](https://github.com/tensorflow/tpu).
+
 ## How do I use this model on an image?
 To load a pretrained model:
 
diff --git a/docs/models/tf-efficientnet-lite.md b/docs/models/tf-efficientnet-lite.md
index 6cc8bd07..6c87593e 100644
--- a/docs/models/tf-efficientnet-lite.md
+++ b/docs/models/tf-efficientnet-lite.md
@@ -1,4 +1,4 @@
-# Summary
+# (Tensorflow) EfficientNet Lite
 
 **EfficientNet** is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a *compound coefficient*. Unlike conventional practice that arbitrary scales  these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if we want to use $2^N$ times more computational resources, then we can simply increase the network depth by $\alpha ^ N$,  width by $\beta ^ N$, and image size by $\gamma ^ N$, where $\alpha, \beta, \gamma$ are constant coefficients determined by a small grid search on the original small model. EfficientNet uses a compound coefficient $\phi$ to uniformly scales network width, depth, and resolution in a  principled way.
 
@@ -8,6 +8,8 @@ The base EfficientNet-B0 network is based on the inverted bottleneck residual bl
 
 EfficientNet-Lite makes EfficientNet more suitable for mobile devices by introducing [ReLU6](https://paperswithcode.com/method/relu6) activation functions and removing [squeeze-and-excitation blocks](https://paperswithcode.com/method/squeeze-and-excitation).
 
+The weights from this model were ported from [Tensorflow/TPU](https://github.com/tensorflow/tpu).
+
 ## How do I use this model on an image?
 To load a pretrained model:
 
diff --git a/docs/models/tf-efficientnet.md b/docs/models/tf-efficientnet.md
index 72521300..832914b7 100644
--- a/docs/models/tf-efficientnet.md
+++ b/docs/models/tf-efficientnet.md
@@ -1,4 +1,4 @@
-# Summary
+# (Tensorflow) EfficientNet
 
 **EfficientNet** is a convolutional neural network architecture and scaling method that uniformly scales all dimensions of depth/width/resolution using a *compound coefficient*. Unlike conventional practice that arbitrary scales  these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. For example, if we want to use $2^N$ times more computational resources, then we can simply increase the network depth by $\alpha ^ N$,  width by $\beta ^ N$, and image size by $\gamma ^ N$, where $\alpha, \beta, \gamma$ are constant coefficients determined by a small grid search on the original small model. EfficientNet uses a compound coefficient $\phi$ to uniformly scales network width, depth, and resolution in a  principled way.
 
@@ -6,6 +6,8 @@ The compound scaling method is justified by the intuition that if the input imag
 
 The base EfficientNet-B0 network is based on the inverted bottleneck residual blocks of [MobileNetV2](https://paperswithcode.com/method/mobilenetv2), in addition to [squeeze-and-excitation blocks](https://paperswithcode.com/method/squeeze-and-excitation-block).
 
+The weights from this model were ported from [Tensorflow/TPU](https://github.com/tensorflow/tpu).
+
 ## How do I use this model on an image?
 To load a pretrained model:
 
diff --git a/docs/models/tf-inception-v3.md b/docs/models/tf-inception-v3.md
index fc238573..c0acdda0 100644
--- a/docs/models/tf-inception-v3.md
+++ b/docs/models/tf-inception-v3.md
@@ -1,7 +1,9 @@
-# Summary
+# (Tensorflow) Inception v3
 
 **Inception v3** is a convolutional neural network architecture from the Inception family that makes several improvements including using [Label Smoothing](https://paperswithcode.com/method/label-smoothing), Factorized 7 x 7 convolutions, and the use of an [auxiliary classifer](https://paperswithcode.com/method/auxiliary-classifier) to propagate label information lower down the network (along with the use of batch normalization for layers in the sidehead). The key building block is an [Inception Module](https://paperswithcode.com/method/inception-v3-module).
 
+The weights from this model were ported from [Tensorflow/Models](https://github.com/tensorflow/models).
+
 ## How do I use this model on an image?
 To load a pretrained model:
 
diff --git a/docs/models/tf-mixnet.md b/docs/models/tf-mixnet.md
index 383b8a45..8fcab550 100644
--- a/docs/models/tf-mixnet.md
+++ b/docs/models/tf-mixnet.md
@@ -1,7 +1,9 @@
-# Summary
+# (Tensorflow) MixNet
 
 **MixNet** is a type of convolutional neural network discovered via AutoML that utilises [MixConvs](https://paperswithcode.com/method/mixconv) instead of regular [depthwise convolutions](https://paperswithcode.com/method/depthwise-convolution).
 
+The weights from this model were ported from [Tensorflow/TPU](https://github.com/tensorflow/tpu).
+
 ## How do I use this model on an image?
 To load a pretrained model:
 
diff --git a/docs/models/tf-mobilenet-v3.md b/docs/models/tf-mobilenet-v3.md
index 016d2d79..0ed20515 100644
--- a/docs/models/tf-mobilenet-v3.md
+++ b/docs/models/tf-mobilenet-v3.md
@@ -1,7 +1,9 @@
-# Summary
+# (Tensorflow) MobileNet v3
 
 **MobileNetV3** is a convolutional neural network that is designed for mobile phone CPUs. The network design includes the use of a [hard swish activation](https://paperswithcode.com/method/hard-swish) and [squeeze-and-excitation](https://paperswithcode.com/method/squeeze-and-excitation-block) modules in the [MBConv blocks](https://paperswithcode.com/method/inverted-residual-block).
 
+The weights from this model were ported from [Tensorflow/Models](https://github.com/tensorflow/models).
+
 ## How do I use this model on an image?
 To load a pretrained model:
 
diff --git a/docs/models/tresnet.md b/docs/models/tresnet.md
index 1bee64dc..ef15085c 100644
--- a/docs/models/tresnet.md
+++ b/docs/models/tresnet.md
@@ -1,4 +1,4 @@
-# Summary
+# TResNet
 
 A **TResNet** is a variant on a [ResNet](https://paperswithcode.com/method/resnet) that aim to boost accuracy while maintaining GPU training and inference efficiency.  They contain several design tricks including a SpaceToDepth stem, [Anti-Alias downsampling](https://paperswithcode.com/method/anti-alias-downsampling), In-Place Activated BatchNorm, Blocks selection and [squeeze-and-excitation layers](https://paperswithcode.com/method/squeeze-and-excitation-block).
 
diff --git a/docs/models/vision-transformer.md b/docs/models/vision-transformer.md
index 4bd235f0..6b6df64f 100644
--- a/docs/models/vision-transformer.md
+++ b/docs/models/vision-transformer.md
@@ -1,4 +1,4 @@
-# Summary
+# Vision Transformer (ViT)
 
 The **Vision Transformer** is a model for image classification that employs a Transformer-like architecture over patches of the image. This includes the use of [Multi-Head Attention](https://paperswithcode.com/method/multi-head-attention), [Scaled Dot-Product Attention](https://paperswithcode.com/method/scaled) and other architectural features seen in the [Transformer](https://paperswithcode.com/method/transformer) architecture traditionally used for NLP.
 
diff --git a/docs/models/wide-resnet.md b/docs/models/wide-resnet.md
index f2a34327..f18c870c 100644
--- a/docs/models/wide-resnet.md
+++ b/docs/models/wide-resnet.md
@@ -1,4 +1,4 @@
-# Summary
+# Wide ResNet
 
 **Wide Residual Networks** are a variant on [ResNets](https://paperswithcode.com/method/resnet) where we decrease depth and increase the width of residual networks. This is achieved through the use of [wide residual blocks](https://paperswithcode.com/method/wide-residual-block).
 
diff --git a/docs/models/xception.md b/docs/models/xception.md
index c42bb1b1..98701dd7 100644
--- a/docs/models/xception.md
+++ b/docs/models/xception.md
@@ -1,7 +1,9 @@
-# Summary
+# Xception
 
 **Xception** is a convolutional neural network architecture that relies solely on [depthwise separable convolution layers](https://paperswithcode.com/method/depthwise-separable-convolution).
 
+The weights from this model were ported from [Tensorflow/Models](https://github.com/tensorflow/models).
+
 ## How do I use this model on an image?
 To load a pretrained model: