Ross Wightman
1186fc9c73
Merge remote-tracking branch 'origin/master' into bits_and_tpu
2 years ago
Ross Wightman
4f338556d8
Fixes and improvements for metrics, tfds parser, loader / transform handling
...
* add back ability to create transform with loader
* change 'samples' -> 'examples' for tfds wrapper to match tfds naming
* add support for specifying feature names for input and target in tfds wrapper
* add class_to_idx for image classification datasets in tfds wrapper
* add accumulate_type to avg meters and metrics to allow float32 or float64 accumulation control with lower prec data
* minor cleanup, log output rate prev and avg
3 years ago
Ross Wightman
80ca078aed
Fix a few bugs and formatting/naming issues
...
* Pass optimizer resume flag through to checkpoint / updater restore. Related to #961 but not clear how relates to crash.
* Rename monitor step args, cleanup handling of step_end_idx vs num_steps for consistent log output in either case
* Resume from proper epoch (ie next epoch relative to checkpoint)
3 years ago
Ross Wightman
59a3409182
Update README.md
3 years ago
Ross Wightman
3581affb77
Update train.py with some flags related to scheduler tweaks, fix best checkpoint bug.
3 years ago
Ross Wightman
f2e14685a8
Add force-cpu flag for train/validate, fix CPU fallback for device init, remove old force cpu flag for EMA model weights
3 years ago
Ross Wightman
b76b48e8e9
Update optimizer creation for master optimizer changes
3 years ago
Ross Wightman
40457e5691
Transforms, augmentation work for bits, add RandomErasing support for XLA (pushing into transforms), revamp of transform/preproc config, etc ongoing...
3 years ago
Ross Wightman
847b4af144
Update README.md
3 years ago
Ross Wightman
5c5cadfe4c
Update README.md
3 years ago
Ross Wightman
ee2b8f49ee
Update README.md
3 years ago
Ross Wightman
cc870df7b8
Update README.md
3 years ago
Ross Wightman
6b2d9c2660
Another bits/README.md update
3 years ago
Ross Wightman
c3db5f5801
Worker hack for TFDS eval, add TPU env var setting.
3 years ago
Ross Wightman
f411724de4
Fix checkpoint delete issue. Add README about bits and initial Pytorch XLA usage on TPU-VM. Add some FIXMEs and fold train_cfg into train_state by default.
3 years ago
Ross Wightman
91ab0b6ce5
Add proper TrainState checkpoint save/load. Some reorg/refactoring and other cleanup. More to go...
4 years ago
Ross Wightman
5b9c69e80a
Add basic training resume based on legacy code
4 years ago
Ross Wightman
72ca831dd4
Back to using strings for the enum translation, forgot about import dep
4 years ago
Ross Wightman
cbd4ee737f
Fix model init for XLA, remove some prints.
4 years ago
Ross Wightman
aa92d7b1c5
Major timm.bits update. Updater and DeviceEnv now dataclasses, after_step closure used, metrics base impl w/ distributed reduce, many tweaks/fixes.
4 years ago
Ross Wightman
938716c753
Fix import issue, use devenv for dist info in parser_tfds
4 years ago
Ross Wightman
76de984a5f
Fix some bugs with XLA support, logger, add hacky xla dist launch script since torch.dist.launch doesn't work
4 years ago
Ross Wightman
12d9a6d4d2
First timm.bits commit, add initial abstractions, WIP updates to train, val... some of it working
4 years ago