Skip to content

Releases: meta-pytorch/torchrec

v1.6.0

15 Mar 04:26

Choose a tag to compare

Announcement

TorchRec has completed the migration from Pyre to Pyrefly for type checking, as announced in the v1.5.0 release. All pre-existing Pyre1 deployments have been removed and Pyrefly + PTT type checking is now enabled across the codebase.

New Features

Memory Stashing

Memory Stashing is a framework for temporarily offloading GPU HBM data to host memory during the forward/backward pass, freeing up memory for other computation. It consists of three techniques: Embedding Memory Stashing (EMS) for embedding weights, Optimizer Stashing for optimizer states (e.g. Shampoo), and Activation Stashing for long-sequence activations. This release introduces the full stashing framework including a multi-threaded stashing manager and configurable restore injection.

  • Embedding Memory Stashing (EMS) for EBC [#3745, #3744, #3872]
  • MemoryStashingManager with multithreading support [#3800]
  • TrainPipelineSparseDistEmbStash with configurable restore injection site [#3807]
  • Optimizer stashing API for Shampoo optimizer [#3768]
  • Refactor embedding weights stashing API to add free/await callbacks [#3746]
  • Long-sequence activation stashing [#3819]
  • Memory Stashing onboarding guide [#3823, #3871]

Triton TBE (TritonBatchedFusedEmbeddingBag)

A new Triton-based batched fused embedding bag implementation, providing an alternative to FBGEMM TBE with support for multi-feature tables, 2D sparse parallelism, stochastic rounding, and FP32 output dtype.

  • Add TritonBatchedFusedEmbeddingBag to TorchRec [#3710]
  • Add bound check for TritonTBE [#3727]
  • Add feature_table_map support for multi-feature tables [#3728]
  • Fix FP16 accumulation precision in forward kernels [#3730]
  • Stochastic rounding, FP32 output dtype [#3856]
  • Add 2D sparse parallel support for FULLY_SHARDED [#3859]
  • Add forward kernel sync before NCCL collectives [#3840]
  • Add state management for TritonFusedEmbeddingBag [#3757]

New EmbeddingPerfEstimator Framework

A complete redesign of the embedding performance estimation system with a modular, config-driven architecture. The new framework introduces hardware-specific estimators via a decorator/annotation system, a topology factory, and config classes, replacing the legacy hardcoded estimator approach.

  • Core type definitions and evaluator pattern for embedding performance estimation [#3737, #3739]
  • Decorator/annotation system for hardware-specific customizations [#3738]
  • Default perf estimator configuration and build config [#3740, #3741]
  • Integrate with Enumerator and legacy EmbeddingPerfEstimator for backward compatibility [#3714]
  • Simplified hardware-specific estimators with config-based approach [#3772]
  • Introducing Topology Factory [#3821]
  • Introduce config classes [#3811]
  • Hardware capabilities detection module [#3832]
  • Cleaning up kill switch and legacy estimator code [#3723]

Eval Workflow Pipeline

New eval workflow pipelines that support standalone CPU sparse evaluation and interleaved train-eval hybrid execution, including CPU sparse eval with shared memory and fused SDD eval.

  • Base train-eval-hybrid pipeline [#3695]
  • Fused SDD eval pipeline [#3699]
  • TrainEvalHybridPipelineBase: model.training-based eval detection + draining [#3833]
  • Train pipeline for sparse operations on CPU [#3822]
  • HybridEvalDMP for CPU sparse eval with shared memory [#3824]
  • Adjust eval exhausting handling in hybrid pipeline [#3750]

Gradient Accumulation

Support for gradient accumulation in train pipelines, enabling larger effective batch sizes without increasing memory usage.

  • Add gradient accumulation support to train pipelines [#3762]
  • Add gradient accumulation benchmark [#3808]

ZCH / MPZCH Improvements

Continued enhancements to Multi-Probe Zero-Collision Hashing including mean pooling support, LRU eviction with opt-in, disable fallback in inference, and several bug fixes.

  • Implement mean pooling with MP-ZCH [#3835]
  • Support disable fallback in MPZCH EC inference path [#3747]
  • LRU + opt-in support [#3831]
  • Use reserved slots to support always-on table
  • Fix hidden bug due to lengths/offsets mismatch when disable_fallback is true [#3829]
  • Fix zch disable_fallback dim mismatch in training eval [#3842]

Pipeline Infrastructure

New pipeline variants and performance optimizations including SDD Lite, backward injection framework, and multi-threaded batch copy to GPU.

  • Introduce TrainPipelineSparseDistLite (SDD Lite) pipeline [#3707]
  • Backward injection framework for SDD & variants
  • Refactor backward injection and add TrainPipelineSparseDistBwdOpt [#3797]
  • Multi-threading for copy data to GPU [#3774, #3780]
  • Prefetch pipeline work with feature preprocessing [#3870]
  • Reuse _copy_batch_to_gpu in _connect in TrainPipelineBase [#3834]

RecMetrics Improvements

Performance and usability improvements to the RecMetrics framework, including CPU-offloaded metric updates, a no-op metric module, and a cleaner MetricModule interface.

  • CPUOffloadedRecMetricModule: DtoHs in the update thread [#3658]
  • NoOpMetricModule [#3657]
  • Cleaner MetricModule interface [#3554]
  • RecMetricModule: torch.cat all tensor lists before gloo all gathers [#3593]
  • Add waitcounter around recmetrics [#3677]
  • Pass batch_size_stages to RecMetrics via _generate_rec_metrics [#3817]

Change Log

  • Implement a write method in DMP [#3801]
  • Add SharderData dataclass and extraction functions [#3851]
  • Pre-compute module-derived fields on ShardingOption during enumerate() [internal]
  • Config-driven int32 embedding indices and offsets support [#3839]
  • Add EmainplaceRowWiseAdagrad optimizer and preserve explicit optimizer type during DMP sharding [#3865]
  • Alpha Decay Optimizer [#3676]
  • Add Muon optimizer support to MVAI trainer [#3844]
  • Auto-detect GPU HBM capacity instead of hardcoding A100 values [#3827]
  • Add intra_group_size to topology [#3696, #3697]
  • Sharing embedding table weights across intra-node ProcessGroup on host [#3810]
  • Add kjt estimated size to FixedPercentageStorageReservation [#3685]
  • Add copy_ and empty_like methods to JaggedTensor [#3721]
  • Add lookup_runtime_meta API in mc_modules [#3826]
  • Return Remapped for SQMCEC [#3680]
  • Log hardcoded compute kernel constraints [#3803]
  • Enable logging for all Planner & ShardEstimator initializations and plan() calls [#3724]
  • Add power-law distribution support for skewed index generation in ModelInput [#3708]
  • GB200 benchmarking scripts and pod size support [#3802]
  • Add RecMetrics support to train pipeline benchmark [#3874]
  • Cloud deployment examples with torchrun and Kubeflow support [#3760]
  • Add comprehensive training visualizations to TorchRec examples [#3759]
  • Fix dynamic shape constraint violation in mark_dynamic_kjt [#3815]
  • Fix TowerQPSMetric load_state_dict_hook to always pop num_batch key [#3854]
  • Fix tensor weighted average [#3701]
  • Fix linear regression prefetch estimate bug [#3784]
  • Fix IntEnum str() behavior for Python 3.11+ compatibility [#3712]
  • Fix typing in batched embedding kernel [#3864]
  • Handle empty tensors in infinity norm computation [#3809]
  • Fix the fx tracer leaf module logic [#3487]
  • Add CUDA 13.0 support [#3820, #3830]
  • Migrate from Pyre to Pyrefly type checking [#3748, #3733]
  • Replace weak test assertions with specific unittest methods [#3875, #3862, #3869, #3860, #3861, #3858]
  • full change log

compatibility

  • fbgemm-gpu==1.6.0
  • torch==2.11.0

test results

image

v1.5.0

15 Feb 20:00

Choose a tag to compare

Announcement

Starting with the next release, TorchRec will migrate from Pyre to Pyrefly for type checking. Contributors should be aware that type checking workflows and configurations will change accordingly.

New Features

Fully Sharded 2D Parallelism

Fully Sharded 2D Parallelism introduces a new sharding strategy that combines fully sharded parallelism with 2D distribution, enabling more efficient utilization of GPU resources for large-scale embedding tables. This includes support for uneven shard sizes, dynamic 2D sharding, and annotated collectives.

  • Fully Sharded 2D Parallelism [#3558]
  • Uneven shard sizes support for Fully Sharded 2D collectives [#3584]
  • Dynamic 2D + Fully Sharded 2D [#3600]
  • Fix padding logic for Fully Sharded 2D [#3626]
  • Add annotations to Fully Sharded 2D collectives [#3678]

Train Pipeline Enhancements

Several new pipeline capabilities have been added, including fused sparse dist enhancements for improved training throughput and in-place batch copy to save HBM usage.

  • Add inplace_copy_batch_to_gpu in TrainPipeline, enabling non-blocking host-to-device batch transfer to save HBM usage [#3526, #3532, #3641]
  • Add device parameter to KeyedJaggedTensor.empty_like and copy_ methods [#3510]
  • Refactor TrainPipelineBase to clean input batch after the forward pass [#3530]
  • Support enqueue_batch_after_forward in TrainPipelineFusedSparseDist [#3675]
  • Train pipeline with FP own bucket feature [#3683]

Benchmark

Continued expansion of the unified benchmarking framework with new benchmark scenarios, memory profiling, and a GitHub benchmark workflow.

  • Memory analysis and profiling: CUDA memory footprint in multi-stream scenario, memory snapshot for non-blocking copy [#3480, #3485, #3504]
  • Device-to-Host LazyAwaitable with knowledge sharing, demonstrating host-device comms [#3477, #3492]
  • Add new benchmark scenarios: base pipeline light, KV-ZCH, MP-ZCH, VBE [#3580, #3540, #3604, #3642, #3585]
  • Benchmark infrastructure improvements: ModelSelectionConfig, prettified output, log level [#3467, #3639, #3494]
  • Create GitHub benchmark workflow [#3631]

RecMetrics

New metrics have been added to expand TorchRec's metric coverage for multi-label, regression, and serving use cases.

  • Per-label precision metric for multi-label tasks [#3661]
  • Lifetime AUPRC [#3674]
  • Serving AE loss metric [#3681]
  • Label average metric for regression APS model [#3650]
  • NMSE metric for APS model [#3489 (internal)]

Delta Tracking and Publishing

The ModelDeltaTracker and DeltaStore APIs have been generalized and extended with raw ID tracking capabilities, enabling more flexible delta-based model publishing workflows.

  • Update ModelDeltaTracker and DeltaStore to be Generic [#3468, #3469, #3470]
  • Update DeltaCheckpointing and DeltaPublish with generic model tracker [#3543]
  • ModelDeltaTracker improvements: post-init initialization, optim state tracking, optim state init bug fix [#3472, #3143, #3476]
  • Raw ID tracker: add tracker, wrapper, post lookup function, DMP integration, and hash_zch_runtime_meta support [#3500, #3501, #3502, #3506, #3527, #3541, #3542, #3545, #3598, #3599]

KVZCH Enhancements

Continued improvements to Key-Value Zero-Collision Hashing, including auto feature score collection and eviction policy updates.

  • Enable feature score auto collection in EBC and EC [#3475, #3474]
  • Eviction policy improvements: no-eviction support, free mem trigger with all2all, skip feature score threshold for ttl, config rename [#3488, #3490, #3552, #3514]
  • Per-feature ZCH lookup support for memory layer [#3618]

Python Free-Threading Support

TorchRec now supports Python free-threading (PEP 703) on Python 3.14, enabling better performance in multi-threaded environments.

  • Support python free-threading [#3684, #3686]
  • Update supported Python version in setup.py, unittest workflow, and documentation [#3596, #3662, #3483]

Change Log

  • Direct MX4→BF16 dequantization to reduce memory [#3620]
  • Input distribution latency estimations [#3575]
  • Enable specifying output dtype for fp8 quantized communication [#3568]
  • Add custom all2all interface [#3454]
  • Adding MC EBC quant embedding modules for inference [#3572]
  • VBE improvements: KJT validator, pre-allocated output tensor & offsets for TBE, MC-EBC support [#3645, #3624, #3617]
  • PT2 compatibility: TBE serialization support to IR, EBC short circuit kwargs support, dynamo pruning logic update, generate 1 acc graph by removing fx wrapper for KJT [#3637, #3557, #3566, #3582]
  • Rowwise for feature processors [#3606]
  • Fix grad clipping compatibility with CPU training [#3679]
  • Fix NaN handling in AUPRC metric calculation [#3523]
  • FQN/checkpointing tests for RecMetrics [#3612]
  • Add Metric compatibility test for RecMetricsModule [#3586]
  • Shard plan validation: shard to rank assignment [#3495]
  • Debug embedding modules for NaN detection in backward [#3519]
  • Enable logging for plan(), ShardEstimators, and TrainingPipeline constructors [#3576]
  • Object_id dedup for fused optimizer [#3666]
  • Cache weight/optimizer tensor mappings for efficient sync() [#3610]
  • Test fixes and stability improvements [#3592, #3672, #3644, #3621, #3590, #3589, #3528]
  • full change log

compatibility

  • fbgemm-gpu==1.5.0
  • torch==2.10.0

test results

v1.4.0

07 Dec 18:25

Choose a tag to compare

Breaking Change

New Features

Unified Benchmark

Benchmarking is absolutely essential for TorchRec, a library designed for building and scaling massive recommender systems. Given TorchRec's focus on handling enormous embedding tables and complex model architectures across distributed hardware, a unified benchmarking framework allows developers to quantify the performance implications of various configurations. These configurations include different sharding strategies, specialized kernels, and model parallelism techniques. This systematic evaluation is crucial for identifying the most efficient training and inference setups, uncovering bottlenecks, and understanding the trade-offs between speed, memory usage, and model accuracy for specific recommendation tasks.

RecMetrics Offloading to CPU

  • Zero-Overhead RecMetric (ZORM)
    We have developed a CPU-offloaded RecMetricModule implementation that removes metric update(), compute(), and publish() operations from the GPU execution critical path, achieving up to 11.47% QPS improvements in production models with numerical parity at the cost of an additional 10% avg host cpu utilization. [#3123, #3424, #3428]

Resharding API

TorchRec Resharding API provides a new capability to reshard the embedding tables during training. It can be used for use cases such as manual tuning of the sharding plans during training, and provides resharding capability for Dynamic Resharding. It enables resharding of the existing sharded embedding tables based on a newer sharding plan. Resharding API accepts the changing shards compared to the current sharding plan.

  • Enable Changing the # of shards for CW resharding: #3188, #3245
  • ReshardingAPI Host Memory Offloading and BenchmarkReshardingHandler: #3291
  • Resharding API Performance Improvement: #3323

Prototyping KVZCH (Key-Value Zero-Collision Hashing)

Extend current TBE: There is considerable effort and expertise which has gone toward enabling performance optimized TBE for accessing HBM as well as host DRAM. We want to leverage such capabilities, and extend on top of TBE.
Abstract out the details of the backend memory: The memory we use could be SSD, Remote memory tiers through back end, or remote memory through front end. We want to enable all such capabilities, without adding backend specific logic to the TBE code.

  • Add configs for write dist: #3390
  • Allow the ability for uneven row wise sharding based on number of buckets for zch: #3341
  • Fix embedding table type and eviction policy in st publish: #3309
  • add direct_write_embedding method: #3332

Change Log

  • There are rare cases using VBE where one of the KJTs has the same batch size. This is not recognized as a VBE on KJT init which can cause issues in the forward pass. We initialize both output dist comms to support this: #3378
  • Pipeline minor change, docstring, and refactoring: #3294, #3314, #3326, #3377, #3379, #3384, #3443 #3345
  • Add ability in SSDTBE to fetch weights from L1 and SP from outside of the module: #3166
  • Add validations for rec metrics config creation to avoid out of bounds indices: #3421
  • add variable batch size support to tower QPS: #3438
  • Add row based sharding support for FeaturedProcessedEBC: #3281
  • Add logging when merging VBE embeddings from multiple TBEs: #3304
  • full change log

compatability

  • fbgemm-gpu==1.4.0
  • torch==2.9.0

test results

image

v1.4.0-rc1

18 Oct 01:21

Choose a tag to compare

v1.4.0-rc1 Pre-release
Pre-release

release cut for v1.4.0
in-sync with fbgemm-gpu release v1.4.0
in-sync with pytorch 2.9

v1.3.0

13 Sep 22:45

Choose a tag to compare

New Features

New Flavors of Training Pipelines

  • Fused SDD: A new pipeline optimization schema that overlaps optimizer with embedding lookup. Training QPS gain is observed for models with heavy optimizer (e.g., Shampoo opt). [#2916, #2933]
  • 2D Sharding support: common SDD train pipeline now supports 2D sharding schema. [#2929]
  • PostProc module support in train pipeline. [#2939, #2978, #2982, #2999]

Delta Tracker and Delta Store

ModelDeltaTracker is a utility for tracking and retrieving unique IDs and their corresponding embeddings or states from embedding modules in model using Torchrec. [#3056, #3060, #3064, ...]
It's particularly useful for:

  • Identifying which embedding rows were accessed during model execution
  • Retrieving the latest delta or unique rows for a model
  • Computing top-k changed embeddings
  • Supporting streaming updated embeddings between systems during online training

Resharding API

TorchRec Resharding API provides a new capability to reshard the embedding tables during training. It can be used for use cases such as manual tuning of the sharding plans during training, and provides resharding capability for Dynamic Resharding. It enables resharding of the existing sharded embedding tables based on a newer sharding plan. Resharding API accepts the changing shards compared to the current sharding plan. [#2911, #2912, #2944, #3053, ...]

  • Resharding API supports Table-Wise (TW) and Column-Wise (CW) resharding
  • Optimizer support includes SGD and Adagrad (with Row-wise Adagrad for TW)
  • Provides a highly performant API, tested on up to 128 GPUs across 16 nodes with NVIDIA A100 80GB GPUs, achieving an average resharding downtime of approximately 200 milliseconds for around 100GB of total data.
  • Achieved 0.1% average downtime per reshard compared to total training time for DLRM ~100GB model.

Prototyping KVZCH (Key-Value Zero-Collision Hashing)

Extend current TBE: There is considerable effort and expertise which has gone toward enabling performance optimized TBE for accessing HBM as well as host DRAM. We want to leverage such capabilities, and extend on top of TBE.
Abstract out the details of the backend memory: The memory we use could be SSD, Remote memory tiers through back end, or remote memory through front end. We want to enable all such capabilities, without adding backend specific logic to the TBE code.

  • KV TBE Design document [#2942]
  • KVZCH embedding lookup module [#2922]

MPZCH (Multi-Probe Zero-Collision Hashing) [#3089]

  • We are introducing a novel Multi-Probe Zero Collision Hash (MPZCH) solution based on multi-round linear probing to address the long-standing hash collision problem in sparse embedding lookup. The proposed solution is general, highly performant, scalable and simple.
  • A fast CUDA kernel is developed to map input sparse features to indices/slots with minimum chance of collision with others under a given budget. Eviction or fallback may happen when a collision occurs. Mapped indices and eviction information are returned for the downstream embedding lookup and optimizer states update. The process only takes a couple of milliseconds per batch at training. A CPU kernel was introduced to provide good performance in the inference environment.
  • A row-wise sharded ManagedCollisionModule (MCH) module is added as a part of TorchRec library that enables seamless integration with large scale distributed model training in production. No extra limit was applied for model scaling and the training throughput regression is little-to-none.
  • The solution has been adopted and tested by various product models with multi-billion hash size across retrieval and ranking. Promising results were observed from both offline and online experiments.

Change Log

compatability

  • fbgemm-gpu==1.3.0
  • torch==2.8.0

v1.3.0-rc3

13 Sep 17:47

Choose a tag to compare

v1.3.0-rc3 Pre-release
Pre-release

Wheel build test: passed
Binary validation: passed
CPU CI test: passed
GPU CI test: passed
image

v1.3.0-rc2

13 Sep 16:35

Choose a tag to compare

v1.3.0-rc2 Pre-release
Pre-release

bump the torchrec version, pin the torch version

v1.3.0-rc1

12 Sep 17:02

Choose a tag to compare

v1.3.0-rc1 Pre-release
Pre-release

align with fbgemm release cut around 6/28

v1.2.0

06 Jun 17:00

Choose a tag to compare

New Features

TensorDict support for EBC and EC

an EBC/EC module can now take in TensorDict as the data input in alternative to KeyedJaggedTensor: #2581 #2596

Customized Embedding Lookup Kernel Support

NVIDIA dynamicemb package depends on an old TorchRec release (r0.7) plus a PR(#2533), refactor TorchRec embedding lookup structures to be easy to plug in a customized emb-lookup kernel: #2887 #2891

Prototype of Dynamic Sharding

Add initial dynamic sharding API and test. This current version supports EBC, TW, and Sharded Tensor. And other variants beyond those configurations (e.g. CW, RW, DTensor etc..): #2852 #2875 #2877 #2863

TorchRec 2D Parallel for EmbeddingCollection

Adding support for EmbeddingCollection modules in 2D parallel. This supports all sharding types that are supported for EC. #2737

Changelog

  • Support MCH for semi-sync (assuming no eviction): #2753
  • Multi forward MCH eviction fix: #2836
  • Fix RW Support and checkpointing: #2890

v1.2.0-rc3

06 Jun 07:01

Choose a tag to compare

v1.2.0-rc3 Pre-release
Pre-release

revert #2876 and update the binary validation script