Unordered IO for Improving Performance and Efficiency of PCIe® Fabrics for AI/ML Applications

Unordered IO for Improving Performance and Efficiency of PCIe® Fabrics for AI/ML Applications

Introduction to PCI-S Webinar

Overview of the Webinar

  • The webinar focuses on "unordered I/O" (UIO) for enhancing performance and efficiency in PCI fabrics, particularly for AI and ML applications.
  • Speakers include Debendra Do Sharma from Intel and Steve Glazer from Nvidia, both of whom are board members of PCI-S.

Agenda Presentation

  • Debendra introduces the agenda, highlighting the significance of unordered I/O as a transformative enhancement in computing interconnects.
  • The presentation will cover topics such as PCI Sig introduction, producer-consumer ordering models, motivation behind UIO, system-level considerations, and a Q&A session.

Understanding PCI Express Technology

Introduction to PCI Sig

  • PCI stands for Peripheral Component Interconnect; SIG refers to Special Interest Group. It consists of over 99 member companies with more than 32 years of history.
  • The group is responsible for developing the open industry standard known as PCI Express technology.

System Architecture Overview

  • Typical systems consist of CPUs connected to memory and various PCI Express devices like SSDs and GPUs.
  • Two representative platforms are discussed: integrated SoCs common in mobile devices and larger data center architectures featuring multiple CPUs.

Memory Address Mapping in Systems

Memory Address Space

  • An example system has a 52-bit address space allowing up to four petabytes total addressable space; not all areas need physical memory.
  • Coherent memory is typically interleaved across DDR devices while non-coherent memory is represented through memory-mapped I/O (MMIO).

Data Movement Mechanism

  • Devices communicate using reads/writes within this mapped memory space; DMA operations facilitate data transfer between SSD storage and coherent memory.

Evolution of PCI Express Technology

Historical Development

Understanding PCI Express Architecture

Overview of PCI Express Evolution

  • PCI Express has evolved to support increased bandwidth per pin while ensuring full backward compatibility across generations, making it a standard I/O interface for various computing platforms.
  • The technology spans seven generations over two decades, adapting through multiple computing revolutions and showcasing significant innovation in maintaining relevance.

Layered Architecture of PCI Express

  • PCI Express features a layered architecture with distinct functionalities at each layer, allowing independent evolution while ensuring backward compatibility.
  • The software layer provides a standardized interface for device discovery via BIOS or operating systems, facilitating plug-and-play capabilities.

Transaction and Data Link Layers

  • The transaction layer employs a packetized split transaction protocol to maximize link efficiency and maintain data consistency through defined producer-consumer ordering models.
  • The data link layer ensures reliable transport using mechanisms like CRC (Cyclic Redundancy Check), while the physical layer manages link training and encoding.

Interoperability Across Devices

  • A single specification governs the entire stack of PCI Express, ensuring interoperability among silicon components regardless of their application in handheld devices or servers.
  • Multiple mechanical form factors (e.g., U.2, M.2, BGA) are supported to accommodate diverse computing device requirements.

Producer-Consumer Model in Data Exchange

  • Direct Memory Access (DMA) utilizes load/store semantics as the primary mode of data exchange within PCI Express systems, involving both CPU and non-PCI devices.
  • A robust data consistency model is essential for coherent memory operations between different types of devices interacting within the PCI domain.

Understanding Coherency Guarantees

  • When a PCI device writes to memory, immediate visibility in system memory is not guaranteed; however, coherency ensures that local writes by CPUs are globally visible.
  • The interaction between different load/store worlds (PCI vs. non-PCI devices) relies on root ports within CPUs to manage data consistency effectively.

Practical Example of Producer Consumer Model

  • An example illustrates how Device A (a PCI device) posts writes to memory locations without knowing when they become architecturally visible.

Producer-Consumer Model and Device Synchronization

Understanding Data Consistency in Producer-Consumer Models

  • The producer-consumer model ensures that when a flag indicating updated data is read, the subsequent data read will reflect this update, guaranteeing data consistency.
  • If an updated flag is read, the consumer can expect to retrieve the latest version of the data (D Prime). However, if it reads an old value of the flag (f), it may receive either stale or updated data.
  • Acceptable outcomes include reading D Prime or FD; however, receiving stale data (D) indicates a loss of data consistency. The model guarantees that seeing an updated flag means accessing up-to-date information.

Device Synchronization Mechanism

  • The example illustrates two devices (A and B), which can be configured as either PC Express devices or CPUs. This flexibility allows for various configurations in device interactions.
  • Device synchronization occurs when two devices (X and Y) collaborate on tasks and need to signal completion of subtasks using specific memory addresses.
  • When device X updates its address with a new value after completing its subtask, it checks device Y's address. If Y has not completed its task yet, X goes to sleep until woken by Y's completion.

Enforcing Ordering in PCI Express Hardware

  • PCI Express hardware enforces ordering through an implemented ordering table that governs how transactions are processed across different devices.
  • There are three flow control classes in PCI Express:
  • Posted (P): Memory writes
  • Non-posted (NP): Memory reads
  • Completion (C): Responses to non-posted transactions

Summary of Key Concepts

  • The producer-consumer ordering model acts as a contract between hardware and software, ensuring consistent transaction handling across systems regardless of where they occur within memory hierarchies.
  • Data produced must be written alongside a flag; if the consumer sees this updated flag, it can confidently read the latest data from any location within system memory or via interrupts from PCI devices.

Importance of Ordering Rules

  • Ordering rules are crucial for maintaining data consistency across transactions that traverse multiple links in PCI Express networks.
  • These rules help prevent issues like deadlock while ensuring that all transactions adhere to established protocols for performance optimization.

Understanding Data Consistency in System Memory

Overview of the Ordering Table

  • The system must enforce an ordering table to ensure data consistency and forward progress across all components. A simplified version of this is a 5x5 table.

Rule A1A: Posted Transactions

  • Rule A1A states that a posted transaction must push all prior posted transactions, ensuring that updated flags reflect the latest data values.
  • If this rule is violated, stale data may be read by the core, leading to inconsistencies. For example, if a flag is updated before prior writes are committed, it can result in reading outdated information.

Non-tree Topology Violations

  • In non-tree topologies, violations occur when different paths for data rights lead to inconsistent reads. For instance, if half of the rights go through one CPU and the other half through another without synchronization, it results in errors.

Rule B1A: Non-post Transactions

  • Rule B1A indicates that a non-post transaction (like a memory read) must push all prior posted transactions (like memory writes). Failure to follow this can also lead to reading stale values.
  • An example illustrates how device interactions behind a PCI Express switch can cause violations if updates happen out of order.

Completion Transaction Rules

  • Rule C1A specifies that completion transactions must push all prior posted transactions. Similar scenarios as previous rules apply here; failure leads to data consistency issues.

Motivation for Unordered IO (UIO)

Challenges with Traditional Ordering Enforcement

  • Traditional ordering enforcement has served well but faces challenges as bandwidth increases. It requires awareness from various system components beyond just transaction queues.

Challenges in PCI Express Bandwidth and Ordering

Internal Fabrics and Bandwidth Doubling

  • The internal fabrics of devices typically operate within the 1 to 4 GHz range, necessitating bandwidth doubling through increased data path width.
  • Small packet transactions require multiple parallel operations due to limitations in data path width upgrades, complicating ordering rules enforcement.

Challenges with Ordering Rules

  • Most internal fabrics are unordered, leading to complexities in enforcing ordering rules across devices like PCI Express switches.
  • Designing high-bandwidth connections (e.g., an 800 Gbps NIC) requires running PCI Express links at higher transfer rates (e.g., 128 G transfers per second).

Peer-to-Peer Bandwidth Needs

  • Accelerators often have significant peer-to-peer bandwidth requirements; relying solely on upstream PCI Express links can be limiting.
  • Independent peer-to-peer paths supported by UIO could alleviate congestion and enhance effective bandwidth management.

Enhancing Reliability and Availability

  • Implementing parallel paths with UIO can improve reliability, availability, and congestion management for evolving applications like AI and machine learning.
  • Maintaining the producer-consumer ordering paradigm is essential while ensuring backward compatibility with existing systems.

Unordered I/O Transactions

  • UIO introduces a simple transaction model that allows for unordered I/O without dependency between flow control classes across virtual channels.
  • The traditional ordering model remains available alongside new unordered I/O options, allowing flexibility in transaction handling.

Transaction Types in UIO

  • UIO defines five transaction types: posted rights (memory writes), non-posted reads, and various completion types for read operations.
  • The source or producer enforces ordering by holding off on issuing completion flags until all corresponding completions are received.

Simplifying Flow Control Management

  • The simplified ordering table indicates minimal dependencies among flow control classes, promoting efficient transaction processing without artificial stalls.

Understanding UIO and Bandwidth Scaling

Overview of UIO Operations

  • The process begins with issuing a request for ownership, followed by data retrieval. This is done assuming no snooping occurs during the operation.
  • After obtaining ownership, a write operation is performed, which minimizes traffic between chips, facilitating easier bandwidth scaling through unordered I/O.

Bandwidth Optimization Techniques

  • Devices can connect to two CPUs to increase bandwidth by utilizing multiple PCI Express links while reducing cross-socket or cross-die traffic.
  • A peer-to-peer bandwidth path with fabric topology allows for more available bandwidth and better reliability; alternate paths can be used if a link fails.

Transitioning to UIO Details

  • The speaker transitions to discussing UIO specifics, emphasizing that it operates in a separate virtual channel (VC), isolating it from non-U traffic.
  • Each read and write request returns a completion message indicating its order, allowing out-of-order completions to be managed effectively.

Handling Completions and Errors

  • Completions are tagged (e.g., "part one of two") so that they can be reassembled correctly despite arriving out of order.
  • In case of partial failures during writes (e.g., 512 bytes fail out of 1 kilobyte), both success and failure messages are returned, confirming transaction completion.

Tagging System in UIO

  • UIO employs 14-bit tags distinct from non-U traffic tags, eliminating legacy issues associated with smaller tag sizes.
  • Separate sets of tags for reads and writes ensure clarity in operations; each VC or TC can have its own set of read/write tag spaces.

Performance Enhancements Through Tags

  • Increased tagging capacity supports larger networks with numerous switches, enabling more outstanding reads/writes to hide latency effectively.
  • Multiple outstanding rights can share the same tag; this allows efficient tracking when writing large amounts of data followed by flag rights.

Conclusion on Efficiency Gains

  • The system's design allows for improved performance as it handles multiple devices efficiently without needing specific order tracking for completions.

Understanding Data Contracts and Flags in Device Communication

Overview of Data Handling in Devices

  • The concept of data contracts and flags is crucial for device communication, where devices can perform correctly by adhering to established rules. Relaxed ordering attributes can be applied to all data rights, while flag rights must remain non-relaxed to ensure effective push behavior.

Challenges with Processor Models

  • In scenarios involving reads and writes between devices (A to B), the visibility of software operations becomes less clear. Many processors may lack appropriate fence instructions, complicating the synchronization of write completions.

Update Granularity in PCI Express

  • PCI Express operates on an update granularity of one byte, which allows a reader to potentially see a mix of old and new data during writes. This contrasts with modern memory systems that are cache line-oriented, necessitating a guarantee that either all old or all new data is visible.

Semaphore Writes and Cache Line Behavior

  • When writing to aligned blocks (e.g., 64 bytes), it ensures that no partial writes will present mixed data states. However, if a partial write occurs outside this alignment, the guarantees do not apply.

Independent Write Operations

  • Each 64-byte block in UIIO transactions operates independently without enforced ordering across cache lines. This means changes within larger write operations can be observed out-of-order by readers.

Enhancements in Memory Systems: Interleaving and Bandwidth

Interleave Mechanisms in CXL

  • CXL supports a 256-byte interleave mechanism for Type 3 memory, allowing subsequent blocks to be handled by different completers. This design aims to optimize bandwidth usage while maintaining performance standards.

Adjusting Transaction Rules for Efficiency

  • For non-uniform memory access (NUMA), PCI Express typically uses 4KB transaction sizes; however, UIIO transactions have been adjusted to limit crossing boundaries at 256 bytes for efficiency gains.

Future Directions: Non-tree Topologies and Fault Tolerance

Support for Multi-path Architectures

  • UIIO facilitates future non-tree topologies such as multi-path configurations and link aggregation. These developments aim at increasing bandwidth while addressing challenges related to chip area utilization as technology advances.

Importance of Power Management

  • Effective power management strategies are essential as they allow individual links within a system to enter low-power states without affecting overall system performance—critical as link complexity increases.

Complexities in Switch Ordering Models

Understanding PCI Ordering Model Limitations

  • The internal fabric of switches must adhere closely to conventional PCI models despite their inherent complexities like crossbars that allow parallelism. Maintaining consistent traffic ordering through shared buses presents significant challenges.

Traffic Management Across Multi-socket Configurations

  • Managing traffic across multiple chip-to-chip links within multi-socket root complexes complicates efforts to maintain the illusion of a single shared fabric due to varying paths taken by packets across different links.

Understanding UIO and Its Implications

Key Considerations in Building Systems with UIO

  • In stream IDE, keys mutate, necessitating careful ordering for specific paths while allowing flexibility for others.
  • The entire path from source to destination must operate in flip mode to utilize UIO effectively; software must recognize this requirement.
  • Current architecture does not dictate when devices should use UIO versus non-UIO components, leading to complexity in device communication.
  • Future developments aim to clarify how devices determine which UIO channel to use, acknowledging that the answer varies based on context.
  • Acknowledgment of latency introduced by waiting for write completions in UIO compared to immediate launches possible in non-UIO channels.

Latency and Performance Trade-offs

  • Devices must manage latency due to required completion waits; this latency is more visible in UIO than hidden as it is in non-UIO systems.
  • While moving latency around may not worsen performance, it requires devices to be more attentive, presenting a trade-off between parallelism and complexity.
  • Current limitations include the lack of support for atomic operations within UIO; future updates are expected to address this gap.

Ordering Mechanisms and Challenges

  • Keys used for packets differ based on their order of transmission; incorrect sequencing can lead to misinterpretation of data received.
  • Maintaining consistent ordering within a single path is crucial; packets must arrive sequentially despite potential reordering across different streams or completions.

Future Directions and Software Requirements

  • The producer-consumer model serves as an essential contract between source and destination, requiring adaptation as system architectures evolve towards better performance with UIO technology.
  • Enabling software support for UIO is necessary; activation isn't automatic but requires deliberate configuration by developers.

Multi-path and Multi-link Developments

  • Discussions are ongoing regarding the implementation timeline for multi-path/multi-link capabilities, anticipated around version 7.1 or 7.2 depending on complexity considerations.

Audience Engagement

Decoupling Ordering Performance in UIO Devices

Optimizing UIO Read and Write Completions

  • Discussion on the potential to decouple ordering for performance, suggesting that interleaving long and short transactions is more efficient than managing bursts of short ones.
  • Acknowledgment that distinct TLP (Transaction Layer Packet) types for read and write completions in UIO devices present an optimization opportunity, with no apparent risks to system integrity.

Interoperability Between UIO and Non-UIO Devices

  • Explanation of how UIO devices can operate using a mix of non-UIO traffic, emphasizing that they are not limited to just UIO protocols.
  • Clarification that traditional ordering models remain intact in VC0 (Virtual Channel 0), ensuring compatibility even when integrating older systems with newer technologies.

Flow Control Models in Flip Mode

  • Introduction of shared flow control in flip mode, which allows for better bandwidth utilization without needing excessive buffering compared to non-flip modes.

Scaling Bandwidth at System Level

  • Overview of scaling strategies by doubling data paths; example given of internal fabrics maintaining frequency across generations while increasing throughput.
  • Detailed explanation on the necessity for wider data paths as bandwidth increases, illustrating the relationship between PCI Express link width and operational frequency.

Addressing Bandwidth Efficiency Challenges

  • Discussion on the inefficiencies associated with round trips in memory transactions, highlighting wasted bandwidth and power consumption due to suboptimal provisioning.
Video description

As the role of PCI Express® (PCIe®) technology expands in new areas such as AI/ML, PCI-SIG® has recognized the need to support multi-path fabrics with an ordering model to improve PCIe bandwidth and reduce latency while maintaining backwards compatibility. Unordered IO (UIO) is a new feature added to the PCIe 6.1 specification that defines a new wire semantic protocol and related capabilities for addressing the limitations of the existing PCIe fabric-enforced ordering rules, while keeping the producer-consumer ordering model intact. UIO is a key enabler for Multi-Link PCIe devices (i.e., non-tree topologies). UIO also helps improve the interaction with other protocols, including on-die/on-package protocols, such as CXL, UCIe, etc. Attendees of this webinar will understand the traditional PCIe ordering model, the system-level considerations and tradeoffs for deploying PCIe Unordered IO, and will learn which applications and market segments will benefit most from deploying UIO. Presenters: Debendra Das Sharma (Intel) and Steve Glaser (NVIDIA)