Unordered IO for Improving Performance and Efficiency of PCIe® Fabrics for AI/ML Applications

Name: Unordered IO for Improving Performance and Efficiency of PCIe® Fabrics for AI/ML Applications
Uploaded: 2024-10-25T17:59:31.000Z
Duration: 2 h 2 min 26 s

Introduction to PCI-S Webinar

Overview of the Webinar

The webinar focuses on "unordered I/O" (UIO) for enhancing performance and efficiency in PCI fabrics, particularly for AI and ML applications.

Speakers include Debendra Do Sharma from Intel and Steve Glazer from Nvidia, both of whom are board members of PCI-S.

Agenda Presentation

Debendra introduces the agenda, highlighting the significance of unordered I/O as a transformative enhancement in computing interconnects.

The presentation will cover topics such as PCI Sig introduction, producer-consumer ordering models, motivation behind UIO, system-level considerations, and a Q&A session.

Understanding PCI Express Technology

Introduction to PCI Sig

PCI stands for Peripheral Component Interconnect; SIG refers to Special Interest Group. It consists of over 99 member companies with more than 32 years of history.

The group is responsible for developing the open industry standard known as PCI Express technology.

System Architecture Overview

Typical systems consist of CPUs connected to memory and various PCI Express devices like SSDs and GPUs.

Two representative platforms are discussed: integrated SoCs common in mobile devices and larger data center architectures featuring multiple CPUs.

Memory Address Mapping in Systems

Memory Address Space

An example system has a 52-bit address space allowing up to four petabytes total addressable space; not all areas need physical memory.

Coherent memory is typically interleaved across DDR devices while non-coherent memory is represented through memory-mapped I/O (MMIO).

Data Movement Mechanism

Devices communicate using reads/writes within this mapped memory space; DMA operations facilitate data transfer between SSD storage and coherent memory.

Evolution of PCI Express Technology

Historical Development

Understanding PCI Express Architecture

Overview of PCI Express Evolution

PCI Express has evolved to support increased bandwidth per pin while ensuring full backward compatibility across generations, making it a standard I/O interface for various computing platforms.

The technology spans seven generations over two decades, adapting through multiple computing revolutions and showcasing significant innovation in maintaining relevance.

Layered Architecture of PCI Express

PCI Express features a layered architecture with distinct functionalities at each layer, allowing independent evolution while ensuring backward compatibility.

The software layer provides a standardized interface for device discovery via BIOS or operating systems, facilitating plug-and-play capabilities.

Transaction and Data Link Layers

The transaction layer employs a packetized split transaction protocol to maximize link efficiency and maintain data consistency through defined producer-consumer ordering models.

The data link layer ensures reliable transport using mechanisms like CRC (Cyclic Redundancy Check), while the physical layer manages link training and encoding.

Interoperability Across Devices

A single specification governs the entire stack of PCI Express, ensuring interoperability among silicon components regardless of their application in handheld devices or servers.

Multiple mechanical form factors (e.g., U.2, M.2, BGA) are supported to accommodate diverse computing device requirements.

Producer-Consumer Model in Data Exchange

Direct Memory Access (DMA) utilizes load/store semantics as the primary mode of data exchange within PCI Express systems, involving both CPU and non-PCI devices.

A robust data consistency model is essential for coherent memory operations between different types of devices interacting within the PCI domain.

Understanding Coherency Guarantees

When a PCI device writes to memory, immediate visibility in system memory is not guaranteed; however, coherency ensures that local writes by CPUs are globally visible.

The interaction between different load/store worlds (PCI vs. non-PCI devices) relies on root ports within CPUs to manage data consistency effectively.

Practical Example of Producer Consumer Model

An example illustrates how Device A (a PCI device) posts writes to memory locations without knowing when they become architecturally visible.

Producer-Consumer Model and Device Synchronization

Understanding Data Consistency in Producer-Consumer Models

The producer-consumer model ensures that when a flag indicating updated data is read, the subsequent data read will reflect this update, guaranteeing data consistency.

If an updated flag is read, the consumer can expect to retrieve the latest version of the data (D Prime). However, if it reads an old value of the flag (f), it may receive either stale or updated data.

Acceptable outcomes include reading D Prime or FD; however, receiving stale data (D) indicates a loss of data consistency. The model guarantees that seeing an updated flag means accessing up-to-date information.

Device Synchronization Mechanism

The example illustrates two devices (A and B), which can be configured as either PC Express devices or CPUs. This flexibility allows for various configurations in device interactions.

Device synchronization occurs when two devices (X and Y) collaborate on tasks and need to signal completion of subtasks using specific memory addresses.

When device X updates its address with a new value after completing its subtask, it checks device Y's address. If Y has not completed its task yet, X goes to sleep until woken by Y's completion.

Enforcing Ordering in PCI Express Hardware

PCI Express hardware enforces ordering through an implemented ordering table that governs how transactions are processed across different devices.

There are three flow control classes in PCI Express:

Posted (P): Memory writes

Non-posted (NP): Memory reads

Completion (C): Responses to non-posted transactions

Summary of Key Concepts

The producer-consumer ordering model acts as a contract between hardware and software, ensuring consistent transaction handling across systems regardless of where they occur within memory hierarchies.

Data produced must be written alongside a flag; if the consumer sees this updated flag, it can confidently read the latest data from any location within system memory or via interrupts from PCI devices.

Importance of Ordering Rules

Ordering rules are crucial for maintaining data consistency across transactions that traverse multiple links in PCI Express networks.

These rules help prevent issues like deadlock while ensuring that all transactions adhere to established protocols for performance optimization.

Understanding Data Consistency in System Memory

Overview of the Ordering Table

The system must enforce an ordering table to ensure data consistency and forward progress across all components. A simplified version of this is a 5x5 table.

Rule A1A: Posted Transactions

Rule A1A states that a posted transaction must push all prior posted transactions, ensuring that updated flags reflect the latest data values.

If this rule is violated, stale data may be read by the core, leading to inconsistencies. For example, if a flag is updated before prior writes are committed, it can result in reading outdated information.

Non-tree Topology Violations

In non-tree topologies, violations occur when different paths for data rights lead to inconsistent reads. For instance, if half of the rights go through one CPU and the other half through another without synchronization, it results in errors.

Rule B1A: Non-post Transactions

Rule B1A indicates that a non-post transaction (like a memory read) must push all prior posted transactions (like memory writes). Failure to follow this can also lead to reading stale values.

An example illustrates how device interactions behind a PCI Express switch can cause violations if updates happen out of order.

Completion Transaction Rules

Rule C1A specifies that completion transactions must push all prior posted transactions. Similar scenarios as previous rules apply here; failure leads to data consistency issues.

Motivation for Unordered IO (UIO)

Challenges with Traditional Ordering Enforcement

Traditional ordering enforcement has served well but faces challenges as bandwidth increases. It requires awareness from various system components beyond just transaction queues.

Challenges in PCI Express Bandwidth and Ordering

Internal Fabrics and Bandwidth Doubling

The internal fabrics of devices typically operate within the 1 to 4 GHz range, necessitating bandwidth doubling through increased data path width.

Small packet transactions require multiple parallel operations due to limitations in data path width upgrades, complicating ordering rules enforcement.

Challenges with Ordering Rules

Most internal fabrics are unordered, leading to complexities in enforcing ordering rules across devices like PCI Express switches.

Designing high-bandwidth connections (e.g., an 800 Gbps NIC) requires running PCI Express links at higher transfer rates (e.g., 128 G transfers per second).

Peer-to-Peer Bandwidth Needs

Accelerators often have significant peer-to-peer bandwidth requirements; relying solely on upstream PCI Express links can be limiting.

Independent peer-to-peer paths supported by UIO could alleviate congestion and enhance effective bandwidth management.

Enhancing Reliability and Availability

Implementing parallel paths with UIO can improve reliability, availability, and congestion management for evolving applications like AI and machine learning.

Maintaining the producer-consumer ordering paradigm is essential while ensuring backward compatibility with existing systems.

Unordered I/O Transactions

UIO introduces a simple transaction model that allows for unordered I/O without dependency between flow control classes across virtual channels.

The traditional ordering model remains available alongside new unordered I/O options, allowing flexibility in transaction handling.

Transaction Types in UIO

UIO defines five transaction types: posted rights (memory writes), non-posted reads, and various completion types for read operations.

The source or producer enforces ordering by holding off on issuing completion flags until all corresponding completions are received.

Simplifying Flow Control Management

The simplified ordering table indicates minimal dependencies among flow control classes, promoting efficient transaction processing without artificial stalls.

Understanding UIO and Bandwidth Scaling

Overview of UIO Operations

The process begins with issuing a request for ownership, followed by data retrieval. This is done assuming no snooping occurs during the operation.

After obtaining ownership, a write operation is performed, which minimizes traffic between chips, facilitating easier bandwidth scaling through unordered I/O.

Bandwidth Optimization Techniques

Devices can connect to two CPUs to increase bandwidth by utilizing multiple PCI Express links while reducing cross-socket or cross-die traffic.

A peer-to-peer bandwidth path with fabric topology allows for more available bandwidth and better reliability; alternate paths can be used if a link fails.

Transitioning to UIO Details

The speaker transitions to discussing UIO specifics, emphasizing that it operates in a separate virtual channel (VC), isolating it from non-U traffic.

Each read and write request returns a completion message indicating its order, allowing out-of-order completions to be managed effectively.

Handling Completions and Errors

Completions are tagged (e.g., "part one of two") so that they can be reassembled correctly despite arriving out of order.

In case of partial failures during writes (e.g., 512 bytes fail out of 1 kilobyte), both success and failure messages are returned, confirming transaction completion.

Tagging System in UIO

UIO employs 14-bit tags distinct from non-U traffic tags, eliminating legacy issues associated with smaller tag sizes.

Separate sets of tags for reads and writes ensure clarity in operations; each VC or TC can have its own set of read/write tag spaces.

Performance Enhancements Through Tags

Increased tagging capacity supports larger networks with numerous switches, enabling more outstanding reads/writes to hide latency effectively.

Multiple outstanding rights can share the same tag; this allows efficient tracking when writing large amounts of data followed by flag rights.

Conclusion on Efficiency Gains

The system's design allows for improved performance as it handles multiple devices efficiently without needing specific order tracking for completions.

Understanding Data Contracts and Flags in Device Communication

Overview of Data Handling in Devices

The concept of data contracts and flags is crucial for device communication, where devices can perform correctly by adhering to established rules. Relaxed ordering attributes can be applied to all data rights, while flag rights must remain non-relaxed to ensure effective push behavior.

Challenges with Processor Models

In scenarios involving reads and writes between devices (A to B), the visibility of software operations becomes less clear. Many processors may lack appropriate fence instructions, complicating the synchronization of write completions.

Update Granularity in PCI Express

PCI Express operates on an update granularity of one byte, which allows a reader to potentially see a mix of old and new data during writes. This contrasts with modern memory systems that are cache line-oriented, necessitating a guarantee that either all old or all new data is visible.

Semaphore Writes and Cache Line Behavior

When writing to aligned blocks (e.g., 64 bytes), it ensures that no partial writes will present mixed data states. However, if a partial write occurs outside this alignment, the guarantees do not apply.

Independent Write Operations

Each 64-byte block in UIIO transactions operates independently without enforced ordering across cache lines. This means changes within larger write operations can be observed out-of-order by readers.

Enhancements in Memory Systems: Interleaving and Bandwidth

Interleave Mechanisms in CXL

CXL supports a 256-byte interleave mechanism for Type 3 memory, allowing subsequent blocks to be handled by different completers. This design aims to optimize bandwidth usage while maintaining performance standards.

Adjusting Transaction Rules for Efficiency

For non-uniform memory access (NUMA), PCI Express typically uses 4KB transaction sizes; however, UIIO transactions have been adjusted to limit crossing boundaries at 256 bytes for efficiency gains.

Future Directions: Non-tree Topologies and Fault Tolerance

Support for Multi-path Architectures

UIIO facilitates future non-tree topologies such as multi-path configurations and link aggregation. These developments aim at increasing bandwidth while addressing challenges related to chip area utilization as technology advances.

Importance of Power Management

Effective power management strategies are essential as they allow individual links within a system to enter low-power states without affecting overall system performance—critical as link complexity increases.

Complexities in Switch Ordering Models

Understanding PCI Ordering Model Limitations

The internal fabric of switches must adhere closely to conventional PCI models despite their inherent complexities like crossbars that allow parallelism. Maintaining consistent traffic ordering through shared buses presents significant challenges.

Traffic Management Across Multi-socket Configurations

Managing traffic across multiple chip-to-chip links within multi-socket root complexes complicates efforts to maintain the illusion of a single shared fabric due to varying paths taken by packets across different links.

Understanding UIO and Its Implications

Key Considerations in Building Systems with UIO

In stream IDE, keys mutate, necessitating careful ordering for specific paths while allowing flexibility for others.

The entire path from source to destination must operate in flip mode to utilize UIO effectively; software must recognize this requirement.

Current architecture does not dictate when devices should use UIO versus non-UIO components, leading to complexity in device communication.

Future developments aim to clarify how devices determine which UIO channel to use, acknowledging that the answer varies based on context.

Acknowledgment of latency introduced by waiting for write completions in UIO compared to immediate launches possible in non-UIO channels.

Latency and Performance Trade-offs

Devices must manage latency due to required completion waits; this latency is more visible in UIO than hidden as it is in non-UIO systems.

While moving latency around may not worsen performance, it requires devices to be more attentive, presenting a trade-off between parallelism and complexity.

Current limitations include the lack of support for atomic operations within UIO; future updates are expected to address this gap.

Ordering Mechanisms and Challenges

Keys used for packets differ based on their order of transmission; incorrect sequencing can lead to misinterpretation of data received.

Maintaining consistent ordering within a single path is crucial; packets must arrive sequentially despite potential reordering across different streams or completions.

Future Directions and Software Requirements

The producer-consumer model serves as an essential contract between source and destination, requiring adaptation as system architectures evolve towards better performance with UIO technology.

Enabling software support for UIO is necessary; activation isn't automatic but requires deliberate configuration by developers.

Multi-path and Multi-link Developments

Discussions are ongoing regarding the implementation timeline for multi-path/multi-link capabilities, anticipated around version 7.1 or 7.2 depending on complexity considerations.

Audience Engagement

Decoupling Ordering Performance in UIO Devices

Optimizing UIO Read and Write Completions

Discussion on the potential to decouple ordering for performance, suggesting that interleaving long and short transactions is more efficient than managing bursts of short ones.

Acknowledgment that distinct TLP (Transaction Layer Packet) types for read and write completions in UIO devices present an optimization opportunity, with no apparent risks to system integrity.

Interoperability Between UIO and Non-UIO Devices

Explanation of how UIO devices can operate using a mix of non-UIO traffic, emphasizing that they are not limited to just UIO protocols.

Clarification that traditional ordering models remain intact in VC0 (Virtual Channel 0), ensuring compatibility even when integrating older systems with newer technologies.

Flow Control Models in Flip Mode

Introduction of shared flow control in flip mode, which allows for better bandwidth utilization without needing excessive buffering compared to non-flip modes.

Scaling Bandwidth at System Level

Overview of scaling strategies by doubling data paths; example given of internal fabrics maintaining frequency across generations while increasing throughput.

Detailed explanation on the necessity for wider data paths as bandwidth increases, illustrating the relationship between PCI Express link width and operational frequency.

Addressing Bandwidth Efficiency Challenges

Discussion on the inefficiencies associated with round trips in memory transactions, highlighting wasted bandwidth and power consumption due to suboptimal provisioning.