Pushing to the Limits: Understanding Lane Margining for PCIe®
By Dr. Debendra Das Sharma, Intel Fellow, PCI-SIG® Board Member
PCI-SIG has built its reputation on delivering high quality PCI Express® (PCIe) specifications that have doubled bandwidth on average every three years, while maintaining full backwards compatibility with prior generations. This is no easy task, and as an organization, we continue to innovate in order to meet the performance requirements of our members and the industry within the power, cost, and high volume manufacturing constraints.
When we set out to design the PCIe 4.0 specification – which doubled bandwidth from 8 GT/s to 16 GT/s per Lane while maintaining backwards compatibility – we realized that system designers would need to know how much signaling margin was actually available in their design in order to squeeze out full 16GT/s performance while taking into account channel loss limits. Of course, while robust high-speed signaling simulations would be required to ensure proper designs, we felt that a test which could be run in the actual physical system would provide confidence on the reliability of the system.
Margining Challenges
A number of different factors contributed to the determination that we needed to address actual in-system margin in order to meet our goal of delivering a reliable 16 GT/s solution to the industry.
- PCIe is used in a wide variety of devices and platforms, and it is manufactured in high volumes and with a large variation in process, voltage, and temperature, even in the same platform. This mass manufacturing and device diversity can result in a lot of variance when it comes to actual signal performance.
- Some systems may need to implement retimers to extend PCIe connections beyond the reach possible with the base channel loss budget. These devices pose additional challenges for signal integrity which required us to address the controllability and observability of the retimers as part of understanding the complete margin “picture”.
- Electrical “noise” and other effects of fully-running systems can have a negative impact on PCIe signal quality, so we needed to be able to determine the margin with real silicon in actual production systems running real-world traffic in order to ascertain link health at 16 GT/s. Further, we needed to be able to do that without impacting the live traffic going through the PCIe Link.
- While many vendors have developed proprietary ways of determining and reporting signal quality and/or signal margin, there was no standardized way for a designer to take an arbitrary PCIe component and measure how much margin is available in real production systems.
Implementation of Lane Margining
To overcome the challenges outlined above, PCI-SIG added a feature formally called “Lane Margining at the Receiver” (but commonly referred to as simply “Lane Margining”) in the PCIe 4.0 specification. Lane Margining enables system designers to measure the available margin in a standardized manner. It is now a mandatory feature, implemented at the Receiver – this is the moment of truth – how much margin is available at the Receiver. Lane Margining allows the system to determine how close to the “edge” (of functionality) each lane is capable of operating under real conditions. Conceptually, imagine that to calculate margin, the receiver moves its sampling point around within the signal eye, to determine its width (and possibly height). See the image below for an example where the actual signal eye is asymmetric – with 10 sampling “steps” available to the right before failure, and 20 sampling “steps” to the left before failure. This data allows the designer to evaluate the amount of margin before the design begins experiencing errors, thereby allowing delivery of a more robust system while better meeting time-to-market goals.
Our solution delivers spec-defined, non-destructive margining during fully-on “L0 state” operation on real platforms. With Lane Margining implemented at the Receiver, a multitude of new usage models are enabled. A system might periodically run Lane Margining and track changes over time to proactively warn of impending failure. System manufacturers can run Lane Margining as part of their production monitoring to ensure material batch-to-batch variation doesn’t negatively impact PCIe functionality.
To recap, some key features of Lane Margining include:
- Works in a production platform without any test equipment
Supports architected and standardized registers to control and report margin in a standard way across all components
Enables retimers to be accessed without special vendor-specific tools
Delivers flexibility for a variety of different implementations and usages
As we put the finishing touches on the PCIe 5.0 specification – once again doubling bandwidth from 16 GT/s to 32 GT/s – Lane Margining continues to remain a critical component of PCIe’s successful adoption by the industry.