The PCIe® 6.0 Specification Webinar Q&A: Error Detection and Correction with FEC
By Debendra Das Sharma, PCI-SIG® Board Member
The PCI Express® (PCIe®) 6.0 specification will feature two primary mechanisms to correct errors: Forward Error Correction (FEC) and Cyclic Redundancy Check (CRC). Each 256-byte FLIT comprises of 242 bytes of payload which are protected by 8 Bytes of CRC. The 250 bytes of payload and CRC are protected by 6 Bytes of FEC. FEC operates on the principle of sending redundant data that can be deployed to correct some errors at the Receiver while CRC is an error detection code used to detect errors. A receiver uses the FEC to correct any errors in a FLIT after which it applies the CRC check on the 250 Bytes that are protected by the CRC. If a FLIT fails the CRC check, it is eventually corrected through the Link Layer Retry mechanism of PCIe. PCIe 6.0 technology uses a unique method to achieve low-latency through a combination of relatively lower First Bit Error Rate (FBER) 10-6 combined with a lightweight, low-latency FEC to complete the initial correction. This blog provides detailed answers to questions about FEC that were asked during the PCIe 6.0 Specification webinar.
- Will a higher Bit Error Rate (BER) such as 10-4 provide more channel length?
We have conducted extensive studies before settling on the 10-6 First Bit Error Rate (FBER). As mentioned in the presentation, 10-6 is a critical number to meet the latency requirement of FEC and Cyclic Redundancy Check (CRC) to be less than 2ns and reducing the bandwidth overhead in line with a less than 2% impact. Another point to note is that using BER would be about an order of magnitude worse than the FBER, due to burst errors in a Lane as well as Lane to Lane correlation. If the FBER was relaxed we would need a networking style FEC, even if we have the retry, to keep the retry probability less than 1E-5. Based on our analysis, we are confident of securing the existing channel reach to 1E-6 FBER. For longer channels we can deploy Retimers.
Based on our experience over the last two decades, channels always improve over time. We always produce better materials with lower loss characteristics but, once we make a target FBER and deploy the FEC/CRC accordingly, that does not change over time. The FBER values are set for the life of the technology so, we needed to make the right set of trade-offs. A higher FBER might provide an extra inch or two of channel reach. However, that gain was not worth the loss of area, performance, cost, power penalty and above all, the substantial segment of latency and power sensitive usage models. The key metrics have been met, including channel reach, even with today’s materials that are deployed in volume.
- How does a CRC error identify which byte has the error?
The Cyclic Redundancy Check (CRC) evaluation happens after FEC decode and correction. Since FEC can correct errors, it has to know the exact location and magnitude of error, in order to perform correction. Therefore, its detection ability is limited. On the other hand, CRC is deployed to detect errors irrespective of where the errors occurred. As a result, the detection ability is much stronger. Once a FLIT fails the CRC check, it will be replayed. Upon replay, the FLIT is corrected.
- What is the code gain of the low latency FEC used in the PCIe 6.0 specification?
PCI-SIG® deploys a lightweight FEC for correction. The goal was to pay close to zero latency penalty and then rely on a very robust CRC for detection, combined with a fast link level replay to handle any errors that the FEC could not correct. As long as the replay probability of a FLIT is around 10-6, there is no appreciable performance impact either due to the FEC latency or the replay latency in case of an undetected error. A combination of FBER of 10-6 with a three-way interleaved single symbol correct FEC gets us to this solution space. Unlike other standards, PCI-SIG does not rely on FEC alone for correction, nor do we view FEC as a means to obtain code gain in the channel. Instead, we leverage a combination of FEC correction and CRC detection that results in a replay that effectively corrects.
- Why does FEC force the move to the use of FLITs?
FEC works on a fixed number of symbols. If the code size were dynamically variable, we would need some kind of framing token with its independent FEC protection to say how many symbols the next FEC code size was. However, this would result in a very inefficient interconnect. Once we decided on fixed sized symbols protected by FEC, it was easy to move to FLITs since they are of fixed size. The FLIT is the basic unit of transfer where there can be variable sized transactions or data link payload etc.
- What frequency was adopted to keep FEC latency within 2ns?
The Link frequency is 64 GT/s. The FEC logic can be run at any frequency. In general, we expect the logic to run at 1G (or 500 MHz or 2G) and easily reach a latency much better than 2ns. We have run the logic at 1G and could perform the decode and correction in one clock cycle.
- Can the FEC be bypassed if the link is running at lower data rates?
The FEC can be bypassed at lower data rates and still result in a robust, operational Link. As the PCIe 6.0 specification is finalized, PCI-SIG will decide whether it is beneficial to create the complexity associated with a different mode for the lower data rates while in FLIT Mode.
- Given that the effective BER with the FEC is still worse than 10-12 (ten to the power -12), will that be an issue?
PCI-SIG does not expect an issue since we have link level retry that will correct the error. It is true that the probability of retry of a FLIT is about three orders of magnitude worse than the prior generations of PCIe specifications with NRZ signaling. However, as long as the retry probability per FLIT is in the range of 10-6 and the retry latency round-trip is in the 100 ns range, we do not expect to see any noticeable performance impact. We are operating on the principle that it is better to keep the latency identical to prior generations and taking the 100 ns latency hit with a probability of 10-6 than adding 150+ ns of latency for each and every FLIT.
- What happens to latency when the FEC cannot successfully correct the error and how often this is expected to happen?
When the FEC process cannot successfully correct, the CRC evaluation will detect the error. A negative acknowledgement (NAK) will be issued to the Link Partner, which will then retry the same FLIT from its replay (or retry) buffer. We expect the probability of this event to be in the range of 10-6 and the retry latency round-trip in the 100 ns range.
When the FLIT is correctly received, either the first time or after one or more retries, the Port sends an Ack to its Link Partner, which then retries the FLIT from its replay buffer.
Dive Deeper Into the PCIe 6.0 Specification
The recording of the PCIe 6.0 Specification webinar is available to watch anytime on the PCI-SIG YouTube channel. Also, this series of Q&A blogs will continue to provide answers to the questions asked by attendees during the live presentation. Follow PCI-SIG on Twitter on LinkedIn for updates about these blogs.