Reliability Analysis of Multibit Error Correcting Coding and Comparison to Hamming Product Code for On-Chip Interconnect

Error control schemes became a necessity in network-on-chip (NoC) to improve reliability as the on-chip interconnect errors increase with the continuous shrinking of geometry. Accordingly, many researchers are trying to present multi-bit error correction coding schemes that perform a high error correction capability with the simplest design possible to minimize area and power consumption. A recent work, Multi-bit Error Correcting Coding with Reduced Link Bandwidth (MECCRLB), showed a huge reduction in area and power consumption compared to a well-known scheme, namely, Hamming product code (HPC) with Type-II HARQ. Moreover, the authors showed that the proposed scheme can correct 11 random errors which is considered a high number of errors to be corrected by any scheme used in NoC. The high correction capability with moderate number of check bits along with the reduction in power and area requires further investigation in the accuracy of the reliability model. In this paper, reliability analysis is performed by modeling the residual error probability Presidual which represents the probability of decoder error or failure. New model to estimate Presidual of MECCRLB is derived, validated against simulation, and compared to HPC to assess the capability of MECCRLB. The results show that HPC outperforms MECCRLB from reliability perspective. The former corrects all single and double errors, and fails in 5.18% cases of the triple errors, whereas the latter is found to correct all single errors but fails in 32.5% of double errors and 38.97% of triple errors.


INTRODUCTION
On-chip communication between the many components integrated on a single chip is facing many reliability issues along with the stringent area and power constraints. On-chip interconnect errors caused by many effects including supply voltage fluctuation, crosstalk, process variation, radiation, or electromagnetic interference become increasingly problematic in very deep submicron (VDSM) technology (Sridhara and Shanbhag ,2005). Reliability can be improved by applying error control techniques, such as automatic repeat request (ARQ), forward error correction (FEC), and hybrid ARQ (HARQ) to on-chip interconnects. Single-error correcting (SEC) codes (e.g., Hamming) have been widely used in previous works to address transient errors.
With the increasing probability of multiple random and burst errors in VDSM technology more powerful and efficient error control schemes were needed (Bertozzi and Benini 2005). Accordingly, many error controls schemes for multi-bit error correction appeared to increase the error correction bound. In Combined Crosstalk Avoidance Code with Error Control Code (Joint LPC-CAC-ECC) the use of simple parity calculation along with message triplication make the correction of two random errors and some of three is possible(Kummary and Dananjayan ,2019). While in Joint crosstalk avoidance and Triple Error Correction (JTEC), Hamming code with message duplication is used to correct three errors. Further optimization was applied for this scheme yielding Triple Error Correction and Quadruple Error Detection (JTEC-SQED) (Ganguly and Pande , 2009). Another optimization made In Joint Crosstalk Aware Multiple Error Correction (JMEC) to correct adjacent errors the used of changed interleaving distance between adjacent bits makes the correction of nine adjacent errors possible (Gul and Chouikha ,2017) Duplication with two dimensional parities was also proposed to provide up to seven errors detection (Flayyih and Samsudin,2014) or six errors detection and single error correction (Flayyih and Samsudin, 2020). In Multi Bit Random and Burst Error Correction (MBRBEC) the use of extended Hamming code with triplication can raise the correction capability to five errors (Maheswari and Seetharaman , 2013). while in Quintuplicated Manchester error correction (QMEC) the nonuple errors correction is possible (Narayanasmy and Muthurathinam,2018). All these duplications, triplication and quintuplication based coding schemes provide high error control at the cost of high link size. Another group of coding schemes provide multibit error control without high link size overhead. Both multiple random and burst errors were corrected by Hamming product codes (HPC) (Fu, 2009,) where extended hamming product code (HPC) combined with type-II HARQ for was used for correcting up to five errors. Type-II HARQ was adopted to avoid the high link size overhead by sending the message in two steps. A recent research, (Vinodhini and Murty, 2018,). proposed multi-bit error correcting coding with reduced link bandwidth (MECCRLB) code. The authors try to reduce the area and power consumption by using different product code arrangements by applying hamming coding on rows and a simple parity on columns. The authors show huge reduction in area and power consumption as compared to (Fu ,2009). They show that their method can correct burst errors of four bits or random error of eleven bits which resulted in lower probability of residual error rate which means higher reliability. All these error correction techniques can correct multi-bit errors with different error correction capability. Besides the correction capability, two important metrics are considered when evaluating the different coding techniques, namely the circuit design complexity and the required link size to send the codeword. These metrics reflect the area and power consumption which are critical in VLSI designs. The reliability provided by any coding scheme is evaluated by the residual error probability which is a measure of the failing probability of the coding scheme. To provide an easy way for comparing the coding schemes, mathematical models are derived to estimate the residual error probability. The models are essential in evaluating the reliability and also in finding the required link voltage swing to achieve a certain reliability level. Thus, the accuracy of these models is reflected in the reliability level accuracy and in the selected voltage swing. The high error correction capability of MECCRLB with low number of parity bits required along with the reduced circuit complexity motivates the requirement of further analysis of this scheme to verify its higher reliability level compared to HPC. This is supported by the fact that correction capability is upper limited by the Hamming distance emerging from the data redundancy. The main contribution of this paper is the derivation of an accurate residual error probability model to estimate the MECCRLB reliability and compare it to HPC (Fu, 2009). supported by simulation results. The remaining of the paper is arranged as follows: the literature review for related works are presented in Section 2. In Section 3, reliability analysis is done for the considered works, while in Section 4, the results and discussion are presented. Eventually, Section 5 represents the conclusion.

LITERATURE REVIEW
To analyze the reliability of the two error coding schemes; hamming product code with type-II HARQ Fu ,2009 and MECCRLB Vinodhini and Murty,2018, the message arrangements are first introduced along with the encoding and decoding in each scheme

2.1-Extended Hamming Product Code with type II HARQ
The k bit input message is arranged into a (k1 × k2) matrix as shown in Fig. 1. The number of rows, k2, is always chosen to be four to reduce the link size according to the study done in (Fu, 2009). Row parity check bits are obtained by encoding the (k1) bits in each row using extended Hamming row encoder EH(n1, k1), where n1 is the row encoded word size. Column parity check bits are obtained by encoding the (k2) column bits using column encoder EH (n2, k2), where n2 is the column encoded word size. Checks-on-checks can be generated by encoding the column parity check bits using row encoder. In (Fu 2009,). the authors used extended Hamming Product Code with type II HARQ to reduce the number of interconnection links. The encoder first encodes the k2 rows using extended Hamming code and sends the result to the decoder encompassing separate row decoders that correct single error and detect double errors at each row. If errors occur and they are within the error correction capability of the decoder, then there is no need to send the column parities and checks on checks. If the errors are detectable but uncorrectable the decoder will request the extra column parity check bits along with checks on checks bits to be sent to correct the detectable errors where each column encoder can also correct one error and detect two. Thus, the second transmission is to increase the correction capability. A message with 32-bits is arranged as (8 × 4) matrix. Each row is encoded as EH (13,8) resulting in 52-bit message. Each column is encoded as EH (8,4) resulting in 32-bit that will be further encoded using row encoder EH (13,8) to generate checks on checks leading to 52-bit message saved in a buffer and sent in case of NACK.

MECCRLB Code
In this coding scheme the k bits input message is arranged with k2=3 rows and number of columns (k1 = k /3). The extended hamming product is applied to the three rows only, while simple parity is applied to the columns. Fig.2 shows a 32-bit message where G1 to G3 are fed to extended hamming code encoder while simple parity checks will be calculated for vector 1 to vector 10. For 32-bit message G1 and G2 are encoded using EH(16,11) while G3 is encoded using EH(15,10), thus, the size of the encoded message will be 57-bits. MECCRLB code was proposed to reduce the complexity of circuits used in HPC Type-II HARQ and also to send the data in one transmission by reducing the number of parities. Unlike HPC the encoder will send the whole encoded message. The decoder will correct the errors within its correction capability according to Fig.3 which shows the decoder correction algorithm.

RELIABILITY ANALYSIS
Reliability can be measured by calculating the residual error probability ( ), which represents the probability of decoder error or failure; this is a complement to the probability of proper decoding which is the sum of the probabilities of correcting random errors and burst errors. The relationship between probability of residual error ( ) and probability of proper decoding ( ) is given in Fu and Ampadu,2008 as: In the following subsections, Fu,2009 and Vinodhini and Murty ,2018 will be analyzed for random errors only, since both coding schemes can correct acceptable amount of burst errors, namely six in Fu,2009 and four in Vinodhini and Murty, 2018.

A-Extended Hamming Product Code with type II HARQ
Presidual depends on both the error detection capability in the first transmission and error correction capability after the retransmission. Presidual is estimated as given in Fu ,2009 as: where Pud is the undetectable error probability in the first transmission and ( , ) is the probability of error after retransmission and three stage decoding is over.
given in Fu ,2009 as: where is the probability of no error and , are the probability of correctable error patterns and the probability of detectable but uncorrectable error patterns in the first retransmission, respectively. ( , ) can be expressed as: By inserting (4) and (3) in (2) we get Since any error pattern with at most one error in each row can be corrected in the first transmission, Pc for random errors can be given in where k2 is the number of rows and n1 is the row encoded word size. After retransmission, the proposed work can correct five random errors so Pd for random errors is defined in (7) as given in Fu ,2009. The first term is the error detection probability when two or three random errors occur in the first transmission. The second and third terms in (7) are the error detection probability of four and five random errors.

B-MECCRLB
The multi-bit error correction scheme MECCRLB proposed in Vinodhini and Murty, 2018 is an FEC-based coding scheme, where there is no retransmission available. As a result, Presidual depends on the coding correction capability.
Vinodhini and Murty, 2018 indicated that MECCRLB can correct up to 11 random errors in 32bit message so Pc for random errors was expressed as: However, the equation is not accurate as it is not possible to correct all the 11 random errors applying the MECCRLB decoding. The minimum Hamming distance dmin represents the minimum number of bits that can be changed between two different valid codewords. Based on this distance, the maximum correction and detection capability of any coding scheme can be given by [(dmin -1)/2]. Applying this to MECCRLB code dmin= (dmin for extended hamming code × d min simple parity check) which equals to dmin=(4 × 2) =8, it can be inferred that the maximum correction capability for MECCRLB code is [(8 -1)/2] which equals to three errors. This represents the theoretical limit of this coding scheme assuming that the simple parities are calculated for all columns; which is not the case in MECCRLB. Fig .4 shows some cases where MECCRLB fails to correct two, three, and four errors. Instead, a new error correction model is derived in this work as given in equation (12) which expresses the correction capability for up to four random errors where the first term represents single error correction in a message shown in Fig .4 (a). The second term expresses double errors correction in a message except if one of errors happens in 10-bits parity checks shown in Fig .4 (b). Term three expresses the correction of three errors in a message except if one or two of the errors happen in 10-bits parity checks shown in Fig .4 (c). Term four expresses the cases when MECCRLB succeeds to correct four errors in three cases; the first case is when all four errors are in one row, the second case is when three errors are in one row and the other are in any other message bits, and the third case is when two errors are in one row and the other two errors are elsewhere in the message as shown in Fig .4 (d). However, there are also few cases where MECCRLB also succeeds to correct four errors but with negligible effect and adding them makes the equation more complex.

RESULTS AND DISCUSSION
A testing program was written in (C++) to test the correction capability of the two schemes. Table. I shows the simulation results for both schemes, where 10 5 samples of 32-bit messages are fed to the two error correction schemes and some errors are injected. The failure percentage is found by: First, all samples are injected with one error randomly located in the message. It was noticed that both schemes can correct all combinations of one-bit error. Second, all samples are injected with two-bit errors randomly located in the message. HPC was able to correct all cases while MECCRLB failed to correct 32.5% of them. For three errors, HPC fails in 5.18% of the cases which is lower than the MECCRLB failure, namely 38.97%. Similarly, HPC showed lower failure percentage at 4 errors as compared with MECCRLB. We notice the huge diverge between HPC and MECCRLB correction capability. To validate the derived model, Fig .5 shows in estimation and simulation for 32-bit message size with different values. k2 for HPC is four and three for MECCRLB. The two techniques were simulated using the (C++) program and random errors are injected at different positions and at different bit error rates for 10 8 codeword samples. Then, after applying the codewords to decoding algorithms for both schemes the simulation program will calculate the percentage of failure cases for each scheme which represents the simulated residual flit error rate. The results show that the estimated residual flit error rate is close to the simulated results which validates the correctness of the derived model. It can also be noticed by looking to the Y-axis that for the same values of HPC sustains higher bit error rate which means it can achieve the same reliability level ( ) but at lower voltage swing which it is translated into lower power consumption. The relation between the bit error rate and the voltage swing is usually modeled using the Gaussian noise model as discussed in many works (Fu,2009, Vinodhini and Murty, Figure 5. Presidual for different Bit Error Rates.

CONCLUSIONS
In this paper, an accurate mathematical model to calculate the residual error probability Presidual for MECCRLB coding scheme was derived. The new estimation is highly correlated to the simulation results which prove the inaccuracy of old estimation results. The new results compared to Extended Hamming Product Code with type II HARQ showed that the latter has higher reliability due to its higher error correction capability, where HPC can correct all messages with two errors and very high percentage of messages that have three errors while MECCRLB code corrects only 68% of messages with two errors and 61% of messages with three errors. Accordingly, it is expected that MECCRLB will have higher link voltage swing which is translated into higher power consumption. Thus, the schemes should be reanalyzed from power perspective using the accurate models. The HPC is still a promising coding scheme that can be used to address the reliability issues but requires some optimization at the encoder and decoder to reduce its complexity.