# HIGH PERFORMANCE AND HARDWARE FRIENDLY NB\_LDPC DECODER OF EMS ALGORITHM Shulong $Sun^{1,2}$ and $Min Lin^{1,2}$ <sup>1</sup>State Key Laboratory of Functional Materials for Informatics Shanghai Institute of Microsystem and Information Technology Chinese Academy of Sciences No. 865, Changning Road, Shanghai 200050, P. R. China ssl2012@mail.ustc.edu.cn <sup>2</sup>University of Chinese Academy of Sciences 19 A Yuquan Rd., Shijingshan District, Beijing 100049, P. R. China Received November 2016; revised February 2017 ABSTRACT. Compared with binary Low-Density-Parity-Check (LDPC) code, Non-Binary LDPC (NB\_LDPC) code offers better error correcting performance and higher spectral efficiency. In this work, we introduce a Double Iterations Extended Min-Sum (EMS) Algorithm for NB\_LDPC decoder future to improve the decoder performance. And Tree-Architecture and partially parallel Elementary Check Unit (ECU) are introduced to complete Check Node Process so as to optimize the critical path delay and make a tradeoff between throughput and complexity. From the point of decoding algorithm, unlike Turbo-Iteration method exchanging message between detector and decoder, the proposed algorithm is merely conducted on the decoder. Simulation result shows that the proposed algorithm can achieve more than 0.5dB coding gain compared with the traditional method on different conditions. And the equivalent iteration numbers are also compared with traditional EMS algorithm in this paper; the result shows that the proposed method can also sustain the decoder throughput. As an illustration, an NB\_LDPC decoder is designed on GF (16), Nm = 8 in this paper. The proposed method is actually a modified EMS algorithm which just adds channel information iteration process to traditional scheme, so this method can be widely applied to plenty of iteration algorithms such as LDPC or NB\_LDPC decoding algorithm irrespective of the specified system; therefore, the proposed method can be conveniently extended to higher Galois Field such as GF (64) or GF (256) by modifying a slight part of the design. Compared with previous works, the decoder proposed in this paper is more efficient and rather hardware friendly for practical applications especially in the low Signal-to-Noise Ratio (SNR) condition without introducing any extra decoding complexity and throughput degradation. Keywords: NB\_LDPC code, Double iterations, EMS algorithm, Tree-Architecture 1. Introduction. Non-Binary Low-Density-Parity-Check (NB-LDPC) codes, defined over Galois field with order GF (q), were investigated by Davey and Mackay in 1998 [1-3]. It was shown that NB\_LDPC codes outperform the binary counterparts when the code length is small or the system is under higher-order modulation [4,5]. Compared with binary LDPC code, NB\_LDPC codes also outperform binary LDPC codes in the presence of burst errors. From a performance point of view, despite high performance of NB\_LDPC decoder its potential capacity can be future developed by several methods [6-10]. By combining symbol detection and channel decoding method together, such as MIMO-LDPC turbo iteration strategy, the performance is rather better than single iteration LDPC decoding method. In [10], a MIMO-NB\_LDPC Iterative Detector-Decoder for $4 \times 4$ 256-QAM system based on 65nm CMOS technics has been proposed; in this method, decoder and detector exchange soft information to improve performance iteratively. However, this method adds rather complexity to system design. According to above iterative structure between decoder and detector, we introduce a double iterations structure for NB\_LDPC decoder architecture future to improve the decoder performance. In this work, as the algorithm can be merely implemented on the decoder, the system design can be much simplified. In the double iterations structure, the current variable nodes information will replace initial channel LLR memory which remains consistent in traditional algorithm. Simulation result shows that the proposed algorithm can achieve more than 0.5dB coding gain compared with the traditional method on different conditions. From an implementation point of view, the complexity of NB\_LDPC codes is much higher than binary counterparts. Therefore, an important effort has been dedicated to design reduced-complexity decoding algorithms for NB\_LDPC codes. In [8] the authors introduce the EMS algorithm which is based on a generalization of the Min-Sum algorithm. Its principle is the truncation of the vector messages from Nm to q values (Nm << q), introducing a negligible performance degradation compared to the Belief Propagation (BP) algorithm and making a good trade-off between performance and complexity. In [13], Min-Max (MM) algorithm is adopted to future reduce decoder's complexity but the performance is seriously degraded. The Forward-Backward (F-B) algorithm for EMS is applied in [8], where a Check Node Unit (CNU) is divided into several identical Elementary Check Units (ECUs) to reduce the computation complexity. However, each ECU just processes two symbols per system cycle, which means 3(dc-2) ECU computations are needed for one Check Node (CN) to variable node VN (C2V) update. In [12], to future reduce hardware consumption, double clocks architecture is proposed where a higher frequency clock which is 8 times of the system clock is designed for ECU so just one ECU is needed that can work serially during a system cycle for CN update. However, the higher frequency clock is rather limited caused by critical path delay; thus, the structure cannot meet high throughput applications. In this paper, in order to optimize the critical path delay and make a tradeoff between throughput and complexity, we applied Tree-Architecture to replace traditional Forward-Backward strategy and partially parallel (ECUs) structure is also introduced to complete Check Node process. The paper is structured as follows. Section 2 introduces proposed double iterations algorithm based on EMS algorithm for NB\_LDPC decoder. Section 3 presents the new Tree-Architecture Check Node Architecture as well as the double iterations decoder system structure in this paper. Section 4 gives the comparison between the proposed decoder with other works in terms of performance, complexity and throughput. Section 5 gives conclusion of this paper. 2. The Proposed Algorithm. In traditional NB\_LDPC decoding algorithm, the number of unsatisfied checks goes down first then oscillates for a while before it reaches a constant level. This is because the positions of erroneous variable nodes change over decoding iterations. This means that decoding got trapped in the stable trapping sets, which is what we try to prevent from happening. This uncorrectable error is the main cause to slow down the throughput. In the proposed method, double iterations strategy is introduced, the inner iteration can be early terminated and the initial Log Likelihood Ratio (LLR) of channel value will be replaced by current variable messages. Then the outer ring loop iteration will start. The whole algorithm stops when all checks are satisfied or the iteration numbers of inner loop and outer loop both reach their thresholds. The proposed method is called Double Iterations EMS Algorithm (DI-EMS) in this paper. The proposed DI-EMS algorithm can be described in detail as follows: ## 1) Initialization In this step, the channel values are stored into the LLR memory and the variable node to check node (V2C) message buffers are also fed with the channel values. Then the initial V2C message of a check node $P(\alpha)$ can be given by: $$p(\alpha_i) = \{\text{sym0 [i], llr0 [i]}\} \quad 1 \le i \le n_m$$ $$P(\alpha) = \{p(\alpha_1)p(\alpha_2)...p(\alpha_{nm})\};$$ (1) $p(\alpha_i)$ contains the ith possible symbol and its corresponding LLR value. ### 2) C2V update For all the check nodes of degree dc, calculate the C2V messages based on the other dc-1 V2C messages, we denote the other dc-1 V2C messages as $P_1(\alpha) \sim P_{dc-1}(\alpha)$ . The update equation can be given by: $$V(\alpha) = \max \left( \sum_{j=1}^{dc-1} P_j(\alpha) \right)$$ $$\alpha_1 \alpha_2 \alpha_3 \dots \alpha_{dc-1} | \alpha_1 + \alpha_2 + \alpha_3 + \dots + \alpha_{dc-1} | = \alpha$$ $$(2)$$ #### 3) V2C update For all the variable nodes of dv degrees, calculate the V2C messages based on the other dv - 1 messages and the channel values, we denote the other dv - 1 C2V messages as $V(\alpha_1) \sim V(\alpha_{dv-1})$ . At the same time, the vectors for decision have to be calculated in this stage, and here we use S to represent. The update equations are shown as follows: $$P(\alpha) = \left\{ [\text{sym0, llr0}] + \sum_{j=1}^{dv-1} V_j(\alpha) \right\}_{\alpha_1 + \alpha_2 + \dots + \alpha_{dv-1} = \alpha}$$ (3) $$S(\alpha) = \left\{ [\text{sym0, llr0}] + \sum_{j=1}^{dv} V_j(\alpha) \right\}_{\alpha_1 + \alpha_2 + \dots + \alpha_{dv} = \alpha}$$ $$\tag{4}$$ # 4) Decision We use $\overset{\wedge}{C} = [C_1 C_2 C_3 \dots C_N]$ to represent the decision codeword for current iteration. Here N is code length and $$\hat{C} = \underset{\alpha}{\operatorname{arg\,max}} S(\alpha) \tag{5}$$ Repeat 2) $\sim$ 4) until $H^T \cdot \stackrel{\wedge}{C} = 0$ where $H^T$ is the transpose of the parity check matrix of NB\_LDPC code or the inner and outer loops thresholds are both reached. If it has only reached the threshold of inner loop iteration, the algorithm goes to step 1). At the same time, the LLR memory is fed with the current variable nodes message. However, in traditional EMS algorithm, the LLR memory remains always consistent during the whole algorithm which keeps the channel information. The updated values can be given by: $$\{\text{sym0, llr0}\} = S \tag{6}$$ 3. Proposed Check Node Architecture. In Figure 1, the decoder mainly consists of the following components, Control Unit, LLR Memory, Message Memory between CNs and VNs, Transposition Unit and Inverse Transposition Unit, CN Unit, VN Unit and Decision Unit. The Message Memory is initialized to be zero and the LLR Memory is initialized to the channel values. These functions are implemented by the Transposition Unit and Inverse Transposition Unit based on LookUp Tables (LUT). The new path in FIGURE 1. (a) Structure of traditional EMS scheme; (b) structure of proposed DI-EMS scheme Figure 1(b) conducts the outer long iterations which feeds variable nodes message into LLR Memory to perform DI-EMS algorithm. Each element in a vector contains LLR value and corresponding symbol, C2V/V2C denotes the check-to-variable node message and variable-to-check node message respectively. Generally speaking, a CNU is divided into 3(dc-2) ECUs and each ECU operates with two incoming messages. A CN message update is based on several ECUs. In conventional design, a CN consists of 3(dc-2) ECUs and a VN consists of 3(dv-1) EVUs when they are fully parallel implemented. The F-B architecture is proposed for EMS algorithm implementation which has been extensively applied. In this paper, in order to deeply optimize the critical path delay, we adapt the Tree-Architecture for our ECN implementation which is first presented in [11]. Without modifying the algorithm, we moved the data path computation to the last stage of the architecture. This modification relaxes the critical path and simplifies the hardware design. As the reason that ECN is rather complex in terms to hardware consumption, that is 12 ECNs are needed for a dc=6 check node implementation if it is fully parallel designed. In order to make a tradeoff between complexity and throughput, we only applied three ECNs and four cycles to completing a dc=6 check node update. The proposed partially parallel structure takes only four cycles rather than 12 cycles in [12] for CN computation. So the throughput is at last three times more than the decoder proposed in [12]. ECN structure is based on the architecture also in [12]. For our decoder, we also adapt double clocks to drive the system. The Check Nodes as well as Variable Nodes are working at high frequency and the other modules are working at low frequency. 4. Implementation Results and Performance Analysis. In this section, we conduct the comparison of proposed DI-EMS algorithm with traditional EMS algorithm and our decoder implementation results with other works. The decoders are both based on GF (16). As depicted in Figure 4(a) and Figure 4(b), we can see that the double iterations scheme provides more than 0.5dB coding gain compared with the traditional method. The presented algorithm is widely suitable for many kinds of iteration algorithms and the performance is nearer to Shannon Limit. In Figure 4(a), set Nm = 8, the simulation is FIGURE 2. (a) Architecture of F-B CN with dc = 6; (b) Tree-Architecture of CN with dc = 6 conducted in Additive White Gaussian Noise (AWGN) channel; on the rather low SNR condition $0 \sim 3 \, \mathrm{dB}$ , the novel method takes great advantage in the decoding performance. In Figure 4(b), we can see that in the double iterations structure those unnecessary iterations in traditional method are efficiently cut and the thresholds of the inner loop and outer iteration need careful design. In the simulation process, the threshold of traditional algorithm is set 30, for fair comparison we set the inner loop iteration threshold of DI-EMS algorithm 10 and outer ring iteration threshold 3, so the throughput is the same. For DI-EMS algorithm, i presents the inner loop iteration number and j presents the outer loop iteration number, Figure 3. Three-ECNs structure for CN of dc = 6 FIGURE 4. (a) Performance of proposed DI-EMS algorithm and traditional EMS algorithm; (b) iteration numbers of proposed DI-EMS algorithm and traditional EMS algorithm and here T is the maximum iteration number of inner loop. So the equivalent iteration number of DI-EMS algorithm is (j-1)\*T+i. We can see from Figure 4(b) that the iteration numbers in different SNR condition of the two algorithms are not evidently different from each other, that is, our method provides better performance (more than 0.5dB) in terms of Bit Error Rate (BER) without introducing any throughput degradation from the traditional method. This throughput is calculated as: $$Throughput = \frac{F_{clk}.L_{code}.R}{Iter.(QC*2) + QC}Mbps$$ (7) where *Iter* represents the number of iterations, QC is the size of submatrix of the check matrix in the decoder and is determined by the specified check matrix. In the proposed decoder, QC = 27 and Iter = 10. $L_{code}$ is code length and R is code rate. For fair comparison, we propose a normalized hardware efficiency p to represent decoder quality $p = (Throughput/Total\ Gates) * Nm \log Nm$ as throughput has already taken code length, code rate, iteration number and q-ary into consideration and the primitive complexity is dominated by $O(Nm \log Nm)$ . Table 1 gives the decoder performance compared with other works. In [13], the hardware efficiency is the best because Max-Min algorithm is adopted to simplify EMS algorithm to future reduce decoder's complexity but the performance is seriously degraded. In [12] double clocks are introduced as the high clock frequency is 8 times of the low clock because just one ECN should work serially that the throughput is hard to improve. Nm = 8 is set here and three ECNs are introduced, we can see that more than 1dB coding gain can be achieved at AWGN channel compared to that Nm = 4 in [12]. And double iterations algorithm can achieve extra 0.5dB coding gain compared to traditional single iteration EMS algorithm without introducing evident throughput degradation. As 3 ECNs are proposed for partially parallel structure to replace the fully parallel structure, the decoder proposed in this paper can make a good tradeoff between complexity and throughput. | Parameter | This work (Double Iterations) | This work (Single Iteration) | [12] | [13] | |-------------------------|-------------------------------|------------------------------|--------------------|---------------------| | Code length/symbols | 162 | 162 | 162 | 576 | | Rate | 0.5 | 0.5 | 0.5 | 0.89 | | Galois Field/ $Nm$ | 16/8 | 16/8 | 16/4 | | | (dv, dc) | Irregular | Irregular | Irregular | (4, 36) | | Algorithm | EMS | EMS | EMS | MM | | Quantization bit | $6 \mathrm{bits}$ | $6 \mathrm{bits}$ | $6 \mathrm{bits}$ | $5 \mathrm{bits}$ | | Frequency of system | $200 \mathrm{MHz}$ | $200\mathrm{MHz}$ | $75 \mathrm{MHz}$ | $226\mathrm{MHz}$ | | Frequency of ECN | $600 \mathrm{MHz}$ | $600 \mathrm{MHz}$ | $600 \mathrm{MHz}$ | _ | | Iterations | 10 | 10 | 10 | 10 | | Throughput | $158.57 \mathrm{Mbps}$ | $186 \mathrm{Mbps}$ | 69 Mbps | $630 \mathrm{Mbps}$ | | SNR/10-6BER | $1.7 \mathrm{dB}$ | 2.3dB | 3.3 dB | 4.2 dB | | Slice Registers | 15847 | 14257 | 4947 | 81644 | | LUT Slice | 28456 | 27668 | 12356 | 34908 | | Equivalent gates (xor) | 724k | 627k | 287k | 1765.57 k | | Device | Virtex-5 | Virtex-5 | Virtex-5 | Virtex-7 | | Hardware Efficiency $p$ | 804bps/gate | 914bps/gate | 628.7 bps/gate | 1224bps/gate | Table 1. Comparison of this decoder and other work 5. Conclusion. In this paper, we proposed a DI-EMS algorithm for NB\_LDPC decoder to improve the performance. This algorithm can provide more than 0.5dB coding gain compared to traditional single loop iteration EMS algorithm without introducing any throughput degradation and complexity increment. Tree-Architecture is presented in this paper to construct check node so as to optimize the path delay from F-B structured check node. Partially parallel ECU architecture is also proposed to make a good tradeoff between complexity and throughput. The proposed method in this paper is worthy to be extended to higher-order modulation systems combined with GF (64) or GF (256) NB\_LDPC decoder. For future researches, efficient decoding of NB\_LDPC is still an open field. The complexity of NB\_LDPC decoder is still high in terms of hardware consumption especially the computation of check node update. Some novel efficient strategies are also needed to be invented to present the message exchange between check nodes and variable nodes. For practical applications, a better tradeoff between performance, throughput and complexity of NB\_LDPC decoder according to different conditions is also a valuable research field. #### REFERENCES - [1] M. Davey and D. J. C. MacKay, Low density parity check codes over GF (q), *IEEE Commun. Lett.*, vol.2, no.6, pp.165-167, 1998. - [2] R. Yazdani and M. Ardakani, Efficient LLR calculation for non-binary modulations over fading channels, *IEEE Trans. Commun.*, vol.59, no.5, pp.1236-1241, 2011. - [3] E. Boutillon and L. Conde-Canencia, Bubble check: A simplified algorithm for elementary check node processing in extended min-sum non-binary LDPC decoders, *Electronics Letters*, vol.46, no.9, pp.633-634, 2010. - [4] E. Boutillon and L. Conde-Canencia, Design of a GF (64)-LDPC decoder based on the EMS algorithm, *IEEE Trans. Circuits and System*, vol.60, no.9, pp.2644-2655, 2013. - [5] A. Voicila, D. Declercq, F. Verdier, M. Fossorier and P. Urard, Low-complexity decoding for non-binary LDPC codes in high order fields, *IEEE Trans. Commun.*, vol.58, no.5, pp.1365-1375, 2010. - [6] G. Sarkis and W. J. Gross, Efficient stochastic decoding of non-binary LDPC codes with degree-two variable nodes, *IEEE Commun. Lett.*, vol.16, no.3, pp.389-391, 2012. - [7] D. J. MacKay and R. M. Neal, Near Shannon limit performance of low density parity check code, *Electron. Lett.*, vol.33, no.6, pp.457-458, 1997. - [8] D. Declercq and M. Fossorier, Decoding algorithms for nonbinary LDPC codes over GF (q), IEEE Trans. Commun., vol.55, no.4, pp.633-643, 2007. - [9] J. Kang, Q. Huang, S. Lin and K. A. Ghaffar, An iterative decoding algorithm with backtracking to lower the error-floors of LDPC codes, *IEEE Trans. Commun.*, vol.59, no.1, pp.64-73, 2011. - [10] C.-H. Chen, W. Tang and Z. Zhang, A 2.4mm 130mW MMSE-Nonbinary-LDPC iterative detector-decoder for 4×4 256-QAM MIMO in 65nm CMOS, SoC for Mobile Vision, Sensing, and Communications, pp.337-339, 2015. - [11] O. Abassi, L. Conde-Canencia, A. A. Ghouwayel and E. Boutillon, A novel architecture for elementary check node processing in non-binary LDPC decoders, *IEEE Trans. Circuits and Systems*, 2016. - [12] S. Sun and M. Lin, High throughput and low complexity implementation of NB-LDPC decoder based on EMS algorithm, *IEICE Electronics Express*, 2016. - [13] J. O. Lacruz, F. García-Herrero, M. J. Canet, J. Valls and A. Pérez-Pascual, A 630 Mbps non-binary LDPC decoder for FPGA, *IEEE International Symposium on Circuits and Systems (ISCAS)*, pp. 1989-1992, 2015. - [14] P. S. A. Marinoni and S. Valle, Efficient design of non-binary LDPC codes for magnetic recording channels, robust to error bursts, *Proc. of IEEE Int. Symp. Turbo Codes Related Topics*, Lausanne, Switzerland, 2008. - [15] M. Karimi and A. H. Banihashemi, Efficient algorithm for finding dominant trapping sets of LDPC codes, *IEEE Trans. Inf. Theory*, vol.58, no.11, pp.6942-6958, 2012. - [16] V. Savin, Min-Max decoding for non-binary LDPC codes, *Proc. of IEEE Int. Symp. Inf. Theory*, Toronto, ON, Canada, pp.960-964, 2008.