Parallel approach of Schr¨odinger based quantum corrections for ultrascaled semiconductor devices

In the current technology node, purely classical numerical simulators lack the precision needed to obtain valid results. At the same time, the simulation of fully quantum models can be a cumbersome task in certain studies such as device variability analysis, since a single simulation can take up to weeks to com-pute and hundreds of device conﬁgurations need to be analyzed to obtain statistically signiﬁcative results. A good compromise between fast and accurate results is to add corrections to the classical simulation that are able to reproduce the quantum nature of matter. In this context, we present a new approach of Schr¨odinger equation-based quantum corrections. We have implemented it using Message Passing Interface (MPI) in our in-house built semiconductor simulation framework called VENDES, capable of running in distributed systems that allow for more accurate results in a reasonable time frame. Using a 12 nm gate length Gate-All-Around Nanowire FET (GAA NW FET) as an application example, the new implementation shows an almost perfect agreement in the output data with less than a 2% diﬀerence between the cases using 1 and 16 processes. Also, a reduction of up to 98% in the computational time has been found comparing the sequential and the 16 process simulation. For a reasonably dense mesh of 150k nodes, a variability study of 300 individ-ual simulations, can be now performed with VENDES in approximately 2.5 days instead of an estimated sequential execution of 137 days.


Introduction
Silicon device architectures have been developed rapidly during the last few decades, being the FinFET the current standard for mass production. However, IRDS predictions [19] point towards a change in the leading architectures for the next technology nodes with promising candidates such as the GAA NW FET and the GAA nanosheet. This shift in the industry standard calls for new architecture design and optimization to be able to predict the performance and reliability of new devices. For the current sub-10 nm scale, the traditional methodology based on trial and error as a way to improve devices is unviable. For this reason Technology Computer Aided Design (TCAD) has become an indispensable tool to both predict future device architectures and optimize the present ones [33] [38] [5] [6] [21].
Traditionally, when devices were in the micron scale, the resolution of classical models was the most popular choice, for instance the drift-diffusion (DD) method. Currently, device dimensions have been scaled down to the nanometer regime which require the use of more complicated and more time-consuming simulation methods. Some of these are the particle-based semiclassical Monte Carlo (MC) [2] [36], or the purely quantum Non-Equilibrium Green Functions (NEGF) [39] [11] [27]. A common alternative, as it is a good compromise between simulation time and accurate results, is to include quantum corrections to classical models such as those based on the solution of the Density Gradient (DG) [15] [23] [4] or the Schrödinger (SCH) [20] [40] [1] [13] equations.
Nevertheless, as billions of transistors coexist in a die, hundreds or thousands of devices need to be simulated to obtain realistic results in variability studies [5] [30] [29] [41], making this a very computationally demanding job regardless of the simulation method. Therefore, the optimization and parallelization of simulation frameworks plays a key role in current device design, enabling to obtain valid physical results at a fraction of the computational cost.
In this paper we propose and evaluate a resolution scheme of SCH equation-based corrections compatible with a highly parallel DD model explained in [37], used in the resolution of 3D finite element (FE) meshes. By implementing these corrections in a scalable manner, we will be able to obtain accurate results without perpetuating the simulation time frame. We have used as a benchmark device a 12 nm Gate-All-Around Nanowire FET, one of the main contenders to replace FinFETs as the preferred device architecture for mass production [42] [28] [14] [9]. The structure of this paper starts in section 2 where we describe the simulation software and the benchmark device used in this study. In section 3 we present the 2D SCH scheme and the implemented parallelization. In section 4 different simulation results and a efficiency discussion are presented. Finally, in section V we draw the main conclusions.
An overall look of the VENDES framework workflow and its capabilities is shown in Fig 1. The first step in the simulation process consists of generating a 3D finite element mesh made of tetrahedra that represents the structure of the device. The modelling of the device has been done via an open source software called Gmsh [16]. The geometry is constructed using parametric design according to the device dimensions (see Fig. 2 a). Then using a Delaunay triangulation technique the 3D FE mesh is generated and optimized as shown in Fig. 2 b). Every node of the 3D FE mesh is given a set of physical properties needed for the simulation such as material permitivities, doping profiles or carrier mobilities. Note that a FE scheme has been chosen, opposed to the finite difference method since it is able to describe complex 3D irregular geometries. At this point the user can choose to apply to the device several built-in variability models that can either modify the properties of the nodes such as Random Discrete Dopants and Metal Grain Granularity or modify the geometry of the device like Line Edge Roughness or Gate Edge Roughness.
The device that has been used in this work is a stateof-the-art 12 nm Si gate-all-around (GAA) nanowire (NW) FET based on experimental data from [7]. The main dimensions and doping profile characteristics can be seen in Table 1. The device channel has been uniformly doped whereas the source/drain (S/D) regions have a Gaussian doping. These Gaussian doping profiles, reverse engineered from experimental data [3], are characterized by the S/D doping lateral straggle (σ x ), that describes the slope of Gaussian profile, and the S/D doping lateral peak (x p ), that indicates the posi- Fig. 2 The fist step in the device modelling process is to design the geometry in GMSH [16] a) and generate the FE mesh b). Using the spatial coordinates of the FE nodes, 2D slices are extracted using an user-fixed criteria to obtain a 2D mesh c). One of the FE 2D slices is presented in d). tion where the Gaussian decay starts measured from the middle of the channel (see Table 1). After the device geometry and properties have been defined, in the beginning of every simulation, an initial solution is calculated using only the Poisson equation at equilibrium at V G = 0.0 V and V D = 0.0 V. The output of this initial routine is a purely classical electrostatic potential. However, a solution that seeks a compromise between sound results and short simulation times has been to include quantum corrections [13] [32] [26]. Some of the most commonly used quantum corrections are the DG corrections that require calibration against a quantum mechanical simulation as explained in [15] or the SCH corrections, based on solving the FE 2D SCH equation that do not demand any fitting parameters.
Finally, to simulate the transport inside the channel of the device, there are two options available. The DD approach is a vastly implemented scheme used to calculate carrier transport in semiconductor device simulations [15] [8] [25]. This simulation method describes the relation between the electrostatic potential and the density of carriers in the device. Secondly, the ensemble MC technique solves the Boltzmann transport equation (BTE) using the charge distribution throughout the device to calculate both the electrostatic potential and electric field. A more detailed explanation of the implementation and solution of the DD model can be found in [15] and for the ensemble MC in [3].

2D Schrödinger workflow
After the first Poisson iteration has been completed and an initial solution for the potential has been found, the SCH quantum corrections routine can be selected as shown in Fig. 3. Even though the device mesh is threedimensional, the SCH algorithm is based on solving the 2D time-independent form of the equation to ensure a good compromise between accuracy and speed in the results since by removing one dimension from the SCH equation, the complexity and resolution time drastically decreases. The routine starts by defining several x-coordinates along the transport direction where the SCH equation will be solved. Then, all 3D nodes with the same x-coordinate are grouped into a single 2D slice. In Fig. 2 c) it can be seen the full 2D mesh of a GAA NW FET containing 39 cuts transversal to the transport direction. In d) the node structure of a single slice can be seen showing the 2D triangular mesh where the SCH equation is solved.
Prior to the resolution of the 2D time-independent SCH equation, the FE discretization matrices are generated: the stiffness matrix and mass matrix. This is done using the Galerkin method [10] [24]. However, since the SCH equation is solved continuously through the simulation, the mass lumping technique was implemented to diagonalize the mass matrix which makes the numerical resolution faster [10] [18] [12]. Once the FEM scheme has been generated, the eigenproblem is solved using EB13 [17], a library routine based on Arnoldi's iterative resolution method. The results of the standard generalized eigenproblem are both the eigenstates of the 2D SCH equation ψ i (y, z) and the eigenenergies E i (y, z). These two parameters are obtained for every node of the 2D slice and are used to calculate the quantum-mechanical electron density n sc (y, z) following the Boltzmann approximation: where k B is the Boltzmann constant, T is the temperature, m * is the electron effective transport mass and E Fn is the quasi-Fermi level as shown in [31].
Now that the quantum-corrected electron concentration has been calculated in the nodes of the 2D slices, it is used to correct the electron concentration in the nodes of the 3D mesh. This process is done using high order Lagrange polynomials. When the value of the quantum concentration is needed for a 3D node that is not contained in a 2D slice, the position of the node is projected into the closest slices, four in the case of cubic interpolation. Then, using the shape functions from the finite element method, the quantum concentration is calculated at the triangular element of the 2D cut where the 3D node projection is. Once the quantum concentration is known at the node projections, the concentration profile at those coordinates can be obtained using the interpolation polynomials, and the quantum concentration at the original 3D point calculated. A vi- Fig. 4 Lagrange interpolation of the 3D quantum concentration. For a cubic interpolation, four slices are taken and the 2D quantum concentration in each one of them is calculated in the node projection using the FEM shape functions.
sual depiction of the 3D interpolation process can be seen in Fig. 4.
The end result is a quantum corrected potential assigned to every node of the full 3D mesh (V node ( r)) that is calculated using the following formula: where n i ( r) is the effective intrinsic carrier concentration of electrons and holes, V sc ( r) is the 3D SCH quantum correction to the potential and V cl ( r) is the classical potential obtained as the solution of the Poisson equation.
A full in-depth description of the SCH quantum corrections inclusion in the main DD scheme can be found in [23].

2D Schrödinger parallel implementation
Prior to this work most of VENDES routines were implemented including MPI directives and could be executed in a distributed manner speeding up the simulation process, except for the SCH routines. In VENDES, the Poisson and electron continuity equations are first decoupled using Gummel methods and then linearized using Newton's algorithm. The resulting linear system is solved using a domain decomposition technique [15]. The problem domain is partitioned into several subdomains that can be solved in a parallel manner. In order to do this, the linear system is rearranged as shown for a two domain example in Fig. 5. The whole domain is divided into internal nodes, whose solution is only stored locally in every process, internal boundary nodes, with a solution that is obtained locally and then shared with adjacent processes, and then external boundary nodes, whose solution is calculated locally in a different process and then shared from this process with its neighbours. The obtained solution in the internal and external nodes is interchanged among adjacent domains after every iteration to assure the consistency between the adjacent information stored in all the processes (see Fig. 5).
This node scheme is implemented not only as a way of distributing the problem domain, but also it has been shown that placing the external boundary nodes at the end of the linear matrices improves performance [34].
One of the first issues arising when trying to parallelize the SCH algorithm was that the SCH quantum concentration is obtained in 2D slices in several coordinates along the transport direction, some of which can be exactly located where the domain boundary is and be split between two adjacent processes. For a GAA NW FET with approximately 100k nodes the SCH routine is solved 164 times in a single full IV curve simulation and the additional synchronization after every iteration of the routine would have too much impact on the resolution performance. To avoid having a 2D slice divided between two processors a block partitioning scheme was implemented (see Fig. 6).
After the 3D mesh is divided to the desired number of processors, the first step is to generate the 2D SCH Fig. 6 Block partitioning scheme for the benchmark GAA NW FET. By using this decomposition method the domain boundaries are not divided between adjacent processes and additional synchronization is avoided. Fig. 7 Parallel Schrödinger (SCH) scheme. a) In the parallel resolution of the SCH algorithm, the boundary slice between two domains is duplicated. This was done as the MPI communication between domains after every iteration was more time consuming than the extra computing expense of an extra slice per domain. b) After the quantum density (n SC ) is calculated in every node of the 2D slices, a set of high order interpolation Lagrange polynomials are generated to evaluate n SC at every node of the 3D mesh.
mesh. As mentioned earlier, the user defines a set of coordinates along the x-axis where the 2D cuts will be performed. During this process, the boundary slices are duplicated so that both adjacent processes contain the slice with the external boundary nodes and do not have to share any information after every iteration which reduces the overall simulation time. A visual representation of the parallel resolution can be seen in Fig. 7.
Once all the slices are distributed, each process solves all the 2D slices it contains sequentially. Next, every node of the 2D cut is assigned a value of the quantum concentration and later interpolated to all the nodes of the local 3D mesh. With the coupled Poisson-SCH equation solver, using the quantum corrected value of the concentration in the 3D mesh, the quantum corrected potential is calculated. Then, the continuity equation is solved in the 3D mesh using Newton method until convergency is reached.
The linear solver algorithm is based on the additive Schwarz method that first solves the linear system locally. After every local iteration, the global linear system needs to be updated. This prompts that after every iteration all the processors exchange information, and the higher the number of processors the more information that has to be synchronized. Therefore, the scalability of the code will be dependent on a compro-

Results and model efficiency
In this section we analyze the performance of the proposed parallel SCH methodology. The first step to validate the effectiveness of this scheme is that it generates a I D -V G consistent with the sequential execution. This is shown in Fig. 8, where I D -V G curves simulated with a number of processors ranging from 1 to 32 are almost identical.
The small disagreement between the different output currents is due to discrepancies in the quantum density (n sc ). Once the quantum density is obtained, it is used to calculate the quantum potential (V sc ) which can be seen in Fig. 9. The n sc variable is calculated in each one of the 2D slices, and then its value is obtained in all the 3D nodes of the FE mesh of the device using a high order Lagrange interpolation method. Depending on the number of slices per process, the interpolation can be linear, quadratic or cubic. For instance, if a process only contains two or three slices, the interpolation polynomials have to be 1rst or 2nd order, whereas if a process contains four or more slices, a cubic interpolation can be used. For the GAA NW FET, 40 x-coordinates, every 1 nm approximately, are defined where the 2D slices are extracted during the simulation. In the case of 32 running processes, usually each processor computes up to 2 slices and therefore the interpolation order can only be linear. Consequently, using more processes to decrease the simulation time induces a loss in accuracy because a lower order of interpolation has to be used.    10 shows the current percentage error with respect to the sequential execution that has been calculated at V D,sat = 0.70 V versus the number of processors. The disagreement between the sequential and parallel I D -V G values reaches a maximum error for the 32 processes case with up to 10% of difference between the extracted current at V G = 0.0 V. At this gate bias the potential barrier is at its maximum and the interpolation mechanism is not able to capture the steep slopes of the potential profile. When V G increases the potential barrier decreases, smoothing the potential profile and minimizing the interpolation error. Note that for high values of V G the percentage error of the current decreases and becomes less than 1% in the range from 0.6 to 1.0 V for all cases.
When using 16 or a lower number of processors the current error percentage is always lower than 2%, therefore, for the rest of this work only cases from 1 to 16 processes have been simulated. Note that the use of more processes was not considered as the decrease of accu- Table 2 Execution time measurements of parallel VENDES using 1, 2, 4, 8 and 16 processes and the S , M and L meshes with 40k, 100k and 150k nodes respectively needed to obtain a current point at V D = 0.05 V and V G = 0.0 V. These results have been obtained using a HP Proliant BL685c G7 @ 3.40GHz of 64 cores and 256 GB of RAM. racy in the results due to the interpolation would not be negligible and also, adding more slices to mitigate the interpolation error does not produce more precise results while increasing the simulation times.

Simulation times
After checking the integrity and validity of the new parallel SCH routines, the performance has been tested. The simulation results obtained for the S, M and L meshes (40k, 100k and 150k nodes respectively), can be seen in Table 2. The time measurements correspond to the execution times needed to obtain a current point at V D = 0.05 V and V G = 0.0 V. This data has been obtained using a HP Proliant BL685c G7 @ 3.40GHz with 64 cores and 256 GB of RAM.
The output simulation times show that the code is highly parallel with a time reduction using 16 processes of 97.1%, 97.9% and 98.1% w.r.t the sequential times for the S, M and L meshes. Similar results were found for the measurements of a full I D -V G , with times up to 98.2% faster using 16 processes as shown in Fig. 11.
The increase in performance has two main reasons. First, the parallelization of the SCH routines, which are run several hundreds of times during a full I D -V G curve reduced the overall simulation times. On the other hand, the VENDES framework was already parallelized with MPI. Therefore, with the proposed parallelization, the resolution of linear systems that are the most computationally expensive part of main scheme can now be executed in a distributed manner decreasing simulation times drastically.

Efficiency of the SCH equation resolution
In order to calculate the efficiency results the standard definition has been followed: where t 1 is the sequential execution time, p the number of processes and t p is the execution time using p processes. The parallel efficiency results for one point of the I D -V G curve are shown in Fig. 12. This data has been simulated with and without activating the resolution of the SCH and the continuity equation routines so that it can be estimated the influence in the efficiency of the different routines that solve the Poisson, SCH and continuity equations. First, if only the Poisson equation is solved to reach a solution for the electrostatic potential in equilibrium (i.e. V D , V G = 0.0 V), the parallel efficiency increases with the number of processors up to approximately 2.8 for the 16 processes case. In this scenario VENDES reaches super linear efficiency values (higher than 1). As explained in [35], this behavior appears because when the number of processors increases the size of the local subdomains decreases, reducing the computational time the linear solver takes more than the increment of the communication time.  We have also found that the results can vary depending on the number of subdomains the device is split into. For example, it has been seen that for 4 domains, the boundaries between these domains are placed closer to different material interfaces where the physical behavior changes, like gate-source or gate-drain boundaries. This domain-interface coincidence forces additional process synchronizations increasing the time to achieve the convergency of the linear systems resolution.
Moreover, when more than 8 processes are used, the efficiency begins to saturate. After every iteration of the Poisson and SCH, a synchronization between the results of the boundary nodes in adjacent domains is performed. When using a large number of processes, the overhead of this communication can take a toll on the efficiency.

Simulation scalability
A similar analysis of the parallel efficiency of the code has been performed changing both the number of running processors and number of mesh nodes to show how scalable VENDES is. The results are shown in Fig. 13. The efficiency increases with the number of processes, particularly with higher node density meshes. The larger the simulated mesh, the more use it makes out of executing the code in a parallel manner. This is why the maximum efficiency for the 40k mesh is 2.2, whereas for the 100k and 150k the efficiency reaches 2.9 and 3.2 respectively. As explained before, beyond 8 processes, the synchronization between processors after every iteration of the solver produces a significant overhead and starts to decrease the scalability of the code.
Finally, to assess the impact this parallelization has on the performance, we are going to consider a typical variability study, where 300 simulations are performed. Considering that the sequential execution of a full IV curve is close to 11 hours, to obtain the results for 300 curves, it would take more than 137 days of computational time. Now, with the parallel software, since the results for a simulation can be obtained within approximately 12 minutes using 16 processes, the time it would take to simulate the 300 devices drops to 2.5 days using 16 processes for the L mesh. The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Conclusions
A fully parallel implementation of 3D finite element (FE) Schrödinger (SCH) quantum corrections was developed for physical modeling of ultrascaled semiconductor devices such as GAA NW FETs. The incorporation of the SCH equation-based quantum corrections into the simulation framework increases the overall accuracy of the results and its execution in distributed systems allows to keep reasonable time frame.
The new routines were tested in a state-of-the-art 12 nm Si GAA NW FET showing an almost perfect agreement w.r.t the sequential simulation in the I D -V G curve up to 16 processes. Also, in order to reduce synchronizations stalls, in the current nested multilevel domain decomposition scheme, the boundary nodes have been duplicated and an additional one or two 2D SCH slices have been computed per processor. This extra computation avoids a considerable overhead because after every iteration the SCH solution needs to be shared between adjacent processors to check for consistency. Finally, a comparison of simulation times and code efficiency was performed. For three different 40k, 100k and 150k node meshes, S, M and L respectively, the new parallel code was tested. The execution times dropped up to 98.2% for L meshes using 16 processors with respect to the sequential case. The efficiency is maximum when the simulator is running in a distributed system with 16 processes, reaching a super linear result of 3.2.
Overall, the new implementation allows for rapid simulation of ultrascaled devices, a much desired characteristic in research studies such as variability analysis were hundreds or thousands of non-ideal devices need to be simulated. For instance, a typical variability study with 300 device simulations, each one running for approximately for 11 hours may take over 137 days of computational time. With the proposed parallel software, the total elapsed time drops to 2.5 days using 16 processes for a mesh of 150k nodes.