Genomic prediction of ordinal traits via application of the BLASSO and BayesCπ methods in simulated data

This paper aims at evaluating the use of BLASSO and BayesCπ methods for the genomic prediction of ordinal traits, studying factors that influence the performance of the models, and if there is a difference in the ranking of individuals. Genotypic and phenotypic information from a simulated population of 4,100 animals, genotyped by 10k markers (QTL-MAS Workshop) were used. 3,000 animals were used for estimation of the predictive ability and bias accessed through 5-fold cross-validation with five repetitions. The other animals were used as a population of selection. One ANOVA and the Ryan-Einot-Gabriel-Welch test were performed to verify, respectively, which factors influence significantly the genomic prediction and if there is a statistical difference between the models. The results show that the four main factors s ignificantly (p < 0.05) affect the predictive ability of GEBVs (genomic estimated breeding values), and that heritability and the number of categories are the most influential factors. Only for ordinal trait 2, with a density of 9k, significant differences (p < 0.05) were observed between the predictive ability of the methods. In general, the BayesCπ method proved to be more efficient in the identification of relevant SNPs and in the ranking of individuals. Finally, there is a slight superiority of the BayesCπ method for the genomic prediction of ordinal


INTRODUCTION
Methods for genomic prediction have been extensively evaluated for continuous traits in real (Croiseau et al., 2012, Resende Junior et al., 2012 and simulated (Atefi et al., 2016) data. However, it is in the interest of plants and animals breeding to determine the genetic variants associated with production, disease or resistance traits, which are not of a continuous and normally distributed nature (González-Recio & Forni, 2011).
It is known that several traits of economic interest and importance in animal and plant production are measured in an ordinal scale. Kizilkaya et al. (2014) commented that there are phenotypic traits of low heritability that are ordinal in nature, such as susceptibility, resistance to a disease and calving difficulty. In plant breeding, ordinal scores are also observed for disease resistance or susceptibility. Montesinos-López et al. (2015) used genomic wide selection techniques to predict resistance to gray leaf spot, measured on an ordinal scale, in maize lines, considering three environments.
Methods for genomic prediction of categorical or ordinal traits have been proposed in recent times, as extensions of the methods used for continuous response. Wang et al. (2013) extended the Bayesian methods, BayesA and BayesB, proposed by Meuwissen et al. (2001), and the BayesCπ method proposed by Habier et al. (2011), for prediction of genomic estimated breeding values (GEBV) of threshold traits, which originated the BayesTA, BayesTB and BayesTCπ methods. The authors have shown that threshold methods are more accurate than their contemporaries when applied to predict genomic breeding values for threshold traits. Pérez & Campos (2014) proposed a class of models, called Bayesian generalized linear regression (BGLR), that allows modeling normal, binary, ordinal or censored traits. The approaches encompassed by BGLR, present the same standard theoretical model, differing in the a priori distribution assumed for the effect of the markers.
The accurate estimation of genomic breeding values is of extreme importance in breeding programs for the selection of genetically superior individuals and that they will act as progenitors in the following generations. According to Montesinos-Lopez et al. (2015), the performance of genomic prediction methods is affected by model choice, training population size, marker density, the heritability of the trait, the amplitude of linkage disequilibrium, among other factors. Although the influence of these factors on predictive accuracy for continuous traits has been extensively addressed in the literature, few studies have evaluated how these factors influence the genomic prediction of ordinal traits.
This paper aims at evaluating the use of BLASSO and BayesCπ methods for the genomic prediction of ordinal traits. Specifically, the objective was to evaluate: which factors significantly influence the performance of the models; whether the predictive abilities of the models assessed in each scenario differ statistically by the Ryan -Einot -Gabriel -Welch test, and if there is a difference in the ranking of individuals by estimating the Kendall's coefficient of concordance.

MATERIAL AND METHODS
In this study we used simulated data from the 16th QTL-MAS Workshop (Usai et al., 2014) composed of 4,100 animals, genotyped by 10,000 biallelic SNPs (single nucleotide polymorphisms) markers distributed by five chromosomes. The database is made up of genotypic, phenotypic, genealogical information and true breeding values (TBV). Of the 10,000 markers, 9,038 passed the quality control based on the lowest allele frequency of 5% and the call rate of 5%. Random samples of markers were taken from the genotype set to obtain different marker densities (3k, 4.5k, 6k, and 9k). Samples were taken sequentially, so all SNPs contained in the set of 6k markers are also part of the 9k, and so on.
The data were simulated in five generations (G0, G1, G2, G3, and G4), so that each generation was formed by 20 males and 1,000 females, given that in the generation zero only for the males the genotypic information was registered. As the simulated phenotypic traits are relative to milk production, only females were phenotyped (G1, G2, and G3), which totaled 3,000 animals. In the fourth generation, comprised of 1,020 animals, of which 20 are males and 1,000 are females, the animals were only genotyped, constituting themselves a population of selection (Usai et al., 2014).
The phenotypic traits 1, 2 and 3 were simulated, respectively, with heritability equal to 0.35, 0.35 and 0.50. These traits were expressed only by females, and correspond to milk and fat yields, and fat content (Usai et al., 2014). Only the traits 1 and 3 were used in this study, because they present greater differences between their heritabilities. The continuous responses of these traits were transformed into ordinals with two (1 or 2), three (1, 2 or 3) or four (1, 2, 3 or 4) ordinal categories by dividing the continuous responses into quantiles (Figure 1). After the transformation, trait 1 and 3 were renamed as ordinal trait 1 and ordinal trait 2. Phenotypic traits of ordinal nature can assume C possible values ∈ {1, … , }, to = 2 have the binary case. For ordinal response, the probit link function is used to link each category to the linear predictor, by means of ( = ) = ( − ) − ( − −1 ), where: (⋅) is the cumulative normal distribution function, = + is the linear predictor and the threshold parameter. The probit link function is implemented via data augmentation, through the insertion of the latent variable = + and the measurement model = if −1 < < (Pérez & Campos, 2014).
The BayesCπ method proposed by Habier et al. (2011), has the characteristic of assuming that the effects of the markers have homogeneous variances. This method allows some markers to have zero effects, i.e., it excludes from the analyses marker that is not associated with any gene. The following distribution is assumed for the effect of the markers | 2 = { 0 with probability (0, 2 ) with probability 1-, with priori distributions given by 2~−2 ( , ) and ~ Beta( 1 , 2 ), respectively for the additive genetic variance and for the proportion of markers with null effects. The hyper-parameters , , 1 and 2 were defined according to Pérez & Campos (2014).
Initially proposed by Park & Casella (2008), the BLASSO (Bayesian Least Absolute Shrinkage and Selection Operator) method was adapted for use in genomic wide selection by Campos et al. (2009). Unlike the BayesCπ method, the BLASSO method assumes heterogeneity of variances for the effects of the markers, in addition, to perform the selection of variables indirectly through a strong shrinkage of effects close to zero. The method assumes that the effects of markers follow a double exponential distribution, or equivalently, a mixture of normal distributions. The priori distributions assumed by the model are given by: | 2 , 2~( 0, 2 2 ), 2 |~(0.5 2 ), 2~ Gamma ( , ) and 2~−2 ( , ), being the hyper-parameters r, s, and , defined a priori according to Pérez & Campos (2014).
The models were compared by 5 -fold crossvalidation with five repetitions, through their predictive ability ( In the cross-validation process, subjects were equally divided into five subgroups. Four of these groups were used as training populations, in which the genomic prediction models were adjusted and used to estimate the effects of the markers. The fifth group was used as a validation population, in this group were accessed the predictive ability and bias of the models. The procedure was repeated five times so that each of the five groups was part of the validation population only once. The cross-validation process was repeated five times, then mean values for predictive ability and bias were stored for each repetition. In order to evaluate the significance of the effects of the main factors, heritability, model, the number of categories and density of the markers, a factorial design was used, with five replications, consisting on forty-eight treatments. The assumptions of normality and homogeneity of variances were tested respectively by the Shapiro-Wilk and Bartlett tests. Then, the Ryan-Einot-Gabriel-Welch test (Hsu, 1996) for multiple comparisons was applied at 5% probability.
The methodologies cited above were used to predict GEBVs of individuals in the selection population. At this point, the independent validation process was adopted, considering the 3,000 genotyped and phenotyped females, to estimate the effects of the markers. Then, the estimated effects were used to predict genomic breeding values of the individuals belonging to the selection population, through the expression =̂, where: represents the genotypic profile of individual i and ̂ the vector of estimated effects for the markers. Subsequently, individuals were organized in decreasing order of TBV (only the 300 largest TBVs were considered), this ranking was compared with the ranking suggested by each of the models through the Kendall's coefficient of concordance (Kendall & Babington, 1939).
All analyses were performed in software R (R Core Team, 2018), with the help of the BGLR (Pérez & Campos, 2014), DescTools (Signorell et al., 2019) and Agricolae (Mendiburu, 2019) packages. In Bayesian models, the Markov chains were run with 60,000 cycles of the Gibbs sampler, with the first 15,000 cycles being discarded as burn-in and a thin of 3.

RESULTS AND DISCUSSION
The significance of the influence of factors, heritability, number of SNPs, model and number of categories, on the prediction of GEBVs was evaluated by means of an analysis of variance (ANOVA) ( Table 1). The ANOVA results show that the four main factors and two interactions (h² × nSNP and Model × nSNP) significantly (p < 0.05) affect the prediction ability of GEBVs.
Among the factors evaluated, heritability is responsible for the greater variability in predictive ability (higher MS), followed by the number of categories, model used and the number of SNPs. Although the number of categories and the density of markers can be controlled by the researcher, heritability is inherent in the trait and cannot be easily controlled. Atefi et al. (2016) obtained results similar to those found in this study when evaluating the accuracy of parametric and semi-parametric models over the generations for continuous traits. According to the authors, the heritability and density of markers emerged as the most influential factors in the prediction accuracy of genomic estimated breeding values, also, a significant interaction (p < 0.05) between heritability and number of SNPs was observed.
The values for predictive ability and bias, obtained through 5-fold cross-validation for the BLASSO and BayesCπ methods, varied according to the scenario evaluated, respectively, belonging to the ranges [0.639, 0.818] and [1.010, 1.497] ( Table 2). When comparing the predictive ability of the methods within the number of categories, it was observed that in the binary case no significant differences between the models were observed for the two ordinal traits evaluated, considering the different marker densities. Already in the case of three or four categories, for the density of 9k, the BayesCπ method proved to be significantly superior to the BLASSO method for ordinal trait 2. In all the scenarios evaluated, consistently higher values for predictive ability were observed for the BayesCπ method, however, these values were not statistically different from those obtained for the BLASSO. Regarding the bias, we can see that in the binary case BayesCπ was the least biased method. For three or four categories, BLASSO showed the smallest bias most of the time.
In the context of continuous and normally distributed responses variables, several papers explored these methods for prediction in real and simulated data. Croiseau et al. (2012) evaluated productive traits of bulls from the database composed of genomic information of bulls from five European countries, totaling 50 scenarios. The authors showed that, in most of the evaluated scenarios (28), the BayesCπ and BLASSO methods presented similar predictive ability. In 20 of these scenarios, the BayesC π method presented a predictive ability superior to the BLASSO method. Resende Junior et al. (2012), comparing several methodologies regarding the accuracy of genomic prediction for 17 pine traits, corroborated with the results presented in this study, showing that for 11 of these traits, the BLASSO and BayesCπ methods presented the same predictive ability. The BayesCπ method was more accurate than BLASSO for only five traits.
In most of the evaluated scenarios, the increase in heritability and the number of categories implied a significant increase in the predictive ability of the methods. In general, the gain in predictive ability, with an increase in heritability was more remarkable for the density of 9k. For two or three categories, increasing the predictive ability of BLASSO and BayesCπ methods with heritability variation was significant at almost all marker densities. Most of the time, the change in the number of categories of binary for three categories was accompanied by an increase in the predictive ability of the models, this increase was less noticeable with the change from three to four categories, this difference being not significant in none of the scenarios. The estimated values for the bias were lower for the trait with higher heritability. These results are corroborated by Wang et al. (2013), which showed that increases in the number of categories of the phenotypic variable and heritability are accompanied by increases in prediction accuracy of genomic breeding values, BayesA, BayesB and BayesCπ methods.  For continuous phenotypic traits, the predictive ability of the methods is expected to increase, or remain constant, with an increasing number of markers (Grattapaglia & Resende, 2011). As shown in Table 1, marker density is the main factor that least influences the predictive ability of genomic breeding values, so it is expected that few significant differences will be found. For the binary case, the increase of the density of the markers did not imply significant gains in the predictive ability, for the two ordinal traits. In the case of three or four categories, only for ordinal trait 2, a significant difference was observed with the change in the number of markers from 6k to 9k for the BayesCπ method.
The estimated values for the effects of the markers were plotted for each of the methods, considering the density of 9k markers and the three forms of categorization of the phenotype for the two ordinal traits (Figure 2 and 3). Identifying markers with large effects is an important step in genomic selection, because, by the position of these markers, it is possible to check the existence of QTLs associated with the trait. When comparing the effects estimated for the markers by BLASSO and BayesCπ methods, it is noticed that in general, the position of the most relevant SNPs was similar for the two methods.  For the ordinal trait 1, the methods identified an SNP of great effect in the initial position of chromosome four, in the three forms of categorization of the phenotype. The methods suggest the existence of two QTLs, one at the beginning and the other at the end of chromosome one, by the presence of SNPs with moderate effects. For the ordinal trait 2, SNPs of large effects were identified by both methods on chromosome 1. With the increase in the number of categories, the BLASSO method starts to identify a large-effect SNP located at the end of chromosome 2, SNP identified by BayesCπ with two categories. The initial part of chromosome 3 presented SNPs with large effects by the BayesCπ method, which suggests the existence of a QTL in this region. However, the effects estimated by BLASSO were small. These results are in agreement with those obtained by Grosse-Brinkhaus et al. (2014) who, through Genomic Wide Association Studies (GWAs), identified QTLs for the two traits without categorizing the phenotype, in positions similar to those indicated in Figures 2 and 3.
The positions of the QTLs are more evident when evaluating the effects estimated by the BayesCπ method, with three or four categories, than with the effects estimated by the BLASSO method. It can be seen from the graphs that the effect of shrinkage on effects estimates is greater for the BLASSO method than for the BayesCπ method. With the variation of the number of categories, it can be seen that the effects estimated by the methods for the two ordinal traits suffered a stronger shrinkage in the binary case, for three or four categories, few differences were observed in the magnitudes of the estimated effects. The BayesCπ method was more sensitive to changes in the number of categories than BLASSO. As the number of categories increases, it is noticed that markers with moderate and small effects are more affected by the shrinkage effect than markers with great effect.
The estimated values for the Kendall's coefficient of concordance (W), for the ranking of the individuals according to the methods and with the TBVs two to two, were significant and greater than 0.65 in all scenarios evaluated (Table 3). The most assertive method is the one that shows the highest agreement with the rankings suggested by the TBVs. A high almost perfect for ordinal trait 1 (0.998) and 0.953 for ordinal trait 2 with two categories. However, with the increase in the number of categories, this agreement decreases, respectively, to 0.94 and 0.88. In all scenarios, the agreement of the BayesCπ method, with the raking suggested by the TBVs, was superior to that obtained by BLASSO. The ranking of individuals assists in the choice of genetically superior individuals, which is of paramount importance in animal and plant genetic improvement.

CONCLUSION
Few differences were found between the BLASSO and BayesCπ methods, regarding predictive ability. Only for the density of 9k markers, for three or four categories, the BayesCπ method proved to be significantly (p < 0.05) superior to BLASSO.
Among the evaluated factors, heritability and number of categories appear as factors that most influence the prediction of genomic breeding values.
The BayesCπ method is more efficient than the BLASSO method in identifying regions that influence ordinal traits and in the ranking of genetically superior individuals.