ability to explain the relevant clinical and histo pathological information. Next, we characterized the fac tors based on 3 properties, 1 their ability to discriminate among tumor types this was done using Linear Discri minant Analysis, a supervised classifier able to find the linear combination of factors which best sepa rates two pre defined classes, 2 their functional biologi cal characterization with the help of literature and databases, 3 their complex biological characterization, by searching novel properties emerging from the joint analysis of miRNA and mRNAs. The procedure is sum marized in Figure 2. Data Preprocessing Data from were transformed by computing log2 of the intensity value of mRNA expression. Quality selec tion filtering was performed removing every row with maximum fold change below 2.
Cilengitide 5, this reduced the dataset from 7182 IDs to 4966 IDs. The filtering was decided to select genetic elements with strong signal of variation. This criterion was selected as natural conse quence of the filtering performed by the authors of the dataset that used the same conditions to reduce the number of the IDs. Data were also normalized in differ ent ways according to, The two methods map the expression level in an interval comprised between 0 and 1 the first and ui and ui 1 the second. The two normalizations give identical results in the Factor Analysis step as expected. In fact, expression signals obtained from qPCR are different from signals obtained from microarrays due to the extended dynamic range of the former.
It is common, in order to validate a set of coding genes obtained by microarray, to express the mRNA level in each sample as a fraction of the expression level in the sample in which that mRNA is most abundant. So, from this point on, miRNA and mRNA expression data were analyzed together, as a sin gle expression table with normalization x1. Factor Analysis The Factor Analysis model can be defined in matrix notation as, D LF ��, where D represents the data matrix, L is the factors loadings matrix, F is the factors scores matrix and �� is the unique factors matrix. Furthermore, m are the number of samples, n the number of genetic elements and l the number of factors. Our model assumes that F and �� are indipendent, E 0, and Cov I. Under these con ditions Cov LLT Cov, for the sake of clarity LLT is named communality and Cov uniqueness.
Variability in a human tumor expression dataset arises from several sources besides tumor type, including human variability and experimental variability. Available information is about tumor types, therefore, our model explicitly involves tumor types variability, and groups other causes within the �� term, showing the power of the FA method. In our work, we were interested in dis covering the hidden or latent structure within tumor types, therefore FA is applied using the model D XT. The R package HDMD developed by Lisa McFerrin at North Carolina State University was used to take advan tage of the princip