Surveys, ipsative and compositional data analysis (CODA) |
|
Chair | Dr Berta Ferrer-Rosell (University of Lleida ) |
Coordinator 1 | Dr Marina Vives-Mestres (University of Girona) |
Coordinator 2 | Dr Juan Jose Egozcue (Tecnical University of Catalonia ) |
Compositional Data (CoDa) are positive vector variables carrying only information about the relative size of their D parts or components. Typical examples are chemical and geological analyses, in which only proportions of components are of interest. Accordingly, CoDa are usually presented with a fixed sum, in proportion, percentage or part-per-million units. Some serious problems that arise when using standard statistical analysis tools on CoDa include non-normality and heteroscedasticity. The simplest statistical concepts (centre, variation and distance) have to be redefined. For instance, Euclidean distance considers the pair of proportions 0.01 and 0.10 to be as mutually distant as 0.51 and 0.60. Interpretationally, one component can increase only if some others decrease. This results in negative spurious correlations among components and prevents interpreting effects of linear models in the usual way ‘keeping everything else constant.
Aitchison’s approach to CoDa states that all needed information about relative magnitudes is in log ratios among components or among their geometric means. Geometric means, ratios and logarithms constitute a natural way of distilling information about relative size. Log ratios are unbounded and, once they have been computed, standard statistical analyses are appropriate. Nowadays CoDa is not only regarded as a means of solving the statistical and assumption-related problems, but as a means of performing analyses whose research questions concern relative rather than absolute magnitudes.
The standard central tendency measure in CoDa is simply the geometric mean. As an alternative to correlation matrices, CoDA uses the variation matrix, whose elements are variances of logarithms of all possible pairwise ratios among components. Perfect proportionality between two components leads to a null log-ratio variance.
In CoDa it is often the case that a standard technique applied to log ratios is equivalent to a compositional technique applied to the raw data. Aitchison’s distance, which focuses on relative differences among compositions, equals the squared Euclidean distance computed from the so-called centred log ratio transformation. Once this transformation has been computed, cluster analysis can be performed in a standard way.
In order to illustrate the basics of CoDa and the main descriptive tools, we reanalyse the data from a social support survey. Respondents assessed 11 perceived social support functions provided by people belonging to the Antonucci social support network. To assess support network composition, support providers were classified into percentages of family/relatives, friends, neighbours and co-workers (D=4). We examine standard and CoDa measures of centre, variability and distance and we use distances for clustering. We compare results and show the interpretational flaws of standard analyses.
The community innovation survey (CIS) is conducted biennially in most EU countries to assess business innovative practices and innovation results. The survey follows the guidelines of the Oslo manual for innovation measurement and, since 1997 it has provided valuable standardized and comparable innovation data for scholarly research. Among others, the CIS asks R&D or general managers about the percentages of turnover from a) innovative products launched by the firm for the first time in its market, b) innovative products launched by the firm but already extant in its market, and c) unchanged or only marginally modified products. These turnover data make it possible to study both the degree of innovativeness and the decisions of early or late entry timing into the market.
Thus far these data have been analysed without taking into account their fixed 100% sum. This leads to the violation of most assumptions of the linear model (e.g., non-normality, non-linearity, and heteroskedasticity). The resulting prediction interval limits often fall below 0 or above 100. The analysis also leads to unclear or wrong interpretations: the sum of coefficients predicting the different percentages is zero, and the issue about which percentages decrease when the percentage which is modelled increases remains unclear.
In this paper we reanalyse the CIS data by means of compositional data analysis methods and show how these shortcomings can be easily overcome by means of log-ratio transformations. These transformations can be tailored to test the hypotheses of interest to the researcher. In our case we compute a log-ratio of innovative over unchanged products (innovativeness) and one of new-to-the-market over new-to-the-firm products (early entry timing). These log-ratios are then regressed on variables related to the degree of openness in the innovation process and to the firm’s strategy.
In August 2015 the Swedish newspaper Metro claimed that the Sweden Democrats were the largest political party in Sweden based on the results of single poll. Inspired by this claim, we discuss how it may be tested and then move on to analyse the parameter space in question, the unit simplex. We show that the parameter space can be suitably partitioned leading to a maximum likelihood ratio test for testing if a specific proportion is the greatest, assuming that the observed frequencies come from a multinomial distribution. We derive the distribution of the test statistic under the null hypothesis. Under the null hypothesis the part (or proportion) of interest is not the greatest, i.e. there is at least one other part that is equal or greater. The boundary of the two parameter space partitions is the subset where the part of interest and one or more other parts are equal. The situation when more than two parts are equal prevents the test statistic from having a simple chi squared distribution. However, we present an approximation and argue by a simulation study that the approximation works well. Finally, we apply to the test to the data presented in the Metro newspaper. Obtaining a p- value of approximately 0.1, we are not able conclude that the Sweden Democrats are the greatest party in Sweden at a five per cent level of significance.
Many surveys consist in presenting D items, and for each of them, the individual is required to assign a value or score in a scale, for instance, from 1 to 10. This is frequently known as scoring in a (discrete or continuous) Likert scale. The meaning of these scales is object of frequent discussions. For instance, are scores in an absolute, although subjective, scale? Or is the information provided only ordinal? Consequently, the methods to analyse this type of data are also controversial. Here, data in Likert scales are assumed to convey mainly ordinal information, but adding some mild assumptions they can be treated as compositional data.
The proposed way is as follows. The collection of D items can be ordered according to their scores in the Likert scale. Ties are admissible just admitting fractional orders. According to Thurston, the ranks of the items can be viewed as a composition. The reasons for this were that the ranks add to a constant. From an updated compositional point of view, this constant sum is not critical for considering a vector as a composition. The main point is its invariance under scaling by positive constants of the scores, which in fact holds in this case. For instance, with D=4, items could be ordered by preferences as (1, 2.5, 2.5, 4) but the ordering is equivalent to that in (10, 25, 25, 40). Once these ranking vectors are placed in the D-part simplex, the Aitchison geometry of the simplex and its derived log-ratio procedures are available. Under the assumption that the Aitchison geometry makes sense, the survey data can be analyzed using log-ratio methods.
Some details may be important. When there is a non-response in the Likert scales, the ranks of items are evaluated in a subcomposition (the constant sum is lost, but the data is still defined in a subcomposition). Association between items is measured by the variation matrix ; the answers of an individual, coded as a D-part composition, can be represented in log-ratio coordinates; the relation between items and individuals can be visualized using compositional biplots; total variance can be decomposed into variances of simple log-ratios, of centered log-ratio components, or of ilr-coordinates, giving the opportunity of further traditional statistical analyses.
A simple case of a survey on Likert scales is used to illustrate the potential of the methods.