In recent years, there has been a growing awareness in the field of Second Language Acquisition (SLA) that research into some of the outstanding questions on language development is contingent upon the use of appropriate corpora. Previous studies in SLA have too often relied on cross-sectional corpora and are frequently biased towards English (Myles 2005). Whereas cross-sectional corpora rely on group means to approximate developmental trajectories, longitudinal corpora allow for a more direct examination of linguistic development.
However, the challenges posed by the constitution of longitudinal corpora from a practical perspective are numerous: data collection is often limited in timespan, group size is difficult to maintain because of participant dropout and it may be difficult to obtain equivalent data without repetition effects. In multilingual corpora, these challenges are not only repeated for each additional language, but the corpora should also be equivalent inasmuch as possible. For corpora of linguistic development, this equivalence should ideally be guaranteed for, amongst others, the participant background, the completed task and the proficiency levels or developmental stages.
Additionally, research departments have often accumulated a considerable body of data that, though not initially conceived as multilingual corpora, may share enough characteristics to constitute a multilingual corpus. Reasons for integrating data in a larger corpus may arise from practical and ecological grounds, such as the amount of time and resources originally invested in the data gathering, but also from the untapped potential of these data.
On the basis of a case study we will discuss some of the challenges encountered when constituting a multilingual corpus for SLA research, and possible solutions to these issues. More specifically, we will address the notion of corpus equivalence and to what extent it is feasible and necessary. Additionally, we will consider a number of methods to triangulate linguistic proficiency. We will equally discuss the implications for research in terms of what can and what cannot be expected from such a multilingual corpus, in light of our own data: the need to assemble a multilingual corpus arose from two related PhD projects aiming at comparing linguistic development in oral production for English and French as a second language across a timeframe spanning from early to fairly advanced proficiency.

Original languageEnglish
Title of host publicationActes des 12es Journées Internationales d'Analyse Statistique des Données Textuelles
EditorsEmilie Née, Jean Michel Daube, Mathieu Valette, Serge Fleury
EventJADT 2014 - 12es Journées internationales d'Analyse statistique des Données Textuelles - Paris, France
Duration: 3 Jun 20146 Jun 2014
Duration: 3 Jun 20146 Jun 2014


ConferenceJADT 2014 - 12es Journées internationales d’Analyse statistique des Données Textuelles

  corpus linguistics, second language acquisition, multilingualism, multilingual corpus

