Formulaic Language Project

This is a project to explore the factors involved in the measurement of repeated word sequences in language sampled from a range of corpora. We make use of a broad range of corpora and sampling techniques to statistically explore the interaction between computationally identified recurrent sequences and a group of key independent variables.

Overview

We are working on a project examining the measurement of recurrent sequences of words across various corpora. The aim is to better understand the factors that affect the use of formulaic language across spoken and written language of different text types, by native and nonnative speakers and children.

Background

The growth of research into phraseology, native-like formulaic language use, and the idiom principle in corpus linguistics, cognitive linguistics, psycholinguistics and applied linguistics (Ellis, 2008) calls for clarity of conceptualization across these fields (Gries, 2008) and, more so, for basic investigations into the metrics for operationalizing formulaicity in texts.

One of the primary aims of this project is to investigate the effect of a broad range of independent variables, including text- and corpus size, number of authors, register (spoken/written) and genre differences, vocabulary diversity (such as token-token and entropy) speaker age, nativeness and proficiency, upon different measures for identifying formulaic (recurrent) sequences. We are interested in both contiguous sequences (n-grams), limited span frameworks (phrase frames and skip-grams) and collocational clusters (item-sets and concgrams). The measures evaluated include frequency above a certain threshold, Mutual Information, t-score, gravity counts, various reference lists of formulas and various dispersion measures.

We want to determine how the different measures reflect the independent variables and how each measure contributes to the convergent and differential validity of operationalization. These exercises in corpus metrics usefully further the standardization of the measurement of formulaicity in language texts and they prove the foundations for triangulation with psycholinguistic definitions of formulaicity as well as for studies of first and second language acquisition, instruction, and evaluation.

References

  • Ellis, N. C. (2008). Phraseology: The periphery and the heart of language. In F. Meunier & S. Grainger (Eds.), Phraseology in language learning and teaching (pp. 1-13). Amsterdam: John Benjamins.
  • Gries, S. T. (2008). Phraseology and linguistic theory: a brief survey. In S. Granger & F. Meunier (Eds.), Phraseology: an interdisciplinary perspective. Amsterdam: John Benjamins.

Project Members

Primary Researchers

  • Nick Ellis
  • Matt O’Donnell
  • Ute Römer
  • Stefan Gries
  • Stefanie Wulff

Research Assistants

  • Annie Devine (UROP student)
  • Kumud Bihani (UM School of Information intern)

Related presentations (2009)

Ellis, Nick C., Matthew Brook O’Donnell, Ute Römer, Stefan T. Gries & Stefanie Wulff. Measuring the formulaicity of language. Paper Presented at the American Association of Applied Linguistics Annual Conference 2009, Denver, CO, 21-24 March.

O’Donnell, Matthew Brook & Ute Römer. Proficiency development and the phraseology of learner language. Paper Presented at the 30th ICAME Conference 2009, Lancaster, UK, 27-31 May.

O’Donnell, M B, Römer, U., Ellis N. C. Examining formulaic sequences in corpora of second language writing. Presentation at SLRF 2009, Michigan State University, East Lansing, MI, USA, 29 October – 1 November 2009.

  • Corpus Training and Research Home
  • University of Michigan Corpus Analysis Group

    If you are a researcher or visiting scholar at the University of Michigan and would like to learn more about corpus analysis and share findings from your own corpus-based research, you may be interested in joining our Corpus Analysis Group.

  • Corpus Training for Students and Visiting Scholars

    The MCL team frequently provide introductions to corpus analysis and training in the use of corpus tools for scholars visiting the ELI and also in writing classes offered in the English Language Institute. There are opportunities for researchers to benefit from these resources through the ELI Visiting Scholar Programs.

  • Formulaic Language Project

    This is a project to explore the factors involved in the measurement of repeated word sequences in language sampled from a range of corpora.

  • Verb Argument Construction Project

    Using computational corpus analysis and experimental data this project aims to produce an extensive inventory of English Verb Argument constructions and to quantify aspects related to the frequency, semantic coherence and speaker accessibility of verbs in constructions.

  • Ongoing Research of MCL Team Members

    The projects on this page provide examples of the kinds of research carried out by the researchers in the MCL team relating to a broad range of issues in academic discourse analysis and corpus and applied linguistics.

  • Links to Corpora and Corpus Resources

    Here are a few links to online-searchable corpora, useful corpus tools and corpus projects that might be of interest.

  • Conferences and presentations

    On these pages you will find information about conferences and colloquia organized by and involving members of the MCL team.

Contact / About Us