Program

Program

Keynote speakers

Laurence ANTHONY (Waseda University, Japan)
Michael BARLOW (University of Auckland, New Zealand)
Stefan EVERT (Friedrich-Alexander-Universität, Germany)
Michaela MAHLBERG (University of Birmingham, UK)
Naoaki OKAZAKI (Tokyo Institute of Technology, Japan)

Plenary speech 1:

Corpus linguistics and the study of fiction – methodological and theoretical challenges
Michaela Mahlberg (University of Birmingham, UK)

Abstract:

Corpus research has traditionally focused on non-literary texts. Along with developments in Digital Humanities, there seems to be an increasing interest into the study of fictional texts, sometimes referred to under the umbrella term ‘corpus stylistics’ (Semino and Short 2004). In order to be able to account as fully as possible for features of literary texts we need to create new tools and develop methodologies that are tailored to the task at hand. In this paper, I will illustrate key functionalities of the web application CLiC (http://clic.bham.ac.uk/) that has been specifically designed for the corpus linguistic study of narrative fiction. The case studies that I will present look at textual patterns that contribute to the creation of fictional characters. The examples will be drawn from the CLiC corpora. The CLiC corpora comprise more than 130 books across four subcorpora: the corpus of Dickens’s Novels, the 19th Century Reference Corpus (19C), the Corpus of 19th Century Children’s Literature (ChiLit) and the Corpus of Additional Requested Texts (ArTs). For all CLiC texts, direct speech and specific places around speech have been marked up (Mahlberg et al. 2016). Hence, CLiC can run searches across defined textual subsets and support the analysis of features of narrative fiction. An important question is how a range of features and patterns in fiction can be brought together in a coherent theoretical framework. My suggestions towards such a framework focus on a lexically-driven approach to body language and raise more fundamental questions about how far corpus linguistics can change our theoretical perspective on fiction.

References

Mahlberg, M., Stockwell, P., Joode, J. de, Smith, C., & O’Donnell, M. B. (2016). CLiC Dickens: novel uses of concordances for the integration of corpus stylistics and cognitive poetics. Corpora, 11(3), 433–463.
https://www.euppublishing.com/doi/full/10.3366/cor.2016.0102
Semino, E., & Short, M. (2004). Corpus Stylistics. Speech, Writing and Thought Presentation in a Corpus of English Writing. London: Routledge.

Bio:

Michaela Mahlberg is Professor of corpus linguistics at the University of Birmingham, UK, where she is also the Director of the Centre for Corpus Research and the Director of Research and Knowledge Transfer for the College of Arts and Law. Michaela is the editor of the International Journal of Corpus Linguistics (John Benjamins) and together with Wolfgang Teubert she edits the book series Corpus and Discourse (Bloomsbury). One of her main areas of research is Dickens’s fiction and the socio-cultural context of the 19th century. Her publications include Corpus Stylistics and Dickens’s Fiction (Routledge, 2013), English General Nouns: a Corpus Theoretical Approach (John Benjamins, 2005) and Text, Discourse and Corpora. Theory and Analysis (Continuum, 2007, co-authored with Michael Hoey, Michael Stubbs and Wolfgang Teubert). Michaela was the Principal Investigator on the AHRC-funded project CLiC Dickens: Characterisation in the representation of speech and body language from a corpus linguistic perspective which led to the development of the CLiC web app.

Plenary speech 2:

Understanding and Advancing the Data-Driven Learning (DDL) Approach
Laurence Anthony (Waseda University, JAPAN)

Abstract:

Data-Driven Learning (DDL) is an inductive, self-directed approach to language-learning in which learners interact directly or indirectly with a corpus under the guidance of the language instructor. Numerous empirical studies have shown DDL to be effective. DDL can empower learners to address their individual discipline-specific language needs in the classroom. It has also been shown to work with learners of different proficiency levels and ages, and in classrooms with and without computers installed. However, reports have revealed that both instructors and learners face major challenges when using the DDL approach. These include finding a suitable corpus, knowing what to search for when a corpus is available, reducing data overload when making observations, incorporating findings as part of learning, and contextualizing what is learned in the target field and discourse community. Other challenges relate to the download, installation and use of corpus analysis software. In this paper, I will first present a case for using DDL in the language learning classroom. I will then explain how the various challenges associated with DDL can be minimized or fully addressed through careful preparation, effective classroom practices, and the use of innovative support tools. Finally, I will discuss how language program administrators and instructors can contribute to future DDL tools and methods development.

Keywords

data driven learning, DDL, AntConc, AntCorGen, AntWordProfiler, ESP, EAP, writing

Bio:

Laurence Anthony is Professor of Applied Linguistics at the Faculty of Science and Engineering, Waseda University, Japan. He has a BSc degree (Mathematical Physics) from the University of Manchester, UK, and MA (TESL/TEFL) and PhD (Applied Linguistics) degrees from the University of Birmingham, UK. He is a former Director and the current coordinator of graduate school English in the Center for English Language Education in Science and Engineering (CELESE). His main research interests are in corpus linguistics, educational technology, and English for Specific Purposes (ESP) program design and teaching methodologies. He received the National Prize of the Japan Association for English Corpus Studies (JAECS) in 2012 for his work in corpus software tools design. He is the developer of various corpus tools including AntConc, AntCorGen, AntWordProfiler, FireAnt, ProtAnt, and TagAnt.

Plenary speech 3:

Distributional Methods in Corpus Linguistics: Towards a Hermeneutic Cyborg
Stefan Evert (Friedrich-Alexander-Universität, Germany)

Abstract:

Distributional approaches in natural language processing are based on the assumption that the meaning of a linguistic item (typically a word or phrase) can be inferred from its distribution in language. More precisely, the distributional hypothesis stipulates that two items differ in meaning to the same degree as they differ in their distribution, which is operationalized in the form of a high-dimensional co-occurrence feature vector. Such distributional models (DM) can flexibly be applied to a wide range of tasks depending on how the target and feature items are chosen (texts, sentences, phrases, words, morphemes, or word pairs).

After giving an overview of the fundamental techniques and parameters of distributional approaches, my talk focuses on two applications in the field of corpus linguistics. I will use these examples to illustrate how DM can help overcome shortcomings of essential corpus analysis techniques.

At the level of texts, DM provide an unsupervised account of stylistic variation across authors, registers and domains. As a result, such effects can systematically be included in frequency comparisons and similar quantitative analyses, leading to much more accurate findings.

At the level of words, DM create a fine-grained semantic representation based on collocational profiles, which play a central role in computational lexicography and corpus-based discourse analysis. As a result, innovative corpus analysis tools can be developed that combine sophisticated computational techniques with human interpretation into an interative hermeneutic procedure – merging man and machine into what I like to call the Hermeneutic Cyborg.

Bio:

Stefan Evert holds the Chair of Computational Corpus Linguistics at the University of Erlangen-Nuremberg, Germany. After studying mathematics, physics and English linguistics, he received a PhD degree in computational linguistics from the University of Stuttgart, Germany. His research interests include the statistical analysis of corpus frequency data (significance tests in corpus linguistics, statistical association measures, Zipf’s law and word frequency distributions), quantitative approaches to lexical semantics (collocations, multiword expressions and distributional semantics), multidimensional analysis (linguistic variation, language comparison, translation studies), as well as processing large text corpora (IMS Open Corpus Workbench, data models and query languages, Web as corpus, sentiment analysis, text clustering).

Web site:

www.stefan-evert.de
www.linguistik.fau.de

Plenary speech 4:

How Deep Learning Changes Natural Language Processing
Naoaki Okazaki (Tokyo Institute of Technology, Japan)

Abstract:

When I first saw the success of deep learning in other research areas, I did not expect that it would greatly influence the research of Natural Language Processing (NLP). Although we are still not sure whether deep learning truly advances the technologies to realize computers that can truly understand the language, it is now very common in the NLP community, achieving the state-of-the-art performance in most tasks. In this talk, I briefly overview the advantage of deep learning over the conventional methods of machine learning, e.g., automatic feature extraction, generic gradient-based learning, end-to-end learning, and versatile software framework. I then explain the key ideas of deep learning that have widely been accepted in NLP: distributed representations of words/phrases/sentences, encoder-decoder models, attention mechanisms, etc. Deep learning has not only provided an alternative approach to the statistical NLP, but also bridged NLP to other research areas and increased the ‘bravery’ of NLP research. I will explain the recent trends of NLP research including multi-modal processing and context modeling. I conclude this talk by summarizing the future prospect of NLP.

Bio:

Dr. Naoaki Okazaki is a professor at School of Computing, Tokyo Institute of Technology. Prior to this position, he worked as a research fellow in National Centre for Text Mining (NaCTeM) (in 2005), as a post-doctoral researcher in University of Tokyo (in 2007-2011), and as an associate professor in Graduate School of Information Sciences, Tohoku University (2011-2017). He obtained his PhD from Graduate School of Information Science and Technology, University of Tokyo in 2007. He has served as a technical consultant in SmartNews Inc. since 2013. He is also a visiting research scholar of the Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST). His research interests include natural language processing, text mining, and machine learning. He is also known as a developer open-source software toolkits including: CRFSuite (a fast implementation of Conditional Random Fields), SimString (an implementation for approximate string matching), and libLBFGS (a C library of L-BFGS).

His research is highly regarded nationally and internationally. Dr. Naoaki Okazaki has published more than 120 refereed papers in journals and conference proceedings including ACL, NAACL, EMNLP, COLING, IJCNLP, and AAAI. He is a recipient of various awards: the Young Scientists’ Prize, the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology (MEXT); Funai Research Award; Docomo Mobile Science Award; and several Best Paper Awards at PACLIC 29 and domestic NLP conferences in Japan. In addition, he has contributed prominently to the academic and scientific community. He has served as area chairs (ACL 2012 and 2016), workshop co-chairs (IJCNLP 2013), and publication chair (EMNLP-CoNLL 2012). He joined numerous program committees for international conferences including AAAI, IJCAI, ACL, NAACL, EMNLP, EACL, and COLING. He has also been in the TACL elite standing reviewing pool.

Web site:

http://www.chokkan.org/index.html.en

Plenary speech 5:

The Individual and the Group in Corpus Linguistics
Michael Barlow (University of Auckland, New Zewland)

Abstract:

In this presentation I examine samples of the spoken output of five White House Press Secretaries in order to investigate the nature and extent of individual differences in production. While there are many idiosyncratic differences in grammar and style in the speech of individuals, the aim here is to investigate very frequent components of grammar. In this study, we can examine differences in the distribution of the most common bigrams in spoken usage. The results show that even with these common bigrams such as of the, there are major distinctions in usage by different press secretaries and we find that the inter-speaker variation for a range of bigrams is much greater than the intra-speaker variation. The results pose interesting questions for the relationship between the language of the individual and the language of the group and for the representation of grammar in relation to production and comprehension.

Bio:

Michael Barlow received his PhD in Linguistics from Stanford University. He is currently Associate Professor in Applied Language Studies at the University of Auckland in New Zealand. Dr. Barlow has written books and articles on corpus linguistics and regularly gives presentations, courses and workshops at institutions and conferences around the world. He has created several text analysis programs including concordancers MonoConc and ParaConc and a collocation extraction program, Collocate. A recently developed program, WordSkew, is designed to apply corpus analysis techniques while at the same time taking note of the structure of texts.