Language in social media is characterised by more formal (written-like) or more informal (spoken-like) style in different contexts, and thus shows high variability. In this project, we focus on one linguistic domain in pragmatics, the management of common ground between writers and readers, and identify the consistent patterns of discourse strategies employed by writers across different groups and channels. We explore three types of phenomena that relate to common ground management: question tags, coreferential expressions, and coherence markers.
Question tags are particles attaching to a typically declarative clause to yield a kind of confirmation request. We conducted an extensive corpus study investigating the contexts and functions of different question tag variants in German. We found significant differences between the functions of individual tags and in the use of tags across conversational corpora (Twitter and spoken corpora), showing that only some uses of tags carry over from speech to written conversation. We are currently working on both computational and formal linguistic models that capture this variability.
Regarding coreferential relations, the research literature yielded partly conflicting results, but it is generally accepted that their behavior differs between spoken and written language, for example in the length of referential chains, and the type of expression (pronoun or full noun phrase, for example) that is used. We extended this research to include social media conversations from Twitter, showing that coreferential relations on Twitter are more similar to spoken data than written. In the following, we have adapted a computational model for automatic coreference resolution to better capture the idiosyncrasies of social media conversations.
Finally, we are investigating the realization of coherence relations in different social media. In existing corpus research, it is often unclear if differences between corpora are due to confounds such as the topic of discourse, the authors/speakers included in the corpus, the language, the time of recording, etc. We address this by studying texts from two social media (Twitter and blogs) from the same authors and on similar topics. This allows us to pinpoint the effect of individual medium constraints such as the mode (spoken vs. written) or the text type (narrative vs. interactive) from individual stylistic variation and topic effects, and identify what stays stable wrt. coherence relation marking across all these dimensions.