Italian Whatsapp Corpus

Photo: Pixabay

Freya Hewett

The corpus consists of ca. 6640 Italian WhatsApp messages, which were collected as part of Freya Hewett's bachelor thesis from January to March 2017, from users based in Germany and Italy.

The data only includes written messages, audio or images are omitted. The messages originate from a total of 16 chats, which includes 11 one-on-one chats and 5 group chats. The total amount of tokens is 45288 and the total amount of types is 12153 (these figures were calculated using the edited conversations and also count the pseudonym of the person who sent the message).

There are 39 participants in total, made up of 26 females and 13 males. 29 of the participants are native Italian speakers, 2 are bilingual with Italian as a native language and the remaining 8 are not native Italian speakers but are capable of composing and presumably also understanding Italian messages. The average (mean) age of the participants was 29 at the time the messages were submitted.

Photo: Pixabay

Il corpus comprende circa 6640 messaggi in lingua italiana per un totale di 45288 parole. Si tratta sia di conversazioni tra due partecipanti (one-on-one) sia di chat di gruppo.