eBay petites annonces
A collection of 1256 online auction listings (petites annonces) collected from the platform eBay.fr, covering a time span of 13 years. The corpus is split into four subcorpora. The first subcorpus consists of 300 listings from 2005, from private users, the second and third subcorpora are from 2017, and feature 300 listings from private and professional users, respectively. The fourth subcorpus was created in 2018 and features 356 listings from private users.
How the listings were collected
- The first wave of the corpus (e05p) consists of 300 listings. An empty search was submitted, which returned all active listings on the site and therefore a wide range of the different categories. In order to only include users who were not professional sellers on eBay, the listings were pruned so that only users with less than 200 ratings were included, and each user featured only once in the corpus. Additionally, listings with extensive delivery or returns information were excluded, as well as listings from shops.
- This corpus was replicated in 2017 (e17p) and an additional corpus with listings from professional users was also created (users with a shop and more than 200 listings, checked manually, corpus e17x). An empty search which returns all listings was no longer possible in 2017; instead a category has to be chosen. The distribution of categories from 2005 was used to create the corpus in 2017.
- In 2018, we used the web scraping tool ParseHub, to automatically collect more listings. We used the search term vraiment to select listings that were potentially more likely to be from private sellers. More than 10,000 listings were collected. However, we filtered the results to only include one listing per user, and to not have more than 1000 ratings per user. Listings containing descriptions that were copy-pasted from elsewhere or expressions that indicated a professional activity (mon stock, mes photos, ma boutique, mes autres, shipping, tracklist, welcome, ask, regroupez, regroupe, ASUS GTX) were excluded. We manually annotated the use of vraiment as a stance marker and excluded listings which used vraiment in a different way. This left 356 listings.
The corpus has been annotated for various features. The various tags and their meanings are as following:
- ann: abbreviations or ‘for sale’ equivalents (tbe, je vends, vds)
- bon: use of an evaluative attribute at the very beginning of the listing
- ego: use of je
- stn/sty: non-standard or standard usage of past participles agreement or negation
- pre: presentatives (il y a, c’est)
- vst: vraiment as a stance marker ("it’s really nice")
- emo: emoticons
- enc: use of bonnes enchères (happy bidding)
- imp: most frequent imperative forms ( hèsitez, consulter, regardez)
- att: evaluative attributes (not at the beginning of the listing)
In addition to these tags which are used consistently throughout all four subcorpora, the first subcorpus (2005) contains extra tags.
- acc: accents which are missing or are non-standard
- ang: anglicisms
- con: contact details
- inf: information
- lex: informal lexical items
- ort: orthographical ‘mistakes’
- pub: marketing language
- slo: use of slogans
- syn: syntax, topicalisation
Average length of listings:
e05p: 43 tokens, e17p: 49 tokens, e17x: 177 tokens, e18v: 97 tokens.
The length of the listing varies depending on the category.
Distribution of categories in the first three subcorpora (e05p, e17c, e17p):
- maison: 41 listings
- voiture et moto: 21
- vêtêments: 122
- PC et téléphone: 20
- enfant: 14
- collections: 39
- loisir: 41
The XML file contains various metadata for each listing: a unique ID, the year and month it was collected in and the category the listing belongs to.
Some subcorpora have additional metadata, listed below:
- e18v: number of ratings the user has
- e05p: ‘svo’ – 0/1, if the listing contains at least one well-formed sentence with subject-verb-object
- e17p: ‘text’ – Y/N, if the listing resembles a text with sentences and punctuation or not
- e17p: the listing is split into two categories, either ‘inf’ or ‘ad’ – ‘inf’ refers to information that is either copy-pasted or numerical details (e.g. dimensions), ‘ad’ refers to everything written by the user
The corpus is available to download in XML format. The file contains all four subcorpora, and each listing has a unique ID, the year and month it was collected in, the category the listing is from, and some additional tags (as explained above).
PDFs of screenshots of the listings are also available for all subcorpora, although due to technical reasons, only the first 249 are available for the 2018 subcorpus.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Quelle/source: Gerstenberg, Annette, Valerie Hekkel & Freya Hewett. 2019. Online Auction Listings Between Community and Commerce. In Julien Longhi & Claudia Marinica (eds.), Proceedings of the 7th Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora2019), 9–10 September 2019, Cerby-Pontoise University, France, 1–5. Cergy-Pontoise: scienceconf.org.
eBay.fr-corpus = Gerstenberg, Annette & Freya Hewett. 2019. A collection of online auction listings from 2005 to 2018 (anonymised). University of Potsdam: LA-bank. https://www.uni-potsdam.de/langage/la-bank/ebay.php
Bitte zitieren Sie das Korpus mit der angegebenen Quelle. Please always cite this corpus if you use it in further work.
Vielen Dank für Ihre Interesse. Um das Korpus herunterzuladen, geben Sie bitte die Emailadresse ein, die Sie schon bei uns registriert haben:
Thank you for your interest in our corpora. To download this corpus, please enter your email address (in exactly the same form that you used in the registration form):