Metadata

Average length of listings in the four subcorpora:

  • e05p: 43 tokens
  • e17p: 49 tokens
  • e17x: 177 tokens
  • e18v: 97 tokens

The length of the listing varies depending on the category.

The XML file contains various metadata for each listing: a unique ID, the year and month it was collected in and the category the listing belongs to.


Distribution of categories in the first three subcorpora (e05p, e17c, e17p):

CategoryDistribution
maison41
voiture et moto21
vêtêments122
PC et téléphone20
enfant14
collections39
loisir41

Additional metadata

Some subcorpora have additional metadata, listed below:

  • e18v: number of ratings the user has
  • e05p: ‘svo’ – 0/1, if the listing contains at least one well-formed sentence with subject-verb-object
  • e17p: ‘text’ – Y/N, if the listing resembles a text with sentences and punctuation or not
  • e17p: the listing is split into two categories, either ‘inf’ or ‘ad’ – ‘inf’ refers to information that is either copy-pasted or numerical details (e.g. dimensions), ‘ad’ refers to everything written by the user