Metadata
Average length of listings in the four subcorpora:
- e05p: 43 tokens
- e17p: 49 tokens
- e17x: 177 tokens
- e18v: 97 tokens
The length of the listing varies depending on the category.
The XML file contains various metadata for each listing: a unique ID, the year and month it was collected in and the category the listing belongs to.
Distribution of categories in the first three subcorpora (e05p, e17c, e17p):
| Category | Distribution |
|---|---|
| maison | 41 |
| voiture et moto | 21 |
| vêtêments | 122 |
| PC et téléphone | 20 |
| enfant | 14 |
| collections | 39 |
| loisir | 41 |
Additional metadata
Some subcorpora have additional metadata, listed below:
- e18v: number of ratings the user has
- e05p: ‘svo’ – 0/1, if the listing contains at least one well-formed sentence with subject-verb-object
- e17p: ‘text’ – Y/N, if the listing resembles a text with sentences and punctuation or not
- e17p: the listing is split into two categories, either ‘inf’ or ‘ad’ – ‘inf’ refers to information that is either copy-pasted or numerical details (e.g. dimensions), ‘ad’ refers to everything written by the user