Zum Hauptinhalt springen

Prototyping (and evaluating) a pipeline for Named Entity Recognition in "Mitteilungsblatt"

What is the project about?

Generate a pipeline for Named Entity Recognition in the Mitteilungsblatt.

Who is part of the team?

Daniil Skorinkin - Potsdam University

Henny Sluyter-Gaethje - Potsdam University

Harald Lordick - Steinheim Institute

Mirjam Rürup - Moses Mendelssohn Center

Ursula Wallmeier - Moses Mendelssohn Center

Benjamin Schnabel - FID Jüdische Studien, JudaicaLink

What is the practical use of your research tool for jewish studies? / What can researchers of jewish studies learn/achieve from your project?

We turn the name registries from the Mitteilungsblatt into structured machine readable form (which can then be used for NER improvements and/or by other researchers)

We prototype the pipeline for person mentions extraction from Mitteilungsblatt

Which technical methods and tools are used for developing your project and what tools and/or methods do you use for reaching your goal?

OCR(FineReader, tesseract)

Annotation interface(CATMA)
NER models (Spacy, Flair)

Regex and gazetteer matching

RDF Graph

How have you approached the project so far?

We created an annotation task force, produced manual markup for some samples of Mitteilungsblatt in German and in Hebrew, applied;

In other direction, we OCR-ed the manually created ‘person registries’ from the library 

What can you learn from this project for your own personal research interests?

  • Organizing collaborative CATMA annotation 
  • Running evaluation pipelines 
  • Cleaning noisy OCR data

What do you expect to achieve in the Hackathon?

Output the personal data from the OCR-ed person registries for Mitteilungsblatt in clear machine readable form 

Test models that extract person mentions, evaluate their performance, try to improve the quality (possibly with help of the registry data from the previous point)