Interactive Labeling of Scan Segmentations

Subject:Bachelor's or Master's Thesis with the goal to design and develop an interactive labelling system for segmentation of advertisements from scanned newspaper archives.
Type:Bachelor/Master Thesis
Supervisor:
Merlin Knäble
Add on:
Status: Open

Problem

As the digitization of the worlds libraries and print archives continues steadily, the demand for automated processing of such documents grows. Hereby, resarchers and practicioners would like to digitally process such documents with tools from computer vision (CV) and optical character recognition (OCR). Further they would like to search and filter for certain document meta-data. However, all of this presumes the availablity of such extracted features and meta-data. As state-of-the-art machine learning (ML) classifiers still do not reach desired accuracy levels, especially on old documents or those from fringe contexts, manual labeling effort is required.

Goals

For the scope of this thesis, we limit the context to segmenting advertisements from scanned pages of newspapers and magazines. This poses an interesting use-case for, for instance, advertising researchers. Associated colleagues at the University of Mannheim (UniMA) have already manually created a labeled set of 9000 segmented pages of the US magazine "The Economist", ranging from the 1840s to today. We expect a thesis student to develop an interactive labeling system in order to support the extension of this segmentation traing data-set to many more pages. Interactive labeling hereby strives to combine automatic steps (e.g. the trained model) with incremental user input. The work-packages entail:

analyzing the state-of-the-art of such segmentation tools
exchange with the researchers at UniMA that created the training data-set regarding requirements and system evaluation
development of an interactive labeling system as part of a design science research process
- train a ML classifier based on the existing training data
- (potentially) include more training data from free data-sets
- develop an interactive labeling tool that integrates the ML classifier with manual segmentation
- include novel interaction paradigms with the existing ML classifier into the tool (manually reviewing those instances in which the model was uncertain, retraining the model based on new user input, ...)
writing a thesis document according to research group requirements & participation in our thesis colloquium

Design science research is a well established methodology in the information systems field, which deals with the scientific view on artifacts, such as the labeling system that should be developed during this thesis. Hereby so called design knowledge can be derived from the development process and the finished artifact.

Requirements

We expect the student to be familiar with web development. The system should be devloped with a modern web application frontend framework (e.g. Vue with Vuetify) or be forked from an existing open source segmentation system. Further we expect the model to be trained based on standard Python frameworks. Experience in this regard is required as well.

Contact

If you are interested in this topic and want to apply for this thesis, please contact Merlin Knäble with a short motivation statement, your CV, and a current transcript of records. Feel free to reach out beforehand if you have any questions.