GROTOAP: GROund Truth for Open Access Publications

The field of digital document content analysis includes many important tasks, for example page segmentation or zone classification. It is impossible to build effective solutions for such problems and evaluate their performance without a reliable test set, that contains both input documents and expected results of segmentation and classification. In this paper we present GROTOAP — a test set useful for training and performance evaluation of page segmentation and zone classification tasks. The test set contains input articles in a digital form and corresponding ground truth files. All input documents included in the test set have been selected from DOAJ database, which indexes articles published under CC-BY license. The whole test set is available under the same license.
D. Tkaczyk, A. Czeczko, K. Rusek, Ł. Bolikowski, and R. Bogacewicz, “GROTOAP: GROund Truth for Open Access Publications,” in Proceedings of the 2012 ACM/IEEE on Joint Conference on Digital Libraries, 2012, pp. 381-382.