Sentence simplification corpus

This corpus consists of Medline abstracts concerning proteins and genes. We randomly selected 130 Medline abstracts (a total of 1,199 sentences), having the words "protein" and "gene" in the title (see table below). We asked five judges to mark the six constructs. To provide a high quality annotated corpus, each sentence was annotated by two judges independently and annotation conflicts (57 sentences in total) were solved by a third party opinion.

iSimp corpus for sentence simplification annotation
Types % of sentences % of abstracts
Coordinations 0.43 0.75
Relative clauses 0.19 0.65
Appositions 0.04 0.36
Parenthesized elements 0.16 0.63
Introductory phrases 0.12 0.53

The corpus contains XML files of passages, sentences, and annotations of simplification constructions. Each XML file is in BioC format, therefore, given with its Key files. For convenience, all BioC files use same DTD, provided by BioC, to verify the structure of XML files.


The download is a zipped file. If you upack the zip file, you should have a BioC XML file and a Key file. The BioC file contains raw text that appears in PubMed abstracts, which is downloaded directly from Medline, and simplfication constructs. For each construct, "annotation" marked its components, and "relation" linked components to show the simplification constructs as a whole. Details of tag meanings can be found in the Key file.

Download the corpus version 1.0.2

BioNLP-ST corpora

We have converted the training and development corpora of the BioNLP-ST 2011 and 2013 GE tasks into BioC format. The converted corpora, as well as the conversion program, are available from the links below. The test corpora are not provided because the event annotations in those data are not released.

In the converted data below, text files (in .txt) in the BioNLP corpora are split by 'newlines' and stored into BioCPassages. Entities (in .a1) and event triggers (in .a2) are stored into separate passages based on their positions in the text files. Target annotations (in .a2), including event, relation, event modification, and equivalence, are annotated at the document level.

2013 GE task
: Training set, Development set

2011 GE task
: Training set, Development set

Conversion program
: You can use Git to get the code. The distribution includes: