Sentence simplification is a technique designed to detect the various types of clauses and constructs used in a complex sentence, in an effort to produce two or more simple sentences while maintaining both coherence and the communicated message. By reducing the syntactic complexity of a sentence, the goal of sentence simplification is to ease the development of natural language processing and text mining tools. For this purpose, we developed iSimp. You can try out our simplifier online.
To illustrate the usefulness of sentence simplification, consider the following complex sentence from the biomedical literature:
Active Raf-1 phosphorylates and activates the mitogen-activated protein (MAP) kinase/extracellular signal-regulated kinase kinase 1 (MEK1), which in turn phosphorylates and activates the MAP kinases/extracellular signal regulated kinases, ERK1 and ERK2. (PMID-8557975)
The major syntactic constructs that we consider when simplifying a sentence are: coordinations, relative clauses, appositions, parenthesized elements, and introductory phrases. Figure 1 shows constructs which can be seen in the example. Figure 2 gives more details of components in each construct.
After identifying these constructs, the complex sentence can be broken into multiple simple sentences. Here we show only six examples, which require combining two coordinations with the relative clause and the apposition:
- Active Raf-2 phosphorylates MEK1
- MEK1 phosphorylates ERK1
- MEK1 phosphorylates ERK2
- Active Raf-2 activates MEK1
- MEK1 activates ERK1
- MEK1 activates ERK2
Suppose we used a straightforward rule "phosphorylates X" to extract the theme of phosphorylation relation in the text, where X is a protein appears as a head word. This rule is able to match straightforward mentions of phosphorylation in text. However, it will fail to find mentions of phosphorylation in the above complex sentence. But the rule can now apply to (a)-(c) and extract "MEK1", "ERK1", and "ERK2" as themes.
BioCreative IV Track 1
To make iSimp readily usable in NLP and text mining tools, we participate in BioCreative IV Track 1, and adopt the BioC format, a simple XML format to share text documents and annotations. Our contribution in this track includes:
- The development of Java BioC IO API, which is independent of the particular XML parser used. The Java API developed as part of the iSimp project becomes part of the public release of the BioC package.
- A tag set for annotating simplification constructs. The tag set can be used in any sentence simplification system to exchange data with other NLP systems and, thus, make the system easily interoperable with other NLP applications.
- A unique mechanism allowing new artificial text to be included and treated as if it were an original collection in the following processing.
- The construction of an iSimp corpus, provided in the BioC format. In addition to this corpus, we also adapted the BioC format to the GENIA Event Extraction (GE) corpora of the BioNLP-ST 2011, and this corpora was used in the evaluation of iSimp as well.
The main technical ideas behind how iSimp and BioC work appear in these papers. Feel free to cite one or more of the following papers depending on what you are using.
Peng,Y., Tudor,C., Torii,M., Wu,C.H., Vijay-Shanker,K. (2014) iSimp in BioC standard format: Enhancing the interoperability of a sentence simplification system. Database.
Peng,Y., Tudor,C., Torii,M., Wu,C.H., Vijay-Shanker,K. (2013) Enhancing the interoperability of iSimp by using the BioC format. In Proceedings of the Fourth BioCreative Challenge Evaluation Workshop. 5-9.
Comeau,D.C., Doğan,R.I., Ciccarese,P., Cohen,K.B., Krallinger,M., Leitner,F., Lu,Z., Peng,Y., Rinaldi,F., Torii,M., Valencia,V., Verspoor,K., Wiegers,T.C., Wu,C.H., Wilbur,W.J. (2013) BioC: A minimalist approach to interoperability for biomedical text processing. Database: The Journal of Biological Databases and Curation.
Peng,Y., Tudor,C., Torii,M., Wu,C.H., Vijay-Shanker,K. (2012) iSimp: A Sentence Simplification System for Biomedical Text. In Proceedings of the 2012 IEEE International Conference on Bioinformatics and Biomedicine. 211-216.
Research reported in this website was supported by the National Library of Medicine of the National Institutes of Health under award number G08LM010720. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
This material is also based upon work supported by the National Science Foundation under Grant No. DBI-1062520. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.