Math-Aware POS Tagging
POS Tagging is one of the NLP task, but what about scientific documents with math expressions?
- can adjust traditional POS Tagging methods to handle formulas
Penn Treebank POS Scheme doesn't have special classes for mathematics.
What we can do is to add other math-related classes:
- ID for identifiers (e.g. "... where $E$ stands for energy", $E$ should be tagged as ID)
- MATH for formulas (e.g. "$E = mc^2$ is the mass-energy equivalence formula", "$E = mc^2$ should be tagged as MATH)
Mathematical expressions are usually contained within special tags, e.g. inside tag
<math></math> for wikipedia, or inside
$$ for latex documents.
- We find all such mathematical expressions and replace each with a unique single token "MATH_mathID"
- the mathID could be a randomly generated string or result of some hash function applied to the content of formula. The latter approach is preferred when we want to have consistent strings across several runs.
- Then we apply traditional POS Tagging techniques to the textual data. They typically will annotate such "MATH_mathID" tokens as nouns
- after that we may want to re-annotate all math tokens: if it contains only one identifier, we label it as ID, if several - as MATH. But in some cases we want to keep original annotation
- after that we can bring the mathematical content back to the document
- Kristianto, Giovanni Yoko, et al. "Extracting definitions of mathematical expressions in scientific papers." 2012. 
- Pagael, Robert, and Moritz Schubotz. "Mathematical Language Processing Project." 2014. 
- Schöneberg, Ulf, and Wolfram Sperber. "POS Tagging and its Applications for Mathematics." 2014.