Math-Aware POS Tagging
POS Tagging is one of the NLP task, but what about scientific documents with math expressions?
- can adjust traditional POS Tagging methods to handle formulas
Classification
Penn Treebank POS Scheme doesn't have special classes for mathematics.
What we can do is to add other math-related classes:
- ID for identifiers (e.g. "... where $E$ stands for energy", $E$ should be tagged as ID)
- MATH for formulas (e.g. "$E = mc^2$ is the mass-energy equivalence formula", "$E = mc^2$ should be tagged as MATH)
Text Preprocessing
Mathematical expressions are usually contained within special tags, e.g. inside tag <math></math>
for wikipedia, or inside $$
for latex documents.
- We find all such mathematical expressions and replace each with a unique single token "MATH_mathID"
- the mathID could be a randomly generated string or result of some hash function applied to the content of formula. The latter approach is preferred when we want to have consistent strings across several runs.
- Then we apply traditional POS Tagging techniques to the textual data. They typically will annotate such "MATH_mathID" tokens as nouns
- after that we may want to re-annotate all math tokens: if it contains only one identifier, we label it as ID, if several - as MATH. But in some cases we want to keep original annotation
- after that we can bring the mathematical content back to the document
Usage
Sources
- Kristianto, Giovanni Yoko, et al. "Extracting definitions of mathematical expressions in scientific papers." 2012. [1]
- Pagael, Robert, and Moritz Schubotz. "Mathematical Language Processing Project." 2014. [2]
- Schöneberg, Ulf, and Wolfram Sperber. "POS Tagging and its Applications for Mathematics." 2014. [3]