Math-Aware POS Tagging

POS Tagging is one of the NLP task, but what about scientific documents with math expressions?

  • can adjust traditional POS Tagging methods to handle formulas


Classification

Penn Treebank POS Scheme doesn't have special classes for mathematics. What we can do is to add other math-related classes:

  • ID for identifiers (e.g. "... where $E$ stands for energy", $E$ should be tagged as ID)
  • MATH for formulas (e.g. "$E = mc^2$ is the mass-energy equivalence formula", "$E = mc^2$ should be tagged as MATH)


Text Preprocessing

Mathematical expressions are usually contained within special tags, e.g. inside tag <math></math> for wikipedia, or inside $$ for latex documents.

  • We find all such mathematical expressions and replace each with a unique single token "MATH_mathID"
  • the mathID could be a randomly generated string or result of some hash function applied to the content of formula. The latter approach is preferred when we want to have consistent strings across several runs.
  • Then we apply traditional POS Tagging techniques to the textual data. They typically will annotate such "MATH_mathID" tokens as nouns
  • after that we may want to re-annotate all math tokens: if it contains only one identifier, we label it as ID, if several - as MATH. But in some cases we want to keep original annotation
  • after that we can bring the mathematical content back to the document


Usage


Sources

  • Kristianto, Giovanni Yoko, et al. "Extracting definitions of mathematical expressions in scientific papers." 2012. [1]
  • Pagael, Robert, and Moritz Schubotz. "Mathematical Language Processing Project." 2014. [2]
  • Schöneberg, Ulf, and Wolfram Sperber. "POS Tagging and its Applications for Mathematics." 2014. [3]