Mathematical Word Problem Solving Using Natural Language Processing

  • Conference paper
  • First Online: 29 February 2020
  • Cite this conference paper

Book cover

  • Shounaak Ughade 17 &
  • Satish Kumbhar 17  

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1077))

658 Accesses

3 Citations

Natural language processing (NLP) is generally done on large data. Due to limited data word problem solving is challenging using NLP. There are some approaches proposed which could solve basic arithmetic problems like addition/subtraction. Knowledge representation is the main task to be done by NLP. Each kind of problem has its own approach. In this paper three types of mathematical word problems have been solved. Two of them are aptitude problems while the other two are reasoning problems. The spacy library has been used for effective use of Named Entity Recognition (NER) and word vectors. Stepwise solution has been generated instead of just answers which helps in improving understanding. The quite generic rather intuitive approach can be extended to solve some other kind of aptitude problems.

  • Knowledge representation
  • Natural language processing
  • Sentiment analysis
  • Named entity recognition

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Sundaram, S.S., Khemani, D.: Natural language processing for solving simple word problems (2015)

Google Scholar  

Huang, D., Liu, J., Lin, C.-Y., Yin, J.: Neural math word problem solver with reinforcement learning (2018)

Shi, S., Wang, Y., Lin, C.-Y., Liu, X., Rui, Y.: Automatically solving number word problems by semantic parsing and reasoning (2015)

Amnueypornsakul, B., Bhat, S.: Machine-guided solution to mathematical word problems (MWP) (2014)

A novel framework for math word problem solving. Int. J. Inf. Educ. Technol. 3 (1) (2013)

Kushman, N., Artzi, Y., Zettlemoyer, L., Barzilay, R.: Learning to automatically solve algebra word problems. Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology and Computer Science & Engineering, University of Washington

Miyani, M., Doshi, S., Jain, J.: Word problem solver system using artificial intelligence. Proc. Comput. Sci. (ICACTA) 45 , 800–807 (2015)

Article   Google Scholar  

Ma, Y., Tan, K., Shao, L., Shang, X.: Constructing the representation model of arithmetic word problems for intelligent tutoring system (2011)

Hevapathige, A., Wellappili, D., Kankanamge, G.U., Dewappriya, N., Ranathunga, S.: Two-phase classifier for automatic answer generation for math word problems (2018)

Dewappriya, N., Kankanamge, G.U., Wellappili, D., Hevapathige, A., Ranathunga, S.: Unit conflict resolution for automatic math word problem solving (2018)

https://spacy.io/api/phrasematcher

https://spacy.io/api/doc

https://spacy.io/usage/training

spacy web: https://spacy.io/models/en

Download references

Author information

Authors and affiliations.

Department of Computer Engineering and IT, College of Engineering, Pune, India

Shounaak Ughade & Satish Kumbhar

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Shounaak Ughade .

Editor information

Editors and affiliations.

Belgrade, Serbia

ITM University, Gwalior, India

Shyam Akashe

Global Knowledge Research Foundation, Ahmedabad, India

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Ughade, S., Kumbhar, S. (2020). Mathematical Word Problem Solving Using Natural Language Processing. In: Tuba, M., Akashe, S., Joshi, A. (eds) ICT Systems and Sustainability. Advances in Intelligent Systems and Computing, vol 1077. Springer, Singapore. https://doi.org/10.1007/978-981-15-0936-0_46

Download citation

DOI : https://doi.org/10.1007/978-981-15-0936-0_46

Published : 29 February 2020

Publisher Name : Springer, Singapore

Print ISBN : 978-981-15-0935-3

Online ISBN : 978-981-15-0936-0

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

mathematical word problem solving using natural language processing

Solving Arithmetic Word Problems Using Natural Language Processing and Rule-Based Classification

  • Sourav Mandal XIM University https://orcid.org/0000-0002-6066-8008
  • Swagata Acharya Dept. of Information Technology, Jadavpur University https://orcid.org/0000-0002-4942-5734
  • Rohini Basak Dept. of Information Technology, Jadavpur University https://orcid.org/0000-0001-9662-3074

In the modern era, Intelligent Tutoring Systems (ITS), Computer Based Trainings (CBT) etc. are gaining popularity rapidly in the educational sectors as well as professional sectors and an automatic math word problem solver is one of the crucial sub-fields of ITS. Solving mathematical word problems automatically is a challenging research problem in Artificial Intelligence (AI), Natural Language Processing (NLP) and Machine Learning (ML), since understanding and extracting relevant information from an unstructured text require lots of reasoning abilities. Till date, much research has been carried out in this domain, focusing on solving each type of mathematical word problem, which include solving like arithmetic word problems, algebraic word problems, geometric word problems, trigonometric word problems etc. In this work, we present an approach to solve arithmetic word problems automatically. However, it is limited to solve only single operation and single equation word problems. We used a rule-based approach in classifying word problems. We propose various rules to establish the relationships and dependencies among various key-entities to broadly classify the word problems into four categories (Change, Combine, Compare and Division-Multiplication) and their sub-categories to identify the desired operation among+, -, *, and /. Irrelevant information is also filtered out from input problem texts, based on hand-crafted rules to extract relevant quantities. Later, an equation is formed with the relevant quantities and predicted operation to generate the final answer. The work proposed here, performs well as compared to most of the similar systems reported on the standard SingleOp dataset achieving an accuracy of 93.02%.

L. Verschaffel, B. Greer, and E. De Corte. Making sense of word problems. Leiden, Netherlands: Lisse Swets and Zeitlinger, 2000, doi:10.1023/A:1004190927303.

S. Roy and D. Roth. Solving general arithmetic word problems. in Proc. 2015 Conf. Empirical Methods Natural Language Processing (EMNLP), Lisbon, Portugal, Sep. 17–21, 2015, pp. 1743-1752, doi:10.18653/v1/D15-1202.

M. J. Nathan. Knowledge and situational feedback in a learning environment for algebra story problem solving. Interactive Learn. Environ. vol. 5, no. 1, pp. 135–159, 1998, doi:10.1080/1049482980050110.

D. Arnau, M. Arevalillo-Herr´aez, L. Puig, and J. A. Gonz´alez-Calero. Fundamentals design and the operation of an intelligent tutoring system for the learning of the arithmetical and algebraic way of solving word problems. Comput. & Educ. vol. 63, pp. 119–130, Apr. 2013, doi:10.1016/j.compedu.2012.11.020

D. Arnau, M. Arevalillo-Herr´aez, and J. A. Gonz´alez-Calero. Emulating human supervision in an intelligent tutoring system for arithmetical problem solving. IEEE Trans. Learn. Technol. vol. 7, no. 2, pp. 155–164, Apr./Jun. 2014, doi: 10.1109/TLT.2014.2307306.

C. R. Beal. Animalwatch: An intelligent tutoring system for algebra readiness. in Int. Handbook Metacognition Learn. Technologies. Springer, Mar. 2013, pp. 337–348, doi:10.1007/978-1-4419-5546-3 22.

M. S. Riley, J. G. Greeno, and J. I. Heller. Development of children’s problem-solving ability in arithmetic. Univ. of Pittsburgh, Pittsburgh, PA, USA, Tech. Rep. LRDC-1984/37, 1984. [Online]. Available: https://files.eric.ed.gov/fulltext/ED252410.pdf

C. R. Fletcher. Understanding and solving arithmetic word problems: A computer simulation. Behav. Res. Methods, Instrum., & Comput. vol. 17, no. 5, pp. 565–571, Sep. 1985, doi:10.3758/BF03207654.

A. Mitra and C. Baral. Learning to use formulas to solve simple arithmetic problems. in Proc. 54th Annu. Meeting Association Computational Linguistics (ACL), Berlin, Germany, Aug. 7–12, 2016, pp. 2144–2153, doi: 10.18653/v1/P16-1202.

S. Mandal and S. K. Naskar. Classifying and Solving Arithmetic Math Word Problems—An Intelligent Math Solver. in IEEE Transactions on Learning Technologies. vol. 14, no. 1, pp. 28-41, Feb. 2021, doi: 10.1109/TLT.2021.3057805.

T. P. Carpenter, J. Hiebert, and J. M. Moser. Problem structure and first-grade children’s initial solution processes for simple addition and subtraction problems. J. Res. Math. Educ., pp. 27–39, Jan. 1981, doi:10.5951/jresematheduc.24.5.0428.

P. Nesher, J. G. Greeno, and M. S. Riley. The development of semantic categories for addition and subtraction. Educational Stud. Math. vol. 13, no. 4, pp. 373–394, Nov. 1982, doi:10.1007/BF00366618.

G. Vergnaud. A classification of cognitive tasks and operations of thought involved in addition and subtraction problems. Addition subtraction: A Cogn. perspective, pp. 39–59, 1982, doi: 10.4324/ 9781003046585-4.

T. P. Carpenter, E. Ansell, M. L. Franke, E. Fennema, and L. Weisbeck. Models of problem solving: A study of kindergarten children’s problem-solving processes. J. Res. Math. Educ., pp. 428–441, Nov. 1993, doi:10.5951/jresematheduc.24.5.0428.

N. Kushman, L. Zettlemoyer, R. Barzilay, and Y. Artzi. Learning to automatically solve algebra word problems. in Proc. 52nd Annu. Meeting Association Computational Linguistics (ACL), Baltimore, MD, USA, Jun. 22–27, 2014, pp. 271–281, doi: 10.3115/v1/P14-1026.

R. Koncel-Kedziorski, H. Hajishirzi.+ 90A. Sabharwal, O. Etzioni, and S. D. Ang. Parsing algebraic word problems into equations. Trans. 01Assoc. Comput. Linguistics. vol. 3, pp. 585–597, Dec. 2015, doi: 10.1162/tacl_a_00160.

D.G. Bobrow. Natural language input for a computer problem solving system. 1964.

E. Charniak. Computer Solution of Calculus Word Problem. 1968.

Y. Bakman. Robust understanding of word problems with extraneous information. vol. arXiv preprint math/0701393, 2007.

C. Liguda and T. Peffier. Modeling Math Word Problems with Augmented Semantic Networks. in In: Bouma G., Ittoo A., Métais E., Wortmann H. (eds) Natural Language Processing and Information Systems. NLDB 2012. vol. vol 7337, Springer, Berlin, Heidelberg., 2012, pp. 247-252, Lecture Notes in Computer Science.

M.J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman. Learning to solve arithmetic word problems with verb categorization. in In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014., Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL., October 25-29,2014, pp. 523-533. [Online]. http://aclweb.org/anthology/D/D14/D14-1058.pdf

S. Shi, Y. Wang, C. Lin, X. Liu, and Y. Rui. Automatically solving number word problems by semantic parsing and reasoning. in In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 1132-1142. [Online]. http://aclweb.org/anthology/D/D15/D15-1135.pdf

S. Roy and D. Roth. Illinois math solver: Math reasoning on the web. in In: Proceedings of the Demonstrations Session, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies., San Diego California, USA., June 12-17, 2016, pp. 52–56. [Online]. http://aclweb.org/anthology/N/N16/N16-3011.pdf

S. Roy and D. Roth. Unit dependency graph and its application to arithmetic word problem solving. in In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence., San Francisco, California, USA., February 4-9, 2017, pp. 3082–3088. [Online]. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14764

S. Roy, T. Vieira, and D. Roth. Reasoning about quantities in natural language. vol. TACL 3, pp. 1–13, 2015. [Online]. https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/452

S. Roy and D. Roth. Mapping to Declarative Knowledge for Word Problem Solving. Transactions of the Association for Computational Linguistics. vol. Volume 6, pp. 159-172, 2018.

L. Zhou, S. Dai, and L. Chen. Learn to solve algebra word problems using quadratic programming. in In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 817-822.

S. Upadhyay and M. Chang. Annotating derivations: A new evaluation strategy and dataset for algebra word problems. 2016. [Online]. http://arxiv.org/abs/1609.07197

D. Huang, S. Shi, C. Lin, J. Yin, and W. Ma. How well do computers solve math word problems? large-scale dataset construction and evaluation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016. vol. Volume 1: Long Papers (2016), August 2016. [Online]. http://aclweb.org/anthology/P/P16/P16-1084.pdf

Y. Wang, X. Liu, and S. Shi. Deep Neural Solver for Math Word Problems. pp. 845–854, January 2017. [Online]. https://www.aclweb.org/anthology/D17-1088.pdf

M.S., et al. Riley. Development of children’s problem-solving ability in arithmetic. 1984.

D.J. Briars and J.H Larkin. An integrated model of skill in solving elementary word problems. vol. Cognition and instruction 1(3), pp. 245-296, 1984.

D. Dellarosa. A computer simulation of childrens arithmetic word-problem solving. Behavior Reaearch Methods. vol. Instruments, & Computers 18(2), pp. 147-154, 1986.

DadsWorksheets.com, Available at: https://www.dadsworksheets.com/worksheets/word-problems.html , accessed June 2021.

C. Liang, S. Tsai, T. Chang, Y. Lin, and K. Su. A meaning-based English math word problem solver with understanding, reasoning and explanation. in Proc. 26th Int. Conf. Computational Linguistics (COLING), Osaka, Japan, Dec. 11–16, 2016, pp. 151–155.

Allen Institute for AI, Available at: http://allenai.org/data.html , accessed June 2021.

S. Mandal and S. K. Naskar. Solving Arithmetic Mathematical Word Problems: A Review and Recent Advancements. ICITAM 2017: 95-114

S. Mandal and S. K. Naskar. Solving Arithmetic Word Problems by Object Oriented Modeling and Query-Based Information Processing. Int. J. Artif. Intell. Tools 28(4): 1940002:1-1940002:23 (2019)

S. Mandal, A. A. Sekh and S. K. Naskar. Solving arithmetic word problems: A deep learning based approach. J. Intell. Fuzzy Syst. 39(2): 2521-2531 (2020)

NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks, Available at: https://github.com/huggingface/neuralcoref , accessed June 2021.

DependencyParser, Available at: https://spacy.io/api/dependencyparser , accessed June 2021.6

Linguistic Features, Available at: https://spacy.io/usage/linguistic-features , accessed June 2021.

Rule-based_Math_Word_Problem_Solver, Available at: https://github.com/Swagata-Acharya/Rule-based_Math_Word_Problem_Solver.git , accessed August 2021.

How to Cite

  • Endnote/Zotero/Mendeley (RIS)

Copyright (c) 2022 Sourav Mandal, Swagata Acharya, Rohini Basak

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .

All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.

IJISAE open access articles are licensed under a  Creative Commons Attribution-ShareAlike 4.0 International License . This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.

Crossref

Most read articles by the same author(s)

  • Avishek Mondal, Arnab Sardar, Rohini Basak, Sourav Mandal, A Novel Mask R-CNN based Approach to Brain Tumour Detection , International Journal of Intelligent Systems and Applications in Engineering: Vol. 10 No. 3 (2022)

Announcements

Information for authors.

Information for Authors: We are pleased to inform that we are now collaborating with Elsevier Digital Commons for much better visibility of journal. Further authors will be able to observe their citations, metric like PlumX from journal website itself. IJISAE will be in transition from OJS to Digital Commons framework in next few months so if their is any queries or delays contact directly on [email protected]

  • Editorial Team
  • Focus and Scope
  • Author Guidelines
  • Publishing Process
  • New Submission
  • Publishing Ethics
  • Peer Review Process

Information

  • For Readers
  • For Authors
  • For Librarians

mathematical word problem solving using natural language processing

  • Classification of Rice Varieties Using Artificial Intelligence Methods 968
  • A Comparative Analysis of ARIMA and VAR Algorithms for Performance Analysis of High-Speed Diesel Pumps 792
  • Advancements in Computing: Emerging Trends in Computational Science with Next-Generation Computing 778
  • Integrating Artificial Intelligence into Project Management for Efficient Resource Allocation 707
  • Fuzzy Logic Based Decision Support Systems for Medical Diagnosis 597
  • Register as a new author
  • Track the status of your submissions
  • Submit a new manuscript
  • Update the details in your profile
  • Show the guidelines for authors
  • Register as a new Peer Reviewer
  • Overview of a current review
  • Process for undertaking reviews
  • Overview of new submissions
  • Assign submissions to reviewers
  • Track progress of revisions
  • Select submissions for copy editing
  • Assign submissions to Section Editors

More information about the publishing system, Platform and Workflow by OJS/PKP.

Classifying and Solving Arithmetic Math Word Problems—An Intelligent Math Solver

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Advertisement

Issue Cover

  • Previous Article
  • Next Article

1 Introduction

2 representative tasks, 5 discussion, acknowledgments, a approach-specific limitations, b diagrammatic categorization of approaches, introduction to mathematical language processing: informal proofs, word problems, and supporting tasks.

Action Editor: David Chiang

  • Cite Icon Cite
  • Open the PDF for in another window
  • Permissions
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Search Site

Jordan Meadows , André Freitas; Introduction to Mathematical Language Processing: Informal Proofs, Word Problems, and Supporting Tasks. Transactions of the Association for Computational Linguistics 2023; 11 1162–1184. doi: https://doi.org/10.1162/tacl_a_00594

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Automating discovery in mathematics and science will require sophisticated methods of information extraction and abstract reasoning, including models that can convincingly process relationships between mathematical elements and natural language, to produce problem solutions of real-world value. We analyze mathematical language processing methods across five strategic sub-areas (identifier-definition extraction, formula retrieval, natural language premise selection, math word problem solving, and informal theorem proving) from recent years, highlighting prevailing methodologies, existing limitations, overarching trends, and promising avenues for future research.

Prove that there is no function f from the set of non-negative integers into itself such that f ( f ( n )) = n + 1987 for every n .
Show that the nearest neighbor interaction Hamiltonian of an electronic quasiparticle in Graphene can be written as H = ℏ Ω ∑ q ( f q b q † a q + f q * a q † b q ) .
How is the sun’s atmosphere hotter than its surface?

If we hope to use machines to derive mathematically rigorous and explainable solutions to address such questions, models must reason over both natural language and mathematical elements such as equations, expressions, and variables. Given some input problem description, the ideal model is at least capable of recalling relevant statements (premise selection) , assigning contex tual descriptions to math elements within that text (identifier-definition extraction) , and performing robust manipulation of equations and expressions towards an explainable reasoning argument (informal theorem proving) . Previous years have advanced many of the components required to deliver this vision. Transformer-based (Vaswani et al., 2017 ), large language models (LLMs) (Brown et al., 2020 ; Chen et al., 2021 ) have begun to exhibit mathematical (Rabe et al., 2020 ) and logical (Clark et al., 2020 ) capabilities. Graph-based models also show competence in premise selection (Ferreira and Freitas, 2020b ), math question answering (Feng et al., 2021 ), and math word problems (MWPs) (Zhang et al., 2022b ). The evolutionary path of mathematical language processing can be traced from MWPs (Feigenbaum and Feldman, 1963 ; Bobrow, 1964 ; Charniak, 1969 ) and linguistic analysis of formal proofs (Zinn, 1999 , 2003 ), to the present day, where transformers and graph-based models deliver leading metrics in math and language reasoning tasks, complemented by symbolic methods (Zhong et al., 2022 ). This survey provides a synthesis of this recent evolutionary arch: We consider five representative tasks with examples, describe contributions leading to the current state-of-the-art, discuss notable limitations of the current solutions, overarching trends, and promising research directions.

There is an abundance of tasks considering mathematical language, such as question answering (Hopkins et al., 2019 ; Feng et al., 2021 ; Lewkowycz et al., 2022 ; Mansouri et al., 2022b ) and headline generation (Yuan et al., 2020 ; Peng et al., 2021 ). Mathematical language processing (MLP) itself has been described in the context of various targeted texts, such as linking variables to descriptions (Pagael and Schubotz, 2014 ), grading answers (Lan et al., 2015 ), and deriving abstract representations for downstream applications (Wang et al., 2021 ). We take an inclusive stance, selecting a few choice tasks spanning surface-level retrieval, as seen in identifier-definition extraction and formula retrieval tasks, through models which require the encoding of formal abstractions and implicit reasoning chains, such as solving MWPs and informal theorem proving . These areas are projected onto an inference spectrum displayed in Figure 1 . Extractive tasks are positioned to the surface form of the text (information retrieval perspective), including identification of relevant mathematical statements, ranking lists of formulae, and linking variables to contextual definitions. Logical puzzle solvers (Groza and Nitu, 2022 ) and informal reasoning generation models (Lewkowycz et al., 2022 ) exist far into the abstractive side, due to the step-wise and sometimes symbolic reasoning required to address them. The use of “formal” versus “informal” differentiates strict automated theorem prover (ATP) approaches requiring the use of a consistent formal language representation (Rudnicki, 1992 ) and hard-coded logic (Bansal et al., 2019 ), from approaches that input mathematical language and infer without necessary reliance on strict symbolic and logical inference mechanisms. Autoformalization (Szegedy, 2020 ; Wu et al., 2022 ) aims to cross this divide. We consider informal methods for solving five representative tasks in this context, with examples given below, visually displayed in Figure 1 .

Extractive tasks tend to not require inference chains to solve them, compared to more abstractive tasks. Identifier-definition extraction assigns identifiers (e.g., ψ(x)) to their context. Formula retrieval considers the structure of formulae, and scores them based on similarity to a query formula. Premise selection selects statements most likely to be useful for solving a proof. Solving MWPs (math word problems) involves calculating solutions to arithmetic problems. Informal theorem proving involves the production of proofs and inference chains combining natural and mathematical language.

Extractive tasks tend to not require inference chains to solve them, compared to more abstractive tasks. Identifier-definition extraction assigns identifiers ( e . g ., ψ ( x )) to their context. Formula retrieval considers the structure of formulae, and scores them based on similarity to a query formula. Premise selection selects statements most likely to be useful for solving a proof. Solving MWPs (math word problems) involves calculating solutions to arithmetic problems. Informal theorem proving involves the production of proofs and inference chains combining natural and mathematical language.

Identifier-Definition Extraction.

The assignment of meaning to otherwise vague mathematical elements. Without context, equations such as p = ℏ k are ambiguous. What meaning is attributed to k ? This task involves finding (identifier, definition) pairs, such as ( k ,wavevector) (Kristianto et al., 2012 ; Stathopoulos et al., 2018 ).

Formula Retrieval.

Mathematical language includes math elements written in markup languages such as LaTeX. Given a query formula, the Wikipedia Formula Browsing task (Zanibbi et al., 2016a ; Mansouri et al., 2022b ) involves ranking a list of candidate formulae in terms of their similarity to that formula. For example, given the query x 2 + y 2 = r 2 , the formula a 2 + b 2 = c 2 should rank higher than y = mx + c .

Natural Language Premise Selection (NLPS).

Given a mathematical statement s that requires proof, and a collection of premises P , this task consists of retrieving the premises in P that are most likely to be useful for proving s (Ferreira and Freitas, 2020a ; Valentino et al., 2022 ). For example, given the purple claim statement in Figure 1 , a NLPS model should select the green statements as premises, excluding the red.

Math Word Problem Solving.

Solving arithmetic (Roy and Roth, 2016 ) or algebra (Kushman et al., 2014 ) word problems. Andrew has 3 dogs. If they each give birth to 2 others, how many dogs will he have? An example requiring premise selection and identifier-definition extraction is given in Figure 1 .

Informal Theorem Proving.

Outputting reasoning chains from premises in order to “prove” a mathematical language statement. From Figure 1 , the energy of the particle is E k = γmc 2 . Substituting v = 0 into the Lorentz factor gives γ = 1, and substituting γ = 1 into E k = γmc 2 gives E k = mc 2 . Such informal reasoning does not rely on formal frameworks, such as Fitch-style proofs, to infer quantitative results (Lewkowycz et al., 2022 ).

We highlight key points abstracted from task approaches in bold, give an overview of methods in Table 1 , and discuss approach specific limitations in the   Appendix .

Summary of different approaches for addressing tasks related to mathematical language processing. The methods are categorized in terms of (i) Learning: Supervised (S), Self-supervised (SS), Unsupervised (UNS), Rule-based (R) (no learning); (iii) Approach; (iv) Dataset; (v) Metrics: MAP (Mean Average Precision), P@K (Precision at K), Perplexity, P (Precision), R (Recall), F1, Acc (Accuracy), BLEU, METEOR, MRR (Mean Reciprocal Rank), Edit (edit distance); (vi) Math format: MathML, LaTeX, natural language (NL), Isabelle formal language. Diagrammatic representations of approaches in identifier-definition extraction ( Figure 3 ), formula retrieval ( Figure 4 ), and MWP solving ( Figure 5 ) can be found in the   Appendix .

3.1 Identifier-Definition Extraction

A significant proportion of variables or identifiers in formulae or text are explicitly defined within a discourse context (Wolska and Grigore, 2010 ). Descriptions are usually local to the first instance of the identifiers in the discourse. It is the broad goal of identifier-definition extraction and related tasks to pair-up variables with their intended meaning.

The task has not converged to a canonical form.

Despite the clarity of its overall aim, the task has materialized into different forms: Kristianto et al. ( 2012 ) predict descriptions given expressions , Pagael and Schubotz ( 2014 ) predict descriptions given identifiers through identifier-definition extraction , Stathopoulos et al. ( 2018 ) predict if a type matches a variable through variable typing , and Jo et al. ( 2021 ) predict notation given context through notation auto-suggestion and notation consistency checking tasks. More concretely, identifier-definition extraction (Schubotz et al., 2016a ) involves scoring identifier-definiens pairs, where a definiens is a potential natural language description of the identifier. Given graph nodes from predefined variables V and types T , variable typing (Stathopoulos et al., 2018 ) is the task of classifying whether edges V × T are either existent (positive) or non-existent (negative), where a positive classification means a variable matches with the type. Notation auto-suggestion (Jo et al., 2021 ) uses the text of both the sentence containing notation and the previous sentence to model future notation from the vocabulary of the tokenizer. This area can be traced from an early ranking task (Pagael and Schubotz, 2014 ) reliant on heuristics and rules (Alexeeva et al., 2020 ), through ML-based edge classification (Stathopoulos et al., 2018 ), to language modeling with Transformers (Jo et al., 2021 ). Different datasets are proposed for each task variant.

There is a high variability in scoping definitions.

The scope from which identifiers are linked to descriptions varies significantly, and it is difficult to compare model performance even when tackling the same variant of the task (Schubotz et al., 2017 ; Alexeeva et al., 2020 ). At a local context, models such as Pagael and Schubotz ( 2014 ) and Alexeeva et al. ( 2020 ) match identifiers with definitions from the same document “as the author intended”, while other identifier-definition extraction methods (Schubotz et al., 2016a , 2017 ) rely on data external to a given document, such as links to semantic concepts on Wikidata and NTCIR-11 test data (Schubotz et al., 2015 ). At a broader context, the variable typing model proposed in Stathopoulos et al. ( 2018 ) relies on an external dictionary of types (Stathopoulos and Teufel, 2015 ; Stathopoulos and Teufel, 2016 ; Stathopoulos et al., 2018 ) extracted from both the Encyclopedia of Mathematics 1 and Wikipedia.

Vector representations have evolved to transfer knowledge from previous tasks, allowing downstream variable typing tasks to benefit from pre-trained embeddings.

Overall, vector representations of text have evolved from feature-based vectors learned from scratch for a single purpose, to the modern paradigm of pre-trained embeddings re-purposed for novel tasks. Kristianto et al. ( 2012 ) input pattern features into a conditional random fields model for the purpose of identifying definitions of expressions in LaTeX papers while Kristianto et al. ( 2014a ) learn vectors through a linear-kernel SVM with input features comprising of sentence patterns, part-of-speech (POS) tags, and tree structures. Stathopoulos et al. ( 2018 ) extend this approach by adding type- and variable-centric features as a baseline also with a linear kernel. Alternatively, Schubotz et al. ( 2017 ) use a Gaussian scoring function (Schubotz et al., 2016b ) and pattern matching features (Pagael and Schubotz, 2014 ) as input to an SVM with a radial basis function (RBF) kernel, to account for non-linear feature characteristics. Alternative classification methods (Kristianto et al., 2012 ; Stathopoulos et al., 2018 ) do not use input features derived from non-linear functions, such as the Gaussian scoring function, and hence use linear kernels. Embedding spaces have been learned in this context for the purpose of ranking identifier-definiens pairs through latent semantic analysis at the document level, followed by the application of clustering techniques and methods of relating clusters to namespaces inherited from software engineering (Schubotz et al., 2016a ). These cluster-based namespaces are later used for classification (Schubotz et al., 2017 ) rather than ranking, but do not positively impact SVM model performance, despite previous evidence suggesting they resolve co-references (Duval et al., 2002 ) such as “ E is energy” and “ E is expectation value”. Neither clustering nor namespaces have been further explored in this context. More recent work learns context-specific word representations after feeding less specific pre-trained word2vec (Mikolov et al., 2013 ; Stathopoulos and Teufel, 2016 ) embeddings to a bidirectional LSTM for classification (Stathopoulos et al., 2018 ). The most recent work predictably relies on more sophisticated pre-trained BERT embeddings (Devlin et al., 2018 ) for the language modeling of mathematical notation (Jo et al., 2021 ). VarSlot (Ferreira et al., 2022 ) obtains SOTA results on variable typing (Stathopoulos et al., 2018 ), and demonstrates robustness to variable renaming, by fine-tuning the sentence transformers (Reimers and Gurevych, 2019 ) SciBERT (Beltagy et al., 2019 ) encoder on augmented data , learning separate representation spaces for variables and mathematical language statements. Four BERT encoder-based approaches (Lee and Na, 2022 ; Popovic et al., 2022 ; Ping and Chi, 2022 ; van der Goot, 2022 ) were submitted to the Symlink task (Lai et al., 2022 ), following the trend of knowledge transfer through pretrained embeddings.

3.2 Formula Retrieval

The task of retrieving similar equations to a query equation, with applications in math-aware search engines (Mansouri et al., 2022a ). Guidi and Coen ( 2016 ) and Zanibbi and Blostein ( 2011 ) emphasize the encoding of formulae and their context for retrieval tasks.

Combining formula tree representations improves retrieval.

There are two prevalent types of tree representations of formulae: Symbol Layout Trees (SLTs) and Operator Trees (OPTs), shown in Figure 2 .

Formula (a) y = ex with its Symbol Layout Tree (SLT) (b), and Operator Tree (OPT) (c). SLTs represent formula appearance by the spatial arrangements of math symbols, while OPTs define the mathematical operations represented in expressions. For more detail, see Mansouri et al. (2019).

Formula (a) y = e x with its Symbol Layout Tree (SLT) (b), and Operator Tree (OPT) (c). SLTs represent formula appearance by the spatial arrangements of math symbols, while OPTs define the mathematical operations represented in expressions. For more detail, see Mansouri et al. ( 2019 ).

Methods reliant solely on SLTs, such as the early versions of the Tangent retrieval system (Pattaniyil and Zanibbi, 2014 ; Zanibbi et al., 2015 , 2016b ), or solely OPTs (Zhong and Zanibbi, 2019 ; Zhong et al., 2020 ) tend to return less relevant formulae from queries. OPTs capture formula semantics while SLTs capture visual structure (Mansouri et al., 2019 ). Effective representation of both formula layout and semantics within a single vector allows a model exploit both representations. Tangent-S (Davila and Zanibbi, 2017 ) was the first evolution of the Tangent system to outperform the NTCIR-11 (Aizawa et al., 2014 ) overall best performer, MCAT (Kristianto et al., 2014b ; 2016 ), which encoded path and sibling information from MathML Presentation (SLT-based) and Content (OPT-based). Tangent-S jointly integrated SLTs and OPTs by combining scores for each representation through a simple linear regressor. Later, Tangent-CFT (Mansouri et al., 2019 ) considered SLTs and OPTs through a fastText (Bojanowski et al., 2017 ) n -gram embedding model using tree tuples. MathBERT (Peng et al., 2021 ) does not explicitly account for SLTs, claiming that LaTeX markup somewhat accounts for SLTs, and therefore encode OPTs. They pre-train the BERT (Devlin et al., 2018 ) model with targeted objectives each accounting for different aspects of mathematical text. They account for OPTs by concatenating node sequences to formula + context BERT input sequences, and by formulating OPT-based structure-aware pre-training tasks learned in conjunction with masked language modeling (MLM).

Leaf-root path tuples deliver an effective mechanism for embedding relations between symbol pairs.

Leaf-root path tuples are now ubiquitous in formula retrieval (Zanibbi et al., 2015 , 2016b ;; Davila and Zanibbi, 2017 ; Zhong and Zanibbi, 2019 ; Mansouri et al., 2019 ; Zhong et al., 2020 ) and their use for NTCIR-11/12 retrieval has varied since their conception (Stalnaker and Zanibbi, 2015 ). Initially (Pattaniyil and Zanibbi, 2014 ) pair tuples were used within a TF-IDF weighting scheme, then Zanibbi et al. ( 2015 , 2016b ) proposed an appearance-based similarity metric using SLTs, maximum subtree similarity (MSS). OPT tuples are integrated (Davila and Zanibbi, 2017 ) later on. Mansouri et al. ( 2019 ) treat tree tuples as words, extract n -grams, and learn fastText (Bojanowski et al., 2017 ) formula embeddings. Zhong and Zanibbi ( 2019 ) and Zhong et al. ( 2020 ) forgo machine learning altogether with an OPT-based heuristic search (Approach0) through a generalization of MSS (Zanibbi et al., 2016b ). Leaf-root path tuples effectively map symbol-pair relations and account for formula substructure, but there is dispute on how best to integrate them into existing machine learning or explicit retrieval frameworks.

Purely explicit methods still deliver competitive results.

Explicit representation methods are those that rely on prescribed representations (structural relations and associated types) rather than learned implicit relationships. Tangent-CFT (Mansouri et al., 2019 ) and MathBERT (Peng et al., 2021 ) are two models to employ learning techniques beyond the level of linear regression. Each model is integrated with Approach0 (Zhong and Zanibbi, 2019 ) through the linear combination of individual model scores. This respectively forms the TanApp and MathApp baselines in Peng et al. ( 2021 ). Approach0 achieves the highest full bpref score of the individual models. While we focus primarily on the NTCIR-12 dataset, recent work (Zhong et al., 2022 ) evaluates a selection of transformer-based models on both NTCIR-12 and ARQMath-2 (Mansouri et al., 2021b ) datasets. They confirm that MathBERT delivers SOTA performance on partial bpref, and Approach0 combined with a fine-tuned dense passage retrieval (DPR) model (Karpukhin et al., 2020 ) outperforms on full bpref (Approach0 + DPR). Combining explicit similarity-based search (Zhong and Zanibbi, 2019 ; Meadows and Freitas, 2021 ) with modern encoders (Khattab and Zaharia, 2020 ; Karpukhin et al., 2020 ) delivers leading performance.

3.3 Natural Language Premise Selection

Formal and informal premise selection both involve the selection of relevant statements for proving a given conjecture (Irving et al., 2016 ; Wang et al., 2017a ; Ferreira and Freitas, 2020a ). The difference lies in the language in which the premises and related proof elements are encoded (either conforming to a logical form or as they appear in mathematical text). Mathematical language as it occurs in papers and textbooks (Wolska and Kruijff-Korbayová, 2004 ) is not compatible with existing provers without autoformalization ; a widely acknowledged bottleneck for the construction of formal proof libraries (Irving et al., 2016 ). Typically, when reasoning over large formal libraries comprising thousands of premises, the performance of ATPs degrades considerably, while for a given proof only a fraction of the premises are required to complete it (Urban et al., 2010 ; Alama et al., 2014 ). Theorem proving is essentially a search problem with a combinatorial search space, and the goal of formal premise selection is to reduce the space, making theorem proving tractable (Wang et al., 2017a ). While formal premises are written in the languages of formal libraries such as Mizar (Rudnicki, 1992 ), informal premises, as seen in ProofWiki, 2 are written in combinations of natural language and LaTeX (Ferreira and Freitas, 2020a ; Welleck et al., 2021a ). Proposed approaches either rank (Han et al., 2021 ) or classify (Ferreira and Freitas, 2020b , 2021 ) candidate premises for a given proof. Natural language premise selection was originally formulated as pairwise relevance classification, evaluated with F 1 (Ferreira and Freitas, 2020b , 2021 ), but has since been evaluated with ranking metrics (Valentino et al., 2022 ). Alternatively, Welleck et al. ( 2021a ) propose mathematical reference retrieval as an analogue of premise selection. The goal is to retrieve the set of references (theorems, lemmas, definitions) that occur in its proof, formulated as a ranking problem.

Separate mechanisms for representing mathematics and natural language can improve performance.

Regardless of the task variation, most current methods do not fully discriminate the semantics of mathematics and natural language, not specifically accounting for aspects of each modality. Ferreira and Freitas ( 2020b ) extract a dependency graph representing dual-modality mathematical statements as nodes, and solve a link prediction task (Zhang and Chen, 2018 ). Recent transformer baselines (Ferreira and Freitas, 2020b ; Welleck et al., 2021a ; Han et al., 2021 ; Coavoux and Cohen, 2021 ), and those at the shared NLPS task (Valentino et al., 2022 ), also do not differentiate between mathematical elements and natural language (Tran et al., 2022 ; Kadusabe et al., 2022 ; Kovriguina et al., 2022 ). STAR (Ferreira and Freitas, 2021 ) purposefully separates the two modalities, encoding distinct representations through self-attention. Explicit disentanglement of the modalities forces STAR to exploit relationships between natural language and mathematics, through the BiLSTM layer. Neuroscience research suggests the brain handles mathematics separately to language (Butterworth, 2002 ; Amalric and Dehaene, 2016 ; Kulasingham et al., 2021 ).

3.4 Math Word Problems

Solving math word problems dates back to the dawn of artificial intelligence research (Feigenbaum and Feldman, 1963 ; Bobrow, 1964 ; Charniak, 1969 ). It can be defined as the task of translating a problem description paragraph into a set of equations to be solved (Li et al., 2020 ). We focus on trends in the task since 2019, as a detailed survey (Zhang et al., 2019 ) captures prior work.

Use of dependency graphs is instrumental to support inference.

In graph-based approaches to solving MWPs, embeddings of words, numbers, or relationship graph nodes are learned through graph encoders , which feed information through to tree (or sequence) decoders. Embeddings are decoded into expression trees which determine the problem solution. Li et al. ( 2020 ) learn the mapping between a heterogeneous graph representing the input problem, and an output tree. The graph is constructed from word nodes with relationship nodes of a parsing tree. This is either a dependency parse tree or constituency tree. Zhang et al. ( 2020 ) represent two separate graphs: a quantity cell graph associating descriptive words with problem quantities, and a quantity comparison graph which retains numerical qualities of the quantity, and leverages heuristics to represent relationships between quantities such that solution expressions reflect a more realistic arithmetic order. Shen and Jin ( 2020 ) also extract two graphs: a dependency parse tree and numerical comparison graph. Zhang et al. ( 2022b ) construct a heterogeneous graph from three subgraphs: a word-word graph containing syntactic and semantic relationships between words, a number-word graph , and a number comparison graph . Although other important differences exist (such as decoder choice), it seems models benefit from relating linguistic aspects of problem text through separate graphs.

Multi-encoders and multi-decoders improve performance by combining complementary representations.

Another impactful decision is the choice of encoder/decoder, and whether to consider alternative representations of a problem. To highlight this, we consider the following comparison. Shen and Jin ( 2020 ) and Zhang et al. ( 2020 ) each extract two graphs from the problem text. One is a number comparison graph, and the other relates word-word pairs (Shen and Jin, 2020 ) or word-number pairs (Zhang et al., 2020 ). They both encode two graphs rather than one heterogeneous graph (Li et al., 2020 ; Zhang et al., 2022b ). They both use a similar tree-based decoder (Xie and Sun, 2019 ). A key difference is that Shen and Jin ( 2020 ) include an additional sequence-based encoder and decoder . The sequence-based encoder first obtains a textual representation of the input paragraph, then the graph-based encoder integrates the two encoded graphs. Then tree-based and sequence-based decoders generate different equation expressions for the problem with an additional mechanism for optimizing solution expression selection. In their own work, Shen and Jin ( 2020 ) demonstrate the impact of multi-encoders/decoders over each encoder/decoder option individually through ablation. Zhang et al. ( 2022a ) similarly combine top-down and bottom-up reasoning to achieve leading results.

Goal-driven decompositional tree-based decoders are a significant component in the state-of-the-art.

Introduced in Xie and Sun ( 2019 ), this class of decoder is considered by most of the discussed approaches, and includes non-graph-based models (Qin et al., 2021 ; Liang et al., 2021 ). In GTS, goal vectors guide construction of expression subtrees (from token node embeddings) in a recursive manner, until a solution expression tree is generated. Proposed models do expand on the GTS-based decoder through the inclusion of semantically aligned universal expression trees (Qin et al., 2020 , 2021 ), though this adaptation is not as widely used. Some state-of-the-art (Liang et al., 2021 ; Zhang et al., 2022b ) models follow the GTS decoder closely.

Language models that transfer knowledge learned from auxiliary tasks rival models based on explicit graph representation of problem text.

As an alternative to encoding explicit relations through graphs, other work (Kim et al., 2020 ; Qin et al., 2021 ; Liang et al., 2021 ) relies on pre-trained transformer-based models, and those which incorporate auxiliary tasks assumed relevant for solving MWPs, to learn such relations latently. However, it seems the case that auxiliary tasks alone do not deliver competitive performance (Qin et al., 2020 ) without the extensive pre-training efforts with large corpora, as we see with BERT-based transformer models. These use either both the ALBERT (Lan et al., 2019 ) encoder and decoder (Kim et al., 2020 ), or BERT-based encoder with goal-driven tree-based decoder (Liang et al., 2021 ). More recent work (Cao et al., 2021 ; Jie et al., 2022 ; Zhang et al., 2022a ) involves iterative relation extraction frameworks for predicting mathematical relations between numerical tokens.

3.5 Informal Theorem Proving

Formal automated theorem proving in logic is among the most abstract forms of reasoning materialised in the AI space. There are two major bottlenecks (Irving et al., 2016 ) that formal methods must overcome: (1) translating informal mathematical text into formal language ( autoformalization ), and (2) a lack of strong automated reasoning methods to fill in the gaps in already formalized human-written proofs. Informal methods either tackle autoformalization directly (Wang et al., 2020 ; Wu et al., 2022 ), or circumvent it through language modeling-based proof generation (Welleck et al., 2021a , 2021b ), trading formal rigor and inference control for flexibility. Transformer-based models have been proposed for mathematical reasoning (Polu and Sutskever, 2020 ; Rabe et al., 2020 ; Wu et al., 2021 ). Converting informal mathematical text into forms which are interpretable by computers (Kaliszyk et al., 2015a , 2015b ;; Szegedy, 2020 ; Wang and Deng, 2020 ; Meadows and Freitas, 2021 ) can strategically impact the dialogue between knowledge expressed in natural text, and a large spectrum of solvers.

Autoformalization could be addressed through approximate translation and exploration rather than direct machine translation.

A long-studied and challenging endeavour (Zinn, 1999 , 2003 ), autoformalization involves converting informal mathematical text into language interpretable by theorem provers (Kaliszyk et al., 2015b ; Wang et al., 2020 ; Szegedy, 2020 ). Kaliszyk et al. ( 2015b ) propose statistical learning methods for parsing ambiguous formulae over the Flyspeck formal mathematical corpus (Hales, 2006 ). Using machine translation models (Luong et al., 2017 ; Lample et al., 2018 ; Lample and Conneau, 2019 ), Wang et al. ( 2020 ) explore dataset translation experiments between LaTeX code extracted from ProofWiki, and formal libraries Mizar (Rudnicki, 1992 ) and TPTP (Sutcliffe and Suttner, 1998 ). The supervised RNN-based neural machine translation model (Luong et al., 2017 ) outperforms the transformer-based (Lample et al., 2018 ) and MLM pre-trained transformer-based (Lample and Conneau, 2019 ) models, with the performance boost stemming from its use of alignment data. Szegedy ( 2020 ) advises against such direct translation efforts, instead proposing a combination of exploration and approximate translation through predicting formula embeddings. In seq2seq models, embeddings are typically granular, encoding word-level or symbol-level (Jo et al., 2021 ) tokens. The method consists of learning mappings from natural language input to premise statements nearby the desired statement in the embedding space, traversing the space between statements using a suitable prover (Bansal et al., 2019 ). Guided mathematical exploration for real-world proofs is still an unaddressed problem and does not scale well with step-distance between current and desired conjecture. Wu et al. ( 2022 ) directly autoformalize small competition problems to Isabelle statements using language models. Similar to previous indication (Szegedy, 2020 ), they also autoformalize statements as targets for proof search with a neural theorem prover.

The need for developing robust interactive natural language theorem provers.

We discuss the closest equivalent to formal theorem proving in an informal setting. Welleck et al. ( 2021a ) propose a mathematical reference generation task. Given a mathematical claim, the order and number of references within a proof are predicted. A reference is a theorem, definition, or a page that is linked to within the contents of a statement or proof. Each theorem x has a proof containing a sequence of references y = ( r 1 ,…, r | y | ), for references r m ∈ R ⁠ . Where the retrieval task assigns a score to each reference in R ⁠ , the generation task produces a variable length of sequence of references ( r ^ 1 , … , r ^ ∣ y ∣ ) with the goal of matching y , for which a BERT-based model is employed and fine-tuned on various data sources. Welleck et al. ( 2021b ) expand on their proof generation work, proposing two related tasks: next-step suggestion , where a step from a proof y (as described above) is defined as a sequence of tokens to be generated, given the previous steps and x ; and full-proof generation which extends this to generate the full proof. They employ BART (Lewis et al., 2019 ), an encoder-decoder model pre-trained with denoising tasks, and augment the model with reference knowledge using Fusion-in-Decoder (Izacard and Grave, 2020 ). The intermediate denoising training and knowledge-grounding improve model performance by producing better representations of (denoised) references for deployment at generation time, and by encoding reference-augmented inputs. Minerva (Lewkowycz et al., 2022 ) is a language model capable of producing step-wise reasoning with mathematical language (LaTeX). They fine-tune a PaLM decoder-only model (Chowdhery et al., 2022 ) on webpages containing MathJax formatted expressions, and evaluate on school-level math problems (Hendrycks et al., 2021 ; Cobbe et al., 2021 ), a STEM subset of problems (Hendrycks et al., 2020 ) of varying difficulty, undergraduate-level STEM problems, and the National Math Exam in Poland. They evaluate for generalization capabilities by generating 20 alternative evaluation problems, perturbing problem wording and numerical values in the MATH (Hendrycks et al., 2021 ) dataset, and compare accuracy before and after the change. While they suggest “minimal memorization”, the numerical intervention comparison does less to support this claim.

Various datasets have been proposed for tasks related to identifier-definition extraction and variable typing (Schubotz et al., 2016a ; Alexeeva et al., 2020 ; Stathopoulos et al., 2018 ; Jo et al., 2021 ), with limited adoption. The Symlink shared task (Lai et al., 2022 ) is an emerging solution, with training data, annotations of 102 papers, and high inter-annotator agreement. Formula retrieval data exists through NTCIR-12 (Zanibbi et al., 2016a ), which has been expanded in the most recent ARQMath task (Mansouri et al., 2022b ), removing formula duplicates and balancing query complexity. Premise selection datasets include PS-ProofWiki (Ferreira and Freitas, 2020a ), used in the NLPS shared task (Valentino et al., 2022 ), and NaturalProofs (Welleck et al., 2021a ). The latter is more inclusive, comprising ProofWiki, text books, and other sources. Modern consensus MWP datasets include (easy) MAWPS (Koncel-Kedziorski et al., 2016 ), (medium) Math23K (Wang et al., 2017b ), and (hard) MathQA (Amini et al., 2019 ), comprising both Chinese and English problems. GSM8K (Cobbe et al., 2021 ) claims to resolve diversity, quality, and language (Huang et al., 2016 ) issues from previous datasets, involves step-wise reasoning and natural language solutions, with balanced difficulty. MATH (Hendrycks et al., 2021 ) is larger and more difficult than GSM8K. Informal theorem proving data includes NaturalProofs (Welleck et al., 2021a ), and some MWP datasets involving step-wise reasoning with mathematical language, such as MATH and GSM8K. However, there is no consensus data for autoformalization or theorem proving from mathematical language input involving sequence learning. ProofNet (Azerbayev et al., 2022 ) aims to remedy this, by providing 297 theorem statements expressed in both natural and formal (Moura et al., 2015 ) language, at undergraduate difficulty. Some are accompanied by informal proofs. MiniF2F (Zheng et al., 2021 ) is a neural theorem proving benchmark of Olympiad-level problems written in many formal languages. Lila (Mishra et al., 2022 ) provides data for 23 math reasoning tasks. Key datasets information is described in Table 2 .

Key datasets for the representative tasks.

Data Scarcity.

Some datasets, such as MATH and the Auxiliary Mathematics Problems and Solutions (AMPS) (Hendrycks et al., 2021 ) datasets, include detailed workings at high school to undergraduate level difficulty. If we aim to use models to produce new mathematics, equivalent datasets composed of the research workings of actual mathematicians would be invaluable. Meadows and Freitas ( 2021 ) attempt to tackle this problem for a single research paper in a very limited setting.

State-of-the-art.

In identifier-definition extraction , leading performance is obtained on Symlink by Lee and Na ( 2022 ), using a SciBERT encoder and MRC-based model (Li et al., 2019 ). Importantly, rather than the BERT tokenizer, they use a rule-based symbol tokenizer , evidencing the benefits of discerning natural language from math elements. VarSlot (Ferreira et al., 2022 ) leads in variable typing, and echoes the importance of such discrimination (see Section 3.2). In formula retrieval , SOTA methods generally include linear combinations of scores obtained from symbolic and neural models. On NTCIR-12, Zhong et al. ( 2022 ) show that MathBERT leads on partial bpref, and Approach0 + DPR leads on full bpref (see Section 3.2). Approach0 + ColBERT (Khattab and Zaharia, 2020 ) leads on ARQMath-2 (Mansouri et al., 2021b ). This work reinforces the importance of including formula structure across multiple tasks. In premise selection , leading results are obtained on the shared NLPS task by a fine-tuned RoBERTa-large encoder (Liu et al., 2019b ), computing similarity scores between statements with Manhattan distance (Tran et al., 2022 ). However, none of the competing models discern mathematical elements from natural language, or include formula structure. In MWP solving , the multi-view model (Zhang et al., 2022a ) achieves state-of-the-art results on Math23K, MAWPS, and MathQA. Minerva, and the Diverse approach (Li et al., 2022 ) based on OpenAI code-davinci-002, lead on MATH. Minerva also beats the national 57% average by 8% on the Polish national math exam. In informal theorem proving , we discuss autoformalization and theorem proving from mathematical language. In the former, code-davinci-002 leads on ProofNet. In the latter, a BART-based model leads on NaturalProofs, and Codex (Chen et al., 2021 ) fine-tuned on autoformalized theorems (Wu et al., 2022 ), leads on MiniF2F. These later methods, particularly those that score highly on MATH, largely consist of fine-tuning generative LLMs also without distinctly considering mathematical content or structure.

Separate Representations for Math and Natural Language.

Many models do not benefit from processing each modality separately. The leading model on Symlink uses a special tokenizer to extract math symbols from scientific documents (Lee and Na, 2022 ). VarSlot improves variable typing by learning representation spaces for variables and mathematical language statements (Ferreira et al., 2022 ). STAR (Ferreira and Freitas, 2021 ) improves on a self-attention baseline encoding combined math/language statements, by separately encoding math and language with the same encoder. MathBERT learns embeddings from tree and latex representations of formulae, and natural language (Peng et al., 2021 ). The Approach0 + [encoder] models linearly combine scores from entirely different methods; one designed for formulae, and one for language (Zhong et al., 2022 ). Multi-view learns an embedding each for words, quantities, and operations (Zhang et al., 2022a ). All of the above are state-of-the-art and show advantage over baselines that do not invoke separate mechanisms. Despite this evidence, methods related to informal theorem proving and premise selection , such as Minerva, IJS (Tran et al., 2022 ), and others, do not discriminate math from language. This is likely true for other subfields of MLP.

Math as Trees.

Many approaches do not incorporate formula structure. For problems involving multi-variate mathematical terms, obvious choices for this are OPTs and SLTs ( Figure 2 ). For example, Approach0 considers formula OPTs, without learning , to achieve competitive results. Inclusion of OPTs during BERT training has been shown to improve performance over BERT in formula retrieval, formula headline generation, and formula topic classification (Peng et al., 2021 ), and is also used in math question answering (Mansouri et al., 2021a ).

Combining Complementary Representations from the Same Input.

Combined use of OPTs and SLTs of the same formula has been suggested to improve formula retrieval performance (Davila and Zanibbi, 2017 ; Mansouri et al., 2019 ; Mansouri et al., 2021a ). This extends to dual-modality mathematical language input. Shen and Jin ( 2020 ) obtain sequence and graph encodings of MWPs, and use sequence and tree-based decoders in unison, with an ablation describing advantage over single encoder representations. The leading MWP solver (Zhang et al., 2022a ) generates two independent solution expression embeddings, by top-down decomposition (Xie and Sun, 2019 ) and bottom-up construction, which are projected into the same latent space.

Conclusion.

Delivering mathematical reasoning over discourse requires close integration between step-wise inference control over localised explicit representations (symbolic perspective), and distributed representations to approximate and cope with incomplete knowledge (neural perspective). The current spectrum of mathematical language processing techniques elicits the key components, representational choices and tasks which are central to the conceptualisation of mathematical inference. Integrating the best-performing representational choices across different subtasks, such as distinct mechanisms for processing natural language and formulae, learning complementary representations of mathematical problem text, and incorporating formula structure, represents a short-term opportunity to develop mathematically robust models capable of more coherent argumentation, reasoning, and retrieval.

This work was partially funded by the Swiss National Science Foundation (SNSF) project NeuMath (200021_204617).

https://encyclopediaofmath.org .

https://proofwiki.org/wiki/Main_Page .

Identifier-Definition Extraction Limitations.

Methods considering the link between identifiers and their definitions have split off into at least three recent tasks: identifier-definition extraction (Schubotz et al., 2017 ; Alexeeva et al., 2020 ), variable typing (Stathopoulos et al., 2018 ), and notation auto-suggestion (Jo et al., 2021 ). A lack of consensus on the framing of the task and data prevents a direct comparison between methods. Schubotz et al. ( 2017 ) advise against using their gold standard data for training due to certain extractions being too difficult for automated systems, among other reasons. They also propose future research should focus on recall due to current methods extracting exact definitions for only 1/3 of identifiers, and suggest use of multilingual semantic role labeling (Akbik et al., 2016 ) and logical deduction (Schubotz et al., 2016b ). Logical deduction is partially tackled by Alexeeva et al. ( 2020 ), which is based on an open-domain causal IE system (Sharp et al., 2019 ) with Odin grammar (Valenzuela-Escárcega et al., 2016 ), where temporal logic is used to obtain intervals referred to by pre-identified time expressions (Sharp et al., 2019 ). We assume the issues with superscript identifiers (such as Einstein notation, etc. ) from Schubotz et al. ( 2016b ) carry over into Schubotz et al. ( 2017 ). The rule-based approach proposed by Alexeeva et al. ( 2020 ) attempts to account for such notation (known as wildcards in formula retrieval). They propose that future methods should combine grammar with a learning framework, extend rule sets to account for coordinate constructions, and create well-annotated training data using tools such as PDFAlign and others (Asakura et al., 2021 ).

Formula Retrieval Limitations.

Zhong and Zanibbi ( 2019 ) propose supporting query expansion of math synonyms to improve recall, and note that Approach0 does not support wildcard queries. Zhong et al. ( 2020 ) later provide basic support for wildcards. Tangent-CFT also does not evaluate on wildcard queries, and the authors suggest extending the test selection to include more diverse formulae, particularly those that are not present as exact matches. They propose integrating nearby text into learned embeddings. MathBERT (Peng et al., 2021 ) performs such integration, but does not learn n -gram embeddings. MathBERT evaluates on non-wildcard queries only.

Informal Premise Selection Limitations.

Limitations involve a lack of structural consideration of formulae and limited variable typing abilities. Ferreira and Freitas ( 2020b ) note that the graph-based approach to premise selection as link prediction struggles to encode mathematical statements which are mostly formulae, and suggest inclusion of structural embeddings ( e.g. , MathBERT [Peng et al., 2021 ]) and training BERT on a mathematical corpus. They also describe value in formulating sophisticated heuristics for navigating the premises graph. Later, following a Siamese network architecture (Ferreira and Freitas, 2021 ) reliant on dual-layer word/expression self-attention and a BiLSTM (STAR), the authors demonstrate that STAR does not appropriately encode the semantics of variables. They suggest that variable typing and representation are a fundamental component of encoding mathematical statements. Han et al. ( 2021 ) plan to explore the effect of varying pre-training components, testing zero-shot performance without contrastive fine-tuning, and unsupervised retrieval. Coavoux and Cohen ( 2021 ) propose a statement-proof matching task akin to informal premise selection, with a solution reliant on a self-attentive encoder and bilinear similarity function. The authors note model confusion due to the proofs introducing new concepts and variables rather than referring to existing concepts.

Math Word Problem Limitations.

In Graph2Tree-Z, Zhang et al. ( 2020 ) suggest considering more complex relations between quantities and language, and introducing heuristics to improve solution expression generation from the tree-based decoder. In EPT, Kim et al. ( 2020 ) find error probability related to fragmentation issues increases exponentially with number of unknowns, and propose generalizing EPT to other MWP datasets. HGEN (Zhang et al., 2022b ) note three areas of future improvement: Combining models into a unified framework through ensembling multiple encoders (similar to Ferreira and Freitas, 2021 ); integrating external knowledge sources ( e.g. , HowNet (Dong and Dong, 2003 ), Cilin (Hong-Minh and Smith, 2008 )); and real-world dataset development for unsupervised or weakly supervised approaches (Qin et al., 2020 ).

Informal Theorem Proving Limitations.

Wang et al. ( 2020 ) suggest the development of high-quality datasets for evaluating translation models, including structural formula representations, and jointly embedding multiple proof assistant libraries to increase formal dataset size. Szegedy ( 2020 ) argues that reasoning systems based on self-driven exploration without informal communication abilities would suffer usage and evaluation difficulties. Wu et al. ( 2022 ) note limitations with text window size and difficulty storing large formal theories with current language models. After proposing the NaturalProofs dataset, Welleck et al. ( 2021a ) characterize error types for the full-proof generation and next-step suggestion tasks, noting issues with: (1) hallucinated references, meaning the reference does not occur in NaturalProofs; (2) non-ground-truth reference, meaning the reference does not occur in the ground-truth proof; (3) undefined terms; and (4) improper or irrelevant statement, meaning a statement that is mathematically invalid ( e.g. , 2/3 ∈ℤ) or irrelevant to the proof; and (5) statements that do not follow logically from the preceding statements. Dealing with research-level physics, Meadows and Freitas ( 2021 ) note the significant cost of semi-automated formalization, requiring detailed expert-level manual intervention. They also call for a set of well-defined computer algebra operations such that robust mathematical exploration can be guided in a goal-based setting.

Categorisation of approaches related to identifier-definition extraction. The shorthand notation used such as (K 2012) and (J 2021) refer to the references in the first four boxes, i.e., (Kristianto 2012) and (Jo 2021). The first four boxes are task variations, then arrows point to other categories that may group approaches. For example, (Stathopoulos 2018) is Variable Typing, considers a Classification task, involves a large machine Learning element, uses a BiLSTM, learns Vector representations of input text without pretraining, and relies on information outside of the instance text (Extra-doc), which is a Types Dictionary.

Categorisation of approaches related to identifier-definition extraction. The shorthand notation used such as (K 2012) and (J 2021) refer to the references in the first four boxes, i.e. , (Kristianto 2012) and (Jo 2021). The first four boxes are task variations, then arrows point to other categories that may group approaches. For example, (Stathopoulos 2018) is Variable Typing , considers a Classification task, involves a large machine Learning element, uses a BiLSTM , learns Vector representations of input text without pretraining, and relies on information outside of the instance text ( Extra-doc ), which is a Types Dictionary .

Categorisation of approaches in formula retrieval. The number at the bottom right of boxes refers to their respective Bpref score (Peng et al., 2021).

Categorisation of approaches in formula retrieval. The number at the bottom right of boxes refers to their respective Bpref score (Peng et al., 2021 ).

Categorisation of approaches in math word problem solving.

Categorisation of approaches in math word problem solving.

Author notes

Email alerts, related articles, related book chapters, affiliations.

  • Online ISSN 2307-387X

A product of The MIT Press

Mit press direct.

  • About MIT Press Direct

Information

  • Accessibility
  • For Authors
  • For Customers
  • For Librarians
  • Direct to Open
  • Open Access
  • Media Inquiries
  • Rights and Permissions
  • For Advertisers
  • About the MIT Press
  • The MIT Press Reader
  • MIT Press Blog
  • Seasonal Catalogs
  • MIT Press Home
  • Give to the MIT Press
  • Direct Service Desk
  • Terms of Use
  • Privacy Statement
  • Crossref Member
  • COUNTER Member  
  • The MIT Press colophon is registered in the U.S. Patent and Trademark Office

This Feature Is Available To Subscribers Only

Sign In or Create an Account

IMAGES

  1. Figure 1 from Survey on Mathematical Word Problem Solving Using Natural

    mathematical word problem solving using natural language processing

  2. solving word problem math

    mathematical word problem solving using natural language processing

  3. What is Natural Language Processing

    mathematical word problem solving using natural language processing

  4. 10 Amazing Examples Of Natural Language Processing

    mathematical word problem solving using natural language processing

  5. An Effective Method Of Solving Mathematical Problems

    mathematical word problem solving using natural language processing

  6. Natural Language Processing: Text generation with Python

    mathematical word problem solving using natural language processing

VIDEO

  1. 21. Psybot

  2. Text Classification using ML NLP

  3. Bots Using Natural Language Processing in Medical Sector

  4. Primary 5 / Grade 5 Math: Whole numbers, Word Problem Q21

  5. Paper Presentation: "Using Natural Language Processing to Develop Military Procurement Specs"

  6. Stress Detection Using Natural Language Processing (NLP) ¦ Informatik von Sarah Mühlemann

COMMENTS

  1. Mathematical Word Problem Solving Using Natural Language Processing

    Abstract. Natural language processing (NLP) is generally done on large data. Due to limited data word problem solving is challenging using NLP. There are some approaches proposed which could solve basic arithmetic problems like addition/subtraction. Knowledge representation is the main task to be done by NLP.

  2. Mathematical Word Problem Solving Using Natural Language Processing

    Three types of mathematical word problems have been solved and the quite generic rather intuitive approach can be extended to solve some other kind of aptitude problems. Natural language processing (NLP) is generally done on large data. Due to limited data word problem solving is challenging using NLP. There are some approaches proposed which could solve basic arithmetic problems like addition ...

  3. Survey on Mathematical Word Problem Solving Using Natural Language

    Natural language processing (NLP) is typically used for analyzing a large set of data. It has traditional applicability like sentiment analysis, spam mail detection, summarizing a large text. However, word problem solving is challenging if it is to be done with NLP. There are some approaches which have been proposed. Some could solve basic arithmetic problems like addition/subtraction ...

  4. Solving Arithmetic Word Problems Using Natural Language Processing and

    Solving mathematical word problems automatically is a challenging research problem in Artificial Intelligence (AI), Natural Language Processing (NLP) and Machine Learning (ML), since understanding and extracting relevant information from an unstructured text require lots of reasoning abilities. Till date, much research has been carried out in ...

  5. PDF Are NLP Models really able to Solve Simple Math Word Problems?

    cally for robust evaluation of Math Word Problems. 3 Background 3.1 Problem Formulation We denote a Math Word Problem P by a sequence of n tokens P = (w 1;:::;w n) where each token w i can be either a word from a natural language or a numerical value. The word problem P can be bro-ken down into body B = (w 1;:::;w k) and ques-tion Q = (w k+1 ...

  6. 1 The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem

    We also cover automatic solvers for other types of math problems such as geometric problems that require the understanding of diagrams. Finally, we identify several emerging research directions for the readers with interests in MWPs. Index Terms—math word problem, semantic parser, reasoning, survey, natural language processing, machine learning

  7. Mathematical Word Problem Solving Using Natural Language Processing

    Mathematical Word Problem Solving Using Natural Language Processing. January 2020. DOI: 10.1007/978-981-15-0936-0_46. In book: ICT Systems and Sustainability (pp.423-433)

  8. Classifying and Solving Arithmetic Math Word Problems—An Intelligent

    Abstract: Solving mathematical (math) word problems (MWP) automatically is a challenging research problem in natural language processing, machine learning, and education (learning) technology domains, which has gained momentum in the recent years. Applications of solving varieties of MWPs can increase the efficacy of teaching-learning systems, such as e-learning systems, intelligent tutoring ...

  9. PDF Math Word Problem Generation with Mathematical Consistency and Problem

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 5986 5999 November 7 11, 2021. ... math word problems (MWPs) given a math ... will need to identify by parsing the problem and then solve the problem using this equation. An MWP is usually also associated with a "context", i.e., the (often real ...

  10. Investigating Math Word Problems using Pretrained Multilingual Language

    Abstract. In this paper, we revisit math word problems (MWPs) from the cross-lingual and multilingual perspective. We construct our MWP solvers over pretrained multilingual language models using the sequence-to-sequence model with copy mechanism. We compare how the MWP solvers perform in cross-lingual and multilingual scenarios.

  11. Introduction to Mathematical Language Processing: Informal Proofs, Word

    Abstract. Automating discovery in mathematics and science will require sophisticated methods of information extraction and abstract reasoning, including models that can convincingly process relationships between mathematical elements and natural language, to produce problem solutions of real-world value. We analyze mathematical language processing methods across five strategic sub-areas ...

  12. PDF Stanford University

    A math word problem P is a sequence of words (question text) that describe a problem to be solved using mathematical reasoning (refer to Table 1 below). Given P, the model's sole objective is to generate a sequence of mathematical tokens (numbers and operators) that represent an appropriate set of equations E (the "Problem"), from which the correct

  13. Survey on Mathematical Word Problem Solving Using Natural Language

    Download Citation | On Apr 1, 2019, Shounaak Ughade and others published Survey on Mathematical Word Problem Solving Using Natural Language Processing | Find, read and cite all the research you ...

  14. PDF Explaining Math Word Problem Solvers

    Math word problem (MWP) solving is an area of natural language processing (NLP) that uses machine learning to solve simple arith-metic problems. MWPs consist of a few sentences of text including a fewnumbersandanunknownquantity,similartoproblemshumans are presented with in grade school. Neural networks are trained to

  15. PDF A Framework for Connecting Natural Language and Symbol Sense in

    31) that is involved at all stages of mathematical problem solving. Kenney (2008) has used a. symbol sense framework (constructed using adaptations of work by Pierce and Stacey (2001, 2002)1 and Arcavi (1994, 2005)), to investigate students' reasoning with mathematical symbols. at different problem solving stages.

  16. Survey on Mathematical Word Problem Solving Using Natural Language

    This survey paper is the overview of the work done in this area till date and also discusses what can be done to improve and extend the work. Natural language processing (NLP) is typically used for analyzing a large set of data. It has traditional applicability like sentiment analysis, spam mail detection, summarizing a large text. However, word problem solving is challenging if it is to be ...

  17. PDF ATURALPROOFS: Mathematical Theorem Proving in Natural Language

    Understanding and creating mathematics using natural mathematical language - ... Several datasets evaluate a model's ability to solve multiple-choice algebraic word problems [34, 24, 2] or arithmetic problems [35] with varying de- ... led to significant advances in many natural language processing domains (e.g. [8, 31, 32, 4]). Recent

  18. PDF Natural Language Processing for Solving Simple Word Problems

    The emphasis of this paper is on the natural language processing(NLP) techniques used to retrieve the relevant information from the English word problem. The system shows improvements over existing systems. 1 Introduction. The aim of this work is to solve a mathematical problem involving addition/subtraction which is given in English.

  19. Natural Language Processing for Solving Simple Word Problems

    The emphasis of this paper is on the natural language processing(NLP) techniques used to retrieve the relevant information from the English word problem and the system shows improvements over existing systems. This paper describes our system which solves simple arithmetic word problems. The system takes a word problem described in natural language, extracts information required for ...

  20. Solving Arithmetic Word Problems Using Natural Language Processing and

    PDF | On Mar 31, 2022, Swagata Acharya and others published Solving Arithmetic Word Problems Using Natural Language Processing and Rule-Based Classification | Find, read and cite all the research ...

  21. PDF Investigating Math Word Problems using Pretrained Multilingual Language

    In this paper, we revisit math word prob- lems (MWPs) from the cross-lingual and mul- tilingual perspective. We construct our MWP solvers over pretrained multilingual language models using the sequence-to-sequence model with copy mechanism. We compare how the MWPsolversperformincross-lingualandmul- tilingualscenarios.