[TaTToo] Why Do LLMs Fail on Tables?

Paper at a Glance

Paper Title: TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
Authors: Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
Affiliation: UIUC, Amazon, Purdue University, Stanford University
Published in: arXiv Preprint, Oct 2025
Link to Paper: https://arxiv.org/abs/2510.06217

The Gist of It: TL;DR

In one sentence: This paper introduces TATTOO, a tool-grounded Process Reward Model that provides reliable step-by-step supervision to Large Reasoning Models for complex tabular reasoning, addressing critical failures in table retrieval and schema interaction found in existing models.

Why It Matters: The Big Picture

Large Language Models (LLMs) are becoming incredibly adept at processing and generating text. But the real world isn’t just paragraphs; it’s filled with structured data like spreadsheets, databases, and web tables. For an LLM to be a true “data analyst” or “fact-checker,” it must be able to reason logically over this tabular data.

The community has developed “Process Reward Models” (PRMs) to help. Think of a PRM as a specialized AI coach that watches an LLM solve a problem step-by-step and gives feedback on each step, rather than just grading the final answer. This fine-grained supervision helps the LLM find the correct reasoning path.

However, there’s a huge problem: existing PRMs, trained primarily on text, are terrible coaches for tabular reasoning. They struggle with table-specific actions, leading to performance bottlenecks where even throwing more compute at the problem doesn’t help. This paper dives deep into why this happens and proposes an elegant solution.

The Core Idea: How It Works

1. The Problem They’re Solving: Why PRMs Fail on Tables

The authors first conduct a detailed error analysis to pinpoint exactly where current PRMs go wrong. They identify two critical failure points:

Table Retrieval Failure: When an LLM needs to solve a query, its first step is often to find and extract the relevant rows and columns from a large table. The paper shows (Figure 3, left) that existing PRMs are shockingly insensitive to this step. They often give a “correct” score even if the LLM retrieves a completely random or wrong part of the table. This initial error poisons all subsequent reasoning.
Schema Interaction Failure: Due to the “locality bias” inherent in transformer architectures, an LLM can “forget” or misinterpret the table data it retrieved several steps ago. For example, it might correctly pull a column of numbers but later miss the last value when performing a sum. Current PRMs, suffering from the same locality bias, fail to catch these long-range dependency errors.

In short, existing PRMs can’t reliably ground their supervision in the actual content of the table.

2. The Key Innovation: Tool-Grounded Verification

The core idea behind TATTOO (Tool-Grounded Thinking PRM) is to stop treating verification as a text-only task. Instead, TATTOO acts more like a real data scientist: it uses external tools to actively check the LLM’s work.

Instead of just reading an LLM’s claim that “the sum of the ‘sales’ column is $5,000,” TATTOO can invoke a Python interpreter to actually run df['sales'].sum() and compare the results. This “tool-grounding” provides a much more accurate and objective reward signal, moving from weak supervision to precise, verifiable feedback.

3. The Method, Step-by-Step

Building TATTOO is a sophisticated two-stage process, neatly illustrated in the paper’s Figure 4.

Stage 1: Supervised Fine-Tuning (SFT) with a Custom Dataset
First, the team created a large-scale (60k+ examples) training dataset specifically for this task.

They collected reasoning trajectories from expert LLMs on various table-based tasks.
They then generated step-by-step “verification rationales” for each trajectory.
Crucially, they augmented these rationales with tool calls. Instead of a rationale saying “I will now manually add the numbers,” it is replaced with a Python code block and its execution output.
They used this dataset to fine-tune a base model, teaching it the fundamental patterns of table-aware, tool-integrated verification. For example, it learns to automatically prepend the relevant sub-table to a reasoning step to avoid the schema interaction problem.

Stage 2: Reinforcement Learning (RL) with Reward Shaping
SFT teaches the model the basic patterns, but RL refines its ability to use tools effectively.

The SFT model is further trained using Reinforcement Learning.
The reward signal is carefully shaped to encourage three things: 1. Label Matching: Is the final verdict (correct/incorrect) right? 2. Confidence Calibration: Is the model confident in its correct predictions? 3. Tool-Grounding: Does the verification rationale correctly incorporate and rely on tool outputs?
This second stage pushes the model beyond just mimicking patterns to actively leveraging tools for more robust and faithful verification.

Key Experimental Results

The results demonstrate that TATTOO is not only effective but also remarkably efficient.

State-of-the-Art Performance: Across five challenging tabular reasoning benchmarks, incorporating the 8B-parameter TATTOO as a verifier improved a powerful 14B LRM’s performance by an average of 30.9%.
Parameter Efficiency: TATTOO consistently outperformed much larger PRM baselines, including the 72B-parameter Qwen-2.5-Math-PRM, showcasing up to 9x greater parameter efficiency (Table 2).
Strong Generalizability: The improvements weren’t limited to one evaluation method. TATTOO showed consistent gains across diverse test-time scaling strategies like Best-of-N, Beam Search, and DVTS (Figure 5).
Dual-Stage Training is Key: Ablation studies (Table 3) confirmed that the RL stage is vital, providing a 10.2% performance gain over the SFT-only model. The “tool-grounding” reward component was the single most important factor in the RL training.

A Critical Look: Strengths & Limitations

Strengths / Contributions

Excellent Problem Diagnosis: The paper provides one of the clearest explanations to date of why general-purpose PRMs fail on tabular data, grounding its motivation in a strong empirical analysis.
Principled and Robust Solution: TATTOO isn’t just a fine-tuned model; it’s a well-designed framework. The combination of a tool-augmented data pipeline and a dual-stage SFT+RL training paradigm is a robust and powerful approach to building specialized verifiers.
High Efficiency and Effectiveness: The empirical results are compelling. Achieving superior performance with a significantly smaller model is a major contribution, making advanced supervision more accessible.
Scalable Data Curation Blueprint: The authors provide a detailed recipe for creating high-quality, tool-augmented verification data, which can guide future work in building verifiers for other specialized domains.

Limitations / Open Questions

Computational Overhead of RL: As the authors acknowledge, the RL training stage, while effective, introduces significant computational complexity and cost compared to SFT-only methods. This may be a barrier to reproducibility and adoption for teams with limited resources.
Dependency on Tool Reliability: The framework’s performance is fundamentally tied to the correctness and robustness of its external tools. It doesn’t explore scenarios where tools might fail, return errors, or have their own limitations.
Limited Scope: The current work focuses on semi-structured tables. Extending this tool-grounded verification approach to more complex, multi-modal data (e.g., tables containing images, charts) is a non-trivial next step.
Potential for Bias Propagation: The data curation pipeline relies on “expert LLMs” to generate initial trajectories and rationales. This creates a risk of inheriting and amplifying any subtle biases or systematic errors present in those expert models.

Contribution Level: Significant Improvement. This paper tackles a critical and underexplored bottleneck in AI reasoning. By clearly identifying the failure modes of existing PRMs and proposing a robust, tool-grounded solution, it substantially advances the state of the art in supervising LLMs on structured data tasks. It provides both a high-performing model and a valuable framework for future research.

Conclusion: Potential Impact

TATTOO represents a major step forward in making LLMs reliable and trustworthy partners for tasks involving structured data. By teaching the “coach” (the PRM) how to use the same tools as a data scientist, this work paves the way for more accurate AI-powered data analysis, fact-checking, and question answering. The core insight—that reliable verification in specialized domains requires domain-specific tools—is a powerful one that could be extended far beyond tables to fields like scientific discovery, code generation, and formal verification.