Boosting Tables with Brains and AI

Author: Denis Avetisyan

A new framework combines human intuition with the power of large language models to create better features for tabular data.

Across six distinct tasks, a novel feature engineering method demonstrably outperforms both CAAFE and OCTree-two leading large language model baselines-within a 50-iteration budget when paired with a multilayer perceptron, as indicated by consistently lower error margins represented by the shaded standard error.

This research details a human-LLM collaborative system for feature engineering that leverages utility modeling, uncertainty estimation, and selective human feedback.

While large language models (LLMs) show promise in automating feature engineering for tabular data, current approaches often treat them as uncalibrated black-box optimizers, leading to inefficient exploration of potentially useful transformations. This paper, ‘Human-LLM Collaborative Feature Engineering for Tabular Data’, introduces a framework that decouples feature proposal from selection, explicitly modeling the utility and uncertainty of each operation and strategically incorporating human expert feedback. Our evaluations demonstrate improved performance across diverse tabular datasets and reduced user cognitive load, suggesting a more effective and intuitive approach to feature engineering. Could this collaborative paradigm unlock even greater potential in leveraging both LLM reasoning and human expertise for complex data analysis tasks?

Deconstructing the Tabular Labyrinth: The Challenge of Feature Engineering

The construction of effective features from raw tabular data frequently presents a substantial obstacle in machine learning workflows. Historically, this process has demanded considerable manual intervention, requiring data scientists to leverage both statistical knowledge and a deep understanding of the specific domain to identify and create variables that enhance model performance. This reliance on human expertise is not only time-consuming and resource-intensive but also introduces a potential for bias and limits scalability. The iterative nature of feature engineering – involving hypothesis generation, implementation, evaluation, and refinement – often becomes a central bottleneck, delaying model deployment and hindering the ability to rapidly adapt to changing data landscapes. Consequently, the need for more efficient and automated approaches to feature creation remains a critical challenge in the field.

While automated machine learning (AutoML) systems excel at algorithm selection and hyperparameter tuning, they frequently struggle with the intricacies of tabular data. These systems often rely on generalized feature transformations that fail to capture the subtle, non-linear relationships and interactions present within specific datasets. Consequently, AutoML may overlook crucial feature combinations or apply inappropriate transformations, leading to suboptimal model performance. The inherent challenge lies in the fact that tabular data, unlike images or text, rarely presents readily apparent patterns; meaningful features often require a deep understanding of the underlying data-generating process, a capability that current AutoML solutions often lack. This limitation highlights the continuing need for more sophisticated feature engineering techniques that go beyond simple, pre-defined operations and can adapt to the unique characteristics of each tabular dataset.

Realizing the full potential of tabular data in machine learning demands a shift beyond conventional feature engineering techniques. While manual approaches are time-consuming and reliant on specialized knowledge, current automated methods frequently struggle with the subtle interactions inherent in structured datasets. More sophisticated strategies are needed – those capable of dynamically identifying and constructing features that capture non-linear relationships, complex dependencies, and previously hidden patterns. This necessitates adaptive algorithms that not only combine existing features but also generate entirely new ones based on data characteristics, potentially leveraging techniques like genetic programming or deep feature synthesis to overcome the limitations of static, pre-defined feature sets and truly unlock predictive power.

Across seven tasks, the proposed method consistently outperforms both CAAFE and OCTree baselines in feature engineering with a 50-iteration budget when paired with an MLP tabular learner, as indicated by the shaded standard error.

The Language of Data: LLM-Powered Feature Engineering

Traditional feature engineering relies on manual creation or automated transformations based on pre-defined rules, both of which struggle with complex datasets and require significant domain expertise. LLM-Powered Feature Engineering bypasses these limitations by employing the generative capabilities of large language models to automatically synthesize new features from existing tabular data. This approach allows for the creation of features beyond simple mathematical combinations or established heuristics, effectively exploring a broader feature space without explicit human intervention. The method’s capacity to generate features based on semantic understanding of the data addresses the inherent scalability and adaptability issues present in manual and rule-based systems.

LLM-powered feature engineering automates the creation of new predictive variables from existing datasets. This process moves beyond pre-defined transformations by leveraging the generative capabilities of large language models to identify and construct features that may not be apparent through traditional methods. The automated generation of features allows for exploration of a wider feature space, potentially revealing previously hidden relationships within the data and resulting in improvements to machine learning model performance. This approach is particularly beneficial for tabular data, where complex interactions between variables can be difficult to identify manually.

Leveraging large language models (LLMs) for feature engineering enables the creation of complex features from tabular datasets, exceeding the capabilities of traditional methods like one-hot encoding or polynomial expansion. This is achieved by prompting LLMs to generate new feature representations based on the relationships within the existing data. Internal testing demonstrates that when LLM-generated features are integrated with the XGBoost algorithm, an average error reduction of up to 11.23% can be observed across a range of benchmark datasets, indicating a significant potential for performance gains in predictive modeling tasks.

The Human Algorithm: Collaborative Feature Refinement

Combining human expertise with Large Language Model (LLM)-driven feature generation presents a robust methodology for feature engineering. This collaborative framework leverages LLMs to automatically generate a diverse set of potential features from raw data, which are then evaluated and refined by human domain experts. The experts provide critical feedback, validating feature relevance, correcting inaccuracies, and incorporating nuanced domain knowledge that the LLM may lack. This iterative process-of LLM generation and human refinement-allows for efficient exploration of the feature space and the creation of optimized feature sets tailored to specific predictive modeling tasks. The resulting features are demonstrably more effective than those derived from either fully automated or purely manual approaches.

Utility modeling and Bayesian optimization are integral to optimizing feature sets generated through human-LLM collaboration. Utility modeling assigns a quantitative value to each feature based on its contribution to model performance, typically measured through cross-validation. This allows for ranking and prioritization of features, guiding the selection process. Bayesian optimization then efficiently explores the vast feature space by iteratively proposing new feature combinations based on a probabilistic model-specifically, a Gaussian process-that balances exploration of untested combinations with exploitation of previously successful ones. This method minimizes the number of model training iterations required to identify high-performing feature subsets, significantly reducing computational cost compared to exhaustive search or random sampling, and ensures the most valuable features are identified for subsequent refinement.

The integration of human expertise into the LLM-driven feature engineering process allows for iterative refinement of automatically generated features. Experts can evaluate LLM outputs and provide targeted feedback, correcting inaccuracies or biases and ensuring generated features are contextually relevant to the specific domain. This feedback loop directly improves model performance; testing demonstrated an average error reduction of 8.96% when utilizing a Multilayer Perceptron (MLP) model with features generated and refined through this collaborative approach.

Decoding the Signal: Validating and Interpreting Results

Performance Trajectory Analysis offers a dynamic lens through which to view the iterative process of feature engineering. By meticulously tracking how modifications to features impact model performance over time, researchers gain crucial insights into which approaches yield the most substantial improvements and which prove detrimental. This isn’t simply about identifying the ‘best’ features at a single point in time, but understanding how features evolve in their effectiveness as the model matures. Such analysis reveals patterns – perhaps an initial boost from a feature followed by diminishing returns, or synergistic effects between features discovered through sequential refinement. Consequently, Performance Trajectory Analysis facilitates a more informed and efficient optimization process, enabling the systematic development of robust and high-performing predictive models and a deeper comprehension of the underlying data itself.

Feature selection represents a crucial step in building robust and understandable predictive models. By strategically identifying and retaining only the most pertinent variables, these techniques address the common pitfalls of incorporating irrelevant or redundant data. This process not only simplifies the model, enhancing its interpretability for stakeholders, but also actively combats overfitting – a scenario where the model performs exceptionally well on training data but falters when presented with new, unseen data. Through methods like recursive feature elimination or techniques based on feature importance scores, the model focuses on the signals most indicative of the target variable, resulting in improved generalization performance and a more reliable predictive capacity. Ultimately, a streamlined feature set fosters a clearer understanding of the underlying data relationships and promotes the development of more effective and trustworthy machine learning applications.

Shapley Additive exPlanations, or SHAP values, offer a unified measure of feature importance by quantifying each feature’s contribution to a model’s prediction, moving beyond simple rankings to reveal nuanced impacts on individual outcomes. This approach doesn’t just identify which features matter, but how they matter – whether a feature consistently pushes predictions higher or lower, or its effect varies depending on the data instance. Recent user studies investigating this method in collaborative data science workflows demonstrate a significant reduction in cognitive load for analysts, allowing them to more rapidly understand model behavior and build trust in its predictions. By providing localized explanations for each prediction, SHAP values facilitate a more transparent and interpretable machine learning process, ultimately fostering better decision-making and accelerating the pace of discovery.

Participants using the algorithm (Alg) demonstrated significantly improved feature engineering performance and user experience for flight satisfaction prediction compared to the control group, with statistically significant differences observed at the [latex]p < 0.1[/latex], [latex]p < 0.05[/latex], and [latex]p < 0.01[/latex] levels, though completion times were comparable.

The study demonstrates a compelling need to challenge established methodologies, much like Bertrand Russell observed, “The difficulty lies not so much in developing new ideas as in escaping from old ones.” This research actively breaks from traditional, automated feature engineering pipelines by introducing selective human elicitation. Rather than passively accepting machine-generated features, the framework deliberately disrupts the process, leveraging human insight at key moments. This selective approach, mirroring a reverse-engineering process, isn’t about finding the best feature immediately, but about intelligently probing the solution space, effectively dismantling assumptions to build a more robust model-a process the paper explicitly frames through utility modeling and uncertainty estimation.

Beyond the Feature: Where This Leads

This work establishes a functional loop-human intuition guiding large language model generation, with utility and uncertainty serving as the feedback mechanism. However, the true exploit lies not in optimizing the process of feature engineering, but in questioning the fundamental assumption that tabular data represents a stable, knowable system. Every exploit starts with a question, not with intent. The current framework treats features as discrete entities to be added or discarded; future iterations should explore features as dynamic, interacting components, perhaps even allowing LLMs to propose changes to the data itself – to induce, rather than simply reflect, underlying patterns.

The selective elicitation strategy, while reducing cognitive load, implicitly assumes a human can accurately assess feature relevance with limited information. This is a comfortable, but potentially fragile, assumption. More robust approaches might involve modeling human error – quantifying how and why humans misjudge feature utility – and integrating that uncertainty into the Bayesian optimization process. The system doesn’t merely need better feedback; it needs to understand the flaws in the feedback source.

Ultimately, the success of human-LLM collaboration in this domain hinges on moving beyond incremental improvements to existing techniques. The interesting problems aren’t about making current feature engineering faster or cheaper; they’re about discovering whether the very notion of “features” is a useful abstraction when dealing with complex, evolving systems. The goal isn’t to engineer better tables, but to reverse-engineer the reality they attempt to represent.

Original article: https://arxiv.org/pdf/2601.21060.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deconstructing the Tabular Labyrinth: The Challenge of Feature Engineering

The Language of Data: LLM-Powered Feature Engineering

The Human Algorithm: Collaborative Feature Refinement

Decoding the Signal: Validating and Interpreting Results

Beyond the Feature: Where This Leads

See also: