Useful, Until It Isn't: LLMs for Data Science

Last week, I tested AI against a real Data Science use case to separate hype from what actually works. Full transparency — I’m ramping up on AI like everyone else, and a week of exploration means some bias is baked in. But here’s what I found.

1. Know your AI lego box

Setting up your workflow is half the process. Before you write a single prompt, you need to understand the assembly pieces at your disposal and how they fit together. There are three decisions to make upfront: First, how are you going to interact with the AI — through a terminal UI like Claude Code, or through an IDE like VS Code? Both have extensions and integrations, but they create very different working styles. Second, which LLM provider are you going to use? Pick one — Claude, OpenAI, or Gemini — and stick with it. You can switch later. Start with a mid-tier model rather than jumping straight to the most powerful option. Third, which MCPs do you need? Set up the Model Context Protocol integrations specific to your workflow outputs. For example, if you’re producing work that goes into Confluence, set up the Confluence MCP. These integrations are what allow the agent to take meaningful actions in your actual environment. Understanding these 3 pieces before you start is what separates a productive session from a frustrating one.

2. MCP is a game changer

The difference between running an AI agent with and without MCP access is stark. For e.g. - In a SQL workflow, running the agent without access to a data dictionary produced highly suboptimal results — the model was guessing what columns meant and hallucinating frequently. The moment a clear data dictionary was integrated through MCP — with column names, definitions, and context — the quality improved dramatically. Hallucinations dropped significantly. This matters beyond just output quality. First impressions of AI are powerful. If your initial interaction produces unreliable results, you’ll write off the tool entirely. Getting MCP set up correctly before you start means your first experience is the right one.

3. Interactive mode until you master the failure modes

There are two ways to work with an AI agent: fully autonomous, where you hand over the wheel entirely to an agent, or interactive, where you remain the architect and the AI plays the role of a junior analyst. For complex work (like the one I undertook - building a feature engineering pipeline), interactive mode was significantly more reliable. Running it autonomously would not have caught the cases where code failed silently — a particularly dangerous failure mode because nothing breaks visibly, but the output is wrong. There were also SQL and architectural inefficiencies in the feature engineering pipeline that required human judgment to spot and fix. The takeaway is not that autonomous mode is bad. It’s that you need to understand the failure modes of your specific use case before you hand over control. Interactive mode is how you learn those failure modes.

4. Human in the Loop should remain the default

One of the genuine wins here is that AI handles the tedious parts of data science work — the repetitive setup, the boilerplate code, the mechanical steps that consume time, when you are engineering features but don’t require strategic thinking. This frees you to focus on what matters: framing the right questions, structuring the problem, and making good architectural decisions and in my case - coming up with new features with predictive ability. That’s a real productivity gain. But this does not come free: based on what I’ve seen, AI gets you to around eighty or ninety percent. The last mile still requires human judgment. There are mistakes that need catching, outputs that need validating, and domain-specific calls that only a practitioner can make. Human in the loop is not a limitation to be engineered away. It’s the correct default until you have enough confidence in a specific workflow to reduce that oversight.

5. A rigorous eval tailored to your use case is essential

There is no universal measurement framework for AI in data science (atleast as of today). Each workflow is different, each use case has its own definition of correctness, and each has different tolerances for error. What that means practically is that before you deploy AI on a workflow, you need to build a rubric for evaluating its output. What does good look like? What does a silent failure look like? How do you measure improvement over time? Without this, you’re flying blind. With it, you have a basis for iterating, improving prompts, and deciding when a workflow is ready to operate with more autonomy.

Where this goes

I’m confident AI will boost productivity for data science work - with two caveats - 1) Data Scientists and Engineers need to understand the failure modes and limitations of AI (as with any new tech) and create customized workflows with guardrails to get this productivity gain without loss in output quality - which will take time. 2) There’s still a last mile gap — human validation and insight are essential to make sure the output is correct. It’s not a plug-and-play solution, but it’s powerful when you respect those constraints. More to come.