The Data Problem Underneath Every AI Initiative
Every AI initiative runs into the same obstacle around six to eight weeks in, with the same expression of surprise. The models do not work the way the demos suggested — not because the technology is deficient, but because the data is not what the organization assumed it was.
Part of the Phase II — Understanding series
By Michael E. Ruiz
Every AI initiative I have observed runs into the same obstacle, usually around six to eight weeks in, and usually with the same expression of surprise from the people who commissioned it. The models do not work the way the demos suggested. Not because the technology is deficient, but because the data feeding the models is not what the organization assumed it was. Data quality, data structure, data provenance, and data lineage are problems that are chronic in enterprise environments and acute in any context where a model is expected to learn from or reason over that data.
The AI industry has done the enterprise world a disservice by consistently positioning AI readiness as a technology procurement decision. Get the right platform, connect your data sources, deploy the model, measure results. The sequencing implies that data preparation is a preprocessing step, a one-time activity before the real work begins.
In practice, data preparation is the work. The model training, the fine-tuning, and the prompt engineering are activities that take days. Resolving the data problems that undermine model performance takes months — and that timeline reflects organizational and governance complexity rather than technical complexity.
The most common data problem is not what organizations expect. They expect missing data, dirty data, and mislabeled records. Those problems exist, but they are tractable. The harder problem is semantic inconsistency: the same business concept represented differently across systems that have never been required to agree on a definition. What is a customer in the CRM is not the same object as a customer in the ERP, which is not the same as a customer in the billing system. Each system has its own schema, its own business rules, its own history of exception handling and manual override. A model trained on any one of these gets a partial and potentially misleading picture of the concept it is supposed to reason about.
In operational environments, including manufacturing, energy, and utilities, the data problem has additional dimensions. Process historian data is timestamped to millisecond precision, but the metadata describing what each tag represents is often in a separate system, maintained manually, and inconsistently populated. Event logs from different control systems use different timestamp formats, different alarm codes, and different unit conventions. Integrating that data into a unified dataset for analysis is not primarily an engineering problem. It is a knowledge problem: you need someone who understands both the process and the data structure to make sense of the relationships. That person is rarely a data scientist.
None of this means AI projects in enterprise and industrial environments are doomed. It means that organizations that invest in data governance as a foundational capability, with standardized definitions, documented lineage, and stewardship accountability, will get more value from AI investments faster than organizations that treat data preparation as a project-level task. The irony is that the discipline required to make AI work well is the same discipline that makes other analytics investments work well. AI simply makes the cost of skipping that discipline visible more quickly, because the model will tell you, through its outputs, that the inputs were not what you thought they were.
The practical implication for leaders is that AI readiness assessments should begin with data, not technology. Not just data volume, because any large enterprise has enough data. What matters is data quality, data accessibility, data governance maturity, and data ownership clarity. An organization that cannot answer basic questions about where its data comes from, who is responsible for its accuracy, and how it moves between systems is not ready for production AI deployment, regardless of what the vendor demo suggested. Getting to ready is not a technical project. It is a governance project.
These ideas are available as keynote presentations and executive briefings. Explore speaking topics →