How MDS Broke Data Governance pt 1

How MDS Broke Data Governance pt 1

Most companies today understand the immense opportunities the “Age of Data” offers, and an ecosystem of modern technologies has sprouted to help. However, building a comprehensive modern data ecosystem to extract value from these technologies is both confusing and difficult. Ironically, some technologies that simplify specific processes have made data governance and protection appear even more challenging.

Big Data and the Multiverse of Madness

“Using data to make decisions? That’s crazy talk!” was a common thought in IT back in the early 2000s. Information Technology groups didn’t fully grasp the value of data. They treated it like money in a bank—assuming that storing it perfectly in databases would automatically increase its value. But locked-up data doesn’t earn compound interest. A more fitting analogy would be food in a freezer: you need to cycle through it—use, update, and refresh it—or it loses value.

Over the past several years, we’ve developed a deeper understanding of how to manage and utilize data to maximize value. This understanding has been accompanied by disruptive technologies that simplify complex tasks and lower the cost of working with data.

Yet, the modern data ecosystem remains fragmented and overwhelming. If you try to organize the myriad companies into a coherent “tech stack,” it feels like playing “52-card pickup.” Few solutions fit neatly together, and the integration of “best-of-breed” offerings is fraught with challenges.

Consider Matt Turck’s data ecosystem diagrams from 2012 to 2020: they reveal an explosion of companies and categories, making the industry harder to navigate. While these diagrams provide a loosely organized catalog of players, they are far from a practical roadmap. Companies attempting to build their own modern stack often find themselves lost in this complexity. No one fully understands the entire ecosystem—it’s simply too vast.

A Saner (But Still Legacy) Approach

Andreessen Horowitz (a16z) offers a more structured perspective, based on the data lifecycle, which they term a “unified data infrastructure architecture.” Their model starts with data sources on the left, followed by ingestion/transformation, storage, historical processing, predictive processing, and finally output on the right. Along the bottom, they highlight data quality, performance, and governance as pervasive layers spanning the entire stack.

This model resembles the linear pipeline architectures of legacy systems. However, as with other models, many modern companies don’t fit neatly into a single category. For instance, some straddle multiple areas (e.g., both ETL and visualization), offering discontinuous but valuable capabilities.

Sources

On the left, data sources are foundational but always evolving. These include transactional databases, applications, and other data sources, long discussed in Big Data presentations. The key takeaway here is the “Three Vs” of Big Data: Volume, Velocity, and Variety. Traditional platforms struggled with at least one of these dimensions, fueling the rise of modern data ecosystems.

Ingestion and Transformation

Ingestion and transformation are more convoluted. This segment includes traditional ETL platforms, newer ELT tools, and real-time or event-based streaming technologies. Innovations in this space address semi-structured data (e.g., JSON) and the need to optimize for flexibility, efficiency, or ease of use. But it’s impossible to achieve all three in a single tool.

Since data sources are dynamic, ingestion and transformation technologies must evolve continuously.

Storage

Storage has also undergone significant innovation, driven by the need for elastic capacity. Historically, databases tightly coupled compute and storage, making upgrades cumbersome and costly. Today, platforms like Snowflake decouple compute and storage, enabling infinite scaling and elastic resource allocation.

Snowflake is a fascinating case, defying categorization. While it is primarily a data warehouse, its Data Marketplace positions it as a data source, and tools like Snowpark allow it to function as a transformation engine. The key disruptions in this space are cheap, infinite storage and scalable compute power.

BI and Data Science

The a16z model becomes less practical when categorizing Historical, Predictive, and Output processes. Many companies span all three, making these groupings largely academic. To simplify, I reduced them to two categories: Business Intelligence (BI) and Data Science.

Business Intelligence (BI)

BI has evolved dramatically in the past 15 years. Legacy platforms required extensive data modeling and semantic layers to overcome performance limitations in OLAP databases. These were centrally managed, with restricted access to aggregated data, updated infrequently.

Modern BI, however, empowers the average office worker to create reports and analyses. Data is more granular and near real-time, enabling faster decision-making. Teams can now monitor performance metrics updated every 15 minutes, driving immediate behavioral adjustments and operational efficiency.

Data Science

Data Science has long been a powerful tool, but its democratization is relatively recent. Today, “citizen data scientists” (non-coders with domain expertise) can leverage user-friendly platforms to conduct sophisticated analyses without needing deep technical skills. However, this democratization has also increased the risk of sensitive data exposure.

For example, analyzing Personally Identifiable Information (PII) may be necessary for consumer insights but doesn’t require access to raw data. Tokenization, which replaces sensitive data with secure tokens, allows analysis without compromising security. Deterministic tokenization even enables database joins using encrypted keys.

The Modern Data Stack: Opportunities and Challenges

The Modern Data Stack (MDS) has revolutionized how organizations handle data, enabling agility and real-time decision-making. By democratizing access to analytics tools, businesses can quickly respond to market shifts and customer needs.

However, this transformation introduces new challenges:

Data Governance: Easier access to granular data increases the risk of misuse.

Security: Sensitive data must be protected as “citizen data scientists” gain broader access.

On the Horizon

As we move toward more advanced analytics—spanning BI, Data Science, Machine Learning, and AI—executives must address governance implications. In upcoming articles, we’ll explore strategies for securing sensitive data, implementing governance frameworks, and balancing innovation with compliance.

Future posts in this series will explore strategies for maintaining control and transparency in an increasingly complex and automated data environment. We’ll address topics such as securing sensitive data through techniques like tokenization, establishing ethical guidelines for AI use, and implementing governance frameworks that balance innovation with risk management. By staying ahead of these governance challenges, executives can ensure their organizations harness the full potential of the Modern Data Stack while mitigating risks and maintaining compliance in this rapidly evolving data and AI landscape.