Discover how tagging leads to better data management and analytics, and all the strategies around it.
Stéphane Derosiaux
Jan 16, 2024
Imagine transforming disorganized data into a well-oiled analytical machine. Over 40% of organizations suffer from poor data quality, affecting their operations and decisions (source: Experian). There's a simple yet often overlooked solution: tagging resources.
Are you making the most of tags in your day-to-day job? This article will explore how powerful tags can be, how they can save resources, and how they can significantly improve the accuracy of data analytics.
A tag? It's just a string... right?
A "tag" is just a label or keyword assigned to a piece of information. Like on X (formerly Twitter) or Instagram where we use hashtags, they are just tags... with a #hash. In seconds, click on it, and you can find anyone talking about it! e.g. #apachekafka
Beyond social networks, we're using the same concept everywhere. Trello, Linear? Our tickets are grouped in columns. Isn't a column just a tag to help us be organized?
A tag on a resource can represent anything:
its owner, team, project name
its environment, update frequency (millis, daily, weekly!)
its internal name (often a cryptic code), its cost center
Tags can represent time and look like a key:value
: shutdown_at:xxx
, restart_every_ms:xxx
. They are meant to be interpreted by someone or something (automation, GitOps). They can be added or removed to trigger any automation.
Tags are virtually free fields to store metadata on a resource. It's like having a key-value store associated with any resource.
For instance, in Conduktor, tags can be added to topics to clarify ownership and make management easier:
Tags are a lossy compression
Tagging data is a straightforward concept necessary to organize, retrieve, and comprehend large-scale data. Tags serve as a metadata layer, facilitating efficient classification and location of data, thereby reducing cognitive effort.
Tagging can be likened to lossy compression for data categorization, creating a unified framework. For instance, data tagged as "PII" indicates the presence of personal identifiers like email, name, and phone number. In libraries, categorizing books under "A-B" may not pinpoint exact locations but significantly aids in efficient searching, akin to a binary tree search mechanism.
Tags act as versatile tools, offering multiple perspectives on data, similar to a prism. They simplify the selection of pertinent resources for analysis, organize information, and support visual representations such as charts and tables, enhancing human data handling efficiency.
Tags enable Data Mesh
At Conduktor, we are firm believers in the Data Mesh principles. Our mission is to help organizations and their platform teams adopt this model to unlock the potential of their teams and maximize the value of their data.
In Data Mesh, tags are critical in categorizing everything in Data Products. They are instrumental in defining various aspects of the product and the data:
ownership, description
semantic models, data schemas
Service Level Agreements (SLAs), lifecycle, intended use
data quality, freshness, lineage, versioning, data range, data's origin...
Tags form an integral component of data governance. They are predominantly curated manually, but the advancing field of AI is rapidly moving towards automating this process or at least suggesting tags.
The absence of tags often indicates subpar data quality and inadequate governance. Tags don’t just organize data; they imbue it with context and meaning, facilitating better accessibility, compliance, and utility across the data ecosystem. In a Data Mesh architecture, where decentralized data ownership and domain-oriented design are critical, tags act as the connective tissue, ensuring coherence and clarity in the sprawling data landscape.
Tags enable Security (ABAC)
Attribute-Based Access Control (ABAC) is an alternative approach to the classic RBAC (Role-Based Access Control) for authorization. Permissions are based on specific attributes (= tags) of resources and users.
Tags are not limited to resources; they can also be assigned to roles and users within a system. This is powerful as it allows the formulation of a diverse set of ABAC policies applicable to "who" the user is and "what" they are trying to access. The crux of these policies lies in their ability to authorize actions only when there's a match between the tags of the user (the "Principal") and the associated resource.
Example: Consider a scenario where specific sensitive data is tagged with sensitive
An ABAC policy can be designed to ensure that only users or roles tagged can_sensitive
can interact with these data.
Tags can be hierarchical
Tags in data management systems are not confined to a singular, flat structure; they can have a hierarchical design, like folders. This nested arrangement enhances the clarity of relationships between various concepts.
A hierarchical structure proves significantly more effective than a flat tagging system in extensive ecosystems with many tagged items. It simplifies navigation and reduces potential conflicts or misunderstandings that often arise at scale.
Moreover, in diverse organizational environments, the same tags might acquire different meanings within various hierarchies due to distinct contexts created by different teams. Combining multiple tags to understand a big picture can become messy.
Here, a hierarchical approach to tagging becomes instrumental, introducing order and facilitating a more intuitive understanding of data relationships, much like in any systematically organized system. This hierarchical tagging not only aids in data categorization but also enhances the efficiency and accuracy of data retrieval and interpretation.
e.g.
roles/owner
sales/onlinesalesteam/summary
geo:europe/france/paris
Going further? Ontology
We're not talking about the "ontology" that represents the branch of metaphysics dealing with the nature of being (!) but the ontology that represents "a set of concepts and categories in a subject area or domain that shows their properties and the relations between them".
An ontology is a structured framework for organizing information and knowledge. They define:
categories
properties
relationships
It's more meaningful than just simple tags. They're used in complex systems like AI, databases, and the semantic web, where rich and formal structures are necessary. We won't explore them here but Google is your friend (or ChatGPT now?).
Ontologies enable a deeper level of understanding and interaction with data. They go beyond the basic categorization provided by tags, establishing a more comprehensive, nuanced, and interconnected data model.
Conclusion: Tag everything, everywhere!
Tags are a must-have for any serious (data) resources ecosystem. They are valuable for any scale of companies as their goal can differ: metadata, automation, data analytics, or regulations. Whether you're a team of 5 working with hundreds of resources or 100 people working with five resources, tagging everything won't hurt.
Conduktor helps to tag main resources like topics, to deal with classification, quick search, and ownership. Tags will become central as they can provide quick insights into your whole ecosystem and lead to better (faster) decisions and a more organized workflow (via automation).