There are different opinions and a lot of confusion about the naming of Topics. In this article, I present the best practices that have proven themselves in my experience and that scale best, especially for larger companies.
Right at the beginning of the development of new applications with Apache Kafka, the all-important question arises: what name do I give my Topics? If each team or project has its own naming scheme, this can perhaps be tolerated at development time. However, it is not very conducive to collaboration if it is not clear which topic is to be used and which data it carries. At the latest, however, a decision must be made when going live in order to prevent a proliferation of naming schemes. After all, topics cannot be renamed afterward: if you decide on a new name over time, you have to delete the old topic, create a new topic with the new name and adapt all dependent applications. So how do you proceed, what scales best, and what should you pay attention to?
Naming things is always a very sensitive topic: I well remember meetings where a decision was to be made for the company-wide programming guidelines and this item on the agenda just wouldn’t disappear from meeting to meeting because of disputes about the naming of variables. With this article, I would like to provide you with a decision-making basis for topic naming in your project or company based on our experience at Xeotek. As a vendor of a datastream exploration and management software for Apache Kafka & Amazon Kinesis (Xeotek KaDeck), we have probably seen and experienced almost every variation in practical use.
The beer coaster rule
The “best practices” presented here have been gained from various projects with a wide range of customers and industries. However, one thing is crucial: don’t do too little, but don’t overdo it either! The methodology used for naming topics naturally depends on the size of the company and the system landscape. Over-engineering should be avoided as much as possible: if at the end of the day the guidelines for topic names fill pages and are only understood by a small group of people, then this is not useful. Regarding the scope, a quote from a colleague always comes to mind, which seems appropriate at this point:
“It has to fit on a beer coaster.“
The strucural design
Since topics cannot technically be grouped into folders or groups, it is important to create a structure for grouping and categorization at least via the topic name. The question arises how the different “folders”, “properties” or simply “components” should be separated. This is primarily a matter of taste. The separation by a dot (.) and the structure in the sense of the Reverse Domain Name Notation (reverse-DNS) has proven itself.
This is the approach we have found most frequently with our customers, followed by underscores. CamelCase or comparable approaches, on the other hand, are found rather rarely.
When separating with dots, it is recommended (as with domains) to avoid capitalization: write everything in lower case. This is a simple rule and avoids philosophical questions like which spelling of “MyIBMId”, “MyIbmId” or “MyIBMid” is better now.
What is the name of the data?
Once the structural design has been determined, it is a question of what we want to structure in the first place: so what all belongs in the topic name? Of course, the topic should bear the name of the data. But what is the name of the data contained in the topic?
Readers who have already experienced the attempt to create a uniform, company-wide data model (there are many legends about it!) know the problem: not only that there can be distinctions between technical and business names. Also between different departments, one and the same data set can have a completely different name (“ubiquitous language”). Therefore, data ownership must be clarified at this point: who is the data producer or who owns the data? And in terms of domain-driven design (DDD): in which domain is the data located?
In order to be able to name the data, it is, therefore, necessary to specify the domain and, if applicable, the context. The actual, functional, or technical name of the data set is appended at the end.
risk.portfolio.analysis.loans.csvimport or sales.ecommerce.shoppingcarts
As the example shows, this is also a question of company size and system landscape: you may only need to specify one domain, or you may even need several subdomains.
Who may use the data?
In the previous section, data was structured on the basis of domains and subdomains. Particularly in larger companies, it can make sense to mark cross-domain topics and thus control access and use. In this way, it is already clear from the topic name whether it is data that is only intended for internal processing within an area (domain), or whether the data stream (for example, after measures have been taken to ensure data quality) can be used by others as a reliable data source. Of course, this does not replace rights management and it is not intended to do so. However, explicitly marking the data as “private” or “public” with a corresponding prefix prevents other users from mistakenly working with “unofficial”, perhaps even experimental data without knowing it.
public. sales.ecommerce.shoppingcarts private.risk.portfolio.analysis.loans.csvimport
What should be avoided?
In addition to the above recommendations that have worked well in the past, there are also a number of approaches that do not work so well. You should have good reasons for these approaches (and there may well be), otherwise, it is best to avoid them.
One of these negative experiences I count the appending of a version number to the topic name. This approach does not only lead to the fact that countless topics are created quickly, which may not be able to be deleted as quickly. Especially with a topic or partition limit, as is common with many managed Apache Kafka providers, this can lead to a real problem. Also, in the worst case, other users of the topic have to deploy one instance per topic version if the application can only read/write from one topic. If the application can read from several topics at the same time (e.g. from all versions), the next problem already arises when writing data back to a topic: do you write to only one topic or do you split the outgoing topics into the respective versions again, because downstream processes might have a direct dependency on the different versions of the topic? As you can see: this will quickly get you into hot water. The better way is to add the version number of the used schema as part of the header to the respective record. This does not solve the problem of handling versions in downstream processes, but the overview is not lost. It is even better to use a schema registry in which all information about the schema, versioning, and compatibility is stored centrally.
Using application names as part of the topic name can also be problematic: a stronger coupling is hardly possible. However, there are exceptions here, for example for applications in the company that are set in stone anyway. In such a case, it makes no sense to create a large abstraction layer, especially if everyone in the company asks for the data of application X anyway and the “neutral” name causes confusion. However, the name of the domain service (e.g. “pricingengine”) can often be used as a good alternative in the sense of Domain-Driven Design.
Example: Using “pricingengine” as application name to avoid coupling.
What about namespaces or company names?
You should only use namespaces if there is really no other way. For example, if you have different clients in an Apache Kafka environment, it makes sense to prepend the company name, e.g.:
If there is no such reason, then you should avoid this unnecessary information: your colleagues usually know the name of the company where they work. So no need to repeat this in every topic name.
Enforcing topic naming rules and adminstrative tasks
To enforce topic naming rules, be sure to set the auto.create.topics.enable setting for your Apache Kafka broker to false. This means that topics can only be created manually, which from an organisational point of view requires an application process. For example, the responsible infrastructure team can be considered as a contact for the manual creation of topics. For the creation of topics, the console application “create-topic” supplied with Apache Kafka can be used, although a look at other third-party tools with a graphical interface is recommended, not only because of the comprehensibility but above all because of the enormous time savings for this and other typical tasks.
In KaDeck Web, for example, the various teams can be granted rights for the independent creation of topics, provided that the topics correspond to a defined naming scheme. This means that teams within their own area (domain) can avoid a bureaucratic process and create and delete topics at short notice, e.g. for testing purposes, without outside help. The user, the action and the affected topic can be traced via an audit log integrated in KaDeck.
By the way, Apache Kafka generally supports wildcards when selecting topics, for example when consuming data (i.e. in the consumer) or when assigning rights via ACLs. The proposed naming scheme for topics works very well in this combination: both, the recommended separation of “private” and “public” topics, as well as the use of domain names as part of the name, allow access for teams from different domains to be created and controlled very intuitively and quickly.
This article is a list of recommendations that have proven useful in the past when naming topics. The exception proves the rule: perhaps another dimension to structure your topics makes sense, or some of the ideas I’ve listed to the list of approaches to avoid make sense in your case. Feel free to let me know (Twitter: @benjaminbuick or the Xeotek team via @xeotekgmbh)!