Do you remember the times we used to create applications by creating data models regarding the business domains and use those data models as the reflection of the relational database objects -mostly tables- in order to do CRUD actions?
Business requirements were pouring from the waterfall and making us so soaking wet that we could not easily respond to change: new business requirements, bug fixes, enhancements, etc.
When Agile methodologies came out, after some time, this made us more flexible and respond to change very quickly whereas there came out some ideas of SOA, service bus, distributed state management, etc. but business domains stayed kind of merged and monoliths survived.
Monolith applications -which is actually not an anti-pattern- ruled the world for a considerable number of years with different kinds of architectures that have their own benefits and drawbacks.
Figure 1. Traditional Application Design. https://speakerdeck.com/mbogoevici/data-strategies-for-microservice-architectures?slide=4
1.1. Benefits and Drawbacks
Easy to start with
Easy transaction management
Can be powered-up with the modular architecture
Hard to change business domain and data model
Hard to scale
Tightly coupled components
2. Cloud-Native Applications and Microservices
One of the best motivations for approaches like Domain Driven Design is being monoliths’ tightly coupled between business domains and the need of separating those domains in order to loosen the coupling and provide single responsibility for each domain as bounded contexts.
So these kinds of approaches led to microservices’ creation with the motivation of loosely coupling between bounded contexts, being polyglot, which means using best-fitting tools for the relevant service, easily being scalable horizontally, and most importantly with these benefits, being able to easily adapt into the could-native world.
Apart from creating a lot of benefits on the microservices side, microservices and the cloud-native application architecture has some challenges that may turn a developer’s life into a nightmare.
Microservice architectures have many challenges like manageability of the services, traceability, monitoring, service discovery, distributed state and data management, and resilience which are handled automatically by cloud-native platforms like Kubernetes. For example service discovery is one of the requirements of an application that consists of microservices and Kubernetes provides this service discovery mechanism on its own side.
What cloud-native platforms can not provide and leave it to the guru developers is state management itself.
In order to keep the state in a distributed system and make it flew through the microservices has some challenges. Keeping the state in a distributed cache system like Infinispan and create a kind of single source of truth for the state is a common pattern but in the mesh of the service, it is tough to manage since there will be an Inception of caches.
Keeping the state distributed through services is even tougher.
As Database-per-microservice is a common pattern and as each bounded context should have and handle its own data, -with the need of sharing the state/data through services- this makes direct point-to-point communication between microservices more important.
2.1.2. Synchronous Communication
Synchronous data retrieval is a way to get the data that is needed from a microservice to another microservice. One can use comparingly new technologies like HTTP+REST, gRPC, or some old school technologies like RMI and IIOP, but all these synchronous point-to-point styles of data retrieval have some costs.
Figure 4. Synchronous Data Retrieval. https://speakerdeck.com/yanaga/distribute-your-microservices-data-with-events-cqrs-and-event-sourcing?slide=5
Latency is one of the key points of messaging between services and with synchronous communication if one of the services whose data will be retrieved has some performance problems in itself, it will be able to serve the data with a bit of latency that may cause data retrieval latency or timeout exceptions.
Or a service may have some failure inside and is not available that specific time period the sync data call won’t work.
Also, any performance issues on the network will directly either affect the latency or service availability. So it is up to the network’s being reliable.
We know that there are some patterns like distributed caching, bulkhead, and circuit breaker patterns for handling this kind of failure scenario by implementing fault tolerance strategies, but is it really the right way to do it?
2.2. Challenging the Challenges
So there are some solutions which some of them are invented years ago, but still rigid enough to be a ‘solution’ while others are brand-new architectures that will help us for challenging the challenges of cloud-native application messaging and communications.
Let’s start by taking a look at the common asynchronous messaging architectures before jumping into the solutions.
2.2.1. Asynchronous Messaging and Messaging Architectures
Like synchronous communication, asynchronous communication -or in other words messaging– has to be done over protocols. Two sides of the communication should agree on the protocol, so that message data that is either forecasted or consumed can be understood by the consumer.
While HTTP+REST protocol is the most used protocol for synchronous communication, there are several other protocols that asynchronous messaging systems widely use; like AMQP, STOMP, XMMP, MQTT, or Kafka protocols.
There are three main types of messaging models:
Point-to-point messaging is like sending a parcel via mail post services. You go to a post office, write the address you want the parcel to be delivered to, and post the parcel knowing it will be delivered sometime later. The receiver does not have to be at home when the parcel is sent and at some point later the parcel will be received at the address.
Point-to-point messaging systems are mostly implemented as queues that use the first-in, first-out (FIFO) order. So this means that only one subscriber of a queue can receive a specific message.
This opens the conversation of queues are being durable, which means if there are no active subscribers the messaging system will retain the messages until a subscriber comes and consumes them.
Point-to-point messaging is generally used for the use cases of calling for a message to be acted upon once only as queues can best provide an at-least-once delivery guarantee.
In order to understand how publish-subscribe (pub-sub) works think that you are an attendee on a webinar. When you are connected you can hear and watch what the speaker says, along with the other participants. When you disconnect you miss what the speaker says, but when you connect again you are able to hear what is being said.
So the webinar works like a pub-sub mechanism that while all the attendees are subscribers, the speaker is the broadcaster/publisher.
Pub-sub mechanisms are generally implemented through topics that act like the webinar broadcast to be subscribed. So when a message is produced on a topic, all the subscribers get it since the message is distributed along.
Topics are nondurable, unlike the queues. This means that a subscriber/consumer that is not consuming any messages -as it might be not running etc.-, misses the broadcasted messages in that period of being off. So this means that topics can provide a at-most-once delivery guarantee for each subscriber.
Hybrid models of messaging systems both include the point-to-point and publish-subscribe as the use cases generally require a messaging system that has many consumers who want a copy of the message with full durability; in other words without message loss.
Technologies like ActiveMQ and Apache Kafka both implement this hybrid model with their own ways of persistence and distribution mechanisms.
Durability is a key factor especially on Cloud-Native distributed systems since the persistence of the state and being able to somehow replay it plays a key role in component communication. By adding it the capabilities of the publish-subscribe mechanism decrease the dependencies between components/services/microservices as it has the power of persisting and getting the message again either with the same subscriber or another one.
So the hybrid messaging systems are very vital when it comes to passing states as messages through Cloud-Native microservices as events since event-driven distributed architectures require these capabilities.
2.2.2. Events & Event Sourcing
As a process of developing microservice-based cloud-native architectures, approaches like Domain-Driven Design (DDD) makes it easy to divide the bounded contexts and see the sub-domains related to the parent domain.
One of the best techniques to separate and define the bounded contexts is the Event Storming technique, which takes the events as entry points and emerges everything including commands, data relationships, communication styles, and most importantly combobulators which are mostly mapped as bounded contexts.
After all when most of the event storming map emerges, one can see all communication points between bounded contexts which are mostly mapped as microservices or services in the system that has their own database and data structure.
Figure 8. A real-life Event Storming example;)
This structure, as it all consists of events, gives the main idea of using async communication via a publish-subscribe system that queues the events to be consumed, in other words, doing Event Sourcing.
Event Sourcing is a state-event-message pattern that captures all changes to an application state as a sequence of events that can be consumed by other applications -in this case, microservices.
Event Sourcing is very important in a distributed cloud-native environment because in the cloud-native world microservices can easily be scaled, new microservices can join or -from an application modernization perspective- microservices can be separated from their big monolith mother in order to make it live its own life.
So having the capability of an asynchronous publish-subscribe system that has the durability of data or ability to replay is very important. Additionally, queueing the events rather than the final data makes it flexible for other services which means implementing dependency inversion in an asynchronous environment with the capability of eventual consistency.
The question here is: How to create/trigger those events?
There are many programmatic ways rather than languages or framework libraries to create events and publish them. One can create database listeners or interceptors programmatically (like Hibernate Envers does) or can handle them in the DAO (Data Access Object) or service layer of the application. Even so, creating an Event Sourcing mechanism is not easy.
At this point, a relatively new pattern comes as a savior: Change Data Capture.
2.2.3. Change Data Capture
Change Data Capture (CDC) is a pattern that is used to track the data change -mostly in databases- in order to take any action on it.
Figure 10. Change Data Capture. https://speakerdeck.com/mbogoevici/data-strategies-for-microservice-architectures?slide=15
A CDC mechanism should listen to the data change, and create the event that includes the change as Create, Insert, Update, Delete actions, and the data change itself. After creating the action it can be published to any durable pub-sub system in order to be consumed.
In order to decouple the database change event listening capability from the application code, it is one of the best patterns that is used for event-driven architectures of cloud-native applications.
Debezium is probably the most popular open-source implementation nowadays because of its easy integration with a popular set of databases and Apache Kafka, especially on platforms like Kubernetes/OpenShift.
Figure 11. CDC with Debezium. https://developers.redhat.com/blog/2020/04/14/capture-database-changes-with-debezium-apache-kafka-connectors/
Now that, we have our event sourcing listener and creator as a CDC implementation like Debezium and let’s say we use Apache Kafka for event distribution between microservices.
Since the data we are creating is not the data itself but the change subscribed microservice should get the change and reflect it to its database. This change -when it comes again a microservice having a database of its own because it’s being a bounded context– is generally used for the read purpose rather than its being a write on the database because it is a reflection of the event that is just triggered by another write on another database.
So this makes us recall a pattern -that is already automatically implemented by design in the sample of ours- that’s been used for many years especially by enterprise-level relational databases: CQRS.
CQRS is a pattern that suggests one to separate the read model from the write model, mainly with the motivation of separation of concerns and performance.
Figure 12. CQRS with Separate Datastores. https://speakerdeck.com/mbogoevici/data-strategies-for-microservice-architectures?slide=20
In a cloud-native system that has a set of polyglot microservices that has its own database -either as relational or NoSQL- the CQRS pattern fits well since each microservice has to have the data reflection of the dependent/dependee application.
2.3. Solutions Assemble: State Propagation
So in our imaginary, well-architected, distributed cloud-native system, in order to make our microservices communicate and transfer their state, we called the best of breed super-heroic patterns -some of which has great implementations.
Figure 13. State Propagation & Outbox Pattern. https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
This state transfer through an asynchronous system that has the durability and pub-sub ability like Apache Kafka, triggered by a change data capture mechanism like Debezium, writing to a component which creates the event data to be read by another component is all called State Propagation which is backed by the Outbox pattern on the microservices side.
State propagation and creating event-driven mechanisms with CDC, Event Sourcing and CQRS help us in a very elegant way in order to solve the challenges of the cloud-native microservices era.
Jakub Korab. ‘Understanding Message Brokers’. O’Reilly Media, Inc. ISBN 9781491981535.
Todd Palino, Gwen Shapira, Neha Narkhede. ‘Kafka: The Definitive Guide’. O’Reilly Media, Inc. ISBN 9781491936160.
A few days ago, I had a chance to speak about “Change Data Capture with Debezium and Apache Kafka” at an Istanbul Java User Group event. After the presentation, I did a small demo that I think was very beneficial for the audience so I thought that it would be best to improve it and kind of “storify” it in order to have both fun and spread it to a wider audience. So here is the demo, and here are the resources that you might need. Enjoy:)
ASAP! – The Storyfied Demo of Introduction to Debezium and Kafka on Kubernetes
Install the prereqs:
Strimzi Kafka CLI:
sudo pip install strimzi-kafka-cli
oc or kubectl
Login to a Kubernetes or OpenShift > cluster and create a new namespace/project.
Let's say we create a namespace called > debezium-demo by running the following > command on OpenShift:
By clicking on the route of the application in the browser you should see a page like this:
And for the overall applications before the demo you should be having something like this (OpenShift Developer Perspective is used here):
So you should have a Django application which uses a MySQL database and an Elasticsearch that has no data connection to the application -yet:)
So you are working at a company called NeverEnding Inc. as a Software Person and you are responsible for the company's blog application which runs on Django and use MYSQL as a database.
One day your boss comes and tells you this:
So getting the command from your boss, you think that this is a good use case for using Change Data Capture (CDC) pattern.
Since the boss wants it ASAP, and you don't want to make dual writes which may cause consistency problems, you have to find a way to apply this request easily and you think it will be best to implement it via Debezium on your OpenShift Office Space cluster along with Strimzi: Kafka on Kubernetes.
In order to install Strimzi cluster on OpenShift you decide to use Strimzi Kafka CLI which you can also install the cluster operator of it.
First install the Strimzi operator:
kfk operator --install -n debezium-demo
If you have already an operator installed, please check the version. If the Strimzi version you've been using is older than 0.20.0, you have to set the right version as an environment variable, so that you will be able to use the right version of cluster custom resource.
Let's create a Kafka cluster called demo on our OpenShift namespace debezium-demo.
In the opened editor you may choose 3 broker, 3 zookeeper configuration which is the default. So after saving the configuration file of the Kafka cluster in the developer preview of OpenShift you should see the resources that are created for the Kafka cluster:
Deploying a Kafka Connect Cluster for Debezium
Now it's time to create a Kafka Connect cluster via using Strimzi custom resources. Since Strimzi Kafka CLI is not capable of creating connect objects yet at the time of writing this article we will create it by using the sample resources in the demo project.
Go to the blog admin page again but this time let's change one of the blog posts instead of adding one.
Edit the post titled Strimzi Kafka CLI: Managing Strimzi in a Kafka Native Way and put a "CHANGED -" at the very start of the body for example.
When you change the data, a relatively smaller JSON data must have been consumed in your console, something like this:
"title": "Strimzi Kafka CLI: Managing Strimzi in a Kafka Native Way",
"text": "CHANGED - Strimzi Kafka CLI is a CLI that helps traditional Apache Kafka users -mostly administrators- to easily adapt Strimzi, a Kubernetes operator for Apache Kafka.\r\n\r\nIntention here is to ramp up Strimzi usage by creating a similar CLI experience with traditional Apache Kafka binaries. \r\n\r\nkfk command stands for the usual kafka-* prefix of the Apache Kafka runnable files which are located in bin directory. There are options provided like topics, console-consumer, etc. which also mostly stand for the rest of the runnable file names like kafka-topic.sh.\r\n\r\nHowever, because of the nature of Strimzi and its capabilities, there are also unusual options like clusters which is used for cluster configuration or users which is used for user management and configuration.",
So this will be the data that you will index in Elasticsearch. Now let's go for it!
Deploying a Kafka Connect Cluster for Camel
In order to use another connector that consumes the data from Kafka and puts it onto Elasticsearch, first we need another Kafka Connect cluster, this time for a Camel connector.
Simple ACL Authorization on Strimzi by using Strimzi Kafka CLI
In the previous example we implemented TLS authentication on Strimzi Kafka cluster with Strimzi Kafka CLI. In this example, we will be continuing with enabling the ACL authorization, so that we will be able to restrict access to our topics and only allow the users or groups we want to.
You should have a cluster called my-cluster on the namespace kafka we created before. If you don't have the cluster and haven't yet done the authentication part please go back to the previous example and do it first since for authorization you will need authentication to be set up before.
Also please copy the truststore.jks and the user.p12 files or recreate them as explained in the previous example and put it along the example folder which we ignore in git.
Considering you have the cluster my-cluster on namespace kafka, let's list our topics to see the topic we created before:
kfk topics --list -n kafka -c my-cluster
NAME PARTITIONS REPLICATION FACTOR
consumer-offsets---84e7a678d08f4bd226872e5cdd4eb527fadc1c6a 50 3
my-topic 12 3
Lastly let's list our user that we created previously, which we will be setting the authorization for.
kfk users --list -n kafka -c my-cluster
NAME AUTHENTICATION AUTHORIZATION
As you can see we have the my-user user that we created and authenticated in the previous example.
Now let's configure our cluster to enable for ACL authorization. We have to alter our cluster for this:
ERROR Error when sending message to topic my-topic with key: null, value: 4 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [my-topic]
ERROR Error processing message, terminating consumer process: (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [my-topic]
Processed a total of 0 messages
As you might also observe, both the producer and consumer returned TopicAuthorizationException by saying Not authorized to access topics: [my-topic]. So let's define authorization access to this topic for the user my-user.
In order to enable user's authorization, we have to both define the user's authorization type as simple for it to use SimpleAclAuthorizer of Apache Kafka, and the ACL definitions for the relevant topic -in this case it is my-topic. To do this, we need to alter the user with the following command options:
--operation TEXT Operation that is being allowed or denied.
--host TEXT Host which User will have access. (default:
--type [allow|deny] Operation type for ACL. (default: allow)
--resource-type TEXT This argument is mutually inclusive with
--resource-name TEXT This argument is mutually inclusive with
In this example we only used --resource-type and --resource-name since those are the required fields and others have some defaults that we could use.
So in this case we used the defaults of type:allow, host:* and operation:All. The equal command should look like this:
ERROR Error processing message, terminating consumer process: (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized to access group: console-consumer-96150
Processed a total of 0 messages
Whoops! It did not work like the producer. But why? Because the consumer group that is randomly generated for us (because we did not define it anywhere) doesn't have at least read permission on my-topic topic.
In Apache Kafka, if you want to consume messages you have to do it via a consumer group. You might say that "we did not specify any consumer group while using the console consumer". Well just like the traditional console consumer of Kafka, it uses a randomly created consumer group id so you have a consumer group but it is created for you (like the one above as console-consumer-96150) since we did not define one previously.
Ok then. Now let's add the ACL for a group in order to give read permission for my-topic topic. Let's call this group my-group, which we will also use it as the group id in our consumer client configuration. This time let's use kfk acls command which works like kfk users --alter --add-acl command. In order to give the best traditional experience to Strimzi CLI users, just like the traditional bin/kafka-acls.sh command, we have the kfk acls command which works mostly the same with the traditional one.
With the following command, we give the my-group group the read right for consuming the messages.
After being sure to produce and consume messages without a problem, now lets enable the authentication for TLS. In Strimzi, if you want to
enable authentication, there are listeners configurations that provides
a couple of authentication methodologies like scram-sha-512, oauth
In order to enable the authentication we have to alter our Kafka
An editor will be opened in order to change the Strimzi Kafka cluster
configuration. Since Strimzi Kafka cluster resource has many items
inside, for now, we don’t have any special property flag in order to
directly set the value while altering. That’s why we only open the
cluster custom resource available for editing.
In the opened editor we have to add the following listeners as:
If you want to fully secure your cluster you have to also change the
plain listener for authentication, because with the upper configuration
unless we use a client configuration that doesn’t use SSL security
protocol it will use the plain one which doesn’t require any
authentication. In order to do that, we can tell the plain listener in
cluster config to use one of the authentication methodologies among
scram-sha-512 or oauth. In this example we will set it as
scram-sha-512 but we will show the authentication via scram-sha-512
in another example.
So the latest listener definition should be like this:
2020-09-22 11:18:33,122 INFO [SocketServer brokerId=0] Failed authentication with /10.130.2.58 (Unexpected Kafka request of type METADATA during SASL handshake.) (org.apache.kafka.common.network.Selector) [data-plane-kafka-network-thread-0-ListenerName(PLAIN-9092)-SASL_PLAINTEXT-3]
Since we are not yet using SSL for authentication, but the PLAIN
connection method, which we set up as scram-sha-512, we can not
authenticate to the Strimzi Kafka cluster.
In order to login this cluster via SSL authentication we have to;
Create a user that uses TLS authentication
Create truststore and keystore files by getting the certificates from Openshift/Kubernetes cluster
Create a client.properties file that is to be used by producer and consumer clients in order to be able to authenticate via TLS
Let's first create the user with the name my-user:
In order create the truststore and keystore files just run the get_keys.sh file in the example directory:
chmod a+x ./get_keys.sh;./get_keys.sh
This will generate two files:
truststore.jks for the client's truststore definition
user.p12 for the client's keystore definition
TLS authentications are made with bidirectional TLS handshake. In order to do this apart from a truststore that has the public key imported, a keystore file that has both the public and private keys has to be created and defined in the client configuration file.
So let's create our client configuration file.
Our client configuration should have a few definitions like:
Truststore location and password
Keystore location and password
Security protocol should be SSL and since the truststore and keystore files are located in the example directory the client config file should be something like this:
Apache Kafka today, is a popular distributed streaming technology providing pub/sub mechanism, storing, and processing streams in a distributed way.
As Kubernetes becomes more and more popular day by day, it is inevitable for a technology like Apache Kafka to run on Kubernetes platform natively. Using some capabilities of Kubernetes like Operator Framework or Statefulsets, a group of people from Red Hat, started an open-source community project called Strimzi which is one of the most popular and reliable Kubernetes operators as a CNCF sandbox project.
It has been almost a year since I gave a speech about Strimzi with the topic “Strimzi: Distributed Streaming with Apache Kafka in Kubernetes” at Istanbul JUG community. Since then a lot has been changed both in the Apache Kafka and the Strimzi world. Zookeeper has now TLS support, and Strimzi is improved with many features like SCRAM-SHA authentication, MirrorMaker2, and many more. Strimzi is more powerful now and this -as the upstream project- is what makes Red Hat AMQ Streams stronger.
Besides many features and improvements on Strimzi aka. (Strimzi Kafka Operator), today I want to introduce you to a tool that I’ve been working on for a while, aiming of making traditional Apache Kafka administrators’ or developers’ life easier while using Kubernetes-Cloud Native Kafka: Strimzi. But before that, let me tell you my short story about meeting with Kafka and Strimzi and experiences afterward. Don’t worry, we will come to a point:)
Meeting Apache Kafka
It was a few years ago I was working as a developer at sahibinden.com a classified ad and e-commerce company. At Sahibinden, we welcomed very high numbers of concurrent visitors at that time including the ones who are not logged in. One of the biggest challenges was actually this because one can surf through all the site content anonymously and do any actions with the classified ads, and other components by viewing details, clicking, etc. And my Apache Kafka journey began with one of those challenges.
The job was: each classified ad visit had to be saved to a list that had to be created for each user -including the anonymous ones- as Last Visited Classified Ads. Before that project, I had no idea of any messaging technology, and its usage or their differences. Kafka had been used for a time in the company and architects had already decided what technology to use for that kind of project. So by getting the basic info and discussing the architecture with the fellow architects I started the project. I won’t go in details but the structure was roughly this: when a user visits a classified ad, put it into a Kafka topic, then consume it to save it into MongoDB database. Easy-peasy!
I made the first implementation. The code is reviewed, and released with the next release and BOOM! The guy form the architect/system team came to me and said something like “consumer lag”. What the hell was this consumer lag?!
I went with him to his desk to see what is happening and he wrote a couple of commands starting with “kafka-consumer-groups” and show me the offset difference for each partition per topic. I knew what a topic was but what was a partition? Did I use an ID for data consistency? What? The guy was talking like in a language that I’d never heard of.
Yeah, I have to admit that since all the code structure was ready and thought Kafka as a simple message broker without thinking about its distributed structure, I did not even move a finger to know more about Kafka itself. Anyways, the status was bad and since MongoDB could not handle too many upserts in a queue that comes from the Kafka consumers, the producer could produce data -because visitors kept visiting classified ads- but the consumer could not consume after some time because of the non-responding MongoDB.
For the “how did we solve this?” part which is not too much important for this article: We put the data to the cache, while producing it for Kafka, saved the data in bulks for each consume, and did upserts in bulks to MongoDB. And a few tweaks like changing the data retention time later.
Importance of a CLI
However, the reason behind I wanted to share with you this short story is not the solution. Apart from its being a system design problem that had to be solved, I just wanted you to notice -even if you are a very experienced person about Kafka itself- administrating and monitoring Apache Kafka is important and it has the very basic and easy to use tools bundled in itself. I am not talking about the detailed monitoring but mostly the administration itself because with just one command, having just Kafka binaries and its other relevant files itself you can do anything allowed in the whole Kafka cluster.
There are of course many tools that help people with the administration of Kafka, but it all starts with setting up the cluster and creating maybe a few topics with commands (actually sh files) that Apache Kafka provides or adding ACLs, changing topic configurations, etc. via Command Line Interface (CLI). CLIs always have been the simplest approach that holds the procedural benefits of the administration world. For the example I gave you, consumer lag check or the topic configuration changes like “log.retention.ms” were done via CLI at that time because it was the simplest, and most reliable approach while everything was on fire!
The AMQ Streams Journey
After my first interaction and stormy experience with Kafka, I did not do anything with it since I had my AMQ Streams on RHEL engagement for a customer in Red Hat, this time not as a developer, but as a consultant. That time -it has been about 1.5 years- Red Hat has this brand-new product called AMQ Streams which provided (and still provides) an enterprise-level Apache Kafka support both in RHEL (bare metal or VM) and its enterprise-level Kubernetes platform: OpenShift. While AMQ Streams on RHEL is the pure Apache Kafka that is hardened, extended, and versioned by Red Hat with full support, AMQ Streams on OpenShift is nothing but the Strimzi project which become a CNCF sandbox project in a short time successfully.
I don’t want to make a branch of this topic and steer to another one by talking about the preparation phase for the customer engagement which was tough but shortly I have to say that I learned all Kafka stuff starting with creating a topic, continuing with administration, security, and monitoring lastly in just one week! It was fun and challenging. The first phase of my customer engagement took about one week or two. It was an important customer who wanted Apache Kafka capabilities supported by a strong vendor like Red Hat because the project was a very important and vital government project.
We finished the project successfully, a couple of problems occurred when I was in London for training, but thanks to the time difference between London and Istanbul I was lucky to have a couple of extra times to discuss and solve the issue with the help of my colleagues from the support team and our Kafka and Strimzi hero Jakub Scholz from the Kafka/Strimzi engineering team.
At the end of the engagement -including the preparation phase- I -for the second time- realized Apache Kafka is a middleware that has a strong binary set as CLI that you can use both for setting up the cluster and topics, and most importantly for day-two operations.
Strimzi in Action
While playing with AMQ Streams and Kafka -because I am not only a middleware consultant but also a so-called “AppDev” consultant whose playground is OpenShift- Strimzi got my attention and started to play with it, and fall in love with it and the idea behind it: running Apache Kafka on a Kubernetes platform natively!
It used the Operator Framework which was pretty new for the Red Hat Middleware world back then (because I remember doing stuff with i.e. Infinispan/Data Grid with OpenShift templates even if there is Operator Framework. Gosh, what a struggle that was..!). The operator(s) for Strimzi managed the Kubernetes custom-resources for Kafka cluster itself, users, topics, mirroring, Kafka Connect, etc. which were bare YAML definitions. It was true magic! I had to learn about this, and I had to talk about this; show this to people. So I got my invitation from my fellows from Istanbul JUG, and did a talk about “Strimzi: Distributed Streaming with Apache Kafka in Kubernetes“. It took a little bit long but it was all fun, because most of the audience stayed till the end, and did some mind-bending conversations about messaging and Kafka.
I had this demo part at the end and while doing the demo, I realized as a traditional Apache Kafka user -which that time I could call myself so because had some real-time hands-on experience so far- changing the YAML file for a specific custom-resource for a specific configuration (for example a topic configuration) and calling oc or kubectl apply -f with that YAML file felt like it’s not accessing to Kafka or it’s not Kafka, but its something else, something different.
Because I had been doing stuff around OpenShift AppDev and CI/CD, creating resources or custom-resources for OpenShift applications or components was not a new thing for me. But from a Kafka admin’s perspective it may be pretty hard in the first place because you both want her/him to focus on both the middleware and the platform.
As a consultant and supervisor for customers, I mostly propose the customers I work with to break down the silos by starting with the person(s) first which means: don’t say “this is not part of my business, I can not do OpenShift, Kubernetes, etc. while writing Java, Python, etc. code or dealing with middleware”. Well since infrastructure, middleware and application are closer now -because of the Cloud/Kubernetes Native Era- people who create or manage them should be more closer, which is a reflection of DevOps culture and change and this should start from personal sentience and responsibility. So in short, a Kafka administrator should learn about Openshift or Kubernetes if her/his company started to use Strimzi. She/he should learn about “kubectl” or “oc”, should learn about how a Kubernetes platform works or the Operator Framework.
Or even further, she/he should learn about GitOps (we will visit this topic later again for Strimzi:) ), writing and dealing with YAMLs, or a source control system like git. Isn’t this too much? Or is there any company that could do this DevOps transformation like a finger snap of Thanos?
DevOps transformation is tough because it is (and must be) a cultural transformation and has to start with people –or even person-, then process, and lastly technology which most of the companies prefer to start with.
I remember a couple of customers who already implemented most of the DevOps practices but has still huge silos that are hard to break because of the organizational structure -because of the cultural structure. I remember the begging eyes of the Kafka admins or developers while we were doing a meeting about AMQ Streams on Openshift (Strimzi) asking like “isn’t there any other way to do this like traditional Kafka?” in the meetings.
Using the classic Kafka shell binaries for accessing the Strimzi Kafka cluster, of course, partially possible. But as I’ve said “partially”, the best practice to manage Strimzi is to use its custom resources and the operator framework, because this is the way to manage it in a Kubernetes Native way which is the main intention. This is not a classic “best-practice” case, because some parts of Strimzi (for example ACLs), doesn’t support two-way binding intentionally by design and architecture. So accessing to Strimzi Kafka cluster with a Kubernetes native way is vital.
Well, besides its being a part of the “transformation” to DevOps even if it is in reverse order (technology-> process-> people), I felt like these folks need a kind of “ferry” that they can use while building a “bridge” between the traditional middleware coast to Cloud Native and DevOps coast, which needs time to build because -as I’ve said again and again- it is a cultural change in the core.
Idea of a CLI for Strimzi
It was actually about 6 months ago. While I was preparing this workshop, CI/CD with Tekton in a Multi-cluster OpenShift Environment, I was surprised to see that Tekton could be both managed via custom resource definition YAMLs (kubectl/oc apply -f), or one could just use the Tekton CLI (tkn) to create, manipulate any object of Tekton itself; both would be handled by the Tekton’s operator.
So the idea evolved in my mind: why not use the same strategy for Strimzi too, and create an interface -actually, a command-line interface- which has the intention of helping traditional Kafka users (administrators, developers, system engineers) by providing a user experience that Apache Kafka has in terms of command executables that we talked about before.
A ferry ready to use, before building the bridge (or for those who don’t have any intention to build one).
For keeping everything Kubernetes/OpenShift Native and providing Strimzi/AMQ Streams users the closest and most familiar experience for accessing and managing Kafka, I started an open-source project that provides a command-line interface which creates/manipulates Strimzi custom-resources and applies them by using -very mostly- familiar parameters of traditional Kafka commands.
With Strimzi Kafka CLI, you can create, alter, delete topics, users, manage ACLs, create, and change the configuration of the Kafka cluster, topics, users. Most importantly, you can most of these -with a few differences and additions- just like you do with Kafka shell files. Let’s see an example:
For example to create a Kafka topic with name “messages” let’s say with 24 partitions and 3 replication factors you would normally write this command:
Please notice that the “kafka-topics.sh” command is transformed to a similar command for Strimzi as “kfk” with the “topics” option which creates the “kfk topics” command together.
Inspired by the CLI of the Tekton project, I wanted to use a three-letter main command both for which will not be hard to remember -like Tekton’s tkn– and will evoke the usage of “kafka-*.sh”.
Basically for the current version, which is 0.1.0-alpha25, the command options are like the following (We get it by running the “kfk –help” command):
Usage: kfk [OPTIONS] COMMAND [ARGS]...
Strimzi Kafka CLI
--help Show this message and exit.
acls This tool helps to manage ACLs on Kafka.
clusters The kafka cluster(s) to be created, altered or...
configs Add/Remove entity config for a topic, client, user or...
console-consumer The console consumer is a tool that reads data from...
console-producer The console producer is a tool that reads data from...
topics The kafka topic(s) to be created, altered or described.
users The kafka user(s) to be created, altered or described.
version Prints the version of Strimzi Kafka CLI
While having similar kinds of commands like acls, configs, console-consumer, console-producer, topics, Strimzi Kafka CLI has some additional commands like clusters, users for managing Strimzi custom resources directly, because of the concern of running the commands towards the Strimzi operator itself. I believe this will also improve the users’ adaptation of using Kafka that lives natively in Kubernetes/OpenShift context while writing the same kind of commands and almost similar options like Kafka binaries provide and learning a few objects’ management that belongs to Strimzi.
For example while one can both create an ACL for a current user with the following command which is more familiar for native Kafka users…:
Both commands will add the ACL to user “my-user” for the topic “my-topic” in the same way: by changing the User custom resource of Strimzi.
Strimzi Kafka CLI uses YAML examples from the original package of Strimzi, so each Strimzi Kafka CLI version uses a relevant version -usually, the latest one which is currently 0.19.0- of the Strimzi custom resource files.
Apart from this, as a dependency, Strimzi Kafka CLI uses the latest version of the kubectl binary that is downloaded regarding the operating system at the first usage. Strimzi Kafka CLI uses kubectl for accessing Kubernetes/OpenShift cluster(s) and for applying the relevant resources of Strimzi Kafka Operator.
To see the version of these external dependencies or the current version of Strimzi Kafka CLI which is being used following command should be used:
Strimzi Kafka CLI is currently in Alpha version since it is not feature-complete yet (but will probably be Beta in a short period of time; fingers crossed:) ).
For each version a PyPi release (since this is a Python project:) ) is created automatically -via Github actions- where anybody can install Strimzi Kafka CLI easily with pip command:
pip install strimzi-kafka-cli
Each release creation also triggers a container image build which is tagged with the same version of the application for possible usages of using it in as a Tekton step, or any other container-native usage. Containerized versions of Strimzi Kafka CLI is located in quay.io and can be pulled with any container CLI:
So it is possible to get Strimzi Kafka CLI via PyPi as a binary package for direct usage or an image version of it in order to run it on in a container.
There are lots of things to tell here but I think the best way to explain something is not by telling it but by showing about it. Therefore I created this introductory video about Strimzi Kafka CLI which will be the start of a potential video series:
I hope Strimzi Kafka CLI will be useful for those who need a CLI approach to help them while being in their DevOps transformation journey and for those who want to try out a different approach.
And from an open-source developer’s perspective -and of course, as a proud Red Hatter-, I hope Strimzi community can benefit from this project at the end.
Either be a user of Strimzi Kafka CLI, or just a source-code viewer, please feel free to contribute to the project and improve it together.
So since this is about the end of the article, let’s finish it here with a memorable quote by Franz Kafka – the well-known author, whose name is given to Apache Kafka-, from one of his most famous book “The Metamorphosis”:
“As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.”
― Franz Kafka, The Metamorphosis
For me, it was first Apache Kafka. Then Kafka on Kubernetes with Strimzi. And the creation of a CLI for it. What a process of metamorphosis🙂
As software people like to find bugs in the software systems, they also tend to find the anomalies or bugs in the processes and fix them or propose other ways; other procedures. This is because of an instinct that continuously pushes them to find the best: continuous improvement.
For a long time, there have been a lot of methodologies came along in the software industry like Spiral Development, V-model, and Waterfall model and all these came out because software development is a process that includes “production” in it, and because of that production thing, it has to have a well-working process to create good outputs. Of course, perspective is changed from time to time to these “outputs” (sometimes it became “time and revenue”, sometimes “product quality”, sometimes risks, sometimes all) and this was what pushed the software people to change the methodologies as time goes by for better processes. And these methodologies labeled as “heavyweight”.
So, a few “lightweight” software development methodologies emerged between the late 90s and early millennium, as a solution to the heavyweight ones. Some software experts, as the creators of these “lightweight” methodologies, decided to get together, share what they were doing and discuss a more common alternative to heavyweight, high-ritual and document-based approaches like Waterfall or Rational Unified Process (RUP) that created a lot of problems back in the days.
It was back in February 2001, in Snowbird, Utah, seventeen influential software experts gathered around a whiteboard to talk over the wretched state of the software development and create a manifesto and a few principles around that manifesto that will deeply influence the software world.
And a manifesto came out…
The following is an excerpt from the Agile Manifesto website:
We are uncovering better ways of developing software by doing it and helping others to do it. Through this work we have come to value:
Individuals and interactions over process and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to changeover following a plan
That is, while there is value in the items on the right, we value the items on the left more.
Besides the Agile Manifesto, the originators of Agile also came up with twelve principles. In this article, we won’t list or go deep into these principles, but it is important that you read and understand those principles by heart before talking about the practices of agile.
Agile itself is not a problem solver, it is a problem exposer. It is all about quick and short feedback loops that make us see problems sooner. Every time you receive feedback, you have a chance to react or not; this is what makes you more agile or less agile or non-agile.
Agile is not a single thing. Agile is a combination of methodologies and techniques that, according to the context, can help teams and companies to adapt to the ever-changing nature of software projects and also reduce the risks associated with them. The Agile disciplines and methodologies can be divided into two main groups: process-oriented and technical-oriented.
Process-oriented agile methodologies affect how teams, groups, organizations work, organize things, and collaborate, and focuses on the business processes itself. Lean Software Development and Scrum are two well-known examples of process-oriented methodologies.
As for the technical-oriented agile methodologies, these are more focused practices for developing, maintaining, and delivering the software itself. Extreme Programming is one of the most popular and well-known methodologies amongst other technical ones. This popularity comes from its containing successful practices like Test-Driven Development, Pair Programming, Simple Design, and of course Continuous Integration.
Extreme Programming (XP)
In 1996, Kent Beck introduced and gathered a set of practices that emerged XP practices, and by the time passed with a few important projects like C3 (Chrysler Comprehensive Compensation), it took its final form with subsequent contributions. Since the Agile summit in 2001, XP is considered an Agile methodology.
XP has many technical practices like Test-Driven Development, pair programming, refactoring, collective ownership, and Continuous Integration, all of which have their own benefits for the software development world.
By adopting the Continuous Integration method, that Grady Booch first proposed, and extending it, maybe XP put the key in the keyhole of the door that opens to the DevOps world. So let’s take a short trip to that world.
The term “devops” was originally first revealed by Patrick Debois and Andrew Shafer in 2008. As it is stated in the book The Phoenix Project, it began its public journey at Velocity Conference in 2009, by John Allspaw and Paul Hammond, with the presentation “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr”.
DevOps is a way of improving the collaboration and shortening the system’s development life cycle with providing continuous methodologies by combining cultural philosophies (people), practices (process), and tools (technology) all of which address the items of the CALMS acronym —which stands for Culture, Automation, Lean (as in Lean management), Measurement, and Sharing— that is a useful acronym for remembering the key points of DevOps philosophy.
For Kief Morris, in his book Infrastructure as Code, DevOps is a movement to remove barriers and friction between organizational silos – development, operations, and other stakeholders involved in planning, building, and running software. Although technology is the most visible, and in some ways simplest face of DevOps, it’s culture, people, and processes that have the most impact on flow and effectiveness.
DevOps has benefitted excessively from the work of the Agile Community. Some methodologies like Lean Software Development and XP helped emerging small releases, frequent release cycles, and maybe most importantly, small teams operate in harmony.
So while building upon the features of XP’s Continuous Integration and other extended practices like Continuous Deployment (by Timothy Fitz), and Continuous Delivery (by Jez Humble and David Farley) which are essential for achieving fast development flow, DevOps also extended those practices with the help of other concepts like “Infrastructure As Code” which should be used to bridge gaps and improve collaboration.
We will come to that shortly, but first, let’s fly a little bit over the clouds.
Cloud Age & Cloud Native
It is for sure that if we stop what we are doing right now and ask some people about what Cloud Native is and how they define them, we will surely get different answers. It is not easy to define Cloud-Native but I like -and prefer- to call it as:
Any technology or methodology that is adapted to run or be used on cloud systems and take full advantage of it.
But of course, Cloud Native is more than this.
I think it is the best to get the definition from Cloud Native Computing Foundation itself:
Cloud-native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.
These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.
Since “Cloud Age” encourages a dynamic and fast-moving technology era -most probably because of the Agile and DevOps effect- and exploits faster-paced technology to reduce risk and improve quality, Cloud Native is in this case much about speed and agility to deliver better reliability, security, and quality.
Unlike the Iron Age -a definition by Kief Morris from the book Infrastructure as Code-, everything is more dynamic and agile in the Cloud Native Era (aka. Cloud Age). However, this agility and the faster pace of change -and also the risk that comes with this package- can not be efficiently managed with the traditional methodologies that belong to Pre-Cloud Native Era (aka. Iron Age).
So an approach like Infrastructure As Code comes as a savior.
Infrastructure As Code
Infrastructure as Code is an approach for building automated infrastructures that comprise continuous change for high reliability and quality with the help of practices from software development.
According to Kief Morris’ book, the core practices of Infrastructure as Code are:
Define everything as code
Continuously validate all work in progress
Build small, simple pieces that can be changed independently
The “everything”, in the first practice “define everything as code” is the whole infrastructure system that can be defined in three layers:
Application Runtime Platform
While defining everything as code will help reusability, consistency, and transparency, continuously validating all work in progress will help to improve the quality. Building small, simple pieces that can be changed independently practice is all about loosely coupling systems. Just like the microservices’ mentality, each piece will be easy to change, deploy, and test in isolation.
Continuous change is inevitable in the Cloud Native Era, and regarding to the research “Accelerate State of the DevOps Report”, making these changes both rapidly and reliably is correlated to organizational success.
Kief Morris states in his book “Infrastructure As Code”:
A fundamental truth of the Cloud Age is: Stability comes from making changes.
And since Infrastructure as Code applies software engineering practices to the infrastructure for better reliability, security, and quality, why not use it?!
As we’ve been saying “software engineering practices”, and “code” -for a few times actually- how about taking a step or two back and look how the “Agile” world is doing while these all kind of technology or cultural transformation stuff are coming up rapidly in a short period of time?
We got excited when we all first heard about Agile. Many of us -the developers- came from the “Waterfall Factory” mentality and it was a kind of hope for us.
Consultancy services for Agile methodologies purchased by companies, training and related certifications are done, use-cases became user stories and Project Managers became Scrum masters.
Many of the companies benefited from the Agile transformation and could pass to the next phase -DevOps- and can now deploy software to production multiple times in a day as a single unit; as a single team.
But it did not end for many others like this. There became a couple of problems:
Doing Agile ceremonies like standup meetings, estimations, sprints, using post-it notes, etc. felt like they are Agile. Ceremonies made those practices useless by forgetting the real purpose. And yes, the real purpose is forgotten.
Managers and Agile coaches who became “secret managers”, started to push developers to work faster because the process from their Agile perspective is more important than engineering and technical practices. So Agile mentality “fix the process and engineering will be fixed”, which is not even an AgileBut.
As developers did write fast code to finish their work in time -in the current sprint- the code quality is reduced so the bugs and the related operational problems became the hot topics of the daily standup meetings and especially retrospective meetings.
In short, the focus was not on the technical work, design architecture, engineering which is the core of the project, but was on the highest-priority item in the backlog which has to be done as soon as possible. They thought they are Agile or doing Agile like this, but it was not and they even wasn’t aware of that until “the Agile Hangover”.
“And then one day, after a few months, or years in some cases, of having fun in the Post-It party, teams and companies woke up with a massive headache—the Agile hangover,” says Sandro Mancuso, in his book The Software Craftsman: Professionalism, Pragmatism, Pride.
However, I prefer to call this “Agile Hallucination”, because, regarding to Sandro’s definition, you have to be really Agile, or even did AgileBut in the past to be in an Agile Hangover presently. Mostly it is not like this: the series of ceremonies that many of the companies do is not even Agile, so there is no “hangover” afterward.
That’s why I call this Agile Hallucination; because in many companies they can not be Agile -or even DO Agile-, but they think they are…
“At the Snowbird meeting in 2001, Kent Beck said that Agile was about the healing of the divide between development and business. Unfortunately, as the project managers flooded into the Agile community, the developers—who had created the Agile community in the first place—felt dispossessed and undervalued. So, they left to form the Craftsmanship movement. Thus, the ancient distrust continues” says Robert C. Martin (Uncle Bob) in his book Clean Agile: Back to Basics
A group of software people met in November 2008 in Chicago to create a new movement to raise the bar of software development and heal some of the Agile goals. The movement’s name was: Software Craftsmanship.
Software Craftsmanship is a journey to mastery. It is about responsibility, professionalism, pragmatism, and pride in software development.
Uncle Bob Martin, in his same book Clean Agile, defines Craftsmanship as follows which gives a more clear understanding:
Craftsmanship promotes software development as a profession. There is a difference between having a job and having a profession. A job is a thing we do but it is not part of who we are. A profession, on the other hand, is part of who we are. When asked, “What do you do?”, a person with a job would normally say something like “I work for company X,” or “I work as a software developer.” But a person with a profession would generally say, “I am a software developer.” A profession is something we invest in. It’s something we want to get better at. We want to gain more skills and have a long-lasting and fulfilling career
Software Craftsmanship doesn’t have any practices. So it is not -or set of-:
A specific set of technologies or methodologies
Software Architecture or Design
A selected group of people
Religion or Cult
Software Craftsmanship itself, on the other hand, promotes a perpetual search for better practices and ways of working. Practices are just tools and good practices are replaced until better ones discovered. For example since 2008 -the foundation year of Software Craftsmanship Community- Extreme Programming (XP) is strongly advocated since it still provides the best set of Agile development practices currently. However, the practices of XP, are still the practices of XP, not the practices of Software Craftsmanship. XP is not the only practice that is proposed. There are principles like Clean Code and SOLID, or methodologies like Continuous Delivery, small releases, etc. that are promoted in Software Craftsmanship
Rather than technical practices, Software Craftsmanship is about putting the “craftsmanship” mindset in software development. Think about this, if you were an apprentice who works with a master of the handmade comb, what would you do? Watch? Learn? Ask? Be Better? Reflect as you learn? Expose your ignorance? Practice?
As Uncle Bob says in his book Clean Agile, “Craftsmanship is not only about technical practices, engineering, and self-improvement. It is also about professionalism and enabling clients to achieve their business goals.”
Responsibility. Professionalism. Pragmatism. Pride. Practices. Those all are shaped and gathered in a manifesto at the end of the meeting that happened in November 2008, just like the one that formed the Agile Manifesto in 2001.
Software Craftsmanship Manifesto
“In that meeting, similar to what happened during the Agile summit in 2001, they agreed on a core set of values and came up with a new manifesto that was built on top of the Agile Manifesto” says Uncle Bob Martin, in his book Clean Agile.
As aspiring Software Craftsmen, we are raising the bar of professional software development by practicing it and helping others learn the craft. Through this work we have come to value:
Not only working software, but also well-crafted software
Not only responding to change, but also steadily adding value
Not only individuals and interactions, but also a community of professionals
Not only customer collaboration, but also productive partnerships
That is, in pursuit of the items on the left we have found the items on the right to be indispensable.
Well-crafted software is about its being well tested and well designed. It means it can be changed without any fear to break things -because of the tests and other mechanisms- and enables any business to respond fast. It is the legacy code that every software developer would want to work with.
Steadily Adding Value
“Value” here is not just about the investments for the relevant project, or more technically, adding new features, fixing the current bugs. It is about being committed to continuously provide increasing value to clients, employers, project stakeholders, etc. This value can be increased both ways: collaborative improvement (a community of professionals) and improving the structure of code, keeping it clean, extendable, testable, and most importantly, easy to maintain.
A Community of Professionals
The only way to move the software industry forward is by doing continuous collaborative improvement. But what does this mean? This means that we all are expected to share what we learn, learn from each other, and mentoring the newcomers of the industry. We are responsible for raising and preparing the next generation of craftspeople, and one way to do this, is by sharing your experiences and exposing what you know and you don’t know.
Having a professional relationship with the clients and employers is very important. Behaving both ethically and respectfully is the base of productive partnerships. Think of this kind of partnership as an invisible kind of contract that is a symbol of mutual professionalism that is essential to make any project succeed.
Agile and Software Craftsmanship
Software Craftsmanship is not a movement that replaces Agile. They both are mutually exclusive and both movements want to achieve same things like customer satisfaction, collaboration, and value short feedback loops, deliver high-quality valuable work and professionalism.
People in Agile movement may criticize Software Craftsmanship for it lacks the focus of business and people, and people in the Craftsmanship movement may criticize Agile movement for its lack of focus on engineering and low-level process. However, these two movements complement each other nicely. As Sandro Mancuso says in his book The Software Craftsman, while Agile methodologies help companies do the right thing, with focusing the low-level things like writing good and well-crafted code and promoting doing more to customers than just writing code, Software Craftsmanship with this way helps customers to do the thing right.
So the cure for Agile Hallucination is the Software Craftsmanship mindset and culture. What about the DevOps world? Do you think everything is wonderful in the DevOps-land?
Think of a company -let’s call it ACME-, that may be the one once you were in, purchased some Scrum training, did some workshops internally, and certified some people as Scrum Masters. How cool is that!
With these Scrum masters, they created a pilot group and selected a pilot project. They started with the Kick-off meeting and did the all planning, the sprints, etc. All Agile right?
After some time, when things seem to be ok with the pilot project and the team, even if the only topic at the retrospectives are the bugs, problems, and fatty backlog of the current sprint, they reported to managers “Agile is perfect, we can start to do the other projects like this; sprints are cool!”
However, as we discussed before, they were not doing Agile, not even from the start. It was an Agile Hallucination.
Now think of the same company, starting to adopt DevOps with this kind of “Agility”. Do you think that will they be able to really do DevOps?
DevOps needs agility and collaboration, which comes from the Agile culture. Of course, one can go backward and try to implement first DevOps itself, but be sure that he/she will end up with being Agile.
Back to the company, if we ask the “DevOps team” how they implemented their DevOps processes, probably they will first explain to us how they automated things -if there is any-, which tools they used, how many deployments they do -or most probably try to do- in one day, etc. Where is CALMS here? Is this DevOps?
No, because like Agile, DevOps is a culture change; starting from people, continuing with the process, and the last one as technology. It is not only tools and practices. It is a loose set of practices, guidelines, and culture designed to break down silos in IT units.
If you just implemented some Continuous Integration and Deployment tool and using it to automate your deployments with just a “git commit/push” command, it is cool. But you are doing is not DevOps.
I call this situation “DevOps Hallucination”. Like Agile Hallucination, companies may be in a state that they think they are doing DevOps, but like Agile, DevOps is not a thing that you can just do. You can BE in a DevOps culture but can not DO DevOps.
It is the situation of seeing “Dancing Elephant(s)”.
I use this metaphor because the “Elephant” metaphor is widely used for explaining DevOps (as we did above as well). And a “Dancing Elephant” most of the times used for “a good working DevOps mechanism -and of course a well structured DevOps culture”
DevOps Hallucination occurs in two ways:
You are Agile and do all practices perfectly while developing your software. You automated things, CI/CD mechanisms work perfectly, automation works perfectly. But there are no tests of the automation code, no single source of truth, or no code review mechanisms. Or there are DevOps teams who just do DevOps with titles of DevOps Engineers. No cross-functional teams, or any organizational reflection of DevOps culture itself, like collaboration. This is also an AgileBut alert!
You are not Agile. You are in an Agile Hallucination and you did automate everything just like the first way above. Because of the non-Agile situation, there is no base to create a real DevOps culture. Your sprints and daily standup meetings make you think you are Agile, and written automation makes you think you are “doing” DevOps.
In either ways, there is no DevOps, just its hallucination. Creating a real Agile culture in the team will lead to a real DevOps culture, but it is hard to achieve unless companies start to focus on people, rather than processes and technologies.
People, Process, Technology
Focus on DevOps should always be in order of:
If you want to build a culture, start with the lowest level: people, then focus on the process, and the last focus always should be technology.
Nowadays most of the companies tend to start in reverse order. They start with technology or process -or both in parallel- then -when they come up- try to fix the organizational problems which are tightly related to each “person”‘s problem.
Focusing on people doesn’t mean that recruiting some DevOps Engineers -which is totally a fake title; sorry for that-, who can “do” DevOps and put them in an isolated group aka. silos and made them automate stuff around a couple of cool tools that can do CI/CD or IaC, like Jenkins, Ansible, Chef, Terraform, ArgoCD, GoCD, etc.
But how to focus on people? Before explaining this, let’s answer “who are these people?” question.
Who are these “people”?
These people are Software Engineers/Developers, Software Architects, System Engineers, System Administrators, and Service Reliability Engineers (SRE) – and of course, many others who are highly related to the DevOps process as technical people. For now, we will just focus on some of them.
Software Developer/Engineer: Mostly develops applications. Highly aware of the Agile and CI/CD process. If there is an Agile culture constructed, he/she is in. Or there is a DevOps culture constructed, he/she is in as well.
System Engineer: The guy between the triangle of the system (including cloud), middleware, and the applications. We know them sometimes as Software Architects in some companies, as DevOps Engineers in some others.
System Administrator: Person who knows about a specific system or infrastructure and most probably certified around it. He/she is the system infrastructure folk who likes the “God” access. He/she doesn’t write too much code like System Engineers (I don’t count the exceptions), mostly uses UI for administration management and operation purposes.
Service Reliability Engineer (SRE): Regarding an article by David Hixon and Betsy Beyer (one of the authors of the famous book of SRE), the roles of Software Engineer and System Engineer lie at the two poles of the SRE continuum, skills, and interests. So this somehow means that SRE is the person who can do both software development at the application level, and system engineering at the infrastructure level. This title is the most confused one with the title “DevOps Engineer” because as it is stated in the short article How SRE relates to DevOps by Murphy, Jones, and Beyer, in many ways, DevOps and SRE sit, in both practice and philosophy, very close to each other in the overall landscape of IT operations.
DevOps Engineer: I am joking:)
Did you notice which roles above are related to the application development and which are not?
When an Agile transformation process is being started in a company if there is no plan for DevOps, the transformation sticks at the application level. This is mostly not a problem from the software project’s stakeholder’s perspective, because as long as you can do sprints without any problem and do CI/CD and other technical stuff, there is no issue for them. But what about the “system people”‘s or operations people’s perspective?
DevOps culture is a way to do this of course. Merging the development and the operations in a cultural way, but in this case, isn’t it be a kind of starting with the “process” rather than “people”?
If you directly merge Dev and Ops cultures with either starting with technology -or tools- (which is a very very common case by the way) or starting with the processes like “guys we will include you system engineers, or SREs to our X project that we use Agile methodologies and we will do some awesome sprints together” or recruiting -yes again:)- DevOps Engineers and make them be a part of this “new” kind of “cross-functional team” just to do the “automation stuff”, you end up with a DevOps Hallucination!
In this DevOps Hallucination let’s put the Cloud in the equation and see what happens.
When Software Engineers of let’s say “Project Roadrunner” realize they have to write some code around the new Cloud technology -for example Kubernetes resource YAML files-, they begin to say “it is the system/infrastructure, I am a Java Developer, I can do magic with Spring Framework, but this Kubernetes thing is not my responsibility”.
All these will go in a loop while “system people” or “DevOps Engineers” who are merged into the Agile workflow of Project Roadrunner try to “automate” things to “do” DevOps, without any test, review, or any kind of quality. A loop that will never break unless anyone focuses on the “inner” level of the problem, focusing on the “people” before trying to solve how to structure the process and technology flows.
Project Roadrunner current status:
SW Engineers don’t want to write cloud-related code that is actually related to the application. They think it is the responsibility of the “DevOps team” (a legendary team that has DevOps Engineers).
System Admins/Engineers don’t know how they will manage some cloud-related common resources. They don’t know who will be responsible for resource A and B.
Regarding this responsibility issue, some people made a “heads up” and teams made a couple of meetings to create a RACI matrix, which is a part of the “process”, not “people”. After those meetings, it is still not clear who will be responsible for what, or how to achieve these organizational goals.
System Engineers/SREs or “DevOps Engineers”, who are responsible to make the automation for just labeling the whole process as “DevOps”, write poor-quality infrastructure code, that has no tests and no validation. That’s because Project Roadrunner is in an Agile Hallucination, so managers -and of course the Mighty Scrum Masters- micromanage these “system people” as they do to “software people”. This made them write code as fast as they can, just to finish in time in the current sprint.
Apart from there is an Agile or non-Agile culture, maybe the System Engineers/SREs or “DevOps Engineers” just don’t care about the tests, or validation, because they are not Software Engineers! They are not developing applications in the end(!)
No real collaboration between Dev and Ops. No single source of truth that is shared across teams.
Do you think they can solve these problems? Do you think to focus these problems just on the organizational level will fix all? Or should we focus on the “people” and start the transformation from there?
As we discussed before, Software Craftsmanship came out as a response to the problems of the implementation of Agile. Not the Agile itself, but the implementations were so bad that as Sandro Mancuso said once “Many Agile projects are now, steadily and iteratively, producing crap code.”
Software Craftsmanship is a mindset change that leads you think like a Crafter from the old times, who loves his/her profession, who only cares about quality and beauty of his/her product, and the “value” it gives to public.
The same mindset and ideology, just because of the similar problems we have in DevOps; like the ones we had while implementing Agile, can be implemented to DevOps culture that is inevitably in a Cloud Native Era right now.
I wanted to call this ideology System Craftsmanship.
System Craftsmanship’s purpose is to put the passion and craft into DevOps culture and raise the bar even further to create more value by caring about quality and profession in the DevOps ecosystem.
System Craftsmanship is not a “yet another Craftsmanship”. It is an extension of Software Craftsmanship, that is, in pursuit of the items that Software Craftsmanship Manifesto has, I’ve found the following extension of the first item indispensible:
Not only well-crafted software, but also well-crafted infrastructure
This means that by accepting all the things that Software Craftsmanship reveals, all the principles and values, expanding its boundaries from “software people” to “system people” and encourage them to be System Crafters to gain a better DevOps culture.
Like Software Craftsmanship, System Craftsmanship -as a matter of course- has no practices but benefits from a lot of technical practices and patterns that come from craftsmanship or apprenticeship culture.
For example, Clean Code and SOLID are two of the practices that Craftsmanship movement benefits. From the Infrastructure of Code (IaC) perspective, you have to create clean infrastructure code to create clean infrastructure. In his book Infrastructure As Code, Kief Morris agrees with us by saying the following:
To keep an infrastructure codebase clean, you need to treat it as a first-class concern. Too often, people don’t consider infrastructure code to be “real” code. They don’t give it the same level of engineering discipline as application code.
Design and manage your infrastructure code so that it is easy to understand and maintain.
Besides, he discusses other practices like code review, pair programming, automated testing by saying: “Follow code quality practices, such as code reviews, pair programming, and automated testing. Your team should be aware of technical debt and strive to minimize it.”
Do you recall the problem that Project Roadrunner System Engineers face? They had some low-quality issues that damage the DevOps environment. They either had no time to follow the best practices to write maintainable code or did not be in need of to do any. With System Craftsmanship mindset, they should be aware of they have to produce well-crafted code in order to create a well-crafted infrastructure. They have to implement CI/CD, maybe do Test Driven Development, write some specifications about their own code to be clean, and keeping it clean, which also leads to a collaboration culture. As it comes to collaboration they can even do Pair Programming, both for improving the quality of their code and collective ownership. This collective ownership solves the responsibility of the resource A and B problems they had. Because by implementing a “single source of truth” mechanism and collaboration, everybody will be responsible from everything mutually.
Apart from the technical practices, Craftsmanship mindset improves professionalism manner. With expanding a known terminology to “Software Craftsmanship” to something new “System Craftsmanship” developers will understand that this is not just a software or application matter, this is a system-wide thing to implement because -as we discussed before- we are in a Cloud Native Era right now and infrastructure transformed to code, and cloud transformed to code, so it is inevitable to application developers to just write application code. That application is a Cloud Native Application right now which is a piece of the Cloud, has pieces from the Cloud.
Remember the application developers of the Project Roadrunner, who don’t want to write cloud-related code within the project. Does the definition above fix this problem and fix the responsibility issue? Responsibility is just a matter of professionalism in many cases, so in this case, it just needs a bit of professionalism from the System Craftsmanship culture.
As it is about professionalism and passion, having his/her job as a profession requires to follow a bunch of Apprenticeship Patterns to practice.
For example, an SRE can be a better SRE by finding mentors for herself/himself, work with them, or just follow them by doing pairing sessions with them. This will improve that SRE’s way of working. Remember the famous Zen poem that Eric S. Raymond also mentions in his famous article How to Become A Hacker:
To follow the path:
look to the master,
follow the master,
walk with the master,
see through the master,
become the master.
Or what about exposing one’s ignorance. It is all about putting your ego under your shoes. It sounds quite painful, right? Some of the Software Developers do this because they are already aware of this craftsmanship movement. But how about “system people”?
Let’s take System Administrators for example. We discussed this -maybe in a joking way- that they have the “God” access to the systems. They can do whatever they want in a system because of their privileges. Because of this, some of these people may have a kind of big ego and it is very hard for them time to time tell what they don’t know. Same for the Security Administrators/Engineers. Because of their duty, they have to ask every action’s reason, so this makes them feel “in control” of the flow just like it is for the System Admins. Well, it is not their fault, it is just human nature, but it can be fixed with practices like “exposing the ignorance” or do activities like “pair programming” or “code review”.
Notice that at the beginning of the Cloud Native Era, these people became a little bit “shocked” because of the adaptation of the cloud. “A place where you can define everything as a code! How will I keep the privileges on myself/my team” or “how will I keep this component secure? It is not even on my network topology”. Collective Ownership -doesn’t mean that everybody can change security configurations-, Single Source of Truth and Code Reviews -by exposing the ignorance-, and maybe more. These are all they have to do.
I believe all these patterns and practices, that originally come from many Agile practices like Extreme Programming or others like Clean Code, or Apprenticeship Patterns are beneficial for System Engineers, System Admins, Security Admins/Engineers, SREs and many others who I kept calling “system people” in this long-ish article.
And I hope these practices and patterns, and the “Craftsmanship” mindset, which I believe raised the bar of software development for many years, will raise the bar even further in this Cloud Native Era, for the people and communities who are passionate about DevOps and Infrastructure as Code and for the companies who are in a plan of (or in the middle of) a DevOps transformation and need to give it a right start with focusing on the “people” first.