Messaging Architectures for Cloud-Native Applications

By Aykut Bulgu / December 1, 2020

1. Traditional Applications and Monoliths
- 1.1. Benefits and Drawbacks
2. Cloud-Native Applications and Microservices
Resources

1. Traditional Applications and Monoliths

Do you remember the times we used to create applications by creating data models regarding the business domains and use those data models as the reflection of the relational database objects -mostly tables- in order to do CRUD actions?

Business requirements were pouring from the waterfall and making us so soaking wet that we could not easily respond to change: new business requirements, bug fixes, enhancements, etc.

When Agile methodologies came out, after some time, this made us more flexible and respond to change very quickly whereas there came out some ideas of SOA, service bus, distributed state management, etc. but business domains stayed kind of merged and monoliths survived.

Monolith applications -which is actually not an anti-pattern- ruled the world for a considerable number of years with different kinds of architectures that have their own benefits and drawbacks.

Figure 1. Traditional Application Design. https://speakerdeck.com/mbogoevici/data-strategies-for-microservice-architectures?slide=4

1.1. Benefits and Drawbacks

The Benefits:

Easy to start with
Easy transaction management
Sync communication
Can be powered-up with the modular architecture

The Drawbacks:

Hard to change business domain and data model
Hard to scale
Tightly coupled components

2. Cloud-Native Applications and Microservices

One of the best motivations for approaches like Domain Driven Design is being monoliths’ tightly coupled between business domains and the need of separating those domains in order to loosen the coupling and provide single responsibility for each domain as bounded contexts.

Figure 2. Bounded Contexts. https://martinfowler.com/bliki/BoundedContext.html

So these kinds of approaches led to microservices’ creation with the motivation of loosely coupling between bounded contexts, being polyglot, which means using best-fitting tools for the relevant service, easily being scalable horizontally, and most importantly with these benefits, being able to easily adapt into the could-native world.

Apart from creating a lot of benefits on the microservices side, microservices and the cloud-native application architecture has some challenges that may turn a developer’s life into a nightmare.

2.1. Challenges

Microservice architectures have many challenges like manageability of the services, traceability, monitoring, service discovery, distributed state and data management, and resilience which are handled automatically by cloud-native platforms like Kubernetes. For example service discovery is one of the requirements of an application that consists of microservices and Kubernetes provides this service discovery mechanism on its own side.
What cloud-native platforms can not provide and leave it to the guru developers is state management itself.

2.1.1. State

In order to keep the state in a distributed system and make it flew through the microservices has some challenges. Keeping the state in a distributed cache system like Infinispan and create a kind of single source of truth for the state is a common pattern but in the mesh of the service, it is tough to manage since there will be an Inception of caches.

Keeping the state distributed through services is even tougher.

Figure 3. Microservices & Data. https://speakerdeck.com/mbogoevici/data-strategies-for-microservice-architectures?slide=5

As Database-per-microservice is a common pattern and as each bounded context should have and handle its own data, -with the need of sharing the state/data through services- this makes direct point-to-point communication between microservices more important.

2.1.2. Synchronous Communication

Synchronous data retrieval is a way to get the data that is needed from a microservice to another microservice. One can use comparingly new technologies like HTTP+REST, gRPC, or some old school technologies like RMI and IIOP, but all these synchronous point-to-point styles of data retrieval have some costs.

Figure 4. Synchronous Data Retrieval. https://speakerdeck.com/yanaga/distribute-your-microservices-data-with-events-cqrs-and-event-sourcing?slide=5

Latency is one of the key points of messaging between services and with synchronous communication if one of the services whose data will be retrieved has some performance problems in itself, it will be able to serve the data with a bit of latency that may cause data retrieval latency or timeout exceptions.

Or a service may have some failure inside and is not available that specific time period the sync data call won’t work.

Also, any performance issues on the network will directly either affect the latency or service availability. So it is up to the network’s being reliable.

We know that there are some patterns like distributed caching, bulkhead, and circuit breaker patterns for handling this kind of failure scenario by implementing fault tolerance strategies, but is it really the right way to do it?

2.2. Challenging the Challenges

So there are some solutions which some of them are invented years ago, but still rigid enough to be a ‘solution’ while others are brand-new architectures that will help us for challenging the challenges of cloud-native application messaging and communications.

Let’s start by taking a look at the common asynchronous messaging architectures before jumping into the solutions.

2.2.1. Asynchronous Messaging and Messaging Architectures

Like synchronous communication, asynchronous communication -or in other words messaging– has to be done over protocols. Two sides of the communication should agree on the protocol, so that message data that is either forecasted or consumed can be understood by the consumer.

While HTTP+REST protocol is the most used protocol for synchronous communication, there are several other protocols that asynchronous messaging systems widely use; like AMQP, STOMP, XMMP, MQTT, or Kafka protocols.

There are three main types of messaging models:

Point-to-point
Publish-subscribe (Pub-sub)
Hybrid

Point-to-Point

Point-to-point messaging is like sending a parcel via mail post services. You go to a post office, write the address you want the parcel to be delivered to, and post the parcel knowing it will be delivered sometime later. The receiver does not have to be at home when the parcel is sent and at some point later the parcel will be received at the address.

Point-to-point messaging systems are mostly implemented as queues that use the first-in, first-out (FIFO) order. So this means that only one subscriber of a queue can receive a specific message.

Figure 5. Point-to-Point Messaging. https://docs.oracle.com/cd/E19340-01/820-6424/aerbj/index.html

This opens the conversation of queues are being durable, which means if there are no active subscribers the messaging system will retain the messages until a subscriber comes and consumes them.

Point-to-point messaging is generally used for the use cases of calling for a message to be acted upon once only as queues can best provide an at-least-once delivery guarantee.

Publish-Subcribe

In order to understand how publish-subscribe (pub-sub) works think that you are an attendee on a webinar. When you are connected you can hear and watch what the speaker says, along with the other participants. When you disconnect you miss what the speaker says, but when you connect again you are able to hear what is being said.

So the webinar works like a pub-sub mechanism that while all the attendees are subscribers, the speaker is the broadcaster/publisher.

Pub-sub mechanisms are generally implemented through topics that act like the webinar broadcast to be subscribed. So when a message is produced on a topic, all the subscribers get it since the message is distributed along.

Figure 6. Publish-Subscribe Messaging. https://engineering.carsguide.com.au/laravel-pub-sub-messaging-with-apache-kafka-3b27ed1ee5e8

Topics are nondurable, unlike the queues. This means that a subscriber/consumer that is not consuming any messages -as it might be not running etc.-, misses the broadcasted messages in that period of being off. So this means that topics can provide a at-most-once delivery guarantee for each subscriber.

Hybrid model

Hybrid models of messaging systems both include the point-to-point and publish-subscribe as the use cases generally require a messaging system that has many consumers who want a copy of the message with full durability; in other words without message loss.

Technologies like ActiveMQ and Apache Kafka both implement this hybrid model with their own ways of persistence and distribution mechanisms.

Durability is a key factor especially on Cloud-Native distributed systems since the persistence of the state and being able to somehow replay it plays a key role in component communication. By adding it the capabilities of the publish-subscribe mechanism decrease the dependencies between components/services/microservices as it has the power of persisting and getting the message again either with the same subscriber or another one.

So the hybrid messaging systems are very vital when it comes to passing states as messages through Cloud-Native microservices as events since event-driven distributed architectures require these capabilities.

2.2.2. Events & Event Sourcing

As a process of developing microservice-based cloud-native architectures, approaches like Domain-Driven Design (DDD) makes it easy to divide the bounded contexts and see the sub-domains related to the parent domain.

One of the best techniques to separate and define the bounded contexts is the Event Storming technique, which takes the events as entry points and emerges everything including commands, data relationships, communication styles, and most importantly combobulators which are mostly mapped as bounded contexts.

Figure 7. Events Storming Components. https://medium.com/@springdo/a-facilitators-recipe-for-event-storming-941dcb38db0d

After all when most of the event storming map emerges, one can see all communication points between bounded contexts which are mostly mapped as microservices or services in the system that has their own database and data structure.

Figure 8. A real-life Event Storming example;)

This structure, as it all consists of events, gives the main idea of using async communication via a publish-subscribe system that queues the events to be consumed, in other words, doing Event Sourcing.

Event Sourcing is a state-event-message pattern that captures all changes to an application state as a sequence of events that can be consumed by other applications -in this case, microservices.

Figure 9. Event Sourcing. https://speakerdeck.com/mbogoevici/data-strategies-for-microservice-architectures?slide=14

Event Sourcing is very important in a distributed cloud-native environment because in the cloud-native world microservices can easily be scaled, new microservices can join or -from an application modernization perspective- microservices can be separated from their big monolith mother in order to make it live its own life.

So having the capability of an asynchronous publish-subscribe system that has the durability of data or ability to replay is very important. Additionally, queueing the events rather than the final data makes it flexible for other services which means implementing dependency inversion in an asynchronous environment with the capability of eventual consistency.

The question here is: How to create/trigger those events?

There are many programmatic ways rather than languages or framework libraries to create events and publish them. One can create database listeners or interceptors programmatically (like Hibernate Envers does) or can handle them in the DAO (Data Access Object) or service layer of the application. Even so, creating an Event Sourcing mechanism is not easy.

At this point, a relatively new pattern comes as a savior: Change Data Capture.

2.2.3. Change Data Capture

Change Data Capture (CDC) is a pattern that is used to track the data change -mostly in databases- in order to take any action on it.

Figure 10. Change Data Capture. https://speakerdeck.com/mbogoevici/data-strategies-for-microservice-architectures?slide=15

A CDC mechanism should listen to the data change, and create the event that includes the change as Create, Insert, Update, Delete actions, and the data change itself. After creating the action it can be published to any durable pub-sub system in order to be consumed.

In order to decouple the database change event listening capability from the application code, it is one of the best patterns that is used for event-driven architectures of cloud-native applications.

Debezium is probably the most popular open-source implementation nowadays because of its easy integration with a popular set of databases and Apache Kafka, especially on platforms like Kubernetes/OpenShift.

Figure 11. CDC with Debezium. https://developers.redhat.com/blog/2020/04/14/capture-database-changes-with-debezium-apache-kafka-connectors/

Now that, we have our event sourcing listener and creator as a CDC implementation like Debezium and let’s say we use Apache Kafka for event distribution between microservices.

Since the data we are creating is not the data itself but the change subscribed microservice should get the change and reflect it to its database. This change -when it comes again a microservice having a database of its own because it’s being a bounded context– is generally used for the read purpose rather than its being a write on the database because it is a reflection of the event that is just triggered by another write on another database.

So this makes us recall a pattern -that is already automatically implemented by design in the sample of ours- that’s been used for many years especially by enterprise-level relational databases: CQRS.

2.2.4. Command Query Responsibility Segregation (CQRS)

CQRS is a pattern that suggests one to separate the read model from the write model, mainly with the motivation of separation of concerns and performance.

Figure 12. CQRS with Separate Datastores. https://speakerdeck.com/mbogoevici/data-strategies-for-microservice-architectures?slide=20

In a cloud-native system that has a set of polyglot microservices that has its own database -either as relational or NoSQL- the CQRS pattern fits well since each microservice has to have the data reflection of the dependent/dependee application.

2.3. Solutions Assemble: State Propagation

So in our imaginary, well-architected, distributed cloud-native system, in order to make our microservices communicate and transfer their state, we called the best of breed super-heroic patterns -some of which has great implementations.

Figure 13. State Propagation & Outbox Pattern. https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/

This state transfer through an asynchronous system that has the durability and pub-sub ability like Apache Kafka, triggered by a change data capture mechanism like Debezium, writing to a component which creates the event data to be read by another component is all called State Propagation which is backed by the Outbox pattern on the microservices side.

To sum up all, getting all the solutions altogether state propagation and creating event-driven mechanisms with CDC, Event Sourcing and CQRS help us in a very elegant way in order to solve the challenges of the cloud-native microservices era.

Resources

Books

Jakub Korab. ‘Understanding Message Brokers’. O’Reilly Media, Inc. ISBN 9781491981535.
Todd Palino, Gwen Shapira, Neha Narkhede. ‘Kafka: The Definitive Guide’. O’Reilly Media, Inc. ISBN 9781491936160.

Videos

Gwen Saphira. ‘Cloud native data pipelines with Apache Kafka’. https://learning.oreilly.com/videos/cloud-native-data/0636920333746/0636920333746-video328373
Edson Yanaga. ‘Distribute Your Microservices Data With Events, CQRS, and Event Sourcing’. https://www.youtube.com/watch?v=HdvWfr2KwA0
Marius Bogoevici, Edson Yanaga. ‘Data Strategies for Microservice Architectures’. https://www.youtube.com/watch?v=n_V8hBRoshY

Presentations

Edson Yanaga. ‘Distribute Your Microservices Data With Events, CQRS, and Event Sourcing’. https://speakerdeck.com/yanaga/distribute-your-microservices-data-with-events-cqrs-and-event-sourcing
Marius Bogoevici. ‘Data Strategies for Microservice Architectures’. https://speakerdeck.com/mbogoevici/data-strategies-for-microservice-architectures

Website Articles

Open Practice Library. ‘Event Storming’. https://openpracticelibrary.com/practice/event-storming/
Donal Spring. ‘A facilitators recipe for Event Storming’. https://medium.com/@springdo/a-facilitators-recipe-for-event-storming-941dcb38db0d
Martin Fowler. ‘Bounded Context’. https://martinfowler.com/bliki/BoundedContext.html
Martin Fowler. ‘CQRS’. https://martinfowler.com/bliki/CQRS.html
Martin Fowler. ‘Event Sourcing’. https://martinfowler.com/eaaDev/EventSourcing.html

ASAP! – The Storified Demo of Introduction to Debezium and Kafka on Kubernetes

Cloud Native

Integration

Messaging

By Aykut Bulgu / November 30, 2020

A few days ago, I had a chance to speak about “Change Data Capture with Debezium and Apache Kafka” at an Istanbul Java User Group event. After the presentation, I did a small demo that I think was very beneficial for the audience so I thought that it would be best to improve it and kind of “storify” it in order to have both fun and spread it to a wider audience. So here is the demo, and here are the resources that you might need. Enjoy:)

Prerequisites

Install the required tools

Strimzi Kafka CLI:

sudo pip install strimzi-kafka-cli

oc or kubectl
helm

Let’s say we create a namespace called debezium-demo by running the following command on OpenShift:

oc new-project debezium-demo

Install demo application ‘The NeverEnding Blog’

Clone the repository:

git clone https://github.com/mabulgu/the-neverending-blog.git> Checkout the debezium-demo branch:

git checkout debezium-demo

Go into the application directory:

cd the-neverending-blog

Install the helm template:

helm template the-neverending-blog chart | oc apply -f - -n debezium-demo

Start the s2i build for the application:

oc start-build neverending-blog --from-dir=. -n debezium-demo

…and OpenShift will take care of the rest and you should have a blog application called ‘The NeverEnding Blog’ in the end:

Install Elasticsearch

Apply Elasticsearch resources to OpenShift:

oc apply -f resources/elasticsearch.yaml -n debezium-demo

Expose the route for Elasticsearch:

oc expose svc elasticsearch-es-http -n debezium-demo

By clicking on the route of the application in the browser you should see a page like this:

And for the overall applications before the demo you should be having something like this (OpenShift Developer Perspective is used here):

So you should have a Django application which uses a MySQL database and an Elasticsearch that has no data connection to the application -yet:)

ASAP!

So you are working at a company called NeverEnding Inc. as a Software Person and you are responsible for the company’s blog application which runs on Django and use MYSQL as a database.

One day your boss comes and tells you this:

So getting the command from your boss, you think that this is a good use case for using Change Data Capture (CDC) pattern.

Since the boss wants it ASAP, and you don’t want to make dual writes which may cause consistency problems, you have to find a way to apply this request easily and you think it will be best to implement it via Debezium on your OpenShift Office Space cluster along with Strimzi: Kafka on Kubernetes.

Oh, you can wear a Hawaiian shirt and jeans while you are doing all these even if it’s not Friday:)

Deploying a Kafka cluster with Strimzi Kafka CLI

In order to install Strimzi cluster on OpenShift you decide to use Strimzi Kafka CLI which you can also install the cluster operator of it.

First install the Strimzi operator:

kfk operator --install -n debezium-demo

IMPORTANT

If you have already an operator installed, please check the version. If the Strimzi version you’ve been using is older than 0.20.0, you have to set the right version as an environment variable, so that you will be able to use the right version of cluster custom resource.

export STRIMZI_KAFKA_CLI_STRIMZI_VERSION=0.19.0

Let’s create a Kafka cluster called demo on our OpenShift namespace debezium-demo.

kfk clusters --create --cluster demo -n debezium-demo

In the opened editor you may choose 3 broker, 3 zookeeper configuration which is the default. So after saving the configuration file of the Kafka cluster in the developer preview of OpenShift you should see the resources that are created for the Kafka cluster:

Deploying a Kafka Connect Cluster for Debezium

Now it’s time to create a Kafka Connect cluster via using Strimzi custom resources. Since Strimzi Kafka CLI is not capable of creating connect objects yet at the time of writing this article we will create it by using the sample resources in the demo project.

Create a custom resource like the following:

apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaConnect
metadata:
  annotations:
    strimzi.io/use-connector-resources: 'true'
  name: debezium
spec:
  bootstrapServers: 'demo-kafka-bootstrap:9092'
  config:
    config.storage.replication.factor: '1'
    config.storage.topic: debezium-cluster-configs
    group.id: debezium-cluster
    offset.storage.replication.factor: '1'
    offset.storage.topic: debezium-cluster-offsets
    status.storage.replication.factor: '1'
    status.storage.topic: debezium-cluster-status
  image: 'quay.io/hguerreroo/rhi-cdc-connect:2020-Q3'
  jvmOptions:
    gcLoggingEnabled: false
  replicas: 1
  resources:
    limits:
      memory: 2Gi
    requests:
      memory: 2Gi

And apply it to OpenShift debezium-demo namespace (or just apply the one you have in this demo repository)

oc apply -f resources/kafka-connect-debezium.yaml -n debezium-demo

This will create a Kafka Connect cluster with the name debezium on your namespace:

Deploy a Debezium connector for MySQL

So you have the Kafka Connect cluster to be able to use with Debezium. Now it’s time for the real magic; the Debezium connector for MySQL.

Create the custom resource like the following, by noticing the parts of configuration starts with database.

Since you have to capture the changes in the neverendingblog database which has the posts database your configuration should be something like this:

apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaConnector
metadata:
  labels:
    strimzi.io/cluster: debezium
  name: debezium-mysql-connector
spec:
  class: io.debezium.connector.mysql.MySqlConnector
  config:
    database.server.name: db
    database.hostname: mysql
    database.user: debezium
    database.password: dbz
    database.server.id: '184054'
    database.port: '3306'
    database.dbname: neverendingblog
    database.history.kafka.topic: db.history
    database.history.kafka.bootstrap.servers: 'demo-kafka-bootstrap:9092'
  tasksMax: 1

Apply this YAML by saving it or just run the following command in this repository:

oc apply -f resources/kafka-connector-mysql-debezium.yaml -n debezium-demo

So you should now have some action in your Kafka cluster by now and the big picture should look like this:

In order to see if there is any new topic is created in your Kafka cluster run this command to list the topics in the debezium-demo namespace and demo Kafka cluster:

kfk topics --list -n debezium-demo -c demo

So you should see some topics are created for you:

NAME                                                                                PARTITIONS   REPLICATION FACTOR
consumer-offsets---84e7a678d08f4bd226872e5cdd4eb527fadc1c6a                         50           1
db                                                                                  1            1
db.history                                                                          1            1
db.neverendingblog.auth-permission---68ff3df4ec8e6a44b01288a87974b27990a559d2       1            1
db.neverendingblog.auth-user---a76d163ac9b98b60f06bfda76e966523ee9ffad              1            1
db.neverendingblog.django-admin-log---889a02bc079f08f8adf60c1b1f1cc6782dd99531      1            1
db.neverendingblog.django-content-type---79cc865eac5ac5b439174d2165a8035d52062610   1            1
db.neverendingblog.django-migrations---adc510d5c63e7b6ccbbf460dfa8c03408559591d     1            1
db.neverendingblog.django-session---38f5de04ea83f7a9add8be00a2d695a9503505c6        1            1
db.neverendingblog.posts                                                            1            1
debezium-cluster-configs                                                            1            1
debezium-cluster-offsets                                                            25           1
debezium-cluster-status                                                             5            1

Now let’s check this connector works or not. So start a consumer that listens your db.neverendingblog.posts topic which the captured data from posts topic is put.

kfk console-consumer --topic db.neverendingblog.posts -n debezium-demo -c demo

After starting the consumer let’s make some changes in the NeverEnding Blog. Open the Django admin page by getting the route URL of the blog and putting a “/admin” at the end.

INFO

You can get the route URL of your application with the following command:

oc get routes -n debezium-demo

So login to the admin page with the credentials mabulgu/123456 and click on Posts and add a new one by clicking Add Post and put these values as a test and save it:

In the consumer you must already have seen a move right? Copy that into a JSON beautifier and see what you have. You must have something like this:

{
  "schema": {
    "type": "struct",
    "fields": [
      {
        "type": "struct",
        "fields": [
          {
            "type": "int32",
            "optional": false,
            "field": "id"
          },
          {
            "type": "string",
            "optional": false,
            "field": "title"
          },
          {
            "type": "string",
            "optional": false,
            "field": "text"
          },
          {
            "type": "int64",
            "optional": false,
            "name": "io.debezium.time.MicroTimestamp",
            "version": 1,
            "field": "created_date"
          },
          {
            "type": "int64",
            "optional": true,
            "name": "io.debezium.time.MicroTimestamp",
            "version": 1,
            "field": "published_date"
          },
          {
            "type": "int32",
            "optional": false,
            "field": "author_id"
          }
        ],
        "optional": true,
        "name": "db.neverendingblog.posts.Value",
        "field": "before"
      },
      {
        "type": "struct",
        "fields": [
          {
            "type": "int32",
            "optional": false,
            "field": "id"
          },
          {
            "type": "string",
            "optional": false,
            "field": "title"
          },
          {
            "type": "string",
            "optional": false,
            "field": "text"
          },
          {
            "type": "int64",
            "optional": false,
            "name": "io.debezium.time.MicroTimestamp",
            "version": 1,
            "field": "created_date"
          },
          {
            "type": "int64",
            "optional": true,
            "name": "io.debezium.time.MicroTimestamp",
            "version": 1,
            "field": "published_date"
          },
          {
            "type": "int32",
            "optional": false,
            "field": "author_id"
          }
        ],
        "optional": true,
        "name": "db.neverendingblog.posts.Value",
        "field": "after"
      },
      {
        "type": "struct",
        "fields": [
          {
            "type": "string",
            "optional": false,
            "field": "version"
          },
          {
            "type": "string",
            "optional": false,
            "field": "connector"
          },
          {
            "type": "string",
            "optional": false,
            "field": "name"
          },
          {
            "type": "int64",
            "optional": false,
            "field": "ts_ms"
          },
          {
            "type": "string",
            "optional": true,
            "name": "io.debezium.data.Enum",
            "version": 1,
            "parameters": {
              "allowed": "true,last,false"
            },
            "default": "false",
            "field": "snapshot"
          },
          {
            "type": "string",
            "optional": false,
            "field": "db"
          },
          {
            "type": "string",
            "optional": true,
            "field": "table"
          },
          {
            "type": "int64",
            "optional": false,
            "field": "server_id"
          },
          {
            "type": "string",
            "optional": true,
            "field": "gtid"
          },
          {
            "type": "string",
            "optional": false,
            "field": "file"
          },
          {
            "type": "int64",
            "optional": false,
            "field": "pos"
          },
          {
            "type": "int32",
            "optional": false,
            "field": "row"
          },
          {
            "type": "int64",
            "optional": true,
            "field": "thread"
          },
          {
            "type": "string",
            "optional": true,
            "field": "query"
          }
        ],
        "optional": false,
        "name": "io.debezium.connector.mysql.Source",
        "field": "source"
      },
      {
        "type": "string",
        "optional": false,
        "field": "op"
      },
      {
        "type": "int64",
        "optional": true,
        "field": "ts_ms"
      },
      {
        "type": "struct",
        "fields": [
          {
            "type": "string",
            "optional": false,
            "field": "id"
          },
          {
            "type": "int64",
            "optional": false,
            "field": "total_order"
          },
          {
            "type": "int64",
            "optional": false,
            "field": "data_collection_order"
          }
        ],
        "optional": true,
        "field": "transaction"
      }
    ],
    "optional": false,
    "name": "db.neverendingblog.posts.Envelope"
  },
  "payload": {
    "before": null,
    "after": {
      "id": 3,
      "title": "Javaday Istanbul 2020",
      "text": "It was perfect as always!",
      "created_date": 1606400139000000,
      "published_date": null,
      "author_id": 1
    },
    "source": {
      "version": "1.2.4.Final-redhat-00001",
      "connector": "mysql",
      "name": "db",
      "ts_ms": 1606400180000,
      "snapshot": "false",
      "db": "neverendingblog",
      "table": "posts",
      "server_id": 223344,
      "gtid": null,
      "file": "mysql-bin.000003",
      "pos": 27078,
      "row": 0,
      "thread": 221,
      "query": null
    },
    "op": "c",
    "ts_ms": 1606400180703,
    "transaction": null
  }
}

So congratulations! You can capture changes on your neverendingblog database.

But your boss still wants you to put these changes on your search system Elasticsearch.

Before rolling the sleeves to send this change data to Elasticsearch let’s purify this data since all you need to index is the operation type and the table fields in this Debezium JSON data.

Simple Data Transformation

So in order to transform the data some key/value converters has to be set in order to do extract transformation which will create a different data model in the end.

So add these lines and apply it on your OpenShift cluster:

    key.converter: org.apache.kafka.connect.json.JsonConverter
    key.converter.schemas.enable: 'false'
    value.converter: org.apache.kafka.connect.json.JsonConverter
    value.converter.schemas.enable: 'false'
    transforms: extract
    transforms.extract.add.fields: 'op,table'
    transforms.extract.type: io.debezium.transforms.ExtractNewRecordState

Or just run this sample in the repository:

oc apply -f resources/kafka-connector-mysql-debezium_transformed.yaml -n debezium-demo

This means that we will extract the data for op and table fields and create a new JSON to be returned.

After applying the changes let’s consume the messages again if we did stop the consumer already:

kfk console-consumer --topic db.neverendingblog.posts -n debezium-demo -c demo

Go to the blog admin page again but this time let’s change one of the blog posts instead of adding one.

Edit the post titled Strimzi Kafka CLI: Managing Strimzi in a Kafka Native Way and put a “CHANGED -” at the very start of the body for example.

When you change the data, a relatively smaller JSON data must have been consumed in your console, something like this:

{
  "id": 2,
  "title": "Strimzi Kafka CLI: Managing Strimzi in a Kafka Native Way",
  "text": "CHANGED - Strimzi Kafka CLI is a CLI that helps traditional Apache Kafka users -mostly administrators- to easily adapt Strimzi, a Kubernetes operator for Apache Kafka.\r\n\r\nIntention here is to ramp up Strimzi usage by creating a similar CLI experience with traditional Apache Kafka binaries. \r\n\r\nkfk command stands for the usual kafka-* prefix of the Apache Kafka runnable files which are located in bin directory. There are options provided like topics, console-consumer, etc. which also mostly stand for the rest of the runnable file names like kafka-topic.sh.\r\n\r\nHowever, because of the nature of Strimzi and its capabilities, there are also unusual options like clusters which is used for cluster configuration or users which is used for user management and configuration.",
  "created_date": 1594644431000000,
  "published_date": 1594644489000000,
  "author_id": 1,
  "__op": "u",
  "__table": "posts"
}

So this will be the data that you will index in Elasticsearch. Now let’s go for it!

Deploying a Kafka Connect Cluster for Camel

In order to use another connector that consumes the data from Kafka and puts it onto Elasticsearch, first we need another Kafka Connect cluster, this time for a Camel connector.

apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaConnect
metadata:
  annotations:
    strimzi.io/use-connector-resources: 'true'
  name: camel
spec:
  bootstrapServers: 'demo-kafka-bootstrap:9092'
  config:
    config.storage.replication.factor: '1'
    config.storage.topic: camel-cluster-configs
    group.id: camel-cluster
    offset.storage.replication.factor: '1'
    offset.storage.topic: camel-cluster-offsets
    status.storage.replication.factor: '1'
    status.storage.topic: camel-cluster-status
  image: 'quay.io/hguerreroo/camel-kafka-connect:0.5.0'
  jvmOptions:
    gcLoggingEnabled: false
  replicas: 1
  resources:
    limits:
      memory: 2Gi
    requests:
      memory: 2Gi

Saving or apply this YAML to your OpenShift namespace or just simply run this sample:

oc apply -f resources/kafka-connect-camel.yaml -n debezium-demo

This will create a Kafka Connect cluster with the name camel on your namespace:

Now let’s put some connector on this connect cluster.

Deploy a Camel Sink connector for Elasticsearch

In order to send the consumed data to Elasticsearch we can use Apache Camel project’s connectors for Kafka Connect.

The following is a sample of an Elasticsearch Sink Connector of Camel, which takes Kafka as the source and Elasticsearch as the sink.

apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaConnector
metadata:
  labels:
    strimzi.io/cluster: camel
  name: elasticsearch-connector
spec:
  class: >-
    org.apache.camel.kafkaconnector.elasticsearchrest.CamelElasticsearchrestSinkConnector
  config:
    camel.sink.endpoint.hostAddresses: 'elasticsearch-es-http:9200'
    camel.sink.endpoint.indexName: posts
    camel.sink.endpoint.operation: Index
    camel.sink.path.clusterName: elasticsearch
    key.converter: org.apache.kafka.connect.storage.StringConverter
    value.converter: org.apache.kafka.connect.storage.StringConverter
    topics: db.neverendingblog.posts
  tasksMax: 1

By saving and applying this resource you tell the connect cluster that consume the db.neverendingblog.posts topic of Kafka, and put them in a posts index in Elasticsearch.

Or just run this command to create the connector:

oc apply -f resources/kafka-connector-elastic-camel.yaml -n debezium-demo

Now the big picture should look like this:

So let’s test your Elasticsearch running some curls as a search request.

Try out Elasticsearch

For Elasticsearch, just like other applications in OpenShift in order to access it externally, you should get its route with the command:

oc get routes -n debezium-demo

Let’s say that we get the route as http://elasticsearch-es-http-debezium-demo.apps.cluster-jdayist-6d29.jdayist-6d29.example.opentlc.com.

So in order to see if the index is created or if it has anything inside, just run the following command for searching everything in the index:

curl -X GET \
  http://elasticsearch-es-http-debezium-demo.apps.cluster-jdayist-6d29.jdayist-6d29.example.opentlc.com/posts/_search

You should get a response that has all the changes including the one for Javaday Istanbul. So let’s see if we can find it or not:

curl -X GET \
  'http://elasticsearch-es-http-debezium-demo.apps.cluster-jdayist-6d29.jdayist-6d29.example.opentlc.com/posts/_search?q=title:Javaday%20Istanbul%202020'

So you should see somethinhg like this in return:

{
    "took": 8,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 4.852654,
        "hits": [
            {
                "_index": "posts",
                "_type": "_doc",
                "_id": "8VI-FnYBP8VChxowl2Pr",
                "_score": 4.852654,
                "_source": {
                    "id": 3,
                    "title": "Javaday Istanbul 2020",
                    "text": "It was perfect as always!",
                    "created_date": 1606690949000000,
                    "published_date": null,
                    "author_id": 1,
                    "__op": "c",
                    "__table": "posts"
                }
            }
        ]
    }
}

Congratulations! You finished it ASAP! Now you can relax and may feel a little bit like a gansta:)

By the way, if you are interested in the event presentation and the demo video, here it is! (p.s. Event was in Turkish)