Apache Kafka: How to delete data from Kafka topic?
When working with Apache Kafka, there may be a situation when we need to delete data from topick, because e.g. during testing junk data was sent, and we have not…
When working with Apache Kafka, there may be a situation when we need to delete data from topick, because e.g. during testing junk data was sent, and we have not…
Many times you might want to have strong typing on your data in Spark. The best to get it is to DataSet insted of DataFrame. In this post I give…
In this post I will try to introduce you to the main differences between ReduceByKey and GroupByKey methods and why you should avoid the latter. But why? The answer is…
In this short post I will show how you can run the Cloudera QuickStart using Docker. As you know from my previous post I am big fan of dockers and…
In this post I will show you few ways how you can export data from Hive to csv file. For this tutorial I have prepared hive table "test_csv_data" with few…
In this post I will show you how to create a fully operational environment in 5 minutes, which will include: Apache Airflow WebServerApache Airflow WorkerApache Airflow SchedulerFlower - is a…
In today's world, we often meet requirements for real-time data processing. There are quite a few tools on the market that allow us to achieve this. At the forefront we…
In this short post I will show you how you can change the name of the file / files created by Apache Spark to HDFS or simply rename or delete any file.
In this post I will show you how to run the shell command by programming in Scala and how you can use it in Apache Spark.
If you want to save DataFrame as a file on HDFS, there may be a problem that it will be saved as many files. This is the most correct behavior and it results from the parallel work in Apache Spark.