Skip to content
Big Data & ETL
  • RSS
  • Articles
  • Apache Spark
  • Tutorials
    • TOS Data Integration Tutorial
  • Football: Live prediction
  • English
  • Polski
Menu Close
  • RSS
  • Articles
  • Apache Spark
  • Tutorials
    • TOS Data Integration Tutorial
  • Football: Live prediction
  • English
  • Polski

Apache Spark

Apache Spark: Convert DataFrame to DataSet – Scala

  • Post author:pawel.ciesla
  • Post published:26 March 2020
  • Post category:Apache Spark/Articles
  • Post comments:0 Comments

Many times you might want to have strong typing on your data in Spark. The best to get it is to DataSet insted of DataFrame. In this post I give…

Continue Reading Apache Spark: Convert DataFrame to DataSet – Scala
Apache Spark: ReduceByKey vs GroupByKey – differences and comparison

Apache Spark: ReduceByKey vs GroupByKey – differences and comparison

  • Post author:pawel.ciesla
  • Post published:26 March 2020
  • Post category:Apache Spark/Articles
  • Post comments:0 Comments

In this post I will try to introduce you to the main differences between ReduceByKey and GroupByKey methods and why you should avoid the latter. But why? The answer is…

Continue Reading Apache Spark: ReduceByKey vs GroupByKey – differences and comparison

Apache Spark: how to rename or delete a file from HDFS?

  • Post author:pawel.ciesla
  • Post published:10 February 2019
  • Post category:Apache Spark/Articles/Big Data/Programming Languages/Tips
  • Post comments:2 Comments

In this short post I will show you how you can change the name of the file / files created by Apache Spark to HDFS or simply rename or delete any file.

Continue Reading Apache Spark: how to rename or delete a file from HDFS?

Scala: How to run a shell command from the code level?

  • Post author:pawel.ciesla
  • Post published:10 February 2019
  • Post category:Apache Spark/Articles/Big Data/Programming Languages/Tips
  • Post comments:0 Comments

In this post I will show you how to run the shell command by programming in Scala and how you can use it in Apache Spark.

Continue Reading Scala: How to run a shell command from the code level?

Apache Spark: How to save DataFrame as a single file on HDFS?

  • Post author:pawel.ciesla
  • Post published:10 February 2019
  • Post category:Apache Spark/Articles/Big Data
  • Post comments:0 Comments

If you want to save DataFrame as a file on HDFS, there may be a problem that it will be saved as many files. This is the most correct behavior and it results from the parallel work in Apache Spark.

Continue Reading Apache Spark: How to save DataFrame as a single file on HDFS?

Apache Spark: How to check if the file exists on HDFS?

  • Post author:pawel.ciesla
  • Post published:10 February 2019
  • Post category:Apache Spark/Articles/Big Data
  • Post comments:0 Comments

We will use the FileSystem and Path classes from the org.apache.hadoop.fs library to achieve it.

Continue Reading Apache Spark: How to check if the file exists on HDFS?
Apache Spark: Machine Learning – predicting diabetes in patients

Apache Spark: Machine Learning – predicting diabetes in patients

  • Post author:pawel.ciesla
  • Post published:7 February 2019
  • Post category:Apache Spark/Articles/Big Data/Machine Learning
  • Post comments:0 Comments

Today I will show you how you can use Machine Learning libraries (ML), which are available in Spark as a library under the name Spark MLib.

Continue Reading Apache Spark: Machine Learning – predicting diabetes in patients
How to install Apache Spark Standalone in CentOs?

How to install Apache Spark Standalone in CentOs?

  • Post author:pawel.ciesla
  • Post published:20 August 2018
  • Post category:Apache Spark/Articles/Big Data
  • Post comments:2 Comments

In this tutorial I will show you how you can easily install Apache Spark in CentOs

Continue Reading How to install Apache Spark Standalone in CentOs?
How to check if table exists in Apache Hive using Apache Spark?

How to check if table exists in Apache Hive using Apache Spark?

  • Post author:pawel.ciesla
  • Post published:20 August 2018
  • Post category:Apache Hive/Apache Spark/Big Data
  • Post comments:2 Comments

Simple short tip how to check if table exists int Hive using Spark

Continue Reading How to check if table exists in Apache Hive using Apache Spark?
Spark SQL, is there a difference in performance when executing a SQL query and using the DataFrame / DataSet API?

Spark SQL, is there a difference in performance when executing a SQL query and using the DataFrame / DataSet API?

  • Post author:pawel.ciesla
  • Post published:15 August 2018
  • Post category:Apache Spark/Articles/Big Data
  • Post comments:0 Comments

Like in the title :)

Continue Reading Spark SQL, is there a difference in performance when executing a SQL query and using the DataFrame / DataSet API?
  • 1
  • 2
  • Go to the next page

Subscribe the newsletter

About the Authors

About the Authors

Ewelina & Paweł

Hi, good to see you on our blog! :)
We hope you will find here a solutions for you questions and learn new skills.
Ewelina is Data Engineer with a passion for nature and landscape photography.
Paweł works as Big Data Engineer and most of free time spend on playing the guitar and crossfit classes.
  • Opens in a new tab

Tags

Android Apache Airflow Apache Hive Apache Kafka Apache Spark Big Data Cloudera DevOps Docker Docker-Compose ETL Excel GitHub Hortonworks Hyper-V Informatica IntelliJ Java Jenkins Machine Learning Maven Microsoft Azure MongoDB MySQL Oracle Quiz Scala Spring Boot SQL Developer SQL Server SVN Talend Teradata Tips Tutorial Ubuntu Windows

Recent Posts

  • [SOLVED] Jersey stopped working with InjectionManagerFactory not found
  • [SOLVED] MessageBodyWriter not found for media type=application/json
  • HTTP Methods and Status Codes – Check if you know all of them?
  • Teradata Studio: How to change query font size in SQL Editor?
  • How to load ehCache.xml from external location in Spring Boot?

Categories

  • Android (5)
  • Apache Airflow (5)
  • Apache Hive (3)
  • Apache Kafka (2)
  • Apache Spark (12)
  • Articles (91)
  • Azure (2)
  • Bez kategorii (1)
  • Big Data (17)
  • Cloudera (3)
  • Databases (18)
  • DevOps (2)
  • Docker (10)
  • Docker-Compose (7)
  • ETL (7)
  • GitHub (1)
  • Hyper-V (1)
  • Informatica (1)
  • Jenkins (2)
  • Machine Learning (2)
  • Maven (6)
  • MongoDB (1)
  • MS Excel (3)
  • MySQL (3)
  • Oracle (4)
  • Programming Languages (4)
  • Quiz (1)
  • Spring Boot (3)
  • SQL Developer (5)
  • SQL Server (6)
  • SVN (2)
  • Talend (7)
  • Teradata (13)
  • Tips (35)
  • Tutorials (1)
  • Ubuntu (9)
  • Windows (3)
Copyright 2021 - by BigData-ETL
Icon made by Freepik from www.flaticon.com
sponsored
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.Ok