How to copy files from one directory to another on HDFS (Hadoop)? – Start doing it the 1 right way!

You are currently viewing How to copy files from one directory to another on HDFS (Hadoop)? – Start doing it the 1 right way!
Hadoop Logo
Share This Post, Help Others, And Earn My Heartfelt Appreciation! :)
4.8
(660)

We often encounter the need to copy data between directories on HDFS on Hadoop. [ How to copy files from one directory to another on HDFS ]Generally, we can perform such an operation in several ways, depending on how complex our copying case is, because it may happen that you want to copy only non-existing files (without overwriting them), or maybe you want to make all the files have been copied again.

How to copy files from one directory to another on HDFS?

hdfs dfs -cp

First, let’s consider a simpler method, which is copying files using the hdfs client and the -cp command. Please take a look at the following command:

hdfs dfs -cp -f /source/path/* /target/path

With this command you can copy data from one place to the final destination. By using a star(*) you are suggesting that all files in the source directory be copied. However, thanks to the -f parameter, files in the target directory will be overwritten. (How to copy files from one directory to another on HDFS)

How to copy files from one directory to another on HDFS

hadoop distcp

The second (more complex) method is to use the hadoop client with the distcp option. Take a look at the following command:

hadoop distcp /source/path /target/path

When you run this command, mapReduce job will be run which will check if the target directory exists, how many files there are to copy etc. The default queue will be used to submit the job. (How to copy files from one directory to another on HDFS)

You can specify the job name, which queue will be used to submit the job, basically the all available -mapred parameters. For example:

hadoop distcp -Dmapred.job.name=my-first-distcp-job -Dmapred.job.queue.name=my-queue /source/path /target/path

You can specify more options. Let’s see the full list below:

usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -append                       Reuse existing data in target files and
                               append new data to them if possible
 -async                        Should distcp execution be blocking
 -atomic                       Commit all changes or none
 -bandwidth <arg>              Specify bandwidth per map in MB
 -blocksperchunk <arg>         If set to a positive value, fileswith more
                               blocks than this value will be split into
                               chunks of <blocksperchunk> blocks to be
                               transferred in parallel, and reassembled on
                               the destination. By default,
                               <blocksperchunk> is 0 and the files will be
                               transmitted in their entirety without
                               splitting. This switch is only applicable
                               when the source file system implements
                               getBlockLocations method and the target
                               file system implements concat method
 -copybuffersize <arg>         Size of the copy buffer to use. By default
                               <copybuffersize> is 8192B.
 -delete                       Delete from target, files missing in source
 -diff <arg>                   Use snapshot diff report to identify the
                               difference between source and target
 -f <arg>                      List of files that need to be copied
 -filelimit <arg>              (Deprecated!) Limit number of files copied
                               to <= n
 -filters <arg>                The path to a file containing a list of
                               strings for paths to be excluded from the
                               copy.
 -i                            Ignore failures during copy
 -log <arg>                    Folder on DFS where distcp execution logs
                               are saved
 -m <arg>                      Max number of concurrent maps to use for
                               copy
 -mapredSslConf <arg>          Configuration for ssl config file, to use
                               with hftps://. Must be in the classpath.
 -numListstatusThreads <arg>   Number of threads to use for building file
                               listing (max 40).
 -overwrite                    Choose to overwrite target files
                               unconditionally, even if they exist.
 -p <arg>                      preserve status (rbugpcaxt)(replication,
                               block-size, user, group, permission,
                               checksum-type, ACL, XATTR, timestamps). If
                               -p is specified with no <arg>, then
                               preserves replication, block size, user,
                               group, permission, checksum type and
                               timestamps. raw.* xattrs are preserved when
                               both the source and destination paths are
                               in the /.reserved/raw hierarchy (HDFS
                               only). raw.* xattrpreservation is
                               independent of the -p flag. Refer to the
                               DistCp documentation for more details.
 -rdiff <arg>                  Use target snapshot diff report to identify
                               changes made on target
 -sizelimit <arg>              (Deprecated!) Limit number of files copied
                               to <= n bytes
 -skipcrccheck                 Whether to skip CRC checks between source
                               and target paths.
 -strategy <arg>               Copy strategy to use. Default is dividing
                               work based on file sizes
 -tmp <arg>                    Intermediate work path to be used for
                               atomic commit
 -update                       Update target, copying only missing files or
                               directories

That’s all about topic: How to copy files from one directory to another on HDFS (Hadoop)! Enjoy!

Hadoop Copy Files, HDFS Copy Files

If you enjoyed this post please add the comment below and share this post on your Facebook, Twitter, LinkedIn or another social media webpage.
Thanks in advanced!

How useful was this post?

Click on a star to rate it!

Average rating 4.8 / 5. Vote count: 660

No votes so far! Be the first to rate this post.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments