We often encounter the need to copy data between directories on HDFS on Hadoop. [ How to copy files from one directory to another on HDFS ]Generally, we can perform such an operation in several ways, depending on how complex our copying case is, because it may happen that you want to copy only non-existing files (without overwriting them), or maybe you want to make all the files have been copied again.
How to copy files from one directory to another on HDFS?
hdfs dfs -cp
First, let’s consider a simpler method, which is copying files using the hdfs client and the -cp command. Please take a look at the following command:
hdfs dfs -cp -f /source/path/* /target/path
With this command you can copy data from one place to the final destination. By using a star(*) you are suggesting that all files in the source directory be copied. However, thanks to the -f parameter, files in the target directory will be overwritten. (How to copy files from one directory to another on HDFS)
The second (more complex) method is to use the hadoop client with the distcp option. Take a look at the following command:
hadoop distcp /source/path /target/path
When you run this command, mapReduce job will be run which will check if the target directory exists, how many files there are to copy etc. The default queue will be used to submit the job. (How to copy files from one directory to another on HDFS)
You can specify the job name, which queue will be used to submit the job, basically the all available -mapred parameters. For example:
hadoop distcp -Dmapred.job.name=my-first-distcp-job -Dmapred.job.queue.name=my-queue /source/path /target/path
You can specify more options. Let’s see the full list below:
usage: distcp OPTIONS [source_path...] <target_path> OPTIONS -append Reuse existing data in target files and append new data to them if possible -async Should distcp execution be blocking -atomic Commit all changes or none -bandwidth <arg> Specify bandwidth per map in MB -blocksperchunk <arg> If set to a positive value, fileswith more blocks than this value will be split into chunks of <blocksperchunk> blocks to be transferred in parallel, and reassembled on the destination. By default, <blocksperchunk> is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when the source file system implements getBlockLocations method and the target file system implements concat method -copybuffersize <arg> Size of the copy buffer to use. By default <copybuffersize> is 8192B. -delete Delete from target, files missing in source -diff <arg> Use snapshot diff report to identify the difference between source and target -f <arg> List of files that need to be copied -filelimit <arg> (Deprecated!) Limit number of files copied to <= n -filters <arg> The path to a file containing a list of strings for paths to be excluded from the copy. -i Ignore failures during copy -log <arg> Folder on DFS where distcp execution logs are saved -m <arg> Max number of concurrent maps to use for copy -mapredSslConf <arg> Configuration for ssl config file, to use with hftps://. Must be in the classpath. -numListstatusThreads <arg> Number of threads to use for building file listing (max 40). -overwrite Choose to overwrite target files unconditionally, even if they exist. -p <arg> preserve status (rbugpcaxt)(replication, block-size, user, group, permission, checksum-type, ACL, XATTR, timestamps). If -p is specified with no <arg>, then preserves replication, block size, user, group, permission, checksum type and timestamps. raw.* xattrs are preserved when both the source and destination paths are in the /.reserved/raw hierarchy (HDFS only). raw.* xattrpreservation is independent of the -p flag. Refer to the DistCp documentation for more details. -rdiff <arg> Use target snapshot diff report to identify changes made on target -sizelimit <arg> (Deprecated!) Limit number of files copied to <= n bytes -skipcrccheck Whether to skip CRC checks between source and target paths. -strategy <arg> Copy strategy to use. Default is dividing work based on file sizes -tmp <arg> Intermediate work path to be used for atomic commit -update Update target, copying only missing files or directories
That’s all about topic: How to copy files from one directory to another on HDFS (Hadoop)! Enjoy!
Hadoop Copy Files, HDFS Copy Files
Could You Please Share This Post? I appreciate It And Thank YOU! :) Have A Nice Day!
YOU MIGHT ALSO LIKE
We are sorry that this post was not useful for you!
Let us improve this post!
Tell us how we can improve this post?