The secret behind the parallelism of import & export job


DATA PUMP IS THE MOST IMPORTANT UTILITY PROVIDED BY ORACLE(SINCE 10G) TO DATABASE ADMINISTRATORS, AS DATA MIGRATION, SCHEMA MIGRATION, LOGICAL BACKUP ARE DAILY PHENOMENA IN DB INDUSTRY, so this post is about utilizing data pump to its most. let’s take a look !!!

How the Parallel Parameter Works for Export Operations :

For the export operations described in this section, it is assumed that the user is using Oracle Database Enterprise Edition and the wildcard option with the DUMPFILE parameter. In a typical export that includes both data and metadata, the first worker process will unload the metadata: tablespaces, schemas, grants, roles, tables, indexes, and so on. This single worker unloads the metadata, and all the rest unload the data, all at the same time. If the metadata worker finishes and there are still data objects to unload, it will start unloading the data too. The examples in this document assume that there is always one worker busy unloading metadata while the rest of the workers are busy unloading table data objects.

For every export operation, Data Pump estimates how much disk space each table data object in the export job will consume (in bytes). This is done whether or not the user uses the ESTIMATE parameter. The estimate is printed in the log file and displayed on the client’s standard output device. The estimate is for table row data only; it does not include metadata. This estimate is used to determine how many PX processes will be applied to the table data object, if any. The columns of the tables are examined to determine if direct path, external tables, or both methods can be used. For direct path, the parallel number for the table data object is always one since direct path does not support parallel unload of a table data object. PX processes are only used with external tables. If the external tables method is chosen, Data Pump will determine the maximum number of PX processes that can work on a table data object. It does this by dividing the estimated size of the table data object by 250 MB and rounding the result down. If the result is zero or one, then PX processes are not used to unload the table. For example, if the table size is 600 MB, there will be two Parallel Execution Processes determined by dividing the estimated size by 250 MB (parallel threshold) and rounding down. In this case there will also be two worker processes: one for the metadata and one for the data. The worker process for the data acts as the coordinator for the PX processes and does not count toward the parallel total. Thus, in this case, the degree of parallelism used by the export job is three: one for the metadata worker and two for the PX processes. If the user looks at the log file or uses the interactive STATUS command to see what the workers are doing, only two worker processes will be visible as the PX processes are not visible in expdp and impdp. Consider another example where the table size is 400 MB. In this case, there will be no Parallel Execution Process. When 400 MB is divided by 250 MB, the resulting parallel value is one. The worker will unload the data using either direct path or external tables without any parallelism. If a job is not big enough to make use of the maximum parallel number, then the user will not see the maximum number of active workers and Parallel Execution Processes. For example, if there is one 800 MB table, and it has been determined that external tables will be used, there will be one worker for the metadata, one worker for the data, and three PX processes. As mentioned above, the worker process for the data acts as the coordinator for the PX processes and does not count toward the parallel total. So, if a user specifies PARALLEL = 10, the degree of parallelism is actually four. The user will only see one active worker in the STATUS display. Data Pump is working optimally; the job is too small for the specified degree of parallelism. What happens if there is a large job and PARALLEL = 4? Can the Master Control Process automatically add more workers if they are needed? No, the PARALLEL parameter can only go up to the maximum number that was specified by the user. If the degree of parallelism for a table would cause the job to exceed the limit specified by the PARALLEL parameter, the number of Parallel Execution Processes will be scaled down to fit within the limit. The user can, however, increase the parallelism of the job via the interactive command line. In a RAC environment, Parallel Execution Processes can run on other instances. The user might not see them running on the same instance where the Data Pump job is running. This might make it even less obvious that Data Pump is running optimally.
How the Parallel Parameter Works for Import Operations :

The PARALLEL parameter works a bit differently in Import than Export. Because there are various dependencies that exist when creating objects during import, everything must be done in order. For Import, no data loading can occur until the tables are created because data cannot be loaded into tables that do not yet exist. Data Pump Import processes the database objects in the following order:
• The first worker begins to load all the metadata – the tablespaces, schemas, etc., until all the tables are created. • Once the tables are created, the first worker starts loading data instead of metadata and the rest of the workers start loading data too. • Once the table data is loaded, the first worker returns to loading metadata again. The rest of the workers are idle until the first worker loads all the metadata up to package bodies. • Multiple workers load package bodies in parallel. • One worker loads metadata up to and including secondary tables. • Multiple workers load secondary table data. • One worker loads the remaining metadata. Note: One worker creates all the indexes but uses PX processes up to the PARALLEL value so indexes get created faster. Thus, an import job can be started with a PARALLEL = 10, and the user will only see one worker being utilized at certain points during job execution. No other workers or Parallel Execution Processes will be working until all the tables are created. When the tables are created, a burst of workers and possibly PX processes will execute in parallel until the data is loaded, then the worker processes will become idle. When loading table data objects, Data Pump divides the size of the data in the dump file by 250 MB to estimate the number of potential PX processes that can be used to load the data. If the number is greater than one, Data Pump can use PX processes to load the data, assuming there is enough parallelism available. Note that table data objects that are large enough can be loaded in parallel even if they were unloaded by one worker.
Getting the Most From the Parallel Parameter:

Here is a general guideline for what should be considered when using the PARALLEL parameter: • Set the degree of parallelism to two times the number of CPUs, then tune from there. • For Data Pump Export, the PARALLEL parameter value should be less than or equal to the number of dump files. • For Data Pump Import, the PARALLEL parameter value should not be much larger than the number of files in the dump file set. • A PARALLEL greater than one is only available in Enterprise.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s