SSIS – Running single SSIS packages in Parallel to get faster performance for multiple data loads
What we are going to show you below, is that by using ONE SSIS package, we can re-use this one SSIS package multiple times to get data from multiple sources, possibly from different databases hosted in different locations around the world. So in effect we are loading our data from one SSIS package in parallel.
NOTE: By completing the steps below we have seen our packages perform and complete a lot quicker due to running in parallel.
The thing to note with my example below is that all the database have exactly the same structure within the databases, where ever they are hosted.
· Databases are hosted in different locations around the world.
o And there were multiple databases on each database system.
· All the databases have exactly the same table and schema structures.
· Within each hosting location we have different IP Addresses and database names.
Our requirement was to load the data as quickly as possible from our multiple locations. This is how we achieved this.
NOTE: I am not going to go into the details on how we got our connection details to connect to each system. If someone has a requirement for this then I can possibly put this into another blog post.
What I am going to explain below is how we use a control SSIS package, which controls how we run our SSIS packages in parallel.
1. The first thing that we did was to create the following variable called:
c. Data Type:
d. NOTE: The reason that we create this is because of the following:
i. This is going to be used in the steps further down in our For Loop container to know when to exit the For Loop.
ii. It is also going to be used in the actual SSIS package that runs in parallel, and then it has completed to use a script task to populate this variable from our parent package.
2. As with normal data warehousing practice we first truncate our staging tables.
3. Next is where we have our sequence container.
a. It is within this sequence container that we put how many packages we want to run in parallel.
4. So within our Sequence container we put the following
a. First we dragged in a For Loop container inside our Sequence Container.
b. We then configured our For Loop container with the following:
ii. What we are saying above is when our variable from step 1 @LoopQuitparent not equals zero then finish or complete.
c. Next inside our For Loop Container we put an Execute Package Task
i. NOTE: This is our SSIS Package that is going to run multiple times or in Parallel.
ii. As per our example if you look at the Package Window on the left hand side you will see where we have configured our SSIS Package
d. Then all that you need to do is to copy and paste the For Loop Container as many times as you want to run in Parallel.
i. In our picture below we wanted to run it in parallel x 4
ii. NOTE: We just renamed each For Loop container so that we know how many we have.
iii. NOTE 2: It might be good to test how many will be the optimal number for you, as it is dependent on server resources, network interface and bandwidth to your locations of where the data is.
5. In the next steps we will explain how we configure our one SSIS package so that it can run in parallel as well as know when to exit.
Configuring our SSIS Package so that it can get the required details and run in parallel (SSIS-Parallel-LoadStaging.dtsx)
In the next steps I will explain how we get our one SSIS package (SSIS-Parallel-LoadStaging.dtsx) to get the required data into variables and then run in parallel.
1. It is going to be easiest to explain if I can show you an example of what our Source Table looks like below. This is where we store the following which will be explained after the picture for clarity.
i. We use this to uniquely identify our Source Systems. This is used throughout our SSIS Packages and is put into our data warehouse and Cube so that we know from which system the data came from.
ii. NOTE: This also enables us to get an overview of a particular system when it has more than one database.
i. This is the actual IP address that we will pass as an expression in our Connection manager in order to connect to the SQL Instance.
i. This is the standard SQL Port number, but if for some reason it was on a different port we would specify it here.
ii. NOTE: This can also get passed as an expression in our Connection manager.
i. This is used later within our SSIS package (SSIS-Parallel-LoadStaging.dtsx) and what we will explain later, is that when the row has been used and the data processed it changes the value from a 2 to a 1.
i. This is the database name within the SQL Instance system we are connecting to.
ii. NOTE: This is gets passed as an expression in our Connection Manager.
i. This is used in our dimensions later if we want to know where the actual data is coming from.
i. This is the actual SQL Instance name that we are connecting to.
i. This is part of our process, where we test to see if we can connect to the actual SQL Instance and databasename. If we can the IsActive is set to 1, and if not it is set to 0 (Zero)
ii. NOTE: We did not explain how we get the data into this table, as this would make this blog post a whole lot longer.
2. So what we do in our first step is to get the SourceSystemSK into a variable.
b. But we need to explain what else we do within our Execute SQL Task.
c. Below is the actual query with an explanation afterwards
— Section 1
— Section 2
— Section 3
— Section 4
— Section 5
— Section 6
— Section 7
— Section 8
i. — Section 1
ii. The first thing that we do is declare our @ServerID
iii. — Section 2
iv. Next we start a Begin Tran
1. NOTE: The reason for this is so that we can actually lock the table whilst this transaction is happening.
2. NOTE 2: This is so that when the SSIS Packages are running in parallel, we know for a fact that each package will only get a valid SourceSystemSK when trying to get another value. If it tries whilst another SSIS Package is busy it will wait until the transaction has been completed.
v. — Section 3
vi. We then set the RowCount to 1, this is so that we will only get one row back.
1. NOTE: we are doing this because we only ever want to get back one SourceSystemSK
vii. — Section 4
viii. Next is our TSQL Select statement
1. Here we are getting any SourceSystemSK from our table, where the IsProcess is set to 2.
2. We are then putting this into our @ServerID variable.
ix. — Section 5
x. Next we are updating our table using the TABLOCK hint so that we can once again lock the entire table, to ensure that for our SourceSystemSK row it gets updated from a 2 to 1.
1. NOTE: This then means that this data is and has been processed when the next package comes along looking for data to process.
xi. — Section 6
xii. We then set the RowCount back to 0 (zero)
xiii. — Section 7
xiv. Then we run a TSQL Select statement so that we can then pass our @ServerID as SourceSystemSK
1. NOTE: This is because we are using the SourceSystemSK and passing this into a variable in our Execute SQL Task.
2. NOTE 2: We also set the isNull to 0 (zero) so that when there are no more rows to process it defaults to zero, which is explained later.
xv. — Section 8
xvi. Finally we are then committing our transaction.
d. Within our Execute SQL Task we also map the result set to our variable called:
3. The next thing that we need to look into is our Exit Loop
b. NOTE: The reason that we have this is because we need to pass a variable from our child package back to our Parent Package.
c. So we put in a Script Task using Visual Basic 2010 and configured it with the following:
d. NOTE: Click on Edit script to open the Script Task editor
‘ Add your code here
Dts.Variables(“LoopQuitParent”).Value = 1
Dts.TaskResult = ScriptResults.Success
ii. What the above does is pass the Variable name LoopQuitParent value of 1 back to the Parent package.
iii. NOTE: This was configured in our For Loop Container in step 4 from the section above, where it was set to not equal to zero. So when it equals one then exit or complete
4. Next we creating another Execute SQL Task, and in this task it then uses a Query within a variable to get the IP Address and Database name for our server that we want to connect to.
b. NOTE: This is based on our SourceSystemSK variable which we populated in step 2 above.
c. This is what our Variable looks like, where we have configured it as an Expression
ii. As you can see from above when we run the above query, if our SourceSystemSK = 1 we would get the following details back
d. Then in our Result set for our Execute SQL Task we would have the following:
e. Now we currently have the following in our variables for our SSIS Package
i. SourceSystemSK, IPAddress and Database Name.
5. Now you need to create a local OLE DB Connection Manager in your SSIS Package.
a. NOTE: The reason that it must be a local connection, is so that when each package is running in parallel it will only populate the expressions for the current package.
b. With our example we created a local OLEDB connection with the following name:
c. Next right click and go into the Properties for your ServerConnection.local-OLE.
d. Where it says Expressions click on the Ellipses button and configure it with the following:
2. NOTE: This is the actual DatabaseName
2. NOTE: This is the actual IPAddress to connect to the server.
v. IT will look like the following:
e. Then click Ok.
f. NOTE: Ensure that your DatabaseName and IPAddress variables have valid default Values in your Variables
i. This is because when you put in your expressions above, it attempts to connect to the system to ensure that it is valid.
ii. This will save you a lot of time waiting for it to timeout and will ensure that it is indeed valid.
6. Now the next step is to put in what you want your SSIS Package to complete on each and every server.
b. With our example we would then get the Max RowID for each system which is stored in a separate table.
c. Within our table we simply have the Max RowID, SourceSystemSK and the date it got inserted.
d. Here is an example below:
f. This would then be in the Result Set
g. Then we have our Data Flow task which based on our Max RowID inserts the data into our Staging_tb_SalesInformation
7. Now the last thing to configure in this SSIS package (SSIS-Parallel-LoadStaging.dtsx) is the Precedence Constraint which can be seen in the picture below
b. The reason that we have this is so that we know for our package when it has completed going through our entire list of servers (and table with the connection details) and populate our LoopQuitParent variable back to our Parent Package, and let that particular For Loop Container complete.
c. NOTE: What we did in step 2 above is your will see that we set an IsNull in section 7 and set this to 0 (zero).
d. This is now used in our Precedence constraint and configured with the following for the “Get the IPAddress and DatabaseName into Variable based on SourceSystemSK” which is on the left hand side.
i. We configured the Precedence Constraint with the following:
iii. And here what we are saying is if we have any value greater than zero then get our values, and get the data down.
e. This is now used in our Precedence constraint and configured with the following for the “Exit Loop” which is on the right hand side.
i. We configured the Precedence Constraint with the following:
iii. And here what we are saying is when there are no more values for the SourceSystemSK, which will be set to zero, then exit out and populate the LoopQuitParent so that this can go back to the parent package.
8. Now we have configured our SSIS Package which can run in parallel.
9. This is what the entire package looks like
· Now when you run the Control package, and it runs, you will see all your For Loop containers running at once.
· The thing that you will not essentially see is if you have configured it to run in parallel with 4 SSIS packages, you will only see one running, but actually all 4 are running at once.
· Once it has all completed you will see your For Loop Container complete with the green arrow.
o You can then go into your Staging_tb_SalesInformation table and validate it to see that you have got all your required information from your Source Systems.
If you have any questions or queries then please do not hesitate to contact me.