As we have promised earlier, we have measured the performance of our new components Data Flow Source and Data Flow Destination. We have used the package provided by Jamie Thomson who has done similar test, comparing the performance of raw data file vs. ADO recordset. We have extended the package to include test for our components and another test without intermediate data flow step. The test without intermediate step is important because it establishes the baseline comparison. The current performance cannot be improved beyond the baseline, unless Microsoft makes changes to the SSIS data flow pipeline implementation. Here are the results:
Number of rows | Baseline execution time | Raw file execution time | Data Flow execution time | Data Flow vs. Baseline | Data Flow vs. Raw |
100000 | 5.531 | 6.156 | 6.203 | -10.83% | -0.76% |
200000 | 7.359 | 8.5 | 8.313 | -11.48% | 2.25% |
300000 | 9.14 | 10.891 | 10.438 | -12.44% | 4.34% |
400000 | 11 | 13.218 | 12.625 | -12.87% | 4.70% |
500000 | 12.765 | 15.563 | 14.75 | -13.46% | 5.51% |
600000 | 14.563 | 17.797 | 16.828 | -13.46% | 5.76% |
700000 | 16.391 | 20.218 | 18.875 | -13.16% | 7.12% |
800000 | 18.282 | 22.578 | 21.297 | -14.16% | 6.01% |
1000000 | 21.781 | 27.218 | 25.453 | -14.43% | 6.93% |
Also in chart form:
The performance improvement compared to the raw data file approach is evident. But what is even better is that the Data Flow Source and Data Flow Destination components are implemented to be non-blocking. So if the same test is executed on a dual or more processor machine, it will execute the reading and writing steps in parallel and show even better performance compared to the sequential raw data file approach. Another shortcoming of the raw data file approach is the fact that it depends on the file system and requires space to transfer the data. Some might say these days you have powerful subsystems with advanced caching, which is true of course if you can afford it.
If you want to check the test package, you may download it from here.
Leave a Reply