16 elements means 4x4 table. sorting each column iteratively means splitting it up into 2x2 tables and combining them into a 4-row column. however, with 9 elements to sort one could make a 3x3 table and sort each column with 3-4 comparisons instead of 1 comparison and a whole insert-sort of the 3 elements along with the whole overhead of it. also, after sorting the 4 columns, what next? should it now be 2x8 or 3x6 or immediately insertion-sort? according to the article it must be insertion-sort, but maybe in this particular case 2x8 would be better? and what to start with? when is 3x6 better than 4x4? what about 5x4?
sorry for the rant, I would be interested in some input though.
meanwhile I will do some testing...