USING MERGE TO IMPLEMENT THE BEST SORT ALGORITHM

Sorting by the process of merging have been used for a long while now mainly to deal with the problem of large data files where internal Random Access Memory (RAM) is too expensive. In this case, each merge operation is done by placing the output merged list on a less expensive external storage medium. Because the time used for Input and Output operations is significantly larger than that used for the internal merge operation, this approach to sorting makes merge-sort significantly slower than other internal sorting techniques.

However, as we show in these pages, there are techniques that can be used to make the merge completely executed in RAM with no additional memory while at the same time using about the same number of comparisons to merge-sort. In the case of k-way merge-sort, we show an implementation that gives optimum comparisons with O(N) data moves.

 

MERGING

Merging is the process whereby two pre-sorted lists of data values are combined in a systematic manner to create a single sorted list. Originally, most sorting was done on large database systems using a technique known as merging. Merging is used because it easily allows a compromise between expensive internal Random Access Memory (RAM) and external magnetic mediums where the cost of RAM is an important factor. Originally, merging is an external method for combining two or more pre-sorted lists to create a single sorted list on an external medium.

The most natural way to merge two lists of m and n pre-sorted data values is to compare the values at the lowest (smallest) locations in both lists and then output the smaller of the two. The next two smallest values are compared in a similar manner and the smaller value outputted. This process is repeated until one of the lists becomes exhausted. Merging in this way requires m+n-1 comparisons and m+n data moves to create a sorted list of m+n data values from 2 pre-sorted lists of m and n data values each. The figure below illustrates an instance of this natural merge operation. The order in which each value in the A and B list is selected for output to the C list is labeled by their subscript in the second listing of A and B. Merging in this manner is easily made stable. In addition, this lower limit is guaranteed regardless of the nature of the input permutation. It has been shown that the lower limit on the number of comparison and data move operations when merging a set of m and n pre-sorted data values is m+n-1 and m+n respectively. However, the lower limit on the number of comparisons is further reduced by use of a binary search technique as explained in our book to be published soon.

A merge or sort algorithm is said to be in-place whenever it does merging or sorting with the use of no more than O(1) extra memory. In this case, the amount of memory required to implement the algorithm does not depend on the number of records in the input list(s). In addition, a merge algorithm is stable if it always leaves sequences of equal data values in the same order at the end of merging. For example, if we have the set of indexed values in the A list and the B list as follows.

INPUT LISTS

Presorted lists to be merged

If the merge algorithm is such that it keeps the indices of a sequence of equal values from a given list in sorted order, i.e. they are kept in the same order as they were before the merge as illustrated in the output example for the above input, the merge algorithm is said to be stable. Otherwise, the algorithm is said to be unstable. Output when the merge operation is stable is illustrated as follows

OUTPUT MERGE

Example stable merge

Note that, the merge operation that produces the output list C, selects values from A in preference to that from B whenever a tie occurs between them. We define stability in the general sense to mean that values from A are always given preference whenever a tie occurs between values in A and in B, and that the indices of equal set of values are kept in sorted order at the end of the merge operation. Therefore, an iterative or recursive merge-sorting scheme that uses such a stable merge and that split a larger list L into smaller sublists A and B is stable whenever the values in the A list always occupy smaller locations to those in the B list.

As a note of caution, the reader should be aware that there are many other techniques for merging pre-sorted lists apart from the classical methods listed herein. In fact, there are additional techniques as listed on these web page.


What about k-way Merge?

We can generalise this merge operation for the case where we have k pre-sorted lists. Where k is some positive integer > 2. We can use k pointers to keep track of the next smallest value in each list during the merge operation. However, if we are not careful about the manner in which we determine the next smallest value in the pre-sorted lists then the number of comparisons is no longer optimum. Therefore, we see that k-way merge suffers from about the same problem as the standard 2-way merge. We show that the minimum number of comparisons you can do when merging k sorted lists, each consisting of m data items is nlog2(k) - (k-1), where n = mk. More general, the number of comparisons is given by (sum ni)log2(k)-(k-1)over i=1, 2, ...k, where ni is the number of element in the ith pre-sorted list. We also show that there are techniques that can be used to perform k-way merge with the use of k additional pointers in-place, while at the same time achieving optimum comparisons and data moves. We use these techniques along with the ideas of block partitioning, data encoding and the internal buffer to implement an in-place k-way merge-sort that does optimum comparisons and O(N) data moves. Actually, what I have done is that I have unravelled a faulty default assumption we tend to make whenever we think about the use of pointers. I show 2 techniques in which k pointers can be implemented in-place, but at a certain cost! This cost differ in each case.


MERGE-SORT - A General Statement as Related to Sorting

In some place, sorting is described as a process of information gathering followed by data movement. Sorting by merging clearly demonstrate this process. In this case, the gathered information is inherent within the lists to be merged. Additional information is also gathered during the process of merging the sorted lists. Merging has the favourable attributes of using no more than m+n-1 comparisons and m+n data moves to merge two pre-sorted lists of m and n data values each. In addition, merging is stable. If you are not familiar with this subject, then you should bear in mind that we are talking about natural merge, which uses O(m+n) extra memory. Actually, you can do stable 2-way merge with the use of no more than O(Min(m, n)) extra memory. Where Min(m, n) is the smaller of m and n. Stability has proven to be a critical attribute for correctness of in-place merge algorithms that uses block rearrangement and the internal buffer. Using merge in an iterative or recursive manner, one can achieve a close hit of the theoretical lower bound on the number of comparisons and data moves for sorting. With classical 2-way merge-sort, this lower bound turns out to be Nlog2(N)-(N-1) for the number of comparisons and Nlog2(N) for the number of data moves when sorting a list of N data values. The operational complexity on the number of comparisons changes slightly but the number of data moves significantly when optimum k-way merge-sort is used for reasonable values of k >2. In fact, we have found this to be Nlog2(N)-Ne+NE/k and NE for the number of comparisons and the number of moves respectively, where e = log_2(N)/log2(k).

If the data file to be sorted is exceptionally large, it may not be practical to load all records at once into internal memory of the computer system in order to sort them. In this case, merging is the most natural means whereby the data file can be sorted from external memory. Merging is not normally used for sorting large data files in internal RAM because of the extra memory space required to merge pairs of sorted lists. This extra memory space is actually O(Min(m, n)).But is traditionally taken to be O(m+n). Please, do not misunderstand what I am saying here, I am talking about RAM, where it is easy to manipulate memory locations references.

THE BIG QUESTION

What is the most efficient general algorithm to order a set of data values in the internal memory of a computer base on their relative size? Is there a stable in-place algorithm that achieves the theoretical lower bound on the number of comparisons and data move operations?

To solve the ranking problem by comparison sort, we would like to use a general algorithm that has the following attributes
(1) Does ranking by comparison
(2) Not difficult to program

We would like our sort algorithm to have the following attributes.
(a) Number of comparison done is proportional to Nlog2(N). With small proportionality constant.
(b) Number of data moves must be less than or about the same as the number of comparisons. Ideally this should be O(N).
(c) Uses the same amount of extra memory regardless of the value of N.
(d) Keep equal set of data values in the same order at the end of sorting.

During a natural merge operation, each output value is place in its final location by a single data move. We define a data move as the process whereby a data value is moved from a selected location to a new location in main memory. A data move operation on most computer systems require a fixed number of memory reference and Central Processing Unit (CPU) register manipulation operations depending on the length of each data item and the length of each register in the CPU. We see that natural merge does not require the use of the swap( s, t) operator. We use the following macro to define a swap operation between the two data variables s and t. A variable is defined as a selected sequence of contiguous memory locations that will hold 2p different data values where p is the number of bits that constitute these memory locations. From this macro, we see that a swap operation requires three data moves.

It has been shown that the lower limit on the number of comparison and data move operations when merging a set of m and n pre-sorted data values is m+n-1 and m+n respectively. However, the lower limit on the number of comparisons is further reduce to m x t + n/2t, where t approx. log2 (m/n) and m < n. This is done by taking advantage of the fact that a fast search can be done on the list containing n data values with selected values from the list containing m by use of the binary search technique.


The Problem with Natural Merging

The merge operation requires an extra m+n memory location for output. Therefore, this method is not in-place. Therefore, for a long time the following question went unanswered. Is it possible to do optimum merge of two pre-sorted lists in-place? We see how this question has been answered with the advent of the ideas of Block Rearrangement and the Internal Buffer. We now look at Kronrod's original internal merge technique and related algorithms.

Make MERGE an optimum in-place operation! If you are very concern about this problem, you need not worry your self too much. I have found some solution base on existing ideas of Block Rearrangement, the Internal Buffer along with some other techniques. However, this may not be the end of the matter. I have not done any experiment to do comparisons apart from a rigorous theoretical analysis. In addition, you should not assume that the classical 2-way or k-way merge techniques are the only ones. You may very well be able to develop slightly better k-way techniques by use of other known 2-way merge techniques. In fact, you may be creative enough to be able to develop a non-classical k-way merge technique that has better attributes than that I have done so far!

Click Next button for information about Kronrod's Technique for Optimum In-place Merge.

Page 2 of 7

Information about in-place mergeInformation about Kronrod's merge technique
Home Purchase LogicCoder Programming Books Free Downloads About LogicExtractor