Get Our Extension

Move-to-front transform

From Wikipedia, in a visual modern way

The move-to-front (MTF) transform is an encoding of data (typically a stream of bytes) designed to improve the performance of entropy encoding techniques of compression. When efficiently implemented, it is fast enough that its benefits usually justify including it as an extra step in data compression algorithm.

This algorithm was first published by B. Ryabko under the name of "book stack" in 1980.[1] Subsequently, it was rediscovered by J.K. Bentley et al. in 1986,[2] as attested in the explanatory note.[3]

Discover more about Move-to-front transform related topics



In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication channel or storage in a storage medium. An early example is an invention of language, which enabled a person, through speech, to communicate what they thought, saw, heard, or felt to others. But speech limits the range of communication to the distance a voice can carry and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time.



In the pursuit of knowledge, data are a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted. A datum is an individual value in a collection of data. Data are usually organized into structures such as tables that provide additional context and meaning, and which may themselves be used as data in larger structures. Data may be used as variables in a computational process. Data may represent abstract ideas or concrete measurements. Data are commonly used in scientific research, finance, and in virtually every other form of human organizational activity. Examples of data sets include stock prices, crime rates, unemployment rates, literacy rates, and census data. In this context, data represents the raw facts and figures which can be used in such a manner in order to capture the useful information out of it.



The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit of memory in many computer architectures. To disambiguate arbitrarily sized bytes from the common 8-bit definition, network protocol documents such as The Internet Protocol refer to an 8-bit byte as an octet. Those bits in an octet are usually counted with numbering from 0 to 7 or 7 to 0 depending on the bit endianness. The first bit is number 0, making the eighth bit number 7.

Data compression

Data compression

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.



In mathematics and computer science, an algorithm is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing calculations and data processing. More advanced algorithms can perform automated deductions and use mathematical and logical tests to divert the code execution through various routes. Using human characteristics as descriptors of machines in metaphorical ways was already practiced by Alan Turing with terms such as "memory", "search" and "stimulus".

The transform

The main idea is that each symbol in the data is replaced by its index in the stack of “recently used symbols”. For example, long sequences of identical symbols are replaced by as many zeroes, whereas when a symbol that has not been used in a long time appears, it is replaced with a large number. Thus at the end the data is transformed into a sequence of integers; if the data exhibits a lot of local correlations, then these integers tend to be small.

Let us give a precise description. Assume for simplicity that the symbols in the data are bytes. Each byte value is encoded by its index in a list of bytes, which changes over the course of the algorithm. The list is initially in order by byte value (0, 1, 2, 3, ..., 255). Therefore, the first byte is always encoded by its own value. However, after encoding a byte, that value is moved to the front of the list before continuing to the next byte.

An example will shed some light on how the transform works. Imagine instead of bytes, we are encoding values in a–z. We wish to transform the following sequence:


By convention, the list is initially (abcdefghijklmnopqrstuvwxyz). The first letter in the sequence is b, which appears at index 1 (the list is indexed from 0 to 25). We put a 1 to the output stream:


The b moves to the front of the list, producing (bacdefghijklmnopqrstuvwxyz). The next letter is a, which now appears at index 1. So we add a 1 to the output stream. We have:


and we move the letter a back to the top of the list. Continuing this way, we find that the sequence is encoded by:

Iteration Sequence List
bananaaa 1 (abcdefghijklmnopqrstuvwxyz)
bananaaa 1,1 (bacdefghijklmnopqrstuvwxyz)
bananaaa 1,1,13 (abcdefghijklmnopqrstuvwxyz)
bananaaa 1,1,13,1 (nabcdefghijklmopqrstuvwxyz)
bananaaa 1,1,13,1,1 (anbcdefghijklmopqrstuvwxyz)
bananaaa 1,1,13,1,1,1 (nabcdefghijklmopqrstuvwxyz)
bananaaa 1,1,13,1,1,1,0 (anbcdefghijklmopqrstuvwxyz)
bananaaa 1,1,13,1,1,1,0,0 (anbcdefghijklmopqrstuvwxyz)
Final 1,1,13,1,1,1,0,0 (anbcdefghijklmopqrstuvwxyz)

It is easy to see that the transform is reversible. Simply maintain the same list and decode by replacing each index in the encoded stream with the letter at that index in the list. Note the difference between this and the encoding method: The index in the list is used directly instead of looking up each value for its index.

i.e. you start again with (abcdefghijklmnopqrstuvwxyz). You take the "1" of the encoded block and look it up in the list, which results in "b". Then move the "b" to front which results in (bacdef...). Then take the next "1", look it up in the list, this results in "a", move the "a" to front ... etc.


Details of implementation are important for performance, particularly for decoding. For encoding, no clear advantage is gained by using a linked list, so using an array to store the list is acceptable, with worst-case performance O(nk), where n is the length of the data to be encoded and k is the number of values (generally a constant for a given implementation).

The typical performance is better because frequently-used symbols are more likely to be at the front and will produce earlier hits. This is also the idea behind a Move-to-front self-organizing list.

However, for decoding, we can use specialized data structures to greatly improve performance.


This is a possible implementation of the move-to-front algorithm in Python.

from typing import List, Tuple, Union
# Instead of always transmitting an "original" dictionary, it is simpler to just agree on an initial set.
# Here we use the 256 possible values of a byte:
common_dictionary = list(range(256))

def encode(plain_text: str) -> List[int]:
    # Change to bytes for 256.
    plain_text = plain_text.encode('utf-8')

    # Changing the common dictionary is a bad idea. Make a copy.
    dictionary = common_dictionary.copy()

    # Transformation
    compressed_text = list()          # Regular array
    rank = 0

    # Read in each character
    for c in plain_text:
        rank = dictionary.index(c)    # Find the rank of the character in the dictionary [O(k)]
        compressed_text.append(rank)  # Update the encoded text

        # Update the dictionary [O(k)]
        dictionary.insert(0, c)

    return compressed_text            # Return the encoded text

The inverse will recover the original text:

def decode(compressed_data: List[int]) -> str:
    compressed_text = compressed_data
    dictionary = common_dictionary.copy()
    plain_text = []

    # Read in each rank in the encoded text
    for rank in compressed_text:
        # Read the character of that rank from the dictionary

        # Update the dictionary
        e = dictionary.pop(rank)
        dictionary.insert(0, e)

    return bytes(plain_text).decode('utf-8')  # Return original string

Example output:

>>> import mtfwiki
>>> mtfwiki.encode('Wikipedia')
[87, 105, 107, 1, 112, 104, 104, 3, 102]
>>> mtfwiki.decode([119, 106, 108, 1, 113, 105, 105, 3, 103])

In this example we can see the MTF code taking advantage of the three repetitive i's in the input word. The common dictionary here, however, is less than ideal since it is initialized with more commonly used ASCII printable characters put after little-used control codes, against the MTF code's design intent of keeping what's commonly used in the front. If one rotates the dictionary to put the more-used characters in earlier places, a better encoding can be obtained:

>>> import mtfwiki
>>> block32 = lambda x : [x + i for i in range(32)]
>>> # Sort the ASCII blocks: first lowercase, then uppercase, punctuation/number, and finally the control code and the non-ASCII stuff
>>> mtfwiki.common_dictionary = block32(0x60) + block32(0x40) + block32(0x20) + block32(0x00) + list(range(128, 256))
>>> mtfwiki.encode('Wikipedia')
[55, 10, 12, 1, 17, 9, 9, 3, 7]

Discover more about Implementation related topics

Linked list

Linked list

In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence. In its most basic form, each node contains: data, and a reference to the next node in the sequence. This structure allows for efficient insertion or removal of elements from any position in the sequence during iteration. More complex variants add additional links, allowing more efficient insertion or removal of nodes at arbitrary positions. A drawback of linked lists is that access time is linear. Faster access, such as random access, is not feasible. Arrays have better cache locality compared to linked lists.

Big O notation

Big O notation

Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by Paul Bachmann, Edmund Landau, and others, collectively called Bachmann–Landau notation or asymptotic notation. The letter O was chosen by Bachmann to stand for Ordnung, meaning the order of approximation.

Python (programming language)

Python (programming language)

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.



ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of technical limitations of computer systems at the time it was invented, ASCII has just 128 code points, of which only 95 are printable characters, which severely limited its scope. All modern computer systems instead use Unicode, which has millions of code points, but the first 128 of these are the same as the ASCII set.

Use in practical data compression algorithms

The MTF transform takes advantage of local correlation of frequencies to reduce the entropy of a message. Indeed, recently used letters stay towards the front of the list; if use of letters exhibits local correlations, this will result in a large number of small numbers such as "0"'s and "1"'s in the output.

However, not all data exhibits this type of local correlation, and for some messages, the MTF transform may actually increase the entropy.

An important use of the MTF transform is in Burrows–Wheeler transform based compression. The Burrows–Wheeler transform is very good at producing a sequence that exhibits local frequency correlation from text and certain other special classes of data. Compression benefits greatly from following up the Burrows–Wheeler transform with an MTF transform before the final entropy-encoding step.


As an example, imagine we wish to compress Hamlet's soliloquy (To be, or not to be...). We can calculate the size of this message to be 7033 bits. Naively, we might try to apply the MTF transform directly. The result is a message with 7807 bits (higher than the original). The reason is that English text does not in general exhibit a high level of local frequency correlation. However, if we first apply the Burrows–Wheeler transform, and then the MTF transform, we get a message with 6187 bits. Note that the Burrows–Wheeler transform does not decrease the entropy of the message; it only reorders the bytes in a way that makes the MTF transform more effective.

One problem with the basic MTF transform is that it makes the same changes for any character, regardless of frequency, which can result in diminished compression as characters that occur rarely may push frequent characters to higher values. Various alterations and alternatives have been developed for this reason. One common change is to make it so that characters above a certain point can only be moved to a certain threshold. Another is to make some algorithm that runs a count of each character's local frequency and uses these values to choose the characters' order at any point. Many of these transforms still reserve zero for repeat characters, since these are often the most common in data after the Burrows Wheeler Transform.

Discover more about Use in practical data compression algorithms related topics

Entropy (information theory)

Entropy (information theory)

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable , which takes values in the alphabet and is distributed according to :

Burrows–Wheeler transform

Burrows–Wheeler transform

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

Plain text

Plain text

In computing, plain text is a loose term for data that represent only characters of readable material but not its graphical representation nor other objects. It may also include a limited number of "whitespace" characters that affect simple arrangement of text, such as spaces, line breaks, or tabulation characters. Plain text is different from formatted text, where style information is included; from structured text, where structural parts of the document such as paragraphs, sections, and the like are identified; and from binary files in which some portions must be interpreted as binary objects.

To be, or not to be

To be, or not to be

"To be, or not to be" is the opening phrase of a soliloquy given by Prince Hamlet in the so-called "nunnery scene" of William Shakespeare's play Hamlet, Act 3, Scene 1. In the speech, Hamlet contemplates death and suicide, weighing the pain and unfairness of life against the alternative, which might be worse. The opening line is one of the most widely known and quoted lines in modern English literature, and the soliloquy has been referenced in many works of theatre, literature, and music. Hamlet is not alone as he speaks—Ophelia is present, and Claudius and Polonius have concealed themselves. Claudius and Polonius have placed Ophelia in Hamlet's way in order to overhear their conversation and find out if Hamlet is really mad or only pretending. Even so, Hamlet seems to consider himself alone, and there is no indication that the others hear him before he addresses Ophelia.

Move-to-front linked-list

  • The term Move To Front (MTF) is also used in a slightly different context, as a type of a dynamic linked list. In an MTF list, each element is moved to the front when it is accessed.[4] This ensures that, over time, the more frequently accessed elements are easier to access.

Source: "Move-to-front transform", Wikipedia, Wikimedia Foundation, (2022, November 25th),

Enjoying Wikiz?

Enjoying Wikiz?

Get our FREE extension now!

  1. ^ Ryabko, B. Ya Data compression by means of a "book stack”, Problems of Information Transmission, 1980, v. 16: (4), pp. 265-269
  2. ^ J. L. Bentley; D. D. Sleator; R. E. Tarjan; V. K. Wei (1986). "A Locally Adaptive Data Compression Scheme". Communications of the ACM. 29 (4): 320–330. CiteSeerX doi:10.1145/5684.5688. S2CID 5854590.
  3. ^ Ryabko, B. Ya.; Horspool, R. Nigel; Cormack, Gordon V. (1987). "Comments to: "A locally adaptive data compression scheme" by J. L. Bentley, D. D. Sleator, R. E. Tarjan and V. K. Wei". Comm. ACM. 30 (9): 792–794. doi:10.1145/30401.315747. S2CID 16138142.
  4. ^ Rivest, R. (1976). "On self-organizing sequential search heuristics". Communications of the ACM. 19 (2): 63–67. doi:10.1145/359997.360000. S2CID 498886.
External links

The content of this page is based on the Wikipedia article written by contributors..
The text is available under the Creative Commons Attribution-ShareAlike Licence & the media files are available under their respective licenses; additional terms may apply.
By using this site, you agree to the Terms of Use & Privacy Policy.
Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization & is not affiliated to