Rounding Algorithms 101 Redux

By   06.22.2006

Way back in the mists of time (in the early days of 2006), I penned a “How To” column that presented An introduction to different rounding algorithms. In order to provide some real-world examples, this article included some test cases in MATLAB that illustrated the types of errors associated with the different rounding schemes applied at various stages throughout a digital filter.

You can only imagine my surprise when this article was picked up by sites such as Slashdot.com, with the result that the little scamp received more than 30,000 hits in just a few days. Since that time, I’ve been working away in the background on a somewhat different presentation of this material (including some interesting nuggets of new information) on the More Cool Stuff page of my DIY Calculator website. For your delectation and delight, an abstracted version of that paper is presented below.

Quick Links

Introduction
We all remember being taught the concept of rounding in our younger years at school. Common problems involved monetary values, such as rounding some amount like $5.19 to the nearest dollar (which would be $5 in the case of this example). However, although this may seem simple at a first glance, there’s a lot more to rounding than might at first meet the eye

One key fact associated with rounding is that it involves transforming some quantity from a greater precision to a lesser precision. For example, suppose that we average out a range of prices and end up with a dollar value of $5.19286. In this case, rounding the more precise value of $5.19286 to the nearest cent would result in $5.19, which is less precise.

This means that if we have to perform rounding, we would prefer to use an algorithm that minimizes the effects of this loss of precision. (The term “algorithm”, which is named after the legendary Persian astrologer, astronomer, mathematician, scientist, and author Al-Khawarizmi [circa 800-840], refers to a detailed sequence of actions that are used to accomplish or perform a specific task.)

We especially wish to minimize the loss of precision if we are performing some procedure that involves cycling round a loop performing a series of operations (including rounding) on the same data over and over again. If we fail, the result will be so-called “creeping errors,” which refers to errors that increase over time as we iterate around the loop.

So how hard can this be? Well, things can become quite interesting, because there is a plethora of different rounding algorithms that we might use. These include round-toward-nearest (which itself encompasses round-half-up and round-half-down ), round-up , round-down , round-toward-zero , round-away-from-zero , round-ceiling , round-floor , truncation (chopping), … and the list goes on.

Just to increase the fun and frivolity, some of these terms can sometimes refer to the same thing, while at other times they may differ (this can depend on who you are talking to, the particular computer program or hardware implementation you are using, and so forth). Furthermore, the effect of a particular algorithm may vary depending on the form of number representation to which it is being applied, such as unsigned values, sign-magnitude values, and signed (complement) values.

In order to introduce the fundamental concepts, we’ll initially assume that we are working with standard (sign-magnitude) decimal values, such as +3.142 and -3.142. (For the purposes of these discussions, any number without an explicit sign is assumed to be positive.) Also, we’ll assume that we wish to round to integer values, which means that +3.142 will round to +3 (at least, it will if we are using the round-toward-nearest algorithm as discussed below). Once we have the basics out of the way, we’ll consider some of the implications associated with applying rounding to different numerical representations, such as signed binary numbers.

Round-Toward-Nearest
As its name suggests, this algorithm rounds towards the nearest significant value (this would be the nearest whole number, or integer, in these case of these particular examples). In many ways, this is the most intuitive of the various rounding algorithms, because values such as 5.1, 5.2, 5.3, and 5.4 will round down to 5, while values of 5.6, 5.7, 5.8, and 5.9 will round up to 6:

But what should we do in the case of a “half-way” value such as 5.5. Well, there are two obvious options: we could round it up to 6 or we could round it down to 5; these schemes, which are introduced below, are known as Round-Half-Up and Round-Half-Down, respectively.

Round-Half-Up (Arithmetic Rounding)
This algorithm, which may also be referred to as arithmetic rounding , is the one that we typically associate with the concept of rounding that we learned at school. In this case, a “half-way” value such as 5.5 will round up to 6:

One way to view this is that (at this level of precision and for this particular example) we can consider there to be ten values that commence with a “5” in the most-significant place (5.0, 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, and 5.9). On this basis, it intuitively makes sense for five of the values to round down and for the other five to round up; that is, for the five values 5.0 through 5.4 to round down to 5, and for the remaining five values 5.5 through 5.9 to round up to 6.

Before we move on, it’s worth taking a few moments to note that “half-way” values will always round up when using this algorithm, irrespective of whether the final result is odd or even:

As we see from the above example, the 5.5 value rounds up to 6, which is an even number; and the 6.5 value rounds up to 7, which is an odd number. The reason we are emphasizing this point is that the Round-Half-Even and Round-Half-Odd algorithms – which are introduced later in this paper – would treat these two values differently. (Note that the only reason we’ve omitted the values 5.2, 5.3, 5.7, 5.8, 6.2, 6.3, 6.7, and 6.8 from the above illustration is to save space so that we can display both odd and even values.)

The tricky point with this round-half-up algorithm arrives when we come to consider negative numbers. There is no problem in the case of the values like -5.1, -5.2, -5.3, and -5.4, because these will all round to the nearest integer, which is -5. Similarly, there is no problem in the case of values like -5.6, -5.7, -5.8, and -5.9, because these will all round to -6. The problem arises in the case of “half-way” values like -5.5 and -6.5 and our definition as to what “up” means in the context of “round-half-up.” Based on the fact that positive values like +5.5 and +6.5 round up to +6 and +7, respectively, most of us would intuitively expect their negative equivalents of -5.5 and -6.5 to round to -6 and -7, respectively. In this case, we would say that our algorithm was symmetric (with respect to zero) for positive and negative values:

Observe that the above illustration refers to negative numbers as “increasing” in size toward negative infinity. In this case, we are talking about the absolute size of the values (by “absolute”, we mean the size of the number if we disregard it’s sign and consider it to be a positive quantity; for example, +1 is bigger than -2 in real-world terms, but the absolute value of -2, which is written as |-2|, is +2, which is bigger than +1).

But we digress. The point is that some applications (and some mathematicians) would regard “up” as referring to positive infinity. Based on this, -5.5 and -6.5 would actually round to -5 and -6, respectively, in which case we would class this as being an asymmetric (with respect to zero) implementation of the round-half-up algorithm:

The problem is that different applications may treat things differently. For example, the round method of the Java Math Library provides an asymmetric implementation of the round-half-up algorithm, while the round function in the mathematical modeling, simulation, and visualization tool MATLAB from The MathWorks provides a symmetric implementation. (Just for giggles and grins, the round function in Visual Basic for Applications 6.0 actually implements the Round-Half-Even [Banker’s rounding] algorithm discussed later in this paper.)

As a point of interest, the symmetric versions of rounding algorithms are sometimes referred to as Gaussian implementations . This is because the theoretical frequency distribution known as a “Gaussian Distribution” – which is named after the German mathematician and astronomer Karl Friedrich Gauss (1777-1855) – is symmetrical about its mean value.

Round-Half-Down
Perhaps not surprisingly, this incarnation of the Round-Toward-Nearest algorithm acts in the opposite manner to its Round-Half-Up counterpart discussed above. In this case, “half-way” values such as 5.5 and 6.5 will round down to 5 and 6, respectively:

Once again, we run into a problem when we come to consider negative numbers, because what we do with “half-way” values depends on what we understand the term “down” to mean. On the basis that positive values of +5.5 and +6.5 round to +5 and +6, respectively, a symmetric implementation of the round-half-down algorithm will round values of -5.5 and -6.5 to -5 and -6, respectively:

By comparison, in the case of an asymmetric implementation of the algorithm, in which “down” is understood to refer to negative infinity, values of -5.5 and -6.5 will actually be rounded to -6 and -7, respectively:

As was noted earlier, the symmetric versions of rounding algorithms are sometimes referred to as Gaussian implementations . This is because the theoretical frequency distribution known as a “Gaussian Distribution” – which is named after the German mathematician and astronomer Karl Friedrich Gauss (1777-1855) – is symmetrical about its mean value.

Round-Half-Even (Banker’s Rounding)
If half-way values are always rounded in the same direction (for example, if +5.5 rounds up to +6 and +6.5 rounds up to +7, as is the case with the Round-Half-Up algorithm presented earlier), the result can be a bias that grows as more and more rounding operations are performed. One solution toward minimizing this bias is to sometimes round up and sometimes round down.

In the case of the round-half-even algorithm (which is often referred to as “Bankers Rounding” because it is commonly used in financial calculations), “half-way” values are rounded toward the nearest even number. Thus, +5.5 will round up to +6 and +6.5 will round down to +6:

This algorithm is, by definition, symmetric for positive and negative values, so both -5.5 and -6.5 will round to the nearest even value, which is -6:

In the case of data sets that feature a relatively large number of “half-way” values (financial records often provide a good example of this), the round-half-even algorithm performs significantly better than the Round-Half-Up scheme in terms of total bias (it is for this reason that the use of round-half-even is a legal requirement for many financial calculations around the world).

By comparison, in the case of data sets containing a relatively small number of “half-way” values – such as real-world values being applied to hardware implementations of digital signal processing (DSP) algorithms – the overhead involved in performing the round-half-even algorithm in hardware does not justify its use (see also the discussions on Rounding Signed Binary Values later in this paper).

Round-Half-Odd
This is the theoretical counterpart to the Round-Half-Even algorithm; in this case, however, “half-way” values are rounded toward the nearest odd number. For example, +5.5 and +6.5 will round to +5 and +7, respectively:

As for it’s “even” counterpart, the round-half-odd algorithm is, by definition, symmetric for positive and negative values. Thus, -5.5 will round to -5 and -6.5 will round to -7:

In practice, the round-half-odd algorithm is rarely (if ever) used because it will never round to zero, and rounding to zero is often a desirable attribute for rounding algorithms.

Round-Ceiling (Toward Positive Infinity)
This refers to always rounding towards positive infinity. In the case of a positive number, the result will remain unchanged if the digits to be discarded are all zero; otherwise it will be rounded up. For example, +5.0 will be rounded to +5, but +5.1, +5.2, +5.3, +5.4, +5.5, +5.6, +5.7, +5.8, and +5.9 will all be rounded up to +6 (similarly, +6.0 will be rounded to +6, while +6.1 through +6.9 will all be rounded up to +7):

By comparison, in the case of a negative number, the unwanted digits are simply discarded. For example, the values -5.0, -5.1, -5.2, -5.3, -5.4, -5.5, -5.6, -5.7, -5.8, and -5.9 will all be rounded to -5:

From the above illustrations, it’s easy to see that the round-ceiling algorithm results in a cumulative positive bias. Thus, in the case of software implementations, or during analysis of a system using software simulation, this form of rounding is sometimes employed to determine the upper limit of the algorithm (that is, the upper limit of the results generated by the algorithm for a given data set) for use in diagnostic functions.

In the case of hardware (a physical realization in logic gates), this algorithm requires a significant amount of additional logic – especially when weighed against its lack of compelling advantages – and is therefore rarely used in hardware implementations.

Round-Floor (Toward Negative Infinity)
This is the counterpart to the Round-Ceiling algorithm introduced above; the difference being that in this case we round toward negative infinity. This means that, in the case of a negative value, the result will remain unchanged if the digits to be discarded are all zero; otherwise it will be rounded toward negative infinity. For example, -5.0 will be rounded to -5, but -5.1, -5.2, -5.3, -5.4, -5.5, -5.6, -5.7, -5.8, and -5.9 will all be rounded to -6 (similarly, -6.0 will be rounded to -6, while -6.1 through -6.9 will all be rounded to -7):

By comparison, in the case of a positive value, the unwanted digits are simply discarded. For example, the values +5.0, +5.1, +5.2, +5.3, +5.4, +5.5, +5.6, +5.7, +5.8, and +5.9 will all be rounded to +5:

From the above illustrations, it’s easy to see that the round-floor algorithm results in a cumulative negative bias. Thus, in the case of software implementations, or during analysis of a system using software simulation, this form or rounding is sometimes employed to determine the lower limit of the algorithm (that is, the lower limit of the results generated by the algorithm for a given data set) for use in diagnostic functions.

Furthermore, when working with signed binary numbers this approach is “cheap” in terms of hardware (a physical realization in logic gates) since it involves only a simple truncation (the reason why this should be so is discussed later in this paper); this technique is therefore very often used with regard to hardware implementations.

Round-Toward-Zero
As its name suggests, this refers to rounding in such a way that the result heads toward zero. For example, +5.0, +5.1, +5.2, +5.3, +5.4, +5.5, +5.6, +5.7, +5.8, and +5.9 will all be rounded to +5 (this works the same for odd and even values, so +6.0 through +6.9 will all be rounded to +6):

Similarly, -5.0, -5.1, -5.2, -5.3, -5.4, -5.5, -5.6, -5.7, -5.8, and -5.9 will all be rounded to -5 (again, this works the same for odd and even values, so -6.0 through -6.9 will all be rounded to -6):

Another way to think about this is form of rounding is that it acts in the same way as the Round-Floor algorithm for positive numbers and as the Round-Ceiling algorithm for negative numbers.

Round-Away-From-Zero
This is the counterpart to the Round-Toward-Zero algorithm introduced above; the difference being that in this case we round away from zero. For example, +5.1, +5.2, +5.3, +5.4, +5.5, +5.6, +5.7, +5.8, and +5.9 will all be rounded to +6 (this works the same for odd and even values, so +6.1 through +6.9 will all be rounded to +7):

Similarly, -5.1, -5.2, -5.3, -5.4, -5.5, -5.6, -5.7, -5.8, and -5.9 will all be rounded to -6 (again, this works the same for odd and even values, so -6.1 through -6.9 will all be rounded to -7):

Another way to think about this is form of rounding is that it acts in the same way as the Round-Ceiling algorithm for positive numbers and as the Round-Floor algorithm for negative numbers.

Round-Up
The actions of this rounding mode depend on what one means by “up”. Some applications understand “up” to refer to heading towards positive infinity; in this case, round-up acts the same way as the Round-Ceiling algorithm. Alternatively, some applications regard “up” as referring to an absolute value heading away from zero; in this case, round-up acts in the same manner as the Round-Away-From-Zero algorithm.

Round-Down
This is the counterpart to the Round-Up algorithm introduced above. As you might expect by now, the actions of this mode depend on what one means by “down”. Some applications understand “down” to refer to heading towards negative infinity; in this case, round-down acts the same way as the Round-Floor algorithm. Alternatively, some applications regard “down” as referring to an absolute value heading toward zero; in this case, round-down acts in the same manner as the Round-Toward-Zero algorithm.

Truncation (Chopping)
Also known as chopping , truncation simply means discarding any unwanted digits. This means that in the case of the standard (sign-magnitude) decimal values we’ve been considering thus far – and also when working with unsigned binary values – the actions of truncation are identical to those of the Round-Toward-Zero algorithm.

With regard to the previous point, as a rule of thumb, when working with software applications that are inputting, manipulating, and displaying decimal values, performing a truncation operation on -3.5 will return -3 when working with most programming languages.

But things are never simple; when working with signed binary values in hardware implementations of accelerator and digital signal processing (DSP) functions, the actions of truncation typically reflect those of the Round-Floor algorithm (that is, truncating -3.5 will result in -4, for example).

See also the discussions on Rounding Sign-Magnitude Binary Numbers and Rounding Signed Binary Numbers
later in this paper.

Round-Alternate
Also known as alternate rounding , this is similar in concept to the Round-Half-Even and Round-Half-Odd schemes discussed earlier, in that the purpose of this algorithm is to minimize the bias that can be caused by always rounding “half-way” values in the same direction.

In the case of the Round-Half-Even algorithm, for example, it would be possible for a bias to occur if the data being processed contained a disproportionate number of odd and even “half-way” values. One solution is to use the round-alternate approach, in which the first “half-way” value is rounded up (for example), the next is rounded down, the next up, the next down, and so on.

Round-Random (Stochastic Rounding)
This may also be referred to as random rounding or stochastic rounding , where the term “stochastic” comes from the Greek stokhazesthai , meaning “to guess at.” In this case, when the algorithm is presented with a “half-way” value, it effectively tosses a metaphorical coin in the air and randomly (or pseudo-randomly) rounds the value up or down.

Although this technique typically gives the best overall results over large numbers of calculations, it is only employed in very specialized applications, because the nature of this algorithm makes it difficult to implement and tricky to verify the results.

Rounding Sign-Magnitude Binary Values
All of our previous discussions have been based on our working with standard sign-magnitude decimal values. Now let’s consider what happens to our rounding algorithms in the case of binary representations. For the purposes of these discussions, we will focus on the truncation and round-half-up algorithms, because these are the techniques commonly used in hardware implementations constructed out of physical logic gates (in turn, this is because these algorithms entail relatively little overhead in terms of additional logic).

We’ll start by considering an 8-bit binary sign-magnitude fixed-point representation comprising a sign bit, three integer bits, and four fractional bits:

Note that the differences between unsigned, sign-magnitude, and signed binary numbers are introduced in our book How Computers Do Math (ISBN: 0471732788).

For our purposes here, we need only note that – in the case of a sign-magnitude representation – the sign bit is used only to represent the sign of the value (0 = positive, 1 = negative). In the case of this particular example, we have three integer bits that can be used to represent an integer in the range 0 to 7. Thus, the combination of the sign bit and the three integer bits allows us to represent positive and negative integers in the range -7 through +7:

As we know (at least, we know if we’ve read the book {hint, hint}), one of the problems with this format is that we end up with both positive and negative versions of zero, but that need not concern us here.

When we come to the fractional portion of our 8-bit field, the first fractional bit is used to represent 1/2 = 0.5; the second fractional bit is used to represent 0.5/2 = 0.25; the third fractional bit is used to represent 0.25/2 = 0.125, and the fourth fractional bit is used to represent 0.125/2 = 0.0625.

Thus, our four bit field allows us to represent fractional values in the range 0.0 through 0.9375 as shown below (of course, more fractional bits would allow us to represent more precise fractional values, but four bits will serve the purposes of our discussions):

Truncation: So, now let’s suppose that we have a sign-magnitude binary value of 0101.0100, which equates to +5.25 in decimal. If we simply truncate this by removing the fractional field, we end up with an integer value of 0101 in binary, or +5 in decimal, which is what we would expect. Similarly, if we started with a value of 1101.0100, which equates to -5.25 in decimal, then truncating the fractional part leaves us with an integer value of 1101 in binary, or -5 in decimal, which – again – is what we expect. Let’s try this with some other values as follows:

The end result is that, as we previously noted, in the case of sign-magnitude binary numbers, using truncation (simply discarding the fractional bits) is the same as performing the Round-Toward-Zero algorithm as discussed earlier in this paper. Also, as we see, this works exactly the same way for both positive and negative (and odd and even) values. (Compare these results to those obtained when performing this operation on signed binary values as discussed in the next section.)

Round-Half-Up: In addition to truncation, the other common rounding algorithm used in hardware implementations is that of Round-Half-Up. The reason this is so popular (as compared to a Round-Half-Even approach, for example) is that it doesn’t require us to perform any form of comparison; all we have to do is to add 0.5 to our original value and then truncate the result.

For example, let’s suppose that we have a binary value of 0101.1100, which equates to +5.75 in decimal. If we now add 0000.1000 (which equates to 0.5 in decimal) we end up with 0110.0100, which equates to +6.25 in decimal. And if we then truncate this value, we end up with 0110, or +6 in decimal, which is exactly what we would expect from a Round-Half-Up algorithm. Let’s try this with some other values as follows:

Thus, the end result is that, in the case of sign-magnitude binary numbers, adding a value of 0.5 and truncating the product gives exactly the same results as performing a symmetrical version of the Round-Half-Up algorithm as discussed earlier in this paper. As we would expect from the symmetrical version of this algorithm, this works exactly the same way for both positive and negative (and odd and even) values. (Compare these results to those obtained when performing this operation on signed binary values as discussed in the next section.)

Rounding Signed Binary Values
For this portion of our discussions, we will base our examples on an 8-bit signed binary fixed-point representation comprising a four integer bits (the most-significant of which also acts as a sign bit) and four fractional bits:

Once again, the differences between unsigned, sign-magnitude, and signed binary numbers are introduced in our book How Computers Do Math (ISBN: 0471732788).

For our purposes here, we need only note that – in the case of a signed binary representation – the sign bit is used to signify a negative quantity (not just the sign), while the remaining bits continue to represent positive values. Thus, in the case of our 4-bit integer field, a “1” in the sign bit reflects a value of -2^3 (that’s -2 raised to the power of 3) = -8, which therefore allows us to use this 4-bit field to represent integers in the range -8 through +7:

The fractional portion of our 8-bit field behaves in exactly the same manner as for the sign-magnitude representations we discussed in the previous section; the first fractional bit is used to represent 1/2 = 0.5; the second fractional bit is used to represent 0.5/2 = 0.25; the third fractional bit is used to represent 0.25/2 = 0.125, and the fourth fractional bit is used to represent 0.125/2 = 0.0625. Thus, our four bit field allows us to represent fractional values in the range 0.0 through 0.9375 as shown below (once again, more fractional bits would allow us to represent more precise fractional values, but four bits will serve the purposes of our discussions):

Truncation: So, now let’s suppose that we have a signed binary value of 0101.1000, which equates to +5.5 in decimal. If we simply truncate this by removing the fractional field, we end up with an integer value of 0101 in binary, or +5 in decimal, which is what we would expect.

However, now consider what happens if we wish to perform the same operation on the equivalent negative value of -5.5. In this case, our binary value will be 1010.1000, which equates to -8 + 2 + 0.5 = -5.5 in decimal. Thus, truncating the fractional part leaves us with an integer value of 1010 in binary, or -6 (as opposed to the -5 we received in the case of the sign-magnitude representations in the previous section). Let’s try this with some other values as follows:

Observe the values shown in red/bold and compare these values to those obtained when performing this operation on sign-magnitude binary values as discussed in the previous section. The end result is that, as we previously noted, in the case of signed binary numbers, using truncation (simply discarding the fractional bits) is the same as performing the Round-Floor algorithm as discussed earlier in this paper.

Round-Half-Up: As we mentioned earlier, the other common rounding algorithm used in hardware implementations is that of Round-Half-Up.

Once again, the reason this algorithm is so popular (as compared to a Round-Half-Even approach, for example) is that it doesn’t require us to perform any form of comparison; all we have to do is to add 0.5 to our original value and then truncate the result.

As you will doubtless recall, in the case of positive values, we expect this algorithm to round to the next integer for fractional values of 0.5 and higher. For example, +5.5 should round to +6. Let’s see if this works. If we have an initial signed binary value of 0101.1000 (which equates to +5.5 in decimal) and we add 0000.1000 (which equates to 0.5 in decimal) we end up with 0110.0000, which equates to +6.0 in decimal. And if we now truncate this value, we end up with 0110 or +6 in decimal, which is what we would expect.

But what about negative values? If we have an initial value of 1010.1000 (which equates to -8 + 2 + 0.5 = -5.5 in decimal) and we add 0000.1000 (which equates to 0.5 in decimal) we end up with 1011.0000, which equates to -8 + 3 = -5 in decimal. If we then truncate this value, we end up with 1011 or -5 in decimal, as opposed to the value of -6 that we might have hoped for. Just to make the point a little more strongly, let’s try this with some other values as follows:

As we see, in the case of signed binary numbers, adding a value of 0.5 and truncating the product gives exactly the same results as performing an asymmetrical version of the Round-Half-Up algorithm as discussed earlier in this paper. As we would expect from the asymmetrical version of this algorithm, this works differently for positive and negative values that fall exactly on the “half-way” (0.5) boundary. (Compare these results to those obtained when performing this operation on signed binary values as discussed in the previous section.)

Summary and Further Reading
The following table reflects a summary of the main rounding modes discussed above as applied to standard (sign-magnitude) decimal values:

With regard to the above illustration, it’s important to remember the following points (all of which are covered in more detail in our earlier discussions):

  1. The generic concept of Round-Toward-Nearest encompasses both the Round-Half-Up and Round-Half-Down modes.
  2. The term “arithmetic rounding” is another name for the Round-Half-Up algorithm (of which there can be both symmetric and asymmetric versions).
  3. Depending on the application/implementation, the actions of the Round-Up algorithm may either equate to those of the Round-Ceiling mode or to those of the Round-Away-From-Zero mode.
  4. Depending on the application/implementation, the actions of the Round-Down algorithm may either equate to those of the Round-Floor mode or to those of the Round-Toward-Zero mode.
  5. In the case of sign-magnitude or unsigned binary numbers, the actions of Truncation are identical to those of the Round-Toward-Zero mode. By comparison, in the case of signed binary numbers, the actions of Truncation are
    identical to those of the Round-Floor mode.

There are many additional resources available to anyone who is interested in knowing more about this &ndash and related – topics. The following should provide some “food for thought,” and please feel free to let us know of any other items you feel should be included in this list:

  1. Mike Cowlishaw, an IBM fellow, has created an incredibly useful site that introduces a fantastic range of topics – including rounding – pertaining to the use of Decimal Arithmetic in computers. Also included are links to related resources.
  2. David Goldberg has created a tremendously useful site entitled What Every Computer Scientist Should Know About Floating-Point Arithmetic , which, you may not be surprised to learn, provides a great introduction to Floating-Point representations and operations (including rounding algorithms, of course).
  3. Employed by many computers, microprocessors, and hardware accelerators, IEEE 754 is the most widely-used standard for floating-point arithmetic. Published by the Institute of Electrical and Electronic Engineers (IEEE), this standard covers all aspects of floating-point computation, including a number of rounding and truncation schemes. The related IEEE 854 standard generalizes the original IEEE 754 to cover decimal arithmetic as well as binary.
  4. With regard to the previous point, a useful overview to the IEEE 754 floating-point representation is provided by Steve Hollasch. Also, a useful tool for performing IEEE 754 conversions from decimal floating-point to their 32-bit and 64-bit hexadecimal counterparts is available from Queens College in New York.
  5. Regarded by many as being a definitive reference, volume 2 of the classic Art of Computer Programming by Donald Knuth focuses on Seminumerical Algorithms and covers a wide range of topics from random number generators to floating point operations and rounding techniques.
  6. Last but certainly not least (if I may dare to mention this one final time), this entire paper and its predecessor were spawned from just a single topic in our book How Computers Do Math (ISBN: 0471732788). So, if you’ve found this paper to be useful and interesting, just imagine how much fun you’d have reading the book itself!

Clive “Max” Maxfield is president of TechBites Interactive, a marketing consultancy firm specializing in high technology. Max is the author and co-author of a number of books, including Bebop to the Boolean Boogie (An Unconventional Guide to Electronics) and How Computers Do Math (ISBN: 0471732788) featuring the pedagogical and phantasmagorical virtual DIY Calculator.

<From https://www.eetimes.com/rounding-algorithms-101-redux/#A7&gt;

Posted in Uncategorized | Leave a comment

An introduction to different rounding algorithms

By   01.04.2006

Editor’s Note: I just thought I should mention that there’s an ongoing, ever-evolving, and dramatically different presentation of this topic entitled Rounding 101 on my personal website.

It is often said that it’s only when you try to explain something to someone else that you come to realize that there are “holes” in your understanding of the topic in question. Such was the case when it came to presenting the concept of rounding in our recently-published book: How Computers Do Math (ISBN: 0471732788) featuring a virtual 8-bit computer-calculator called the DIY Calculator. In fact, the mind soon boggles at the variety and intricacies of the rounding algorithms that may be used for different applications.

For example, we have round-up , round-down , round-toward-nearest , arithmetic rounding , round-half-up , round-half-down , round-half-even , round-half-odd , round-toward-zero , round-away-from-zero , round-ceiling , round-floor , truncation (chopping), round-alternate , and round-random (stochastic rounding), to name but a few. What makes things even more exciting is that some of these terms can refer to the same thing (or not, depending on the application and to whom you are talking). Similarly, some rounding schemes work one way on sign-magnitude values (like standard decimal numbers and sign-magnitude binary numbers) and a different way on complement representations (like twos complement [signed binary] and tens complement [signed decimal]).

But, nothing daunted, we are going to rip the veils asunder and discover more about rounding than most of us ever wanted to know. We will commence by introducing a smorgasbord of basic concepts. Then, in order to provide some “meat,” Tim Vanevenhoven at AccelChip was kind enough to create and run some test cases in MATLAB® from The MathWorks to illustrate the types of errors associated with different rounding schemes applied at various stages throughout a fixed-point digital filter implemented in hardware (the MATLAB source files are available for you to download and play with as described later in this article).

Rounding in decimal
We’ll start the ball rolling by considering the various rounding schemes in the context of the decimal numbers we know and love so well. The most fundamental fact associated with rounding is that it involves transforming some quantity from a greater precision to a lesser precision; for example, rounding a reasonably precise value like $3.21 to the nearest dollar would result in $3.00, which is a less precise entity.

Given a choice, we would generally prefer to use a rounding algorithm that minimizes the effects of this loss of precision, especially in the case where multiple processing iterations – each involving rounding – can result in “creeping errors” (by this we mean that errors increase over time due to performing rounding operations on data that has previously been rounded). However, in the case of hardware implementations targeted toward tasks such as digital signal processing (DSP) algorithms, for example, we also have to be cognizant of the overheads associated with the various rounding techniques so as to make appropriate design trade-offs.

For the purposes of the following discussions, we will assume that the goal is to round to an integer value. In real-world applications we might wish to round to any particular digit (usually a fractional digit), but the principles are exactly the same.

A summary of the actions of the main rounding modes as applied to standard (sign-magnitude) decimal values is provided in Table 1.

Round-toward-nearest: This is perhaps the most intuitive of the various rounding algorithms. In this case, values such as 3.1, 3.2, 3.3, and 3.4 would round down to 3, while values of 3.6, 3.7, 3.8, and 3.9 would round up to 4. The trick, of course, is to decide what to do in the case of the half-way value 3.5. In fact, round-toward-nearest may be considered to be a superset of two complementary options known as round-half-up and round-half-down , each of which treats the 3.5 value in a different manner as discussed below.

Round-half-up: This algorithm, which may also be referred to as arithmetic rounding , is the one that we typically associate with the rounding we learned at grade-school. In this case, a half-way value such as 3.5 will round up to 4. One way to view this is that, at this level of precision and for this particular example, we can consider there to be ten values that commence with a 3 in the most-significant place (3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, and 3.9). On this basis, it intuitively makes sense for five of the values to round down and for the other five to round up; that is, for the five values 3.0 through 3.4 to round down to 3, and for the remaining five values 3.5 through 3.9 to round up to 4.

The tricky point with the round-half-up algorithm arrives when we come to consider negative numbers. In the case of the values -3.1, -3.2, -3.3, and -3.4, these will all round to the nearest integer, which is -3; similarly, in the case of values like -3.6, -3.7, -3.8, and -3.9, these will all round to -4. The problem arises in the case of -3.5 and our definition as to what “up” means in the context of round-half-up . Based on the fact that a value of +3.5 rounds up to +4, most of us would intuitively expect a value of -3.5 to round to -4. In this case, we would say that our algorithm was symmetric for positive and negative values.

However, some applications (and mathematicians) regard “up” as referring to positive infinity. Based on this, -3.5 will actually round to -3, in which case we would class this as being an asymmetric implementation of the round-half-up algorithm. For example, the round method of the Java Math Library provides an asymmetric implementation of the round-half-up algorithm, while the round function in MATLAB provides a symmetric implementation. (Just to keep us on our toes, the round function in Visual Basic for Applications 6.0 actually implements the round-half-even [Banker’s rounding] algorithm discussed below.)

Round-half-down: This acts in the opposite manner to its round-half-up counterpart. In this case, a half-way value such as 3.5 will round down to 3. Once again, we run into a problem when we come to consider negative numbers, depending on what we assume “down” to mean. In the case of a symmetric implementation of the algorithm, a value of -3.5 will round to -3. By comparison, in the case of an asymmetric implementation of the algorithm, in which “down” is understood to refer to negative infinity, a value of -3.5 will actually round to -4.

As a point of interest, the symmetric versions of rounding algorithms are sometimes referred to as “Gaussian implementations.” This is because the theoretical frequency distribution known as a Gaussian distribution – which is named for the German mathematician and astronomer Karl Friedrich Gauss (1777-1855) – is symmetrical about its mean value.

Round-half-even: If half-way values are always rounded in the same direction (for example 3.5 rounds to 4 and 4.5 rounds to 5), the result can be a bias that grows as more rounding operations are performed. One solution toward minimizing this bias is to sometimes round up and sometimes round down.

In the case of the round-half-even algorithm (which is often referred to as Banker’s Rounding because it is commonly used in financial calculations), half-way values are rounded toward the nearest even number. Thus, 3.5 will round up to 4 and 4.5 will round down to 4. This algorithm is, by definition, symmetric for positive and negative values, so both -3.5 and -4.5 will round to -4.

In the case of data sets that feature a relatively large number of “half-way” values (financial records provide a good example of this), the round-half-even algorithm performs significantly better than the round-half-up scheme in terms of total bias. However, in the case of data sets containing a relatively small number of “half-way” values – such as real-world values being applied to DSP algorithms – the overhead involved in performing the round-half-even algorithm in hardware does not justify its use (see also the filter examples shown later in this paper).

Round-half-odd: This is the theoretical counterpart to the round-half-even algorithm, in which half-way values are rounded toward the nearest odd number. In this case, 3.5 will round to 3 and 4.5 will round to 5 (similarly, -3.5 will round to -3, and -4.5 will round to -5). The reason we say “theoretical” is that, in practice, the round-half-odd algorithm is rarely (if ever) never used because it will never round to zero (rounding to zero is often a desirable attribute for rounding algorithms).

Round-alternate: Also known as alternate rounding , this is similar in concept to the round-half-even and round-half-odd schemes discussed above, in that the purpose of the round-alternate algorithm is to minimize the bias that can be caused by always rounding half-way values in the same direction.

In the case of the round-half-even approach, for example, it would be possible for a bias to occur if the data being processed contained a disproportionate number of odd and even half-way values. One solution is to use the round-alternate algorithm, in which the first half-way value is rounded up (for example); the next is rounded down, the next up, the next down, and so on.

Round-random: This may also be referred to as random rounding or stochastic rounding , where the term “stochastic” comes from the Greek stokhazesthai , meaning “to guess at.” With this technique, in the case of “half-way” values, we effectively toss a metaphorical coin in the air and randomly (or pseudo-randomly) round the value up or down.

Although this technique typically gives the best overall result over a large number of calculations, it is only employed in very specialized applications, because the nature of this algorithm makes it difficult to implement and tricky to verify the results.

Round-ceiling: This refers to rounding towards positive infinity. In the case of a positive number, the result will remain unchanged if the digits to be discarded are all zero; otherwise it will be rounded up. For example, 3.0 will be rounded to 3, but 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, and 3.9 will all be rounded to 4. By comparison, in the case of a negative number, the unwanted digits are simply discarded. For example, -3.0, -3.1, -3.2, -3.3, -3.4, -3.5, -3.6, -3.7, -3.8, and -3.9 will all be rounded to -3.

This algorithm (which is implemented using the ceil function in MATLAB), results in a cumulative positive bias and also requires additional logic when realized in hardware. For these reasons, the round-ceiling algorithm is not often used in hardware implementations. (In the case of non-hardware implementations, or during analysis of a system using software simulation, the round-ceiling approach is sometimes employed to determine the upper limit of the algorithm for use in diagnostic functions).

Round-floor: The counterpart to round-ceiling , this refers to rounding towards negative infinity. In the case of a negative number, the result will remain unchanged if the digits to be discarded are all zero; otherwise it will be rounded down. For example, -3.0 will be rounded to -3, but -3.1, -3.2, -3.3, -3.4, -3.5, -3.6, -3.7, -3.8, and -3.9 will all be rounded to -4. By comparison, in the case of a positive number, the unwanted digits are simply discarded. For example, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, and 3.9 will all be rounded to 3.

This algorithm (which is implemented using the floor function in MATLAB), results in a cumulative negative bias. However, round-floor is “cheap” in terms of a hardware implementation since it involves only a simple truncation; this technique is therefore very often used with regard to hardware implementations. (In the case of non-hardware implementations, or during analysis of a system using software simulation, the round-floor approach is sometimes employed to determine the lower limit of the algorithm for use in diagnostic functions).

Round-toward-zero: As its name suggests, this refers to rounding in such a way that the result heads toward zero. For example, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, and 3.9 will all be rounded to 3. Similarly, -3.1, -3.2, -3.3, -3.4, -3.5, -3.6, -3.7, -3.8, and -3.9 will all be rounded to -3.

Another way to think about this is that round-toward-zero (which is implemented using the fix function in MATLAB) acts in the same way as the round-floor algorithm for positive numbers and as the round-ceiling algorithm for negative numbers.

Round-away-from-zero: The counterpart to round-toward-zero , this refers to rounding in such a way that the result heads away from zero. For example, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, and 3.9 will all be rounded to 4. Similarly, -3.1, -3.2, -3.3, -3.4, -3.5, -3.6, -3.7, -3.8, and -3.9 will all be rounded to -4.

Another way to think about this is that round-away-from-zero acts in the same manner as a round-ceiling algorithm for positive numbers and as a round-floor algorithm for negative numbers.

Round-up: The actions of this rounding mode depend on what one means by “up”. Some applications understand “up” to refer to heading towards positive infinity; in this case, round-up is synonymous for round-ceiling .

Alternatively, some applications regard “up” as referring to an absolute value heading away from zero; in this case, round-up acts in the same manner as the round-away-from-zero algorithm.

Round-down: The counterpart to the round-up algorithm, the actions of this mode depend on what one means by “down”. Some applications understand “down” to refer to heading towards negative infinity; in this case, round-down is synonymous for round-floor .

Alternatively, some applications regard “down” as referring to an absolute value heading toward zero; in this case, round-down acts in the same manner as the round-toward-zero algorithm.

Truncation: Also known as chopping , truncation simply means discarding any unwanted digits. Although this would appear – on the surface – to be relatively simple, things become a little more interesting when we realize that the actions resulting from truncation vary depending on whether we are working with sign-magnitude, unsigned, or signed (complement) values. In the case of sign-magnitude values (like the standard decimal values we’ve been considering thus far), and also when working with unsigned binary values, the actions of truncation are identical to those of the round-toward-zero mode. In the case of signed binary values, however, truncation works somewhat differently for negative values (we’ll return to this point shortly).

Hardware implementations and implications
The next point we need to consider is when and where we may wish to actually perform rounding operations. In order to do this, we first need to take a step back to define a couple of concepts, such as the differences between integer, fractional, fixed-point, and floating-point values.

Let’s start with the term “fixed-point,” which refers to a way of writing (or otherwise representing, including storing in the case of computers) numerical quantities with a predetermined number of digits and with the “point” located at a single, unchanging position (this would be the “decimal point” in the case of decimal numbers, the “binary point” in the case of binary numbers, and so forth).

For example, a fixed-point sign-magnitude decimal number supporting three digits to the left of the decimal point and two digits to the right could be used to represent values in the range “999.99 to +999.99. From this, we realize that integer representations are a special case of fixed-point in which there are no fractional digits. Similarly, purely fractional representations are a special case of fixed-point in which there are no integer digits.

As opposed to fixed-point, a floating-point value is one that is expressed as a multiple of an appropriate power of the base of the number system, which thereby allows the point to move around. For example, using floating-point notation, the value 12.34 can also be written as 123.4 × 10-1 , 1234 × 10-2 , 12340 × 10-3 , and so forth; and also as 1.234 × 101 , 0.1234 × 102 , 0.01234 × 103 , and so forth.

So, one case where we might wish to perform rounding would be when working only with integers and dividing two integer values, such as 9 / 2 = 4.5, at which point we have to round the result back to an integer. Similarly, it may be necessary to perform a rounding operation after multiplying two fractional numbers together.

One very common case where rounding proves necessary is when converting from floating-point representations into their fixed-point counterparts. For example, DSP algorithms are typically first analyzed and evaluated using floating-point representations; these algorithms are subsequently recast into fixed-point representations for implementation in hardware. In this case, it is possible to experience what are known as quantization errors or computational noise (where the term “quantization” refers to the act of limiting the possible values of some quantity or magnitude into a discrete set of values or quanta.)

When we come to hardware implementations of rounding algorithms, the most common techniques are truncation (which is often referred to as round-toward-zero , but this would be incorrect in the case of signed binary values as discussed below), round-half-up (which is commonly referred to as round-to-nearest , but this would be inexact as discussed earlier in this article), and round-floor (rounding towards negative infinity).

In the case of unsigned binary representations, truncation , round-half-up , and round-floor work as discussed above. However, there are some interesting nuances when it comes to signed binary representations. Purely for the sake of discussion, let’s consider an 8-bit signed binary fixed-point representation comprising four integer bits and four fractional bits (Fig 1).


1. An 8-bit signed binary 4.4 fixed-point representation.For the purpose of this portion of our discussions (and for the sake of simplicity), let’s assume that we wish to perform our various rounding algorithms so as to be left with only an integer value. Now, let’s suppose that our 8-bit field contains a value of 0011.1000, which equates to +3.5 in decimal. In the case of a truncation (or chopping) algorithm, we will simply discard the fractional bits, resulting in an integer value of 0011; this equates to 3 in decimal, which is what we would expect.

Next, assume a value of 1100.1000. The integer portion of this equates to -4, while the fractional portion equates to +0.5, resulting in a total value of -3.5. Thus, when we perform a truncation and discard our fractional bits, we end up with an integer value of 1100, which equates to -4 in decimal. Thus, in the case of a signed binary value, performing a truncation operation actually results in implementing a round-floor algorithm.

One of the reasons the round-half-up algorithm is popular for hardware implementations is that it doesn’t require up to perform a comparison operation. Instead, all we have to do is to add a value of 0.5 and truncate the result. (Note that, when we say “add 0.5,” this is based on our earlier assumption that we are rounding to the nearest integer; by comparison, if we were rounding to the nearest half, we would add 0.25 [which is 0.01 in binary] and truncate; and so forth.)

For example, suppose that our 8-bit field contains a value of 0011.0110, which equates to 3.375 in decimal. In this case, if we add a value of 0000.1000 (which equates to 0.5 in decimal), the result is 0011.1110; so when we now truncate this result, we end up with 0011, which equates to 3 in decimal. And this is, of course, what we would expect, because 3.375 should round to 3 in the case of the round-half-up algorithm.

Now remember that, in the case of positive values, we expect the round-half-up algorithm to round to the next integer for fractional values of 0.5 and higher. Suppose we have an initial value of 0011.1000, which equates to 3.5. Adding 0000.1000 and truncating the result leaves us with 0100, or 4 in decimal, which is what we expect. Similarly, when we take an initial value of 0011.1100, which equates to 3.75 in decimal, adding 0000.1000 and truncating the result again leaves us with 0100, which is what we expect.

But what about negative values? For example, consider an initial value of 1100.1000. Once again, the integer portion equates to -4, while the fractional portion equates to +0.5, resulting in a total value of -3.5. In this case, adding 0000.1000 and truncating results in a value of 1101, which equates to -3 in decimal. From this we see that the simple round-half-up algorithm favored by many hardware implementations actually results in an asymmetrical realization of this rounding function.

Applying different rounding schemes to a filter design
In order to provide some real-world examples, Tim Vanevenhoven at AccelChip was kind enough to create and run some test cases in MATLAB to illustrate the types of errors associated with the different rounding schemes applied at various stages throughout a digital filter.

These examples are based on a 32-tap low-pass FIR filter. The screenshot in Fig 2 shows a noisy input applied to the filer (top) compared to the golden output from a floating-point version of the filter (bottom).


2. Noisy input versus golden output.Next, a fixed-point version of the filter was created with the following characteristics:

  • The input has 12 bits, 10 of which are fractional.
  • The coefficients have 12 bits, 11 of which are fractional.
  • The internal accumulator has 11 bits, 10 of which are fractional.

The idea is to apply some of the more common rounding algorithms to different portions of the filter so as to compare the quantization errors and computational noise associated with these schemes. Note that all of the quantization errors are derived from the rounding in this example, because it has been designed such that there is no overflow.

The first test involved performing a round-floor algorithm on the filter coefficients (note that a floating-point version of the accumulator was used in this test, so as to isolate any effects to the coefficients themselves). As shown in Fig 3 , the result was a maximum rounding error of 0.003187 and a DC bias of +5.8258e-005 as compared to the ideal (golden) output.


3. Round-floor on coefficients.Next, we applied a round-ceiling algorithm to the filter coefficients, which resulted in a maximum rounding error of 0.0043851 and a DC bias of -7.6429e-005 (Fig 4).


4. Round-ceiling on coefficients.We then applied a round-half-up algorithm to the coefficients; as illustrated in Fig 5 , this resulted in a maximum rounding error of 0.00052307 (an order of magnitude better than the round-floor and round-ceiling algorithms) and a maximum bias of +7.4045e-007 (a two orders of magnitude improvement).


5. Round-half-up on coefficients.Rounding the filter coefficients is an effective way to reduce the overall quantization error in a design. Of particular interest is the fact that – assuming the coefficients to be constant values – this doesn’t add any additional hardware to the design. Thus, it would be interesting to experiment with alternative rounding schemes on the coefficients, such as round-half-even , round-alternate , and even round-random .

It’s important to note that exactly how much the more sophisticated rounding schemes will improve the filter’s performance versus the round-floor or round-ceiling algorithms is dependent on the actual values of the coefficients and “how far” the floating point values are from the nearest quantization steps. In this particular example, round-ceiling introduced more quantization error than round-floor ; however, the results could be different with different coefficients.

Rounding to the nearest neighbor (using the round-half-up algorithm in this example) will usually give the best results, because the absolute value of the difference between the floating-point and fixed-point coefficients will be minimized. In addition, some of the rounding errors will be positive and some will be negative and will – in a sense – tend to cancel each other out. By comparison, using round-floor or round-ceiling will shift all of the coefficients toward negative or positive infinity, resulting in the positive and negative DC biases reflected in Figs 3 and 4 , respectively.

Our next experiment was to apply a round-floor algorithm to the entire datapath (except for the filter coefficients, which we maintained using the round-half-up algorithm from Fig 5 ). The result was a maximum rounding error of 0.021075 and a DC bias of +0.015377 as illustrated in Fig 6 .


6. Round-floor on entire datapath (except filter coefficients).Next, we applied a round-ceiling algorithm to the accumulator while maintaining the round-half-up on the coefficients and the round-floor on the remainder of the datapath. This resulted in a maximum rounding error of 0.020344 and a DC bias of -0.015342 as illustrated in Fig 7 .


7. Round-ceiling on accumulator (RF on datapath, RHU on coefficients).As we see, rounding the accumulator using round-floor or round-ceiling can – not surprisingly – introduce a DC bias into the output. These schemes also increase the overall error, because the rounding will cause creeping during the accumulation; the larger the number of accumulation iterations, the more the creep will add up.

In order to address this, we first modified the accumulator to use a round-half-up algorithm, resulting in a maximum rounding error of 0.0047283 and a DC bias of +0.000043169, as shown in Fig 8 .


8. Round-half-up on accumulator (RF on datapath, RHU on coefficients).We then changed this to the round-half-even algotithm (which MATLAB refers to as a “convergent round”), resulting in a maximum rounding error of 0.0047283 and a DC bias of +0.000047076, as shown in Fig 9 .


9. Round-half-even on accumulator (RF on datapath, RHU on coefficients).Both the round-half-up and round-half-even algorithms provided an order of magnitude improvement in the maximum rounding error as compared to the round-floor and round-ceiling functions; they both also provided three orders of magnitude improvement in the DC bias.

The interesting point here is that the results from the round-half-up and round-half-even algorithms were almost identical in this case (this tells us that the data in this example didn’t contain many values that fell exactly on “half-way” [0.5] boundaries). However, a hardware implementation of the round-half-even would require the additional overhead of a comparison operation (to identify data values that exactly fell on 0.5 boundaries). Thus, the round-half-up algorithm, which doesn’t require this comparison, gives a much better “bang for the buck” in this example, because the only additional “cost” in hardware is the inclusion of a simple adder.

There are of course many more experiments that one could perform along these lines. For those who are interested, Tim has kindly made the MATLAB Source Files for the examples described here available for your delectation and delight. If, you come up with some additional rounding facts and considerations (for these examples or for your own test cases), please feel free to tell me about them, and maybe we’ll write another article!

Clive “Max” Maxfield is president of TechBites Interactive, a marketing consultancy firm specializing in high technology. Max is the author and co-author of a number of books, including Bebop to the Boolean Boogie (An Unconventional Guide to Electronics), The Design Warrior’s Guide to FPGAs (Devices, Tools, and Flows), and – most recently – How Computers Do Math (ISBN: 0471732788) featuring the pedagogical and phantasmagorical virtual DIY Calculator.

<From https://www.eetimes.com/an-introduction-to-different-rounding-algorithms/&gt;

Posted in Uncategorized | Leave a comment

How to Create a Multilayer Perceptron Neural Network in Python

by Robert Keim

This article takes you step by step through a Python program that will allow us to train a neural network and perform advanced classification.

This is the 12th entry in AAC’s neural network development series. See what else the series offers below:

  1. How to Perform Classification Using a Neural Network: What Is the Perceptron?
  2. How to Use a Simple Perceptron Neural Network Example to Classify Data
  3. How to Train a Basic Perceptron Neural Network
  4. Understanding Simple Neural Network Training
  5. An Introduction to Training Theory for Neural Networks
  6. Understanding Learning Rate in Neural Networks
  7. Advanced Machine Learning with the Multilayer Perceptron
  8. The Sigmoid Activation Function: Activation in Multilayer Perceptron Neural Networks
  9. How to Train a Multilayer Perceptron Neural Network
  10. Understanding Training Formulas and Backpropagation for Multilayer Perceptrons
  11. Neural Network Architecture for a Python Implementation
  12. How to Create a Multilayer Perceptron Neural Network in Python

In this article, we’ll be taking the work we’ve done on Perceptron neural networks and learn how to implement one in a familiar language: Python.

 

Developing Comprehensible Python Code for Neural Networks

Recently I’ve looked at quite a few online resources for neural networks, and though there is undoubtedly much good information out there, I wasn’t satisfied with the software implementations that I found. They were always too complex, or too dense, or not sufficiently intuitive. When I was writing my Python neural network, I really wanted to make something that could help people learn about how the system functions and how neural-network theory is translated into program instructions.

However, there is sometimes an inverse relationship between the clarity of code and the efficiency of code. The program that we will discuss in this article is most definitely not optimized for fast performance. Optimization is a serious issue within the domain of neural networks; real-life applications may require immense amounts of training, and consequently thorough optimization can lead to significant reductions in processing time. However, for simple experiments like the ones that we will be doing, training doesn’t take very long, and there’s no reason to stress about coding practices that favor simplicity and comprehension over speed.

The entire Python program is included as an image at the end of this article, and the file (“MLP_v1.py”) is provided as a download. The code performs both training and validation; this article focuses on training, and we’ll discuss validation later. In any case, though, there’s not much functionality in the validation portion that isn’t covered in the training portion.

As you’re pondering the code, you may want to look back at the slightly overwhelming but highly informative architecture-plus-terminology diagram that I provided in Part 10.

 

Preparing Functions and Variables

 

The NumPy library is used extensively for the network’s calculations, and the Pandas library gives me a convenient way to import training data from an Excel file.

As you already know, we’re using the logistic sigmoid function for activation. We need the logistic function itself for calculating postactivation values, and the derivative of the logistic function is required for backpropagation.

Next we choose the learning rate, the dimensionality of the input layer, the dimensionality of the hidden layer, and the epoch count. Training over multiple epochs is important for real neural networks, because it allows you to extract more learning from your training data. When you’re generating training data in Excel, you don’t need to run multiple epochs because you can easily  create more training samples.

The np.random.uniform() function fills ours two weight matrices with random values between –1 and +1. (Note that the hidden-to-output matrix is actually just an array, because we have only one output node.) The np.random.seed(1) statement causes the random values to be the same every time you run the program. The initial weight values can have a significant effect on the final performance of the trained network, so if you’re trying to assess how other variables improve or degrade performance, you can uncomment this instruction and thereby eliminate the influence of random weight initialization.

Finally, I create empty arrays for the preactivation and postactivation values in the hidden layer.

 

Importing Training Data

 

 

This is the same procedure that I used back in Part 3. I import training data from Excel, separate out the target values in the “output” column, remove the “output” column, convert the training data to a NumPy matrix, and store the number of training samples in the training_count variable.

 

Feedforward Processing

The computations that produce an output value, and in which data are moving from left to right in a typical neural-network diagram, constitute the “feedforward” portion of the system’s operation. Here is the feedforward code:

 


The first for loop allows us to have multiple epochs. Within each epoch, we calculate an output value (i.e., the output node’s postactivation signal) for each sample, and that sample-by-sample operation is captured by the second for loop. In the third for loop, we attend individually to each hidden node, using the dot product to generate the preactivation signal and the activation function to generate the postactivation signal.

After that, we’re ready to calculate the preactivation signal for the output node (again using the dot product), and we apply the activation function to generate the postactivation signal. Then we subtract the target from the output node’s postactivation signal to calculate the final error.

 

Backpropagation

After we have performed the feedforward calculations, it’s time to reverse directions. In the backpropagation portion of the program, we move from the output node toward the hidden-to-output weights and then the input-to-hidden weights, bringing with us the error information that we use to effectively train the network.

 

 

We have two layers of for loops here: one for the hidden-to-output weights, and one for the input-to-hidden weights. We first generate SERROR, which we need for calculating both gradientHtoO and gradientItoH, and then we update the weights by subtracting the gradient multiplied by the learning rate.

Notice how the input-to-hidden weights are updated within the hidden-to-output loop. We start with the error signal that leads back to one of the hidden nodes, then we extend that error signal to all the input nodes that are connected to this one hidden node:

 


After all of the weights (both ItoH and HtoO) associated with that one hidden node have been updated, we loop back and start again with the next hidden node.

Also note that the ItoH weights are modified before the HtoO weights. We use the current HtoO weight when we calculate gradientItoH, so we don’t want to change the HtoO weights before this calculation has been performed.

 

Conclusion

It’s interesting to think about how much theory has gone into this relatively short Python program. I hope that this code helps you to really understand how we can implement a multilayer Perceptron neural network in software.

You can find my full code below:

 

 

Download Code

<from https://www.allaboutcircuits.com/technical-articles/how-to-create-a-multilayer-perceptron-neural-network-in-python&gt;

Posted in Uncategorized | Leave a comment

Signal Processing Using Neural Networks: Validation in Neural Network Design

 by Robert Keim

This article explains why validation is particularly important when we’re processing data using a neural network.

AAC’s series on neural network development continues here with a look at validation in neural networks and how NNs function in signal processing.

  1. How to Perform Classification Using a Neural Network: What Is the Perceptron?
  2. How to Use a Simple Perceptron Neural Network Example to Classify Data
  3. How to Train a Basic Perceptron Neural Network
  4. Understanding Simple Neural Network Training
  5. An Introduction to Training Theory for Neural Networks
  6. Understanding Learning Rate in Neural Networks
  7. Advanced Machine Learning with the Multilayer Perceptron
  8. The Sigmoid Activation Function: Activation in Multilayer Perceptron Neural Networks
  9. How to Train a Multilayer Perceptron Neural Network
  10. Understanding Training Formulas and Backpropagation for Multilayer Perceptrons
  11. Neural Network Architecture for a Python Implementation
  12. How to Create a Multilayer Perceptron Neural Network in Python
  13. Signal Processing Using Neural Networks: Validation in Neural Network Design

The Nature of Neural-Network Signal Processing

A neural network is fundamentally different from other signal-processing systems. The “normal” way to achieve some sort of signal-processing objective is to apply an algorithm.

In this model, a researcher creates a mathematical method for analyzing or modifying a signal in some way. There are methods for removing noise from audio, finding edges in images, calculating temperature from the resistance of a thermistor, determining the frequency content of an RF waveform, and so forth. The designer then builds upon the work of the researcher by converting that method into an algorithm that can be carried out by a processor and adapted to the needs of a given application.

 

FIR filter is an example of a signal-processing system that we can assess and understand in a precise mathematical way. 

A trained neural network, on the other hand, is an empirical system.

The mathematical processes that occur in the network do not constitute a specific algorithm that is intended to classify handwritten characters, or predict the formation of tornadoes, or develop control procedures for extreme aeronautical maneuvers. Rather, the math in the neural network is a framework that enables the network to create a customized computational model based on training data.

We understand the mathematical framework that allows a neural network to learn and achieve its required functionality, but the actual signal-processing algorithm is specific to the training data, the learning rate, the initial weight values, and other factors.

 

 

A neural network, unlike a FIR filter, is dependent upon many different factors.

 

It’s like the difference between learning a language as a child and studying a language as an adult.

A child who has never even heard the word “grammar” can repeatedly produce the correct verb form because his or her brain has naturally recognized and retained patterns contained in the enormous quantity of linguistic input data that children receive from older people with whom they interact.

Adults, however, usually don’t have access to all this input and may not assimilate patterns in the same way, and consequently we memorize and implement the linguistic “algorithms” that enable us to correctly conjugate verbs and choose tenses.

 

The Importance of Validation

Neural networks can solve extremely complex problems because when given abundant input, they “naturally” find mathematical patterns similar to how children find linguistic patterns. But this approach to signal processing is by no means infallible.

Consider English-speaking children who say “goed” instead of “went” or “holded” instead of “held.” These are called overregularization errors. They’ve picked up on the -ed pattern for past tense, but for some reason—maybe insufficient data, or cognitive idiosyncracies—they haven’t yet refined their linguistic model to account for verbs that are irregular in the past tense.

No one, of course, is going to castigate a four-year-old for saying “I goed to the park.” But if a prominent politician were delivering an important speech and repeatedly said “goed,” “holded,” “finded,” “knowed,” and so forth, the audience would be seriously displeased (or utterly perplexed) and the speaker’s political career might come to an abrupt end.

These overregularization errors are a good example of how a trained neural network might have unexpected gaps in its ability to achieve the desired signal-processing functionality. And though little gaps may seem unimportant, or even interesting, when we’re merely performing experiments, the example of the politician reminds us that they could be catastrophic in a real application.

 

Both undertraining and overtraining can result in unexpected and problematic behavior when the network confronts real application data. See Part 4 for more information.

And now we see why validation is a crucial aspect of neural-network development. Training isn’t enough, because a training data set is inherently limited and therefore the network’s response to this data set is also limited.

Furthermore, training results in a “black box” computational system that we cannot analyze and assess as though it were a typical formula or algorithm. Thus, we need to validate, which I would define as doing everything we reasonably can to ensure that the network will successfully process typical real-life input data and will not produce spectacular failures when presented with atypical data.

 

Sorting Through the Terminology

The procedure that I identify as “validation” might also be called “verification” or simply “testing.”

In the context of software development, the first two terms have distinct meanings. Wikipedia, citing Barry Boehm, says that verification seeks to determine if the product is being built correctly, and validation seeks to determine if the correct product is being built. Since both of these issues are essential, you will see the abbreviation “V&V” for “verification and validation.”

I’m not a software engineer, so hopefully that means I’m not obligated to adopt this paradigm. I am simply using the term “validation” to refer to the testing, analysis, and observation that we perform in an attempt to ensure that the trained neural network meets system requirements.

 

Closing Thoughts: What Exactly Is Validation?

Well, that depends.

NASA, for example, published a fairly long document entitled “Verification & Validation of Neural Networks for Aerospace Systems.” If you’re more interested than I am in neural-network V&V, you might want to start with this document. If you’re a true V&V zealot, you should consider the book Methods and Procedures for the Verification and Validation of Artificial Neural Networks; it’s 293 pages long and surely exceeds my knowledge of this topic by at least three orders of magnitude.

In my world of simple neural networks developed for experimental or instructional purposes, validation primarily means running the trained network on new data and assessing classification accuracy, and we could also include fine-tuning that helps us to determine if and how the overall performance can be improved.

We’ll look at specific validation techniques in future articles.

<from https://www.allaboutcircuits.com/technical-articles/neural-network-signal-processing-validation-in-neural-network-design&gt;

Posted in Uncategorized | Leave a comment

Inside Apple’s New Audio Adapter

The decision to axe the iPhone’s built-in headphone port and simply put an adapter in the box has provoked reactions ranging from amusement to near panic. Why did they do it? Was it worth it? Will other manufacturers copy it? Today we’re going to ignore all of these questions. Instead we’re asking, How did they do it? And since we like taking things apart, we’ll answer with some exploratory surgery and some X-rays.

All Your Digitals Are Belong To Us

Apple feels the 3.5 mm audio jack is an antique whose time has passed. But we’re not all prepared to shell out for new headphones just yet, so to ease the transition Apple gave iPhone 7 owners a deal worthy of Oprah—you get a headphone adapter! And you get a headphone adapter! Everyone gets a headphone adapter!

Separately, this little adapter retails for $9.00—making it pretty much the cheapest thing in the Apple store, where you can drop $35 for a simple screen protector. So, you’d probably think a $9 dongle doesn’t have much going on.

Imagine our surprise, then, when our pals over at Creative Electron gave Apple’s new adapter the X-ray treatment:

Thanks to Creative Electron for this X-ray image of Apple’s audio adapter.

There’s actually a lot going on in there. As expected, one end is a simple female 3.5 mm headphone jack, and the other end is a male Lightning connector. But what’s all that silicon around the Lightning connector end? Most of the retail space near the connector is taken up by a single mystery IC.

Image courtesy of the amazing folks at Creative Electron.

We needed a closer look. Thankfully, long-time iFixit contributor and gadgeteer extraordinaire oldturkey03 sliced his adapter open so we could all get a peek inside. He uncovered that mystery IC by the Lightning connector, marked 338S00140 A0SM1624 TW—which doesn’t tell us much, other than it’s an Apple part number.

Thanks to longtime iFixit community member OldTurkey03 for his teardown of the audio adapter.

While the official purpose of this IC is unknown, at minimum we can guess that it contains a digital-to-analog converter (DAC) and amplifier, and its counterpart, an analog-to-digital-converter (ADC).

We know this because audio accessories like earphones (as well as human ears) need analog signals to work—and unlike ye olde analog headphone jack, Apple’s Lightning connector is all digital. The DAC bridges that gap. By the same logic, this chip must also contain an ADC circuit to convert the analog signal from your headphones’ built-in mic into something that can pass back through the Lightning port so your iPhone can make use of it.

In past iPhones like the 6s, both DAC and ADC functions were handled internally. The analog inputs and outputs from the headphone jack (and other components) were wrangled by a single chip on the logic board, a custom Apple/Cirrus Logic IC labeled 338S00105. (In the iPhone 7 and 7 Plus, that same exact chip still exists—because even without a headphone jack, the phone still has to shake hands with other built-in analog components.)

Apple/Cirrus Logic 338S00105 audio codec in the iPhone 6s (left) and iPhone 7 (right).

Is All Well In Appleville?

The moment Apple’s plans for a headphone adapter first came to light, audiophiles began questioning whether such a tiny dongle—and, presumably, the DAC + amp buried inside—could possibly supply a quality audio experience. Speculation was that, in order to fit aboard the adapter, the audio hardware would have to be so small that corners would inevitably be cut.

Well, here’s a visual comparison of the audio chip on the iPhone 7’s logic board, photographed right next to the exposed chip on the new adapter:

This isn’t an oranges-to-oranges comparison however, because each of these chips handles more than just DAC/ADC. The larger chip is also a codec, and is not believed to contain an amplifier (there are three amps located elsewhere on the iPhone 7 logic board).

In short, a more scientific approach is called for. So upon its release, hi-fi enthusiasts at German computer tech magazine c’t ran a battery of sound quality tests on Apple’s new adapter. After taking baseline measurements from the old-school, built-in headphone jacks on an iPhone 6s and iPad Air, they compared the adapter’s output on iPhone 6s, iPad Air, and iPhone 7:

Hi-fi enthusiasts at German computer tech magazine c’t ran a battery of sound quality tests on Apple’s new adaptor and broke down the results.

The takeaway seems to be that in some areas, the sound quality does measure a bit worse from the adapter than we might be accustomed to. For instance, when playing an uncompressed 16-bit audio file on the iPhone 6s, the dynamic range dropped from 99.1 dB at the headphone jack to 97.3 dB at the adapter. Though keep in mind, this slightly lower measurement is still higher than the theoretical maximum you get from a compact disc (which is 96 dB). So, is it a difference you are likely to notice? If you sit in a quiet room with a really, really good pair of headphones … and you’re a canine, the answer is: maybe.

But it appears Apple’s engineers did their job, and this tiny adapter performs better than most people expected or even thought possible.

Why did they do it? Was it worth it? Will other manufacturers copy it? Give us your thoughts.

<from http://ifixit.org/blog/8448/apple-audio-adapter-teardown/&gt;

Posted in Uncategorized | Leave a comment

Train and Deploy Deep Learning Applications with NVIDIA DIGITS 5 and New TensorRT

March 8, 2017

 

Trained models can be deployed to the cloud, PC, embedded or automotive GPU platforms, with TensorRT inference engine that maximizes throughput and efficiency.

 

Highlights from this release:

  • NVIDIA DIGITS 5 introduces image segmentation workflow for partitioning images into regions of interest, and Model Store for downloading pre-trained models that can decrease training time and improve model accuracy.
  • NVIDIA TensorRT is a high performance deep learning inference engine for production deployment of applications such as image classification, segmentation, and object detection that delivers up to 14x more images/sec than CPU-only inference.

DIGITS 5 and TensorRT are available as a free download to the members of the NVIDIA Developer Program.

<from https://news.developer.nvidia.com/train-and-deploy-deep-learning-applications-with-nvidia-digits-5-and-new-tensorrt/?ncid=em-ded-ndcgnr-10524&mkt_tok=eyJpIjoiWmpFNE16VTFNelkwTWpBNCIsInQiOiJEWlplZjgzT01HaHJtbmpWZDZsU1YyWjAyNjZWZFpvWFM5emZqajd3SXg5MysyZGo2QXVta1JFNzRVbm51T3hOcTFUVGZBVzgyZGE3V2VxT1lyeDc2NUxkbEZ1VnpJaHdhdkwxb2dhdEtUYmN3cjNSdlwvRnBzVXAxc0NKNEFTTzgifQ%3D%3D&gt;

 

Posted in Uncategorized | Leave a comment

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

By: Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, Lior Wolf

In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We revisit both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4,000 identities.

The learned representations coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.35% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 27%, closely approaching human-level performance.

<from https://research.fb.com/publications/deepface-closing-the-gap-to-human-level-performance-in-face-verification/&gt;

Posted in Uncategorized | Leave a comment

NOT VR BUT MR

At the Technology Forum presented by the imec research centre in late May (“ITF”),
an invited presentation was given by Microsoft on the topic of its Hololens headset product. Hololens is conceived as a means of providing its wearer (operator? host? user? – there is, perhaps, a whole new niche of terminology to evolve here) with what Microsoft terms “mixed reality”. Rather than transport the wearer (I’ll stick with “wearer” for now) into a different space, it aims to insert convincing holographic images into the space that he or she is actually occupying. So; not virtual reality (there is lots of that around at present), but mixed reality. The Hololens is worth outlining, if for no other reason, to convey some measure of the scale of engineering effort that is required for a project
of this ambition.

The headset inserts an image, with full depth information and focussed at the correct distance (hence, “holographic”) into the wearer’s visual field. The image therefore needs to be solidly located in that space. As well as capturing that space and placing its image data into it, the device accepts input from the wearer in gestures, movements and voice; it is a headmounted, full-featured computer; an autonomous unit that doesn’t rely on external markers or beacons, cameras, wired connection or host PC.

The Microsoft presentation included an exploded view which I can’t bring you here, but those who know anything of preparing a consumer product for production would recognise that the plastics moulding tooling costs alone must have been spectacular – before even considering the electronics NRE. The headset contains four environmental cameras (imaging the surroundings) plus a depth camera aided by time-of-flight sensing – and a video camera and ambient light sensor. Depth information is gathered at short range to capture the wearer’s gestures.

You might suppose that the image would be some form of miniature projection head-updisplay; it is not. The lenses carry fine gratings, with R/G/B feeds via optical waveguides, comprising 2.3 million light points. Resolution has to be expressed with a new metric, as 2,500 pixels/radian. That’s enough, Microsoft says, to render small fonts on virtual documents. Almost incidentally, there’s an audio channel with speakers at each ear.

We come, inevitably, to the silicon. The basic architecture is X86, with a CPU plus HPU – that is, a holographic processing unit which is an accelerator with an instruction set and functionality focussed on the holographic image generation task. There is 64 GB of flash memory, and 2 GB of RAM, with a custom memory architecture that can handle the flood of data incoming from the array of sensors. It’s always-on, has no sleep mode and has no heatsinking or fan. Naturally, this simply would not be possible were it not for the latest generations of silicon process technology – which is somewhat the point of imec inviting the presentation as part of its ITF. The main CPU is in 14 nm FinFET, while the HPU is in (mature by these standards) 28 nm – this will be migrated to a newer process, Microsoft assures us.
It is perhaps a statement of the obvious that only the largest corporations can contemplate
the research and NRE investment needed to realise such a product – especially in the context of so much activity around VR and with no certainty which of the many VR use cases will “fly”. For those of us not in that league, we can reflect that a whole new space is opening up in HMI terms (human/machine interfaces) that could create a major opportunity in add-on products and accessories.

<from http://www.edn-europe.com&gt;

Posted in Uncategorized | Leave a comment

Understanding Convolutional Neural Networks for NLP

<from http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/&gt;

When we hear about Convolutional Neural Network (CNNs), we typically think of Computer Vision. CNNs were responsible for major breakthroughs in Image Classification and are the core of most Computer Vision systems today, from Facebook’s automated photo tagging to self-driving cars.

More recently we’ve also started to apply CNNs to problems in Natural Language Processing and gotten some interesting results. In this post I’ll try to summarize what CNNs are, and how they’re used in NLP. The intuitions behind CNNs are somewhat easier to understand for the Computer Vision use case, so I’ll start there, and then slowly move towards NLP.

What is Convolution?

The for me easiest way to understand a convolution is by thinking of it as a sliding window function applied to a matrix. That’s a mouthful, but it becomes quite clear looking at a visualization:

Convolution with 3×3 Filter. Source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution

Imagine that the matrix on the left represents an black and white image. Each entry corresponds to one pixel, 0 for black and 1 for white (typically it’s between 0 and 255 for grayscale images). The sliding window is called a kernel, filter, or feature detector. Here we use a 3×3 filter, multiply its values element-wise with the original matrix, then sum them up. To get the full convolution we do this for each element by sliding the filter over the whole matrix.

You may be wondering wonder what you can actually do with this. Here are some intuitive examples.

Averaging each pixel with its neighboring values blurs an image:

 

 

 

Taking the difference between a pixel and its neighbors detects edges:

(To understand this one intuitively, think about what happens in parts of the image that are smooth, where a pixel color equals that of its neighbors: The additions cancel and the resulting value is 0, or black. If there’s a sharp edge in intensity, a transition from white to black for example, you get a large difference and a resulting white value)

 

 

 

The GIMP manual has a few other examples. To understand more about how convolutions work I also recommend checking out Chris Olah’s post on the topic.

What are Convolutional Neural Networks?

Now you know what convolutions are. But what about CNNs? CNNs are basically just several layers of convolutions with nonlinear activation functions like ReLU or tanh applied to the results. In a traditional feedforward neural network we connect each input neuron to each output neuron in the next layer. That’s also called a fully connected layer, or affine layer. In CNNs we don’t do that. Instead, we use convolutions over the input layer to compute the output. This results in local connections, where each region of the input is connected to a neuron in the output. Each layer applies different filters, typically hundreds or thousands like the ones showed above, and combines their results. There’s also something something called pooling (subsampling) layers, but I’ll get into that later. During the training phase, a CNN automatically learns the values of its filters based on the task you want to perform. For example, in Image Classification a CNN may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers. The last layer is then a classifier that uses these high-level features.

Convolutional Neural Network (Clarifai)

There are two aspects of this computation worth paying attention to: Location Invariance and Compositionality. Let’s say you want to classify whether or not there’s an elephant in an image. Because you are sliding your filters over the whole image you don’t really care where the elephant occurs. In practice,  pooling also gives you invariance to translation, rotation and scaling, but more on that later. The second key aspect is (local) compositionality. Each filter composes a local patch of lower-level features into higher-level representation. That’s why CNNs are so powerful in Computer Vision. It makes intuitive sense that you build edges from pixels, shapes from edges, and more complex objects from shapes.

So, how does any of this apply to NLP?

Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character. That is, each row is vector that represents a word. Typically, these vectors are word embeddings (low-dimensional representations) like word2vec or GloVe, but they could also be one-hot vectors that index the word into a vocabulary. For a 10 word sentence using a 100-dimensional embedding we would have a 10×100 matrix as our input. That’s our “image”.

In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words). Thus, the “width” of our filters is usually the same as the width of the input matrix. The height, or region size, may vary, but sliding windows over 2-5 words at a time is typical. Putting all the above together, a Convolutional Neural Network for NLP may look like this (take a few minutes and try understand this picture and how the dimensions are computed. You can ignore the pooling for now, we’ll explain that later):

Illustration of a Convolutional Neural Network (CNN) architecture for sentence classification. Here we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states. Source: hang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification
Illustration of a Convolutional Neural Network (CNN) architecture for sentence classification. Here we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states. Source: Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.

What about the nice intuitions we had for Computer Vision? Location Invariance and local Compositionality made intuitive sense for images, but not so much for NLP. You probably do care a lot where in the sentence a word appears. Pixels close to each other are likely to be semantically related (part of the same object), but the same isn’t always true for words. In many languages, parts of phrases could be separated by several other words. The compositional aspect isn’t obvious either. Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly this works what higher level representations actually “mean” isn’t as obvious as in the Computer Vision case.

Given all this, it seems like CNNs wouldn’t be a good fit for NLP tasks. Recurrent Neural Networks make more intuitive sense. They resemble how we process language (or at least how we think we process language): Reading sequentially from left to right. Fortunately, this doesn’t mean that CNNs don’t work.  All models are wrong, but some are useful. It turns out that CNNs applied to NLP problems perform quite well. The simple Bag of Words model is an obvious oversimplification with incorrect assumptions, but has nonetheless been the standard approach for years and lead to pretty good results.

A big argument for CNNs is that they are fast. Very fast. Convolutions are a central part of computer graphics and implemented on a hardware level on GPUs. Compared to something like n-grams, CNNs are also efficient in terms of representation. With a large vocabulary, computing anything more than 3-grams can quickly become expensive. Even Google doesn’t provide anything beyond 5-grams. Convolutional Filters learn good representations automatically, without needing to represent the whole vocabulary. It’s completely reasonable to have filters of size larger than 5. I like to think that many of the learned filters in the first layer are capturing features quite similar (but not limited) to n-grams, but represent them in a more compact way.

CNN Hyperparameters

Before explaining at how CNNs are applied to NLP tasks, let’s look at some of the choices you need to make when building a CNN. Hopefully this will help you better understand the literature in the field.

Narrow vs. Wide convolution

When I explained convolutions above I neglected a little detail of how we apply the filter. Applying a 3×3 filter at the center of the matrix works fine, but what about the edges? How would you apply the filter to the first element of a matrix that doesn’t have any neighboring elements to the top and left? You can use zero-padding. All elements that would fall outside of the matrix are taken to be zero. By doing this you can apply the filter to every element of your input matrix, and get a larger or equally sized output. Adding zero-padding is also called wide convolution, and not using zero-padding would be a narrow convolution. An example in 1D looks like this:

Narrow vs. Wide Convolution. Source: A Convolutional Neural Network for Modelling Sentences (2014)
Narrow vs. Wide Convolution. Filter size 5, input size 7. Source: A Convolutional Neural Network for Modelling Sentences (2014)

You can see how wide convolution is useful, or even necessary, when you have a large filter relative to the input size. In the above, the narrow convolution yields  an output of size (7-5) + 1=3, and a wide convolution an output of size (7+2*4 - 5) + 1 =11. More generally, the formula for the output size is n_{out}=(n_{in} + 2*n_{padding} - n_{filter}) + 1 .

Stride Size

Another hyperparameter for your convolutions is the stride size, defining by how much you want to shift your filter at each step.  In all the examples above the stride size was 1, and consecutive applications of the filter overlapped. A larger stride size leads to fewer applications of the filter and a smaller output size. The following from the Stanford cs231 website shows stride sizes of 1 and 2 applied to a one-dimensional input:

Convolution Stride Size. Left: Stride size 1. Right: Stride size 2. Source: http://cs231n.github.io/convolutional-networks/
Convolution Stride Size. Left: Stride size 1. Right: Stride size 2. Source: http://cs231n.github.io/convolutional-networks/

In the literature we typically see stride sizes of 1, but a larger stride size may allow you to build a model that behaves somewhat similarly to a Recursive Neural Network, i.e. looks like a tree.

Pooling Layers

A key aspect of Convolutional Neural Networks are pooling layers, typically applied after the convolutional layers. Pooling layers subsample their input. The most common way to do pooling it to apply a max operation to the result of each filter. You don’t necessarily need to pool over the complete matrix, you could also pool over a window. For example, the following shows max pooling for a 2×2 window (in NLP we typically are apply pooling over the complete output, yielding just a single number for each filter):

Max pooling in CNN. Source: http://cs231n.github.io/convolutional-networks/#pool
Max pooling in CNN. Source: http://cs231n.github.io/convolutional-networks/#pool

Why pooling? There are a couple of reasons. One property of pooling is that it provides a fixed size output matrix, which typically is required for classification. For example, if you have 1,000 filters and you apply max pooling to each, you will get a 1000-dimensional output, regardless of the size of your filters, or the size of your input. This allows you to use variable size sentences, and variable size filters, but always get the same output dimensions to feed into a classifier.

Pooling also reduces the output dimensionality but (hopefully) keeps the most salient information. You can think of each filter as detecting a specific feature, such as detecting if the sentence contains a negation like “not amazing” for example. If this phrase occurs somewhere in the sentence, the result of applying the filter to that region will yield a large value, but a small value in other regions. By performing the max operation you  are keeping information about whether or not the feature appeared in the sentence, but you are losing information about where exactly it appeared. But isn’t this information about locality really useful? Yes, it  is and it’s a bit similar to what a bag of n-grams model is doing. You are losing global information about locality (where in a sentence something happens), but you are keeping local information captured by your filters, like “not amazing” being very different from “amazing not”.

In imagine recognition, pooling also provides basic invariance to translating (shifting) and rotation. When you are pooling over a region, the output will stay approximately the same even if you shift/rotate the image by a few pixels, because the max operations will pick out the same value regardless.

Channels

The last concept we need to understand are channels. Channels are different “views” of your input data. For example, in image recognition you typically have RGB (red, green, blue) channels. You can apply convolutions across channels, either with different or equal weights. In NLP you could imagine having various channels as well: You could have a separate channels for different word embeddings (word2vec and GloVe for example), or you could have a channel for the same sentence represented in different languages, or phrased in different ways.

Convolutional Neural Networks applied to NLP

Let’s now look at some of the applications of CNNs to Natural Language Processing. I’ll try it summarize some of the research results. Invariably I’ll miss many interesting applications (do let me know in the comments), but I hope to cover at least some of the more popular results.

The most natural fit for CNNs seem to be classifications tasks, such as Sentiment Analysis, Spam Detection or Topic Categorization. Convolutions and pooling operations lose information about the local order of words, so that sequence tagging as in PoS Tagging or Entity Extraction is a bit harder to fit into a pure CNN architecture (though not impossible, you can add positional features to the input).

[1] Evaluates a CNN architecture on various classification datasets, mostly comprised of Sentiment Analysis and Topic Categorization tasks. The CNN architecture achieves very good performance across datasets, and new state-of-the-art on a few. Surprisingly, the network used in this paper is quite simple, and that’s what makes it powerful.The input layer is a sentence comprised of concatenated word2vec word embeddings. That’s followed by a convolutional layer with multiple filters, then a max-pooling layer, and finally a softmax classifier.  The paper also experiments with two different channels in the form of static and dynamic word embeddings, where one channel is adjusted during training and the other isn’t. A similar, but somewhat more complex, architecture was previously proposed in [2]. [6] Adds an additional layer that performs “semantic clustering” to this network architecture.

 

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification
Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification

[4] Trains a CNN from scratch, without the need for for pre-trained word vectors like word2vec or GloVe. It applies convolutions directly to one-hot vectors. The author also proposes a space-efficient bag-of-words-like representation for the input data, reducing the number of parameters the network needs to learn. In [5] the author  extends the model with an additional unsupervised “region embedding” that is learned using a CNN  predicting the context of text regions. The approach in these papers seems to work well for long-form texts (like movie reviews), but their performance on short texts (like tweets) isn’t clear. Intuitively, it makes sense that using pre-trained word embeddings for short texts would yield larger gains than using them for long texts.

Building a CNN architecture means that there are many hyperparameters to choose from, some of which I presented above: Input represenations (word2vec, GloVe, one-hot), number and sizes of convolution filters, pooling strategies (max, average), and activation functions (ReLU, tanh). [7] performs an empirical evaluation on the effect of varying hyperparameters in CNN architectures, investigating their impact on performance and variance over multiple runs. If you are looking to implement your own CNN for text classification, using the results of this paper as a starting point would be an excellent idea. A few results that stand out are that max-pooling always beat average pooling, that the ideal filter sizes are important but task-dependent, and that regularization doesn’t seem to make a big different in the NLP tasks that were considered. A caveat of this research is that all the datasets were quite similar in terms of their document length, so the same guidelines may not apply to data that looks considerably different.

[8] explores CNNs for  Relation Extraction and Relation Classification tasks. In addition to the word vectors, the authors use the relative positions of words to the entities of interest as an input to the convolutional layer. This models assumes that the positions of the entities are given, and that each example input contains one relation. [9] and [10] have explored similar models.

Another interesting use case of CNNs in NLP can be found in [11] and [12], coming out of Microsoft Research. These papers describe how to learn semantically meaningful representations of sentences that can be used for Information Retrieval. The example given in the papers includes recommending potentially interesting documents to users based on what they are currently reading. The sentence representations are trained based on search engine log data.

Most CNN architectures learn embeddings (low-dimensional representations) for words and sentences in one way or another as part of their training procedure. Not all papers though focus on this aspect of training or investigate how meaningful the learned embeddings are. [13] presents a CNN architecture to predict hashtags for Facebook posts, while at the same time generating meaningful embeddings for words and sentences. These learned embeddings are then successfully applied to another task – recommending potentially interesting documents to users, trained based on clickstream data.

Character-Level CNNs

So far, all of the models presented were based on words. But there has also been research in applying CNNs directly to characters. [14] learns character-level embeddings, joins them with pre-trained word embeddings, and uses a CNN for Part of Speech tagging. [15][16] explores the use of CNNs to learn directly from characters, without the need for any pre-trained embeddings. Notably, the authors use a relatively deep network with a total of 9 layers, and apply it to Sentiment Analysis and Text Categorization tasks. Results show that learning directly from character-level input works very well on large datasets (millions of examples), but underperforms simpler models on smaller datasets (hundreds of thousands of examples). [17] explores to application of character-level convolutions to Language Modeling, using the output of the character-level CNN as the input to an LSTM at each time step. The same model is applied to various languages.

What’s amazing is that essentially all of the papers above were published  in the past 1-2 years. Obviously there has been excellent work with CNNs on NLP before, as in Natural Language Processing (almost) from Scratch, but the pace of new results and state of the art systems being published is clearly accelerating.

Questions or Feedback? Let me know in the comments. Thanks for reading!

 

Paper References

 

Posted in Uncategorized | Leave a comment

Can LiFi stave off the wireless spectrum crunch?

by Rebecca Pool

With LiFi’s arrival on the horizon, does this mean lights out for WiFi? Exploring the LED’s future as a communication link.

17 December 2015, SPIE Newsroom. DOI: 10.1117/2.2201512.01
LiFi in the office (Velmenni)

In July 2011, UK researcher Harald Haas (Figure 1) showed how a humble LED bulb, equipped with a transmitter driver chipset, could stream high-definition video to a computer. Digital-to-analog converters within the unit rapidly modulated the LED — at speeds unnoticeable to the naked eye — to encode data. This was then transmitted to a photosensitive detector and receiver on a desktop, decoded and converted to a high-speed data stream.

The world was startled, visible light communications was thrown into the limelight, and Haas’ technology, dubbed LiFi, had just taken center stage.

“At the time we showed people what we could do, in principle, with the visible light spectrum,” Haas, Chair of Mobile Communications at the University of Edinburgh, tells SPIE Newsroom. “I’m now hoping we will have LiFi in all of our everyday light sources from streetlights to headlights, and in our homes, offices museums, and more.”

“Do this and we will have unlocked the Internet of Things,” he adds.

Harald Haas

Figure 1. Harald Haas with Li-Fi equipment.

Since 2011, progress has been as rapid as LiFi’s data rates. In the lab, as part of the Ultra-Parallel Visible Light Communications Project (UP-VLC), Haas and colleagues have reached blisteringly fast 10Gbit/s data transmission speeds.

Here they developed micro-LED arrays to transmit 3.5Gbit/s via each red, green and blue micro-LEDs in parallel. They also applied novel spatial modulation orthogonal frequency divisional multiplexing so the micro-LED elements within the array could beam thousands of streams of light in parallel, multiplying the volume of data transmitted at any one time.

“LEDs have been the bottleneck in data rate so as part of this project, we wanted to develop a technology that would unlock the vast amount of data rates available in the visible light spectrum,” says Haas. “We’ve achieved 10Gbit/s with these LEDs, but can reach 100Gbit/s using red, green and blue laser diodes.”

At the same time, the researchers have also fabricated sensitive photodetector and receiver arrays using high-gain avalanche photodetectors to detect the encoded light streams from the LEDs.

According to Haas, a detector-receiver array may have direct line of sight with a LED transmitter, but will also receive “residual” light that has reflected off surrounding walls, objects, the ground and ceiling.

“If you block the strongest incoming ray, residual light still reaches the receiver, and a good photodetector on the receiver side will still make sense out of the weakest of signals,” he says.

“We can achieve the sensitivity we need with off-the-shelf avalanche photodiodes and PIN diodes, but colleagues at Edinburgh have pioneered single-photon avalanche diodes,” he adds. “These detect single photons, achieving sensitivities that are orders of magnitude higher than those from off-the-shelf avalanche photodiodes.”

But while point-to-point demonstrations have proven LiFi to be a serious alternative to RF-based communications, the technology’s real strength lies in networking. Detector-receiver arrays also house infrared diodes to provide an uplink connection back to the LED-based LiFi access point. And by combining numerous micro-LED transmitters and receiver assemblies (Figure 2), the researchers can take LiFi beyond a straightforward point-to-point communications system, and create a multiple-input multiple-output (MIMO) transmission network. Transmitters and receivers can even be combined in a single unit to create a single-input, single-output link.

Optical data downlink from the light fixtures above. (Velmenni)

Figure 2. Optical data downlink from the light fixtures above. (Velmenni)

As Haas emphasizes: “The LiFi communications system can serve many users. It allows multi-user access, has an uplink and downlink, and allows handover.”

“The coverage of each LED lamp is up to 10 m2 and when you leave that space, you are illuminated by another lamp, handover takes place, and you don’t lose wireless connectivity,” he adds.

Haas is also confident ambient light will not interfere with transmissions as the receivers are only sensitive to the modulating LED light.

“Even the 100 Hz flicker from an incandescent light bulb is not an issue, as our modulation frequency starts at 1 MHz,” he says. “As long as fluctuations are outside our modulation, these aren’t an issue.”

What’s more, since light can’t penetrate walls, signals are less likely to be intercepted, which Haas reckons makes LiFi more secure than WiFi.

Commercializing LiFi

In the meantime, Haas and colleagues have also set up a company,pureLiFi, to commercialize the technology. Their first transmitter, ‘Li-1st’, provides point-to-point Internet access, delivering data at 5MB/s in both the downlink and uplink, for up to 3 meters.

Meanwhile, ‘Li-Flame’ delivers a multi-access wireless network to many users, providing a glimpse of what has been developed in the lab. It comprises two key components: a ceiling transmitting unit that connects to an LED light fixture and serves as a Li-Fi access point; meanwhile, a desktop receiving unit connects to a client’s device via a USB port, so users can roam within a room or building while connected.

The 10MB/s uplink and downlink data rates match average WiFi speeds but are a far cry from the breathtaking gigabit laboratory speeds. However, Haas points out: “We’ve developed these products to work with LEDs that are already on the market.”

Yet many commercial LEDs use a phosphor coating to convert blue light to white light, and this coating limits how fast the devices can be modulated, slowing down data rates. Haas isn’t fazed, saying: “Even with slow, phosphor-coated LEDs, we can exploit parallelism in the spatial dimension.”

Still, he is eager to highlight how swapping these LEDs for off-the-shelf laser diodes — as experimented with in the UP-VLC — boosts data rates beyond 100Gb/s.

“Laser LEDs have bandwidths up to 1 GHz, as opposed to 20 MHz for the phosphor-coated LEDs, so we can also encode data in the frequency domain to achieve fast data rates,” he says.

“But working with existing infrastructure is key, and we will closely monitor commercial LED devices and adopt new technologies so our data rates match state-of-the-art technology.”

Indeed, pureLiFi has recently joined forces with France-based LED lighting business, Lucibel, to develop a LiFi luminaire. The luminaire should be ready for market by late 2016, with the first installation taking place at the headquarters of a Paris-based property developer, Sogeprom.

Haas is not alone in his hot pursuit to commercialize a light-based alternative to RF wireless communications. An India-based startup, Velmenni, recently set up a two-way LiFi network across a series of streetlights spanning 30m, in Estonia.

According to chief executive Deepak Solanki, its LiFi network was based on a transceiver that included an LED array and LED driver, as well as a standard photodiode receiver. Transceivers were embedded into the streetlights to transmit data, from streetlight to streetlight, at 10Mbit/s.

“We also integrated a wireless mesh networking protocol through the lighting system to ensure the sending of data from one street lamp to another,” he says. “Using this network we can send high-definition videos over relatively large distances.”

At the same time, Velmenni has developed a smart LED bulb — Jugnu — to downstream data to smartphones at 10Mbit/s (Figure 3). Other key players, including LVX System (US), Israel-based RiT Technologies, IBSENTelecom of Norway and Japan-based Nakagawa Labs, are also demonstrating wireless optical networks. And not a moment too soon.

smart LED bulb - Jugnu (Velmenni)

Figure 3. Velmenni’s Jugnu bulb transmits data to smartphones at 10 Mbit/s.

The US Federal Communications Commission has long warned of a looming spectrum crisis as data-hungry mobile devices eat into ever more radio frequency bandwidth. Meanwhile, Haas himself anticipates that by 2025, the radio frequency spectrum will not provide sufficient resources for wireless data traffic.

Mark Leeson from the Communications Research Group at Warwick University (UK) agrees. “Our increase in data produced is simply incredible,” he says. “In terms of coding and implementing more efficiencies, we’ve still got a long way to go, but we are going to run out of bandwidth.”

While Leeson questions how quickly industry will adopt a new technology, he believes wireless optical networks will have a place beside saturated RF-based networks.

“[LiFi] isn’t going to save us from a spectrum crunch; we will also run out of optical space,” he says. “But it will help to put this off.”

Indeed, Haas and colleagues have already carried out “extensive research” on a hybrid WiFi/LiFi system, he says.

“Let me be clear, we will never see a LiFi or WiFi situation; it’s going to be WiFi and LiFi. And as [industry] installs new lighting systems, we will see more and more of the Internet enabled by a light source. Incandescent light bulbs just serve one purpose, light, but in 20 years I see a world where ceiling lights have more applications,” he adds.

“My ambition is to have iLight in the same way that we have the iPhone.”

Rebecca Pool is a freelance writer based in the UK.

<from http://spie.org/newsroom/technical-articles/1215-lifi?highlight=x2414&ArticleID=x116615&gt;

Posted in Uncategorized | Leave a comment