## Data Formats

There are only four types of data:

• Numbers
• Characters (text)
• Sound
• Pictures (all types of pictures: charts, drawings, photos, animations, video, etc.)

To use a computer to process all four of these data types, we must have a way of representing all four of these data types in binary. Chapter 2 discussed how to represent numbers in binary. So, if we only had a way of converting non-numeric data types (characters, sound, and pictures) into numbers, we would have a way of representing any data type on a computer. This is what we are going to do.

## 3.2 ALPHANUMERIC CHARACTER DATA

Assume that you want to design a coding system for representing characters with numbers. All you have to do is decide on:

·         Which characters you want to represent

·         Which numbers stand for which characters.

How do you decide which characters to represent? If you're working in an English-speaking country, that's easy. Just look at a keyboard: A..Z, a..z, 0..9, punctuation, special chars, blank.

A possible coding scheme is: A=1, B=2, C=3, etc. This is only one of many possible coding schemes. The code you use doesn't matter. But it would make everyone's life simpler if everyone agreed on the same code. Such a standard code is the ASCII code. ASCII is a 7-bit code. A byte is 8 bits -- just big enough to hold the ASCII code for a single character (with a single bit left over), and that's what is done -- a byte is used to hold a single character (the leftmost bit is set to 0). So byte and character are pretty much interchangeable terms.

There are two other codes you should be aware of: EBCDIC (a code used only on IBM mainframes, an 8-bit code; and Unicode. The ASCII code is a subset of Unicode called the "Latin" code (see the first item on the Unicode link). Unicode is a universal coding system. It is a 16-bit code (this means that 65,536 different characters can be represented) that provides a code for every character in every major language in the world. With the World-Wide Web suddenly making all computer applications world-wide applications, you will be seeing a great deal of Unicode in the future. Java, a relatively new (since 1995) computer programming language that is used for web applications, uses Unicode for representing all character data.

## NOTABLE FEATURES OF THE ASCII CODE

Note that "A" < "B" < "C", etc, and "a" < "b" < "c", etc. This makes it easy to do sorting of characters, provided that the characters are all upper case or all lower case.

Note that all upper-case letters are considered less than all lower-case letters. This causes problems when sorting data where the letters are mixed case. Sorting based on the ASCII code would end up with all of the words beginning with upper-case letters coming before all words that begin with lower-case letters.

Note that the first 32 characters (corresponding to the numbers 0-31) and the last character (DEL -- corresponding to the number 127) are not printable characters, but are control characters. Control characters, when sent to an output device (a printer or monitor) do not cause a character to be printed, but cause an action to take place, such as ringing a bell (character 7 -- "bell"), or moving the cursor to the beginning of a line (character 13 -- "carriage return"), or moving to the beginning of a new page (character 12 -- "form feed").

## SOURCES OF ALPHANUMERIC INPUT

Keyboard. Keyboards do not generate ASCII codes directly. Instead, they generate what are called "scan codes". Each key on the keyboard is assigned a number (usually it's as simple as starting in the upper left of the keyboard with key #1 and assigning numbers in sequence from left to right, top to bottom). When the key is pressed, the number is generated, and when the key is released, another number is generated (the original number plus 128). Note that this method treats all keys the same -- even the shift, alt, and control keys. Note also that by generating a scan code both when a key is pressed and when it is released, it can be determined whether one key is being held down (like shift, alt, or control) while another is being pressed.

Optical character recognition (OCR). This involves the use of a scanner and OCR software to convert an image of black and white pixels representing text into the equivalent ASCII characters. Today's software is capable of reading many different fonts and is more than 99% accurate on good-quality input. This software is especially useful if you have a fax modem. It allows you to receive a fax as a scanned image (a bit map taking up hundreds of thousands of bytes) and convert it into a text file of only a few thousand bytes. OCR does not work as well on handwriting.

Bar code readers. This involves converting bar codes into equivalent ASCII characters (digits). Bar code readers are examples of source data automation. Source data automation is the automation of inputting data into a computer to eliminate human keyboarding errors.

Voice input. Since the late 1990s, we finally have both the software and powerful-enough hardware to do continuous speech recognition on a PC! Since 1997 products have been on the market that allow speaker-independent continuous speech recognition. This means that you may be using a keyboard less and a microphone more in the future.

## 3.3 IMAGE DATA

In recent years, PC's have finally gotten enough computing horsepower to begin doing serious work with graphical data (charts, drawings, photographs, animations, and video). PC's could always do image processing, but not quickly enough to be practical. There are generally two ways to store image data:

Bit map images. Each pixel ("picture element" -- a dot making up a picture) is converted into a number. The quality of a picture is determined by two things: resolution and the number of colors.

·         If you want higher resolution (more pixels), you need more numbers (and the memory to store them). An 800x600 picture requires 56% more memory than a 640x480 picture (480,000 pixels vs. 307,200 pixels).

·         If you want more colors, you will need bigger numbers (which means more bits and more memory to store them). A true color (photographic quality) picture stores the color information for each pixel in three bytes (sometimes called 24-bit color). A picture with fewer colors may only require a single byte (or even less) to store each pixel. The number of bits used to store each pixel is called the color depth of the image.

Object images. Rather than a collection of pixels, an object image is a collection of graphical objects, each with its own attributes that may be manipulated by the user. For example, the user may drag a circle from one part of the image to another, enlarge or shrink it, change its color, etc.

Bit maps vs. objects. A program that allows manipulation of graphical objects is like a bulletin board, where each object can be moved around, moved to the front or back, rotated, etc. A bit-map image is like a painted wall. Once you paint a red circle on the wall, you cannot easily move it or change its size or color. And once you paint over a red circle on the wall, you cannot move the red circle to the front. This is why programs that manipulate bit-maps are called paint programs (like the paint accessory that comes with Windows).

## EXAMPLES OF BIT MAP IMAGE FORMATS

Digitizing: Pictures need to be "digitized" before they can be displayed on a computer. Digitizing a picture involves breaking the picture down into small dots called "pixels", determining the color of each pixel, assigning a corresponding number to the pixel, and then saving all of the numbers (hundreds of thousands or millions) in a file. There are many file formats for digitizing pictures, but only three are widely used on the Web: GIF, JPEG, and PNG.

GIF: Graphic Interchange Format (GIF) is a standard format for representing digitized pictures. GIFs can contain up to 256 colors (how many bits per pixel is that?). GIF is a lossless compression algorithm. There are three special variations of the GIF format: (1) transparent backgrounds, (2) interlaced GIFs, and (3) animated GIFs.

JPEG: Joint Photographic Experts Group (JPEG, or JPG) is a graphic format that can display up to 16 million colors (quick, how many bits per pixel is that?). JPEG is a lossy compression algorithm. This means that some detail is lost when the picture is compressed, but because of the greater number of colors, JPEG pictures generally look better than GIFs.

PNG: Portable Network Graphics (PNG) is a flexible graphics format that can store pictures at multiple color depths.

## EXAMPLE OF OBJECT IMAGE FORMATS

PostScript. PostScript is an example of a page description language. The statements in this language look similar to statements in a programming language except that each statement describes the attributes of an object on a page. PostScript files are plain ASCII files. Today's laser printers have PostScript interpreters built in. PostScript includes support for scalable fonts. A scalable font is a font that can be drawn at any size without losing quality. If you've ever seen fonts that look like they're made up of big ugly pixels when they're expanded, they are not scalable fonts. Posters created with PrintShop before Windows came along often looked like this. Windows uses scalable fonts called TrueType fonts (this is what the "T T" stands for in front of a font name). Scalable fonts are defined as sequences of lines and arcs rather than as a collection of pixels. To double the size of a character, just double the lengths of the lines and the radii of the arcs.

## 3.4 AUDIO DATA

How can sound be represented as numbers? Consider a child's xylophone where each note is numbered. This is a pretty simple example, but it's still a way of converting sounds into numbers.

Most sound is more complex than just the simple notes on a xylophone. What a music player sends to the speaker is a voltage that continuously varies as the sound changes. If the voltage changes rapidly, the speaker vibrates rapidly and you get a high-pitched sound. If the voltage changes slowly, you get a low-pitched sound. If the voltage has very high values, you get a loud sound, and if it has very low values, you get a soft sound.

What is needed is a way of converting this continuous (analog) wave into a series of (digital) numbers. How do you convert a continuously changing voltage into numbers? What you do is sample the voltage at fixed intervals (many thousands of times each second). Each sample is a number. The sequence of numbers will approximate the voltage. A device that performs this conversion is called an analog-to-digital converter, or ADC for short.

When it's time to play the sound back, the process is reversed. The digital numbers that represent the sound are converted from digital form back into analog. A device that performs this conversion is a digital-to-analog converter, or DAC for short.

How often to you need to do this to get an accurate depiction of the original wave form? The answer depends on the quality of sound that you want. Voice-quality sound (as required for phone calls) is sampled 8,000 times per second. However, to get CD-quality sound, the sound wave is sampled 44,100 times each second. If you want stereo sound (2 tracks), this means that you need 88,200 numbers just to represent one second of music!

How many numbers do you need to get one hour's worth of stereo music? At 44,100 samples per second, for 2 tracks (stereo), with 60 minutes at 60 seconds per minute, we need: 44,100 * 60 * 60 * 2 = 317,520,000 numbers (they're 2 bytes each, so the total number of bytes is: 635,040,000 bytes!). This is approximately the capacity of a single CD. So if you don't get close to an hour of music on a CD, it's not because the publishers couldn't fit the data on the CD, it's just because they chose not to put any more music on it.