Intel's Ronler Acres Plant

Silicon Forest
If the type is too small, Ctrl+ is your friend

Saturday, May 14, 2011

Pattern Matching

Pattern matching is one of things people do really well, and something that seems to be kind of difficult for computers. Recognizing anything, from a toothbrush to a song, involves pattern matching. For a while I've been kicking around the idea of writing a computer program to look for patterns in digital data. All digital data is stored in some kind of structure. The data is only useful when it is decoded and displayed or used for something. Much digital data is stored as a sequence of bytes. Text files are one example, each byte represents one character. Most audio and video data is compressed, and the individual bytes are now just part of a stream of bits. Lose your place in the stream and your data may as well go in the bit bucket.

We got an example of this last night while we were watching The Closer. Every so often there would a little skritch-skritch sound. I didn't notice it until my wife pointed it out, and then I had to replay it five or six times before I was able to pick up on it (we were watching a recording off of the Verizon/Frontier/Motorola DVR). After that I started noticing it. Well, I noticed it at least once more. I suspect what happened is something got corrupted in the audio bit stream, and the DVR lost it's place and had to wait for a synchronization pattern to appear, and in those few milliseconds between the two events, the audio circuit got fed garbage, which we heard as skritch-skritch.

Anyway, a couple of days ago I got an error message in my browser, along with a block of data to show the
"highly trained monkeys" who were supposedly looking into the problem, and I got thinking about running a pattern recognition program on this block of data.

Pattern recognition can become very complex very quickly, but I thought I would just start with some elementary analysis: how many of each bit pattern are there in a given block of data? The first problem is how many bits are you going to look at. I mean a bit a pattern can be anywhere from one bit long to a zillion bits. Well, it can be as long as your data sample. Any file you can store on your hard drive is going to have a fixed length, and it probably doesn't make sense to look for bit patterns any longer than maybe a megabyte or so. Still, that is a very long pattern, and would require a large amount of working memory. So let's start at the other end and see where it leads.

We start with one bit, go through the file and count the number of bits set to one and the number of bits set to zero. If all of the bits are set to one, or they are all set to zero, there is no pattern and there is no further analysis to be done.

Now we can move up to two bits, which have four possible values. You also have two starting positions. You can start at the beginning of the file, or you can start one bit in and count all the combinations that start on odd boundaries.

Next we have three bits, which have eight possible values, and three possible starting offsets. Small amounts of data like these can be easily stored in working memory.

This works well for a while, but when we reach 16 bits we have 65 thousand possible values and 15 starting positions, which means we will need a megabyte of memory to use for counting. When we reach 32 bits we will need over 100 billion bytes. We would need this much memory even if the file we are analyzing is only a few thousand bytes long. Perhaps there is a better way.

At some point, which we may want to choose based on the size of the file we analyzing, we could start using a content addressable array. That is, instead of just using the value of the current bit pattern as an index into an array, we would assign an index to the bit pattern. As new bit patterns are discovered, new indices would be created. So we would need a function to convert the bit pattern to an index.

One way is to just create a list. Whenever the function is called, the list is scanned to see if that pattern already exists, and if so, then the index to that pattern is returned. If we are analyzing a lot of data, and there are a great many bit patterns, scanning the list could prove to be time consuming. In that case we would probably need to maintain a sorted list. This would make checking the list very quick, but adding each new pattern would mean considerable work (time) to rearrange the list to make room for the new entry. Since our first subject is rather small, we can probably skip this sorting business for the time being.

So now all I have to do is write the program and see what pops out.

3 comments:

Ole Phat Stu said...

Just take the Fourier Transform of your bitfile and look for significant peaks.
That'll tell you if there is a pattern and what its length is. You could use the autocorrelation instead. I recommend the FFT algorithm.

Chuck Pergiel said...

Have you ever tried running an FFT on an encrypted message? Did it provide any useful information?

Anonymous said...

Depends on the 'quality' of the encryption. Kasiski attacks are good for transpositions. During ww2, Kappa attacks were used to extract keylengths. So some information can be found.