Intel's Ronler Acres Plant

Silicon Forest
If the type is too small, Ctrl+ is your friend

Saturday, April 20, 2013

Regular Expressions

Scott sent me a link to a list of movies, and I go look at it because I like movies, but I don't like the list. They've formatted it with all kinds of extraneous nonsense, so I decide to copy the data into a spreadsheet because, well, because I wanna. And you can't stop me.

Once I do the original copy and paste I have one column of data. The cells in the this column alternate between the title and description. How can I get them into two columns without cutting and pasting them individually? Hmmm. I screw around for a bit, and then I hit on scheme. Insert an extra row at the top. Copy Column A, starting at the second row, and then paste into Column B, starting at row 1. Now each title in Column A is followed by it's description in Column B. The descriptions in Column A are followed by the wrong title in Column B. We need to eliminate them. How do we do that? I notice that the titles are preceded by numbers, so we simply sort the sheet by the data in Column A. This puts all the titles at the top (since they all start with numbers) and all the mismatched data at the bottom, where we can easily delete it.

So we've gotten the basic transfer and layout done, but now we need to do a little trimming. We could do it by hand, but that's tedious, and why spend 15 minutes cutting and pasting when we can spend hours figuring out how to do it automatically? Funny how the mind works.

I looked at the available functions, and it looks like I am going to need to use some kind of Regular Expression. I don't like Regular Expressions [tm]. They are the worst kind of cryptic mumbo-jumbo. They are even worse than AEDIT macros, and those can be pretty bad. I know, I've been there. But that's what's in the toolbox here, so let's see if we can make it work.

In short order I was able to come up with an expression to extract the order number from the title, but then I couldn't use the order number as number. When I tried to sort the spreadsheet by these numbers, it treated them like letters, which meant that while 1 was in row one, 10 was in row two, 11 was in row three and 2 was way down in row twelve. I tried different formatting options but nothing seemed to work. I finally posted a question on the Forum and in short order I got a couple of answers, both of which struck me as bizarre overkill for what I was trying to do, but I did pick up on the VALUE function, which converts a string of digits to a number. It does not change the appearance of the data one whit, but it changes it's behavior. Why the formatting controls could not do this is one of the reasons I hate computers.

So now I have a (somewhat tenuous) grip on Regular Expressions and I muck around a bit more with extracting numeric data from the text. I spend a little more time extracting some text by hand. I could have done some of it with more Regular Expressions, but I have had enough, and this portion was not onerous, and presto! We have a nice spreadsheet with all 50 movies, with titles, descriptions and other fun data. The first sheet has the extracted data. The second sheet has all the horrible Regular Expressions. You know, just in case you are curious.

As for the movies on the list, well, the only one that struck me as a great film was The Big Lebowski. There were a few films I remembered specifically not liking, like Reservoir Dogs and The Shining. Not that they were bad films, I just didn't like them. Most of the others are so old I couldn't tell you whether I have seen them or not. The titles sound familiar, but I really can't remember the stories.

Someone who was really ambitious could probably write a script to look up the title on IMDB, find the year the movie was released and add it to the spreadsheet. I'm not that ambitious. At least not today.

No comments: