Unusual Tools in Data Science

2014-05-28

Isn’t all science “Data Science?”

I’m an applied research mathematician in my day job, which on a day-to-day basis puts me somewhere between mathematician, software developer, machine learning practitioner, and statistician. In other words, I’m probably what’s called a Data Scientist nowadays. See the image below for how some people feel about that term;1 I’m a bit ambivalent about it myself.

It seems to be a bit trendy and not have much of a solid definition. Some people even think the title should be killed off entirely. That being said,

  1. I’m mostly OK with applying a trendy label to myself
  2. I don’t have a better term for the “person who uses software to do interesting things with data” right now

so I’ll stick with it for the time being.

Data
Scientist

A series idea

In my time working in “data science,” I’ve noticed that there are a number of interesting tools out there that tend to go unnoticed. Perhaps it’s because people that fit under the data science umbrella have wide and varied backgrounds, but there seems to be a lack of agreement of best practices—or a lack of following best practices even if folks agree. Moreover, there doesn’t seem to be much awareness about the wide array of new, interesting, or just plain different (computational) tools that are out there.2

There are a number of tools out there that can solve certain problems much more cleanly, elegantly, or efficiently than the ones people usually reach for. Though of course the usual tools have a lot of merit—this book is particularly nice—there’s definitely room for more tools in the data science tool belt.

Post the first

The first tool I’ll mention is Haskell for parsing various types of logs and files. Several times I’ve ended up having to parse a log or other type of file that, while it follows a specific format, isn’t one of the usual suspects like SQL or CSV. Rather than reaching for regexes,3 I contend that this is exactly the type of problem that Haskell—with Parsec or (even better) Attoparsec—can easily handle.

I won’t make this post into an introduction to parsing with Attoparsec; there are already great intros out there that do a better job than I could. I’ll just echo the experiences of others out there and let you know that parsing ad hoc (or just new) file formats with Attoparsec leads to solutions that are

Moral of the story: if you’re doing data science, don’t be afraid to use different tools than you’re used to; they just might lead to better solutions.

  1. The post from which I stole borrowed that picture is also a pretty good read. ↩︎

  2. Or perhaps that’s just at my workplace. ↩︎

  3. Which I’ve also done because Haskell wasn’t an option (the people I was working with only liked Perl), and it lead to some horrific circumstances similar to this↩︎