tehgeekmeister’s blog

February 6, 2009

google code project, google group, and darcs repository for WikimediaParser

Filed under: programming — Tags: , — tehgeekmeister @ 10:38 am

So if anyone wants to contribute, they can now, easily!

The google code project is here, I intend to use it only for the wiki and bug tracker.  The darcs repository is on patch-tag.  The google group, if anyone has any suggestions or anything of the sort, is here.

In any case, expect at least a few more releases and a reasonably general, clean version from me soon; but help from others is more than welcome!

February 5, 2009

releasing WikimediaParser 0.1

Filed under: programming — Tags: , , — tehgeekmeister @ 8:17 pm

It’s a cabalized version of some very rough, but functional, tools I’m using for parsing wikimedia markup.  Currently it has some french wikipedia specific code, but by release 0.2 (which should come soon) I intend to have it general enough to be used for at least any language of wikipedia (and in later releases any wikimedia markup at all).  Anyone who wants to contribute some patches and get it there quicker, feel free to submit patches using darcs send (for now, that is.  I’ll have a darcs repository up on code.haskell.org soon, but for right now I’m locked out of my account).

hackage page

January 17, 2009

what to do when you can’t solve a problem with a hackage library you need?

Filed under: programming — Tags: — tehgeekmeister @ 8:12 pm

in trying to work on my graded reader project recently I encountered two problems that were beyond my ability to solve. the first was that HDBC’s quickQuery was loading all results into memory, while fetchAllRows was not loading any rows (in fact the query was never reaching postgresql). this made it so I couldn’t use HDBC with postgresql for my project. then I decided to try takusen, but couldn’t figure out the problems with the current cabal file for it, despite quite a lot of googling. I asked around on #haskell about both of these problems, and got some help, but neither got resolved; so my question for planet haskell is what should I do when I reach this sort of an impass? i’ll gladly fix any problems I can on my own and share the wealth, but I hate being stuck when it’s absolutely beyond me.

December 22, 2008

fast string appending/concatenation in haskell

Filed under: programming — tehgeekmeister @ 9:58 pm

working on my graded reader project yesterday, i was really frustrated by how slow it was going.  after profiling, i realized i was spending over 95% of my time appending strings; uh oh.  i did something stupid.

the code in question was something like this:

appendToContent str page = page {pageContent = newContent}
where newContent = (pageContent page) ++ str

which looks innocent enough to the unwary — after all, you’re just using the normal haskell append operator, right?

to see why this is so bad, let’s take a look at what it’s doing:

[]    ++ ys = ys
(x:xs) ++ ys = x : (xs ++ ys)

this means that to concatenate two lists, we have to recurse thru EVERY value of the first list.  this means that you’re essentially calling (++) once for every element in the first list.  why does haskell do this?  let’s take a look at the definition of a list in haskell

data List a = [] | a : List a — (this isn’t how it’s really defined; it’s built into GHC, but this is how it’d look.)

this means that a List is either the empty list, [], or a pair of any value of type a and a list of values of type a.  this means there’s no direct way to access any element in the list but the very first one, so to concatenate two lists we have to traverse every element in the first list, deconstruct it, and reconstruct it.  that’s not so bad if you are always adding onto the front of a list, because each time you only have to traverse the new elements; but if you’re adding to the end of a list (what i was doing), you have to traverse all the old elements of the list.  and each time you append, there are now more elements to traverse the next time.

there are two ways around this:

  1. don’t append to the end of a list multiple times.
  2. if you have to, use DList or ShowS

since i was working with a String (just a list of characters in haskell), i used ShowS.  so the original code became

appendToContent str page = page {pageContent = newContent}
where newContent = (pageContent page) . showString str

the difference being that now str is of type ShowS, which is to say that for the initial value of pageContent we started with showString “” (showString is a function that takes two strings, and prepends the first to the second).  since pageContent starts as a value of type ShowS (or String -> String), we can compose it with more partial applications of showString.  when we finally are done appending, we simply apply pageContent to “” (the empty string), and it does all those appends once only, so we only traverse all the elements of each string once.

so.  if you find your haskell code involving a lot of appends/concatenations is slow, make sure you’re not doing what i did.  if you are, use the showString trick, and it’ll be much faster.  or if you’re working with lists of other types, the DList library on hackage should solve your problem in the same way.  just remember to make sure you’re composing calls, and then applying them to “”, because if you do each call one at a time it’ll be just as slow as regular appends.

March 18, 2008

how to grok a multi-file project?

Filed under: programming — Tags: , , — tehgeekmeister @ 12:38 am

i’ve decided that my next step in learning haskell and programming will be to get familiarized with and start contributing to a few open source projects.  the ones i’m most interested in and feel the most capable to use/contribute to in any way are yi and happs, and to get started i’ve built and looked around the source of both of them a bit: but i’ve found it very difficult to get oriented in projects of this size.  being that the most complicated things i’ve ever coded are somewhere along the lines of maybench or trivial exercises, i’m not used to figuring out where to start or how to go about understanding larger projects.  so, if anyone out there (hi planet haskell!) has any suggestions for how to get acquainted with a larger project, or can help in any other way, it’d be muchly appreciated.  even better if you’re involved in the development of either yi or happs and would be willing to answer some questions/point me in the right direction until i’m up to speed and able to contribute on my own!

p.s.: one idea i’d had was to attempt to document both projects, seeing as they both are in need of more documentation, and i’d absolutely have to become intricately familiar with the source in order to do that — but the problem remains of how to manage the complexity of a multi-file project, where one starts in order to figure out the whole contraption.

February 8, 2008

checkquick (and hello, planet haskell!)

Filed under: programming — Tags: , , — tehgeekmeister @ 12:51 pm

i’ve just uploaded a preliminary (and hackish) version of checkquick as requested here by eric kow.  it’s already usable, but will likely need a lot more polishing and changes before it reaches any reasonable state.

usage for right now is as follows:

  1. once you’ve checked out the repository, compile Main.hs to your liking.  (i’ll assume it’s compiled to checkquick and is in your $PATH, for simplicity)
  2.  run as follows: checkquick –setup=CMD –cleanup=CMD “CMD1” “CMD2”, where the commands supplied to setup and cleanup should be scripts that respectively setup and cleanup the environment, and CMD1, CMD2 (or more, it will accept an arbitrary number of commands to be timed) are the commands to be timed.

any suggestions or comments are very welcome, but please bear in mind — this is the first time i’ve written any software intended to be used by anyone else before!

January 12, 2008

better one line binary readers in haskell, thanks to commenters and redditors.

Filed under: programming — Tags: , — tehgeekmeister @ 6:38 pm

update to my previous post on binary readers.

from roconnor:

readBin s = x

[(x,"")] = Numeric.readInt 2 (`elem` "01") Char.digitToInt s -- probably the best way of doing it; reinventing what's available in the standard libraries isn't a good thing.

and two from pjdelport:

readBin = foldM (\a d -> (a*2 +) <$> elemIndex d "01") 0 -- as elemIndex evaluates to a Maybe, (<$>) or fmap is necessary to lift the arithmetic section into the Maybe. bonus points for using Maybe instead of error. readBin = (foldl' ((+) . (*2)) 0 <$>) . mapM (`elemIndex` "01") — this version separates “parsing” and summing the values of each bit. otherwise same benefits as the previous versions.

that’s it for now.

January 11, 2008

one line binary reader in haskell

Filed under: computer science, math, programming — Tags: , , — tehgeekmeister @ 4:22 pm

update: apparently this is now in the first page of google results for “binary reader”, so i feel obligated to point out that i posted an update on this, with better versions thanks to commenters both here and on reddit.

i post this in the tradition of pointlessly showing how concisely one’s favorite language can perform a given task. also, i’m slightly proud of this version, as my previous wasn’t nearly as to the point (and wasn’t point free!). so, here it is:

readBin = (`div` 2) . foldl (\x y -> (x + y) * 2) 0 . map (\c -> case c of {‘0’ -> 0; ‘1’ -> 1; _ -> error “Input is not a binary string.”})

previous to this, i was using:

readBin xs = core (map bin xs) 0
core [x] n = n + x
core (x:xs) n = core xs ((x + n) * 2)
bin c = case c of {‘0’ -> 0; ‘1’ -> 1; _ -> error ” Input is not a binary string.”}

which, while absolutely equivalent, is overly verbose. now i’ll explain the algorithm for those of you to whom it’s not blindingly obvious (i’m not doing anything tricky or special here, i swear. i’m not that cool.). so, first off, we’re mapping over this string (strings are lists of chars in haskell), and converting each ‘0’ or ‘1’ to the 0 or 1, respectively. also, if we come across something aside from a ‘0’ or ‘1’, we raise an error. once the string is converted to a list of ints, we consume the first value of the list, add it to an accumulator, and multiply that by two. why? because by definition any binary integer has the decimal value of the sum of all the digits, going from right to left, to the nth power, where n is the index starting from zero of the right to left position in the binary representation.

so, wait, if we’re summing and raising to the nth power each digit from right to left, then why are we using a left fold (foldl)? because, in this case, adding the digit to the accumulator (x in the lambda passed to foldl) and then multiplying that times two is exactly equivalent to the previous definition i gave of the decimal value of a binary integer. i’ll leave figuring out exactly why that’s the case to the reader. also, in case you didn’t notice, the fold i’m using in the first definition does one more multiplication by two than the other version, which is why, to reap the benefit of a more elegant, pointfree version, we must divide by two at the end.

the point, on the whole, of this? well, as i already said, it’s pointless. but if there were to be one, it’d be this: damn, haskell is cool. i couldn’t have ever done that in other languages i’ve used. (but i’m sure someone else could. i’m just not that cool.)

p.s.: if anyone finds any flaws with any of what i’ve said here, please take the time to comment — while this algorithm has passed a few test runs, and it seems right intuitively, i’m just not experienced enough to be sure it actually is.

January 5, 2008

obligatory monthly post

Filed under: math, Uncategorized — tehgeekmeister @ 10:18 pm

it seems i’ve been posting just about monthly so far. this is a decent pace — i expect it’ll get better.

nearly done with my preliminary studies of linear algebra: have ditched the previous course in favor of serge lang’s (hmm, have i mentioned this already? *checks. yes, i have.) okay, more up to date then, i’m actually reading lang’s introduction to linear algebra. while beezer’s text was good, it seems to be aimed more towards those studying math as a requisite for other studies rather than those interested in math itself (i don’t say math major or non-major, because, well, i’m not in college). this just wasn’t what i wanted. at the same time, lang’s book is much to my liking in being concise (at times a touch too concise, but nothing unconquerable if you’re dedicated and ready for linear algebra itself).

i’ve been finding myself bored with the internet lately; this is probably a good sign, as it means i’m using the internet as a tool for studies and communications and not as an end in itself. oddly, this might mean more posts. we’ll see. not like there’s anyone really dying for the next post.

i’m off to begin reading the hobbit and go to bed now.  potentially more later.

December 4, 2007

brief update

Filed under: academics, computer science, math, programming — tehgeekmeister @ 10:08 pm

i’ve just finished an initial version of a utility which i’ll have a chance to describe and develop further after i get done with the next bout of work.  i just got back from a three week job out of town, which occupied almost all of my time — now i’m at a local job that’s more hours, so it’s taking all my time as well.  will give more details as they’re available.  for now, here‘s what i developed with the help of those in #haskell today.

my parents have also bought me serge lang‘s algebra and linear algebra, which is very exciting to me.  they should be the start to getting me properly into advanced mathematics.

with that, i must leave you for now.  more later.

Older Posts »

Blog at WordPress.com.