you need this

Subscribe to
Posts [Atom]

extra special blogs

 

about me
poet, technologist, cynic, father of five, child of chaos, punker, prankster, patriot, punster, leftist, latino, japanophile, audiophile, beer drinker, quiche eater, dog walker, soft talker, deep thinker, shallow sleeper, introvert, covert operative in a parallel universe.

View my complete profile



* m a y s t a r *
designs

archives
  • September 2004
  • October 2004
  • November 2004
  • December 2004
  • January 2005
  • February 2005
  • March 2005
  • April 2005
  • May 2005
  • June 2005
  • July 2005
  • August 2005
  • September 2005
  • October 2005
  • November 2005
  • December 2005
  • January 2006
  • February 2006
  • March 2006
  • April 2006
  • May 2006
  • June 2006
  • July 2006
  • August 2006
  • September 2006
  • October 2006
  • November 2006
  • December 2006
  • January 2007
  • February 2007
  • March 2007
  • April 2007
  • May 2007
  • June 2007
  • July 2007
  • August 2007
  • September 2007
  • October 2007
  • November 2007
  • December 2007
  • January 2008
  • February 2008
  • March 2008
  • April 2008
  • May 2008
  • June 2008
  • July 2008
  • August 2008
  • September 2008
  • October 2008
  • November 2008
  • December 2008
  • January 2009
  • February 2009
  • March 2009
  • April 2009
  • May 2009
  • June 2009
  • August 2009
  • September 2009
  • October 2009
  • November 2009
  • December 2009
  • January 2010
  • February 2010
  • March 2010
  • April 2010
  • May 2010
  • June 2010
  • July 2010
  • August 2010
  • September 2010
  • October 2010
  • November 2010
  • December 2010
  • January 2011
  • February 2011
  • March 2011
  • April 2011
  • May 2011
  • June 2011
  • July 2011
  • August 2011
  • September 2011
  • October 2011
  • November 2011
  • December 2011
  • March 2012
  • April 2012
  • May 2012
  • June 2012
  • August 2012
  • April 2013
  • June 2013
  • August 2013
  • September 2013
  • October 2013
  • February 2014
  • March 2014
  • April 2014
  • August 2014
  • March 2015
  •  

    extra special bitter
    hops are bitter. life is bitter. coincidence?

    Saturday, June 23, 2012

    Puppet and Poetry

    I have written literally thousands of haiku since the mid-90s. Most of them are archived, thanks to a program called Hypermail, which converts individual emails to individual HTML files. Ever since September 1999 I have been emailing my haiku to a special email address, which, in turn, saves it in a dedicated email folder. Every so often I'll execute a shell script to process the haiku in the folder using the aforementioned Hypermail. I put the resulting HTML files in a directory and then ftp it to my website so that I can view archived email at www.haikupoet.com/archive/.

    You'll notice that the archive starts at the beginning of the year. That's because Hypermail runs more slowly as more files are added to the folder. I got around this by renaming the archive folder at the end of each year and starting from scratch on January 1st. I have a separate folder for each year dating back to 1999. This made it easier to search archived haiku based on date, or sorted by subject (in my case also the first line). But if I wanted to search based upon a word appearing anywhere else in the haiku, I had to resort to onerous finds piped into greps - or primitive shell scripts.

    Not one to reinvent the wheel, I contacted David G. Lanoue, whose Kobayashi Issa website includes a nifty search feature. Each of the Haiku Master's 10,000 poems were saved in a single CSV file, which were then searched using PHP code. My haiku, however, were not saved in a CSV file, but in individual HTML files, stored in multiple folders.

    I busied myself with writing yet another shell script to process the the HTML files into a single text file. I added some post-processing using sed to translate unprintable characters and to strip extraneous text from the file. Then I taught myself just enough PHP to write a very simple search function, which I then added to my website. Victory at last!

    ...except that I still had to email each new haiku to myself and then use Hypermail to convert it to a new HTML file; and that I still had to process the resulting HTML file into a new line of text to be appended to the ever-growing file. The fact that I still write haiku daily - often several times a day - means that this is not a static archive, but rather a living document. I couldn't help think that the primitive techniques used to aggregate my haiku and make it available for searching mirrored some of the challenges I saw every day in the workplace. Scope creep: what had been a simple archive had evolved into a searchable archive; Scalability: what worked for dozens or hundreds of haiku is insufficient for thousands; Maintainability: the tools being used may not be around forever, after which the whole process breaks down.

    There's also the issue of execution - it's in two parts. The shell script that invokes Hypermail was written in 1999. I usually run it manually at the command line, but I used to run it via cron - that is, until I decided to make the archive searchable. Now I have another more recent script that calls the first script and then concatenates all of the HTML files created this year into a single text file. I could "automate" this by running it once a night via cron, but what if I write several haiku during the course of a day and want the archive to be as up-to-date as possible at all times? What if I don't write anything for a day or two? The cron job is running in vain. Why isn't there an easy way to sense when I've added a haiku and then append it to the existing archive without a time trigger or a manual process?

    Enter Puppet Labs. Their flagship product, Puppet, is software that enables systems administrators to automate configuration management and application deployment. My employer uses it for this and more, deploying and maintaining systems and application software to hundreds of servers in a sprawling, complex enterprise. Surely it's up to the task of automating updates to my haiku archive.

    So here's what it needs to do: 1) detect a new email sent to my haiku archive address, 2) convert the email into a format readable and searchable on my website, and 3) append it to the existing data. Pretty easy, huh?

    To do this, I'm going to need to know puppet much better than I do now. Like most lazy sys admins (which I realize might be a redundant term), I tend to copy an existing puppet configuration file and modify it for my own use. The puppet ecosystem we use in my workplace was put together by another team and handed to us. I've never built it out from scratch.

    HyperMail is still available for free download from SourceForge, but it hasn't been updated since 2004. Who knows whether or not it will continue to be available? Besides, now that the goal is a single repository of searchable content, there's no need to have an interim step consisting of converting individual emails to individual HTML files that are then concatenated into a text file. Instead, each email should be processed as it arrives, directly into a searchable format, and added to the existing repository.

    Puppet will work with an ever-increasing number of tools, so as the technology changes, the puppet code can change with it. I'll use puppet to detect the new email and to orchestrate its inclusion into the archive. Under the hood, I have some ideas on how to replace HyperMail (ActiveMQ? RSS?), as well as alternatives to a flat text file (MySQL? NOSQL?). The PHP code would need to change in order to search a database instead of a text file, but maybe I'll use a programming language like Ruby instead.

    I don't know how to do many of the things I've suggested above, but I can't wait to get started...

    2 Comments:

    Blogger Unknown said...

    Hello,

    It seems to me like using puppet for this task is over-complexifying things..

    You could setup a mailbox that you dedicate to sending new poems (you could for example set it up to accept mail only from you to avoid any spam), and then have a cronjob on your server treat a batch of email and deleting those that it has treated.

    This way you're only transforming new poems, but the setup stays simple (e.g. you can re-use your shell script and make it process a minimal amount of poems at a time, and then posting a new poem to your site means only sending an e-mail to the right place).


    Also, for storing and searching, you might want to investigate a means of indexing text in your poems and running your searches through the index. There are multiple solutions for that. For example:
    * storing in text files
    * or in database rows in a table
    * indexing via Sphinx
    * via Lucene
    * via Apache Solr

    2:29 PM  
    Blogger extraspecialbitter said...

    Thanks for the comment. Using puppet for this task does seem a bit like using a fork lift to move sheet of paper, but part of my intention is to learn how to build out a puppet master from scratch. On the other hand, my current method involves a dedicated mailbox, HyperMail, a pair of shell scripts and some PHP code - lots of bubble gum and duct tape. And while invoking it via a cron job does automate it to some degree, it's only masking the inefficiency - in my opinion. If I get technology fatigue somewhere down the line I may simplify my approach, but first I'll see if I can make this work.

    3:32 PM  

    Post a Comment

    << Home

     

    design by may
    maystar designGet awesome blog templates like this one from BlogSkins.com