<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>underdog-blog &#187; Uncategorized</title>
	<atom:link href="http://blog.underdog-projects.net/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.underdog-projects.net</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Sat, 23 Oct 2010 18:43:18 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>URL encode / decode in JavaScript</title>
		<link>http://blog.underdog-projects.net/2010/02/url-encode-decode-in-javascript-2/</link>
		<comments>http://blog.underdog-projects.net/2010/02/url-encode-decode-in-javascript-2/#comments</comments>
		<pubDate>Fri, 26 Feb 2010 19:57:13 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[urldecode]]></category>
		<category><![CDATA[urlencode]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=535</guid>
		<description><![CDATA[Decoding and Encoding URLs in JavaScript should be a pretty easy thing to do especially since all browsers still have the functionality built-in. Interestingly no browser allows the JavaScript runtime to use this feature. So I had to write it for myself. The code I came up with is far from perfect but it worked [...]]]></description>
			<content:encoded><![CDATA[<p>Decoding and Encoding URLs in JavaScript should be a pretty easy thing to do especially since all browsers still have the functionality built-in. Interestingly no browser allows the JavaScript runtime to use this feature. So I had to write it for myself.<br />
The code I came up with is far from perfect but it worked for me. To decode an URL use url_decode(url) and to reverse it just call the utf16to8 function. The rest does your browser for you.</p>
<pre class="prettyprint javascript">
function url_decode(str){
var hex = /^[0-9a-fA-F]{2}/;
var out='';
var arr = str.split('%');
if(arr.length&lt;2) return str;
for(var i=0;i&lt;arr.length;i++)
{
  /*look for hex values */
  if(hex.exec(arr[i])) {
    out += String.fromCharCode(parseInt(arr[i].substring(0,2),16))+arr[i].substring(2,arr[i].length);
  } else { if(i==0) out+=arr[i]; else out+='%'+arr[i];
  }
}
return utf8to16(out);
}

function utf16to8(str) {
    var out, i, len, c;

    out = "";
    len = str.length;
    for(i = 0; i &lt; len; i++) {
	c = str.charCodeAt(i);
	if ((c &gt;= 0x0001) &#038;&#038; (c &lt;= 0x007F)) {
	    out += str.charAt(i);
	} else if (c &gt; 0x07FF) {
	    out += String.fromCharCode(0xE0 | ((c &gt;&gt; 12) &#038; 0x0F));
	    out += String.fromCharCode(0x80 | ((c &gt;&gt;  6) &#038; 0x3F));
	    out += String.fromCharCode(0x80 | ((c &gt;&gt;  0) &#038; 0x3F));
	} else {
	    out += String.fromCharCode(0xC0 | ((c &gt;&gt;  6) &#038; 0x1F));
	    out += String.fromCharCode(0x80 | ((c &gt;&gt;  0) &#038; 0x3F));
	}
    }
    return out;
}

function utf8to16(str) {
    var out, i, len, c;
    var char2, char3;

    out = "";
    len = str.length;
    i = 0;
    while(i &lt; len) {
	c = str.charCodeAt(i++);
	switch(c &gt;&gt; 4)
	{
	  case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
	    // 0xxxxxxx
	    out += str.charAt(i-1);
	    break;
	  case 12: case 13:
	    // 110x xxxx   10xx xxxx
	    char2 = str.charCodeAt(i++);
	    out += String.fromCharCode(((c &#038; 0x1F) &lt;&lt; 6) | (char2 &#038; 0x3F));
	    break;
	  case 14:
	    // 1110 xxxx  10xx xxxx  10xx xxxx
	    char2 = str.charCodeAt(i++);
	    char3 = str.charCodeAt(i++);
	    out += String.fromCharCode(((c &#038; 0x0F) &lt;&lt; 12) |
					   ((char2 &#038; 0x3F) &lt;&lt; 6) |
					   ((char3 &#038; 0x3F) &lt;&lt; 0));
	    break;
	}
    }
    return out;
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2010/02/url-encode-decode-in-javascript-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>tracking virtual links with google analytics</title>
		<link>http://blog.underdog-projects.net/2010/01/tracking-virtual-links-with-google-analytics/</link>
		<comments>http://blog.underdog-projects.net/2010/01/tracking-virtual-links-with-google-analytics/#comments</comments>
		<pubDate>Mon, 04 Jan 2010 17:57:00 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cloudtheweb.com]]></category>
		<category><![CDATA[google analytics]]></category>
		<category><![CDATA[javascript]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=521</guid>
		<description><![CDATA[Tracking dynamic sites is sometimes a bit tricky. Typically tracking systems are specialized in tracking page views. More sophisticated system have there own way of tracking custom event (like shown here). Unfortunately I needed to track clicks on a HTML canvas. To make these clicks visible to a tracking system, I wanted to transform each [...]]]></description>
			<content:encoded><![CDATA[<p>Tracking dynamic sites is sometimes a bit tricky. Typically tracking systems are specialized in tracking page views. More sophisticated system have there own way of tracking custom event (<a href="http://code.google.com/apis/analytics/docs/tracking/eventTrackerOverview.html">like shown here</a>).<br />
Unfortunately I needed to track clicks on a HTML canvas. To make these clicks visible to a tracking system, I wanted to transform each click to virtual URL. That way I could use Google analytics not only for tracking but also for popularity statistics of certain content.<br />
The script for doing so is actually pretty simple.</p>
<pre class='prettyprint'>
function trace(url){
var tracker = _gat._getTracker("UA-XXXXXXX-X");
tracker._trackPageview(url);
}
</pre>
<p>Now every time I need to track something I call this function with a custom build URL.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2010/01/tracking-virtual-links-with-google-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>searching for hash strings in postgres</title>
		<link>http://blog.underdog-projects.net/2009/09/searching-for-hash-strings-in-postgres/</link>
		<comments>http://blog.underdog-projects.net/2009/09/searching-for-hash-strings-in-postgres/#comments</comments>
		<pubDate>Wed, 23 Sep 2009 13:54:14 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[hash]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[postgres]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=442</guid>
		<description><![CDATA[For one of my projects a have a database which has a rather large table consisting of just an url and a corresponding id. For performance reasons I added a md5 column which hashes the url. With this column it should be a lot faster to look up an url. CREATE TABLE pages ( id [...]]]></description>
			<content:encoded><![CDATA[<p>For one of my projects a have a database which has a rather large table consisting of just an url and a corresponding id. For performance reasons I added a md5 column which hashes the url. With this column it should be a lot faster to look up an url.</p>
<pre class="prettyprint lang-sql">CREATE TABLE pages
(
  id bigint NOT NULL,
  url character varying(255),
  md5 character(32),
  CONSTRAINT pages_pkey PRIMARY KEY (id)
)</pre>
<p>The faster lookup should mainly be possible through the shorter column length (and therefore smaller index). Actually I don&#8217;t know if the fixed width is good or bad here, but hashes usually don&#8217;t vary in length. After creating this table I added a B-Tree unique Index to the md5 column to enable a fast lookup.<br />
After a while a noticed a rather high CPU load on lookups for this table so I tried  to analyze the problem. First I tried the obvious through psql.</p>
<pre class="prettyprint lang-sql">cloud=# explain analyze select * from pages where md5 ='abc';
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Index Scan using i_pages_md5 on pages  (cost=0.00..8.50 rows=1 width=166) (actual time=0.046..0.046 rows=0 loops=1)
   Index Cond: (md5 = 'abc'::bpchar)
 Total runtime: 0.157 ms
(3 rows)</pre>
<p>As the explain shows all works perfectly fine and lookups shouldn&#8217;t be a problem. So there had to be something different what was going on.</p>
<p>After that I tried the same select with an actual md5.</p>
<pre class="prettyprint lang-sql">cloud=# explain analyze select * from pages where md5 = md5('abc');
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Seq Scan on pages  (cost=0.00..32017.63 rows=3994 width=166) (actual time=1203.699..1203.699 rows=0 loops=1)
   Filter: ((md5)::text = '900150983cd24fb0d6963f7d28e17f72'::text)
 Total runtime: 1203.769 ms
(3 rows)</pre>
<p>Now you can see the plan does change quite a lot. I have a full table scan instead of an index scan. You can also see that the query time increases nearly by factor ten thousand.</p>
<p>The reason for this dramatic change is a simple type mismatch. For whatever reason the md5 function will be evaluated to a string of the type text. To create a match with the column md5 all values had to be casted to that type. The side effect of this is that the index can no longer be used, because it is of the wrong type.</p>
<p>To solve this I just had to cast the result of the md5 function back to something that is compatible with the index type. In my case I used a fixed width character field which is represented in the database as bpchar (blank padded character). So after modifying the query to the following I was back on index usage.</p>
<pre class="prettyprint lang-sql">cloud=# explain analyze select * from pages where md5 = md5('abc')::bpchar;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Index Scan using i_pages_md5 on pages  (cost=0.00..8.50 rows=1 width=166) (actual time=0.141..0.141 rows=0 loops=1)
   Index Cond: (md5 = '900150983cd24fb0d6963f7d28e17f72'::bpchar)
 Total runtime: 0.199 ms
(3 rows)</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/09/searching-for-hash-strings-in-postgres/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>cloud-the-web &#8211; my new web project</title>
		<link>http://blog.underdog-projects.net/2009/08/cloud-the-web-my-new-web-project/</link>
		<comments>http://blog.underdog-projects.net/2009/08/cloud-the-web-my-new-web-project/#comments</comments>
		<pubDate>Sat, 08 Aug 2009 10:28:01 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[canvas]]></category>
		<category><![CDATA[cloudtheweb.com]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[javascript]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=379</guid>
		<description><![CDATA[A little time ago I started experimenting with some of the new HTML 5 features. Some seam pretty impressive although some a rather unnecessary in my opinion. But one thing got me really hooked &#8211; the HTML canvas. The possibilities of this control are only limited by the performance of javascript and the missing 3d [...]]]></description>
			<content:encoded><![CDATA[<p>A little time ago I started experimenting with some of the new HTML 5 features. Some seam pretty impressive although some a rather unnecessary in my opinion. But one thing got me really hooked &#8211; the HTML canvas.<br />
The possibilities of this control are only limited by the performance of javascript and the missing 3d feature (hopefully this comes pretty soon). With that technology I finally got some way to implement something I wanted to try for some time now. So here is the basic idea what it is all about.</p>
<p>A lot of people out there use services like delicious where you tag you favorite sites and make this available to other users. Now I started to grab that data and began to build a massive tag cloud. After some time the site collected hundreds of thousands of links with their corresponding tags. So now you can start on the site and search for tags which interests you. These search tags will then be correlated against the cloud database and you get the most active links for your tags. So here is an example.<br />
Let&#8217;s say you are interested in a tomcat tutorial.<br />
tags:</p>
<blockquote><p> tutorial, tomcat</p></blockquote>
<p>results: </p>
<blockquote><p>howtoforge.com<br />
howtogeek.com<br />
java.sun.com</p></blockquote>
<p>Of course those results will be a link to the concrete tutorial (not just the entry page).<br />
So far to the official part. For me this is more of a fun project. I prefer to start with some random tag and then wander around. It&#8217;s more like browsing, cause you start at points that you don&#8217;t already know. You have the chance to break out of your existing network of most used sites and see something new.</p>
<p>So have fun with it.<br />
<strong><br />
<a href="http://www.cloudtheweb.com">www.cloudtheweb.com</a><br />
</strong><br />
PS: For the implementation part &#8211; if you have any questions, just ask. I&#8217;m planning to explain some details about how it works on some later posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/08/cloud-the-web-my-new-web-project/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>shell-tools.net version2</title>
		<link>http://blog.underdog-projects.net/2009/07/shell-tools-net-version2/</link>
		<comments>http://blog.underdog-projects.net/2009/07/shell-tools-net-version2/#comments</comments>
		<pubDate>Fri, 31 Jul 2009 21:10:27 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[shell-tools.net]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=403</guid>
		<description><![CDATA[After nearly a full year, I finally got the time to rework my existing shell-tools.net site. The changes are rather small, because I think the site already works well, so why destroy a working site. So here is what I have done. I have added some new features like evaluating XPaths, pretty print JSON and [...]]]></description>
			<content:encoded><![CDATA[<p>After nearly a full year, I finally got the time to rework my existing <a href="http://www.shell-tools.net">shell-tools.net</a> site. The changes are rather small, because I think the site already works well, so why destroy a working site. So here is what I have done.<br />
I have added some new features like evaluating XPaths, pretty print JSON and transform XML files into JSON. Further to that I modified the css slightly so that from now on the current location will be highlighted in the navigation. I also removed some features which were rarely used because otherwise the menu would get to crowded.<br />
There are still some features which I really would like to implement, but haven&#8217;t got the time so far, for example resizing the input and output boxes on demand.</p>
<p>So enjoy the new features and lets hope the next release will not take another year.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/07/shell-tools-net-version2/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>installing GLUEscript on debian squeeze 64bit</title>
		<link>http://blog.underdog-projects.net/2009/07/installing-gluescript-on-debian-squeeze-64bit/</link>
		<comments>http://blog.underdog-projects.net/2009/07/installing-gluescript-on-debian-squeeze-64bit/#comments</comments>
		<pubDate>Wed, 15 Jul 2009 14:06:15 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[64-bit]]></category>
		<category><![CDATA[debian]]></category>
		<category><![CDATA[GLUEscript]]></category>
		<category><![CDATA[javascript]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=381</guid>
		<description><![CDATA[The GLUEscript runtime is still in an pretty early development stage. Basically they use the Firefox spidermonkey javascript engine and build some useful libraries on top of that (like curl,mysql, filesystem support). They also provide a little help in form of a little text file, but with this, it still took me half a day [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="https://sourceforge.net/projects/gluescript/">GLUEscript</a> runtime is still in an pretty early development stage. Basically they use the Firefox spidermonkey javascript engine and build some useful libraries on top of that (like curl,mysql, filesystem support).<br />
They also provide a little help in form of a little text file, but with this, it still took me half a day for my first installation. Most issues I got were based on version mismatches, because debian and also ubuntu use older versions of the required libraries.</p>
<p>First download the GLUEscript source from sourceforge.<br />
A second tool you will need to get this running is <a href="http://premake.sourceforge.net/">premake</a>. This is also a sourceforge project (you can use the binary version of it right away).<br />
After downloading, I copied the premake binary into the glue/src folder.</p>
<p>So now we can start with fetching the dependencies which debian can fulfill.</p>
<pre class='prettyprint lang-shell'>
sudo apt-get install libnspr4-0d libnspr4-0d-dbg libnspr4-dev libcurl4-openssl-dev libwxgtk2.8-dev libssl-dev libiodbc2-dev libmysql++-dev
</pre>
<p>In addition to that I needed a library called <a href="http://pocoproject.org/">poco</a> version 1.3.5 (all repositories I found just provided versions up to 1.3.3 -> those don&#8217;t work). So get the source from <a href="http://pocoproject.org/download/">http://pocoproject.org/download/</a> (the complete version). Compiling that should make no trouble cause all the dependencies are already installed.</p>
<pre class='prettyprint lang-shell'>
/tmp$ cd poco-1.3.5-all/
/tmp/poco-1.3.5-all$ ./configure
Configured for Linux
/tmp/poco-1.3.5-all$ make
/tmp/poco-1.3.5-all$ sudo make install
</pre>
<p>Now let&#8217;s get back to configuring GLUEscript. All configuration is done via lua script which will than be consumed by premake. The config file I needed to edit was the premake.lua file:</p>
<blockquote><pre>
-- Check NSPR
if ( string.len(nspr_dir) == 0 ) then
  print("Using the NSPR library which is part of GLUEscript")
  dopackage("nspr") -- build our own NSPR
  nspr_dir = "../nspr/include"
  nspr_lib = "nspr"
  nspr_lib_dir = project.libdir
else
  print("You are using your own NSPR library: ")
  nspr_dir = "/usr/include/nspr"
  print("nspr include: " .. nspr_dir)
  print("nspr lib: " .. nspr_lib_dir .. "/" .. nspr_lib)
end
</pre>
</blockquote>
<p>I copied the whole paragraph to just make it easier to find the position. Important is the added row in the else part.</p>
<blockquote><p>  nspr_dir = &#8220;/usr/include/nspr&#8221;</p></blockquote>
<p>This is needed because debian has a different file structure for header files than the script expects it.</p>
<p>After that we are done with configuring. To actually start the build process you have to run premake first.</p>
<pre class='prettyprint lang-shell'>
./premake gnu
make
</pre>
<p>The output will be generated to the following directory:</p>
<blockquote><p>glue/bin/Debug</p></blockquote>
<p>So far the makefile does not a an installation part. So if you want to install this you have to do it by yourself.</p>
<p>PS: This only works for the 0.0.1 version. So far I didn&#8217;t get any more recent svn version running.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/07/installing-gluescript-on-debian-squeeze-64bit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>why sub-selects can be faster than inner joins</title>
		<link>http://blog.underdog-projects.net/2009/06/why-sub-selects-can-be-faster-than-inner-joins/</link>
		<comments>http://blog.underdog-projects.net/2009/06/why-sub-selects-can-be-faster-than-inner-joins/#comments</comments>
		<pubDate>Thu, 25 Jun 2009 12:28:26 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[join]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[postgres]]></category>
		<category><![CDATA[sub-select]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=355</guid>
		<description><![CDATA[So here is my situation. I have 2 tables with the following DDL. CREATE TABLE tags ( id bigint NOT NULL, "value" character varying(150), CONSTRAINT tags_pkey PRIMARY KEY (id), CONSTRAINT tags_value_key UNIQUE (value) ) &#160; CREATE TABLE sites_tags ( sites_id bigint NOT NULL, pages_id bigint NOT NULL, tags_id bigint NOT NULL, count integer, updated timestamp [...]]]></description>
			<content:encoded><![CDATA[<p>So here is my situation. I have 2 tables with the following DDL.</p>
<pre class="prettyprint lang-sql">
CREATE TABLE tags
(
  id bigint NOT NULL,
  "value" character varying(150),
  CONSTRAINT tags_pkey PRIMARY KEY (id),
  CONSTRAINT tags_value_key UNIQUE (value)
)</pre>
<p>&nbsp;</p>
<pre class="prettyprint lang-sql">
CREATE TABLE sites_tags
(
  sites_id bigint NOT NULL,
  pages_id bigint NOT NULL,
  tags_id bigint NOT NULL,
  count integer,
  updated timestamp without time zone,
  CONSTRAINT sites_tags_pkey PRIMARY KEY (sites_id, pages_id, tags_id)
)
</pre>
<p>As you can see, the tags table is a simple value-id-table. The second table represents a join table between pages and tags.</p>
<p>The goal of my Query should be to get the most used tag from the join table. Only the first x-Rows are of interest to me. To get there I used a simple limit command. So just for comparison here a simple query of the join table without the actual values.</p>
<pre class="prettyprint lang-sql">
select sum(st.count) as anzahl from sites_tags st group by st.tags_id order by anzahl desc limit 50;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=13185.22..13185.35 rows=50 width=12) (actual time=1974.893..1975.033 rows=50 loops=1)
   ->  Sort  (cost=13185.22..13192.27 rows=2819 width=12) (actual time=1974.888..1974.941 rows=50 loops=1)
         Sort Key: (sum(count))
         Sort Method:  top-N heapsort  Memory: 18kB
         ->  HashAggregate  (cost=13056.34..13091.58 rows=2819 width=12) (actual time=1766.681..1876.092 rows=66136 loops=1)
               ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.120..690.719 rows=570756 loops=1)
 Total runtime: 1975.669 ms
(7 rows)
</pre>
<p>This is just a statement to get you the picture of cost for a simple query (without fetching any actual values).</p>
<p>To make this query useful I needed to add the values. All the values will be joined through the tags table.</p>
<p>Here the first implementation I came up with.</p>
<pre class="prettyprint lang-sql">
select t.value,sum(st.count) as anzahl from sites_tags st inner join tags t on t.id=st.tags_id group by st.tags_id,t.value order by anzahl desc limit 50;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=123002.69..123002.82 rows=50 width=22) (actual time=12640.100..12640.239 rows=50 loops=1)
   ->  Sort  (cost=123002.69..124429.58 rows=570756 width=22) (actual time=12640.095..12640.153 rows=50 loops=1)
         Sort Key: (sum(st.count))
         Sort Method:  top-N heapsort  Memory: 20kB
         ->  GroupAggregate  (cost=91200.58..104042.59 rows=570756 width=22) (actual time=10165.002..12537.121 rows=66136 loops=1)
               ->  Sort  (cost=91200.58..92627.47 rows=570756 width=22) (actual time=10162.562..11564.604 rows=570756 loops=1)
                     Sort Key: st.tags_id, t.value
                     Sort Method:  external merge  Disk: 18808kB
                     ->  Hash Join  (cost=1877.06..24921.63 rows=570756 width=22) (actual time=259.674..3080.093 rows=570756 loops=1)
                           Hash Cond: (st.tags_id = t.id)
                           ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.070..781.449 rows=570756 loops=1)
                           ->  Hash  (cost=1050.36..1050.36 rows=66136 width=18) (actual time=259.518..259.518 rows=66136 loops=1)
                                 ->  Seq Scan on tags t  (cost=0.00..1050.36 rows=66136 width=18) (actual time=0.027..115.197 rows=66136 loops=1)
 Total runtime: 12647.403 ms
(14 rows)
</pre>
<p>As you can see, simply joining the table makes this query quite complex. The part which consumes most of the cost is the more complicated group by clause. Now the execution engine has to join these tables and then sort all values by id and string (mostly the value is the important part).</p>
<p>To avoid this there only could be one solution &#8211; remove the join. With removing the join there comes the question how to get the values from the second table. One way to do this would be to use the program (in my case a php web application) to query again for every line of the result set.<br />
Another way to approach this would be to do a sub-select in the select section. This way you don&#8217;t have the additional round trip of doing it in the application. Another advantage would be that the database would only do these sub-selects for the actually returning rows (with respect of the limit).</p>
<p>So here the query I came up with (with the query execution plan)</p>
<pre class="prettyprint lang-sql">
select (select value from tags t where t.id=st.tags_id),sum(st.count) as anzahl from sites_tags st group by st.tags_id order by anzahl desc limit 50;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=36511.94..36512.07 rows=50 width=12) (actual time=2682.650..2682.790 rows=50 loops=1)
   ->  Sort  (cost=36511.94..36518.99 rows=2819 width=12) (actual time=2682.645..2682.705 rows=50 loops=1)
         Sort Key: (sum(st.count))
         Sort Method:  top-N heapsort  Memory: 20kB
         ->  HashAggregate  (cost=13056.34..36418.30 rows=2819 width=12) (actual time=1752.934..2570.690 rows=66136 loops=1)
               ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.109..713.541 rows=570756 loops=1)
               SubPlan
                 ->  Index Scan using tags_pkey on tags t  (cost=0.00..8.27 rows=1 width=10) (actual time=0.006..0.007 rows=1 loops=66136)
                       Index Cond: (id = $0)
 Total runtime: 2683.478 ms
(10 rows)
</pre>
<p>As you can see i still costs a lot. It is still 3 times more expensive then doing it without the values. On the other hand the cost is only a fourth of the cost of the join. This is mostly owed to the limit clause. The join has no way of knowing that it would be enough to run the limit without the join and later join the values. So far I found no way to tell postgres to do this more efficient.<br />
So the simplest solution for that would be to do sub-queries. With that, the limit clause will be honored.</p>
<p>So as this example shows, it is always a good idea to try different approaches to one and the same query. Often you can see lots of differences in the execution plan which can have a major impact on performance.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/06/why-sub-selects-can-be-faster-than-inner-joins/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>get hostname from url as stored procedure in plpgsql</title>
		<link>http://blog.underdog-projects.net/2009/06/get-hostname-from-url-as-stored-procedure-in-plpgsql/</link>
		<comments>http://blog.underdog-projects.net/2009/06/get-hostname-from-url-as-stored-procedure-in-plpgsql/#comments</comments>
		<pubDate>Mon, 08 Jun 2009 22:38:35 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[plpgsql]]></category>
		<category><![CDATA[postgres]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=346</guid>
		<description><![CDATA[I just needed a simple stored procedure to extract the hostname from any given URL. So here is what I came up with. CREATE OR REPLACE FUNCTION getHostFromUrl(p_url character varying) RETURNS character varying AS $BODY$ declare begin return substring(p_url from 'http.?://(.*?)/(.*)'); end; $BODY$ LANGUAGE 'plpgsql' VOLATILE COST 100;]]></description>
			<content:encoded><![CDATA[<p>I just needed a simple stored procedure to extract the hostname from any given URL. So here is what I came up with.</p>
<pre class="prettyprint lang-sql">
CREATE OR REPLACE FUNCTION getHostFromUrl(p_url character varying)
  RETURNS character varying AS
$BODY$
declare
begin
  return substring(p_url from  'http.?://(.*?)/(.*)');
end;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE
  COST 100;</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/06/get-hostname-from-url-as-stored-procedure-in-plpgsql/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>extracting information from websites through xslt</title>
		<link>http://blog.underdog-projects.net/2009/06/extracting-information-from-websites-through-xslt/</link>
		<comments>http://blog.underdog-projects.net/2009/06/extracting-information-from-websites-through-xslt/#comments</comments>
		<pubDate>Fri, 05 Jun 2009 17:05:40 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[XSLT]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=300</guid>
		<description><![CDATA[Today, most websites feature some kind of feed, so every user who wants to stay in touch, can follow new publications very easily. Some sites support RSS-feeds or mail notification. Although this is pretty common, there are still sites out there who doesn&#8217;t. For that purpose I tried to find some easy solution. First problem [...]]]></description>
			<content:encoded><![CDATA[<p>Today, most websites feature some kind of feed, so every user who wants to stay in touch, can follow new publications very easily. Some sites support RSS-feeds or mail notification. Although this is pretty common, there are still sites out there who doesn&#8217;t. For that purpose I tried to find some easy solution.<br />
First problem here is, how to get the information into some format usable. HTML is not meant for complex data mining operations. So the first thing to look at would be to ignore the HTML part and do string analysis of the content. This can be really difficult, because you lose the structure of the site completely.<br />
Another approach would be, to somehow utilize the DOM tree which the browser uses to obtain the data. One side Effect would be, that data mining could easily be done via DOM operations. But even for that solution I found no engine which provides DOM support for HTML pages which can be build into an application.<br />
Keeping the DOM approach in mind I started to look around for XML solutions which can also parse HTML data (cause they are not so different from one another). So I came to the <a href="http://xmlsoft.org/">libXML</a> project. They implemented a open-source XML-parser which has also the ability to parse HTML. Although the HTML part is still a bit shaky, it looked quite promising. Now there was still the problem of how to retrieve the information. LibXML provides no DOM interface at all. One thing it does provide is XSLT support. Being an EAI developer this was a good compromise.</p>
<p>So here a little tutorial how to make this work in the shell.</p>
<p><strong>First choose your site.</strong><br />
I choosed <a href="http://www.deraktionaer.de/xist4c/web/Online---Musterdepot_id_1261_.htm">this one</a> (from a german stocks magazine I subscribed) just out of curiousity and I also have to mention, this site already has e-mail notification (so there is no need to actually use this on that site).</p>
<p><strong>Second, get the XPath</strong> you want to extract. If you have knowledge of XPath this should be easy, if not use something like firebug to get there.<br />
<img class="aligncenter size-full wp-image-305" title="depot_xpath" src="http://blog.underdog-projects.net/wp-content/uploads/2009/06/depot_xpath.png" alt="depot_xpath" width="80%" /><br />
<strong>After that you can start on creating your XSL</strong> script for the transformation. The complete xslt should look something like this:</p>
<pre class="prettyprint lang-xml">&lt;?xml version="1.0"?&gt;
&lt;xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"&gt;
&lt;xsl:output method="text"/&gt;

&lt;xsl:template match="/"&gt;
	&lt;xsl:apply-templates select="*"/&gt;
&lt;/xsl:template&gt;

&lt;xsl:template match="*"&gt;
	&lt;xsl:apply-templates select="*"/&gt;
&lt;/xsl:template&gt;

&lt;xsl:template match='//tr[@class="odd" and string-length(td[@class="date"])&gt;0]'&gt;
	&lt;xsl:text&gt;date: &lt;/xsl:text&gt;
	&lt;xsl:value-of select='td[@class="date"]'/&gt;
	&lt;xsl:text&gt;,action: &lt;/xsl:text&gt;
	&lt;xsl:value-of select='td[@class="action"]'/&gt;
	&lt;xsl:text&gt;,wkn: &lt;/xsl:text&gt;
	&lt;xsl:value-of select='td[@class="wkn"]'/&gt;
	&lt;xsl:text&gt;,name: &lt;/xsl:text&gt;
	&lt;xsl:value-of select='td[@class="name"]/a'/&gt;
	&lt;xsl:text&gt;,amount: &lt;/xsl:text&gt;
	&lt;xsl:value-of select='td[@class="amount"]'/&gt;
	&lt;xsl:text&gt;,value: &lt;/xsl:text&gt;
	&lt;xsl:value-of select='td[@class="value"]'/&gt;
	&lt;xsl:text&gt;
&lt;/xsl:text&gt;
	&lt;xsl:apply-templates/&gt;
&lt;/xsl:template&gt;

&lt;/xsl:stylesheet&gt;</pre>
<p>If you need more detail one XSLT you should check out <a href="http://www.w3schools.com/xsl/default.asp">w3schools</a>. They have some good tutorials for starters.<br />
The important part of this XSLT is the last template section. This is actually the part you have to use to get to your information. First comes the template match. Here you have to insert the XPath you have obtained before. After that you have to select (via XPath) what information you want and formulate how you want this to be written out.</p>
<p><strong>Now you just have to put one and one together</strong> and you have your data mining solution.<br />
I inserted the following bash script into my cron table and now have a subscription to this site.</p>
<pre class="prettyprint lang-shell">curl -s http://www.deraktionaer.de/xist4c/web/Online---Musterdepot_id_1261_.htm | sed -e 's/&amp;/&amp;amp;/g' - | xsltproc -html online.xslt -</pre>
<p>Now that you have the raw information it should be no problem to get this into some mailinglist oder database for future use.</p>
<p>PS: In case you wonder about the sed in the statement. As I already mentionen the libXML is not really the most flexible solution for parsing HTML. Especially when it comes to HTML codes like ©, libXML resigns with an error. To avoid this I transformed all ampersands to escaped ampersands.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/06/extracting-information-from-websites-through-xslt/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>gnome 3.0, just my two cents</title>
		<link>http://blog.underdog-projects.net/2009/05/gnome-30-just-my-two-cents/</link>
		<comments>http://blog.underdog-projects.net/2009/05/gnome-30-just-my-two-cents/#comments</comments>
		<pubDate>Fri, 08 May 2009 12:10:59 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[gnome 3.0]]></category>
		<category><![CDATA[gnomeshell]]></category>
		<category><![CDATA[gtk]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=231</guid>
		<description><![CDATA[Over the last weeks there were several discussions how to handle the gnome 3.0 release. There are a lot of different points of view out there what has to be done in gnome to deserve a 3.0 release. One fact that often seems to be forgotten is the fact that the timeframe for all this [...]]]></description>
			<content:encoded><![CDATA[<p>Over the last weeks there were several discussions how to handle the gnome 3.0 release. There are a lot of different points of view out there what has to be done in gnome to deserve a 3.0 release.<br />
One fact that often seems to be forgotten is the fact that the timeframe for all this is something about a year. So some of the suggestions like switching the <a href="http://gnomejournal.org/article/63/letter-from-the-editor-gnome-30">core language</a> of the gtk seem like a bit overkill to me. Personally I think the C language a good and solid base for gtk. It provides the performance and flexibility you need for such a framework.</p>
<p><strong>GTK</strong><br />
So on my end the complexity and structure of the existing framework is a necessity. What can and should be done is to clean up, remove deprecated functionality and maybe change little things to make the framework smoother.</p>
<p><strong>File management</strong><br />
Today file management is mainly done through applications like nautilus and gnome-terminal. Managing information through folders and files is the default way since way back. For me this seems a perfectly fine way to handle information. On the other hand, with increasing storage capabilities I see more and more people wandering around the filesystem without any plan where to search for what information. There are multiple approaches to address this issue.<br />
One would be to do everything through a search engine (like <a href="http://desktop.google.com">google desktop search</a>, <a href="http://beagle-project.org/">beagle</a>). This would require that every peace of information would be crawlable and the indexer understands the structure and content. In reality this is not easily achievable. Most crawler just don&#8217;t understand the file format or can&#8217;t analyze the content correctly. If I have an text file in my home folder and it is a header file the information that I want from this file are completely different then if that file would be my mailbox backup in form of a mbox file. Todays crawler (at least the ones I have seen) ignore this fact and try to match strings against the plain content. This can help but is a very primitive way and the results are pretty random. So for me this seems not likely the way to go as a general approach.<br />
Another approach could be to embrace a meta information level to give the user a custom structure on top of the filesystem. Basically this means just tag everything the way you want. This way you can create a different abstraction level where to find information. Mainly for MP3 this is the default way to organize music (ID3-tags). For pictures this is in creation through EXIF and XMP. Both are useful and standardized formats, but as always, let the format wars begin. For other file types it is pretty thin out there. So far I have seen no general system how to handle tags for binary files. It would be nice to see some progress on that end.<br />
So there are quite a few things that could be done on the file management side but like in the gtk section the interesting question is what can be done in such short timeframe. For me, a tag system which plugs into nautilus (or other file managers) would be the ideal solution, because that way, you have the chance of bringing this to other desktop environments like KDE or (lets think big) maybe even Windows Explorer (Yes, it is extendable by API).</p>
<p><strong>UI</strong><br />
My final point would be the UI and with UI I mean the complete environment starting with the desktop leading to application design. First let&#8217;s start with the desktop. It hasn&#8217;t changed since a very long time. So now there are two fractions out there.<br />
One, everything should stay the same for ever.<br />
Two, everything should be done some new way (although nobody states how).<br />
For me both seam a bit drastic. The proposal to change the desktop to GnomeShell seams like a good compromise to me. You get a new interface but basically keep the classic applet bar and application menu structure. One side effect of GnomeShell is that it is implemented in JavaScript. So it should get easier to build some extensions for it (at least in theory). Another plus would be the integration of graphical effects in the system. I know that this is partly controversial because there are a lot of legacy system out there which can&#8217;t handle it, but for me eye-candy is always a good thing. So there should be at least some switch to disable the eye-candy for those who don&#8217;t want it (so far I didn&#8217;t see something like this during my first try).<br />
A few weeks ago I was at a Microsoft conference (yes I know, evil) and they showed some of the new features of Windows 7. There was one thing which really impressed me. Aside from the new &#8216;superbar&#8217; (former task bar) they could integrate widgets on the fly everywhere on the desktop. These widgets were partly some of the old stuff, known from vista&#8217;s sidebar, but some were quite interesting and could easily build via simple scripting. I would hope that Gnome would also bring some widget support which is not only usable in the gnome-applet bar. Especially the support for scripting languages as widget platform would really help to bring more developers to extending the desktop.<br />
I know there are a lot of projects out there which have the goal of bringing gtk to different languages like <a href="http://www.pygtk.org/">Python</a>, <a href="http://live.gnome.org/Gjs">JavaScript</a>, <a href="http://gtk2-perl.sourceforge.net/">Perl</a>, <a href="http://www.mono-project.com/">C#</a>, only to name a few, but so far these are still relatively complex to use for simple widgets or applications.<br />
So far I am no big fan of introducing another abstraction layer for designing applications (like XUL or XAML), but these systems encourage not so experienced developers to create new content (seen in the massive market of firefox plugins). So maybe it should be considered.</p>
<p>That&#8217;s it so far. For me it seams there is a lot to cover for the gnome 3.0 release. I think most of it is just visionary and not implementable in the short term. In the long term however I really would like to see some changes to the way users interact with computers (no, I do not mean multitouch).</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/05/gnome-30-just-my-two-cents/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

