shell-tools.net version2

After nearly a full year, I finally got the time to rework my existing shell-tools.net site. The changes are rather small, because I think the site already works well, so why destroy a working site. So here is what I have done.
I have added some new features like evaluating XPaths, pretty print JSON and transform XML files into JSON. Further to that I modified the css slightly so that from now on the current location will be highlighted in the navigation. I also removed some features which were rarely used because otherwise the menu would get to crowded.
There are still some features which I really would like to implement, but haven’t got the time so far, for example resizing the input and output boxes on demand.

So enjoy the new features and lets hope the next release will not take another year.

1 Comment

installing GLUEscript on debian squeeze 64bit

The GLUEscript runtime is still in an pretty early development stage. Basically they use the Firefox spidermonkey javascript engine and build some useful libraries on top of that (like curl,mysql, filesystem support).
They also provide a little help in form of a little text file, but with this, it still took me half a day for my first installation. Most issues I got were based on version mismatches, because debian and also ubuntu use older versions of the required libraries.

First download the GLUEscript source from sourceforge.
A second tool you will need to get this running is premake. This is also a sourceforge project (you can use the binary version of it right away).
After downloading, I copied the premake binary into the glue/src folder.

So now we can start with fetching the dependencies which debian can fulfill.

sudo apt-get install libnspr4-0d libnspr4-0d-dbg libnspr4-dev libcurl4-openssl-dev libwxgtk2.8-dev libssl-dev libiodbc2-dev libmysql++-dev

In addition to that I needed a library called poco version 1.3.5 (all repositories I found just provided versions up to 1.3.3 -> those don’t work). So get the source from http://pocoproject.org/download/ (the complete version). Compiling that should make no trouble cause all the dependencies are already installed.

/tmp$ cd poco-1.3.5-all/
/tmp/poco-1.3.5-all$ ./configure
Configured for Linux
/tmp/poco-1.3.5-all$ make
/tmp/poco-1.3.5-all$ sudo make install

Now let’s get back to configuring GLUEscript. All configuration is done via lua script which will than be consumed by premake. The config file I needed to edit was the premake.lua file:

-- Check NSPR
if ( string.len(nspr_dir) == 0 ) then
  print("Using the NSPR library which is part of GLUEscript")
  dopackage("nspr") -- build our own NSPR
  nspr_dir = "../nspr/include"
  nspr_lib = "nspr"
  nspr_lib_dir = project.libdir
else
  print("You are using your own NSPR library: ")
  nspr_dir = "/usr/include/nspr"
  print("nspr include: " .. nspr_dir)
  print("nspr lib: " .. nspr_lib_dir .. "/" .. nspr_lib)
end

I copied the whole paragraph to just make it easier to find the position. Important is the added row in the else part.

nspr_dir = “/usr/include/nspr”

This is needed because debian has a different file structure for header files than the script expects it.

After that we are done with configuring. To actually start the build process you have to run premake first.

./premake gnu
make

The output will be generated to the following directory:

glue/bin/Debug

So far the makefile does not a an installation part. So if you want to install this you have to do it by yourself.

PS: This only works for the 0.0.1 version. So far I didn’t get any more recent svn version running.

, , ,

No Comments

why sub-selects can be faster than inner joins

So here is my situation. I have 2 tables with the following DDL.

CREATE TABLE tags
(
  id bigint NOT NULL,
  "value" character varying(150),
  CONSTRAINT tags_pkey PRIMARY KEY (id),
  CONSTRAINT tags_value_key UNIQUE (value)
)

 

CREATE TABLE sites_tags
(
  sites_id bigint NOT NULL,
  pages_id bigint NOT NULL,
  tags_id bigint NOT NULL,
  count integer,
  updated timestamp without time zone,
  CONSTRAINT sites_tags_pkey PRIMARY KEY (sites_id, pages_id, tags_id)
)

As you can see, the tags table is a simple value-id-table. The second table represents a join table between pages and tags.

The goal of my Query should be to get the most used tag from the join table. Only the first x-Rows are of interest to me. To get there I used a simple limit command. So just for comparison here a simple query of the join table without the actual values.

select sum(st.count) as anzahl from sites_tags st group by st.tags_id order by anzahl desc limit 50;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=13185.22..13185.35 rows=50 width=12) (actual time=1974.893..1975.033 rows=50 loops=1)
   ->  Sort  (cost=13185.22..13192.27 rows=2819 width=12) (actual time=1974.888..1974.941 rows=50 loops=1)
         Sort Key: (sum(count))
         Sort Method:  top-N heapsort  Memory: 18kB
         ->  HashAggregate  (cost=13056.34..13091.58 rows=2819 width=12) (actual time=1766.681..1876.092 rows=66136 loops=1)
               ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.120..690.719 rows=570756 loops=1)
 Total runtime: 1975.669 ms
(7 rows)

This is just a statement to get you the picture of cost for a simple query (without fetching any actual values).

To make this query useful I needed to add the values. All the values will be joined through the tags table.

Here the first implementation I came up with.

select t.value,sum(st.count) as anzahl from sites_tags st inner join tags t on t.id=st.tags_id group by st.tags_id,t.value order by anzahl desc limit 50;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=123002.69..123002.82 rows=50 width=22) (actual time=12640.100..12640.239 rows=50 loops=1)
   ->  Sort  (cost=123002.69..124429.58 rows=570756 width=22) (actual time=12640.095..12640.153 rows=50 loops=1)
         Sort Key: (sum(st.count))
         Sort Method:  top-N heapsort  Memory: 20kB
         ->  GroupAggregate  (cost=91200.58..104042.59 rows=570756 width=22) (actual time=10165.002..12537.121 rows=66136 loops=1)
               ->  Sort  (cost=91200.58..92627.47 rows=570756 width=22) (actual time=10162.562..11564.604 rows=570756 loops=1)
                     Sort Key: st.tags_id, t.value
                     Sort Method:  external merge  Disk: 18808kB
                     ->  Hash Join  (cost=1877.06..24921.63 rows=570756 width=22) (actual time=259.674..3080.093 rows=570756 loops=1)
                           Hash Cond: (st.tags_id = t.id)
                           ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.070..781.449 rows=570756 loops=1)
                           ->  Hash  (cost=1050.36..1050.36 rows=66136 width=18) (actual time=259.518..259.518 rows=66136 loops=1)
                                 ->  Seq Scan on tags t  (cost=0.00..1050.36 rows=66136 width=18) (actual time=0.027..115.197 rows=66136 loops=1)
 Total runtime: 12647.403 ms
(14 rows)

As you can see, simply joining the table makes this query quite complex. The part which consumes most of the cost is the more complicated group by clause. Now the execution engine has to join these tables and then sort all values by id and string (mostly the value is the important part).

To avoid this there only could be one solution – remove the join. With removing the join there comes the question how to get the values from the second table. One way to do this would be to use the program (in my case a php web application) to query again for every line of the result set.
Another way to approach this would be to do a sub-select in the select section. This way you don’t have the additional round trip of doing it in the application. Another advantage would be that the database would only do these sub-selects for the actually returning rows (with respect of the limit).

So here the query I came up with (with the query execution plan)

select (select value from tags t where t.id=st.tags_id),sum(st.count) as anzahl from sites_tags st group by st.tags_id order by anzahl desc limit 50;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=36511.94..36512.07 rows=50 width=12) (actual time=2682.650..2682.790 rows=50 loops=1)
   ->  Sort  (cost=36511.94..36518.99 rows=2819 width=12) (actual time=2682.645..2682.705 rows=50 loops=1)
         Sort Key: (sum(st.count))
         Sort Method:  top-N heapsort  Memory: 20kB
         ->  HashAggregate  (cost=13056.34..36418.30 rows=2819 width=12) (actual time=1752.934..2570.690 rows=66136 loops=1)
               ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.109..713.541 rows=570756 loops=1)
               SubPlan
                 ->  Index Scan using tags_pkey on tags t  (cost=0.00..8.27 rows=1 width=10) (actual time=0.006..0.007 rows=1 loops=66136)
                       Index Cond: (id = $0)
 Total runtime: 2683.478 ms
(10 rows)

As you can see i still costs a lot. It is still 3 times more expensive then doing it without the values. On the other hand the cost is only a fourth of the cost of the join. This is mostly owed to the limit clause. The join has no way of knowing that it would be enough to run the limit without the join and later join the values. So far I found no way to tell postgres to do this more efficient.
So the simplest solution for that would be to do sub-queries. With that, the limit clause will be honored.

So as this example shows, it is always a good idea to try different approaches to one and the same query. Often you can see lots of differences in the execution plan which can have a major impact on performance.

, , ,

1 Comment