<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>underdog-blog &#187; postgres</title>
	<atom:link href="http://blog.underdog-projects.net/tag/postgres/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.underdog-projects.net</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Sat, 23 Oct 2010 18:43:18 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>searching for hash strings in postgres</title>
		<link>http://blog.underdog-projects.net/2009/09/searching-for-hash-strings-in-postgres/</link>
		<comments>http://blog.underdog-projects.net/2009/09/searching-for-hash-strings-in-postgres/#comments</comments>
		<pubDate>Wed, 23 Sep 2009 13:54:14 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[hash]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[postgres]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=442</guid>
		<description><![CDATA[For one of my projects a have a database which has a rather large table consisting of just an url and a corresponding id. For performance reasons I added a md5 column which hashes the url. With this column it should be a lot faster to look up an url. CREATE TABLE pages ( id [...]]]></description>
			<content:encoded><![CDATA[<p>For one of my projects a have a database which has a rather large table consisting of just an url and a corresponding id. For performance reasons I added a md5 column which hashes the url. With this column it should be a lot faster to look up an url.</p>
<pre class="prettyprint lang-sql">CREATE TABLE pages
(
  id bigint NOT NULL,
  url character varying(255),
  md5 character(32),
  CONSTRAINT pages_pkey PRIMARY KEY (id)
)</pre>
<p>The faster lookup should mainly be possible through the shorter column length (and therefore smaller index). Actually I don&#8217;t know if the fixed width is good or bad here, but hashes usually don&#8217;t vary in length. After creating this table I added a B-Tree unique Index to the md5 column to enable a fast lookup.<br />
After a while a noticed a rather high CPU load on lookups for this table so I tried  to analyze the problem. First I tried the obvious through psql.</p>
<pre class="prettyprint lang-sql">cloud=# explain analyze select * from pages where md5 ='abc';
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Index Scan using i_pages_md5 on pages  (cost=0.00..8.50 rows=1 width=166) (actual time=0.046..0.046 rows=0 loops=1)
   Index Cond: (md5 = 'abc'::bpchar)
 Total runtime: 0.157 ms
(3 rows)</pre>
<p>As the explain shows all works perfectly fine and lookups shouldn&#8217;t be a problem. So there had to be something different what was going on.</p>
<p>After that I tried the same select with an actual md5.</p>
<pre class="prettyprint lang-sql">cloud=# explain analyze select * from pages where md5 = md5('abc');
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Seq Scan on pages  (cost=0.00..32017.63 rows=3994 width=166) (actual time=1203.699..1203.699 rows=0 loops=1)
   Filter: ((md5)::text = '900150983cd24fb0d6963f7d28e17f72'::text)
 Total runtime: 1203.769 ms
(3 rows)</pre>
<p>Now you can see the plan does change quite a lot. I have a full table scan instead of an index scan. You can also see that the query time increases nearly by factor ten thousand.</p>
<p>The reason for this dramatic change is a simple type mismatch. For whatever reason the md5 function will be evaluated to a string of the type text. To create a match with the column md5 all values had to be casted to that type. The side effect of this is that the index can no longer be used, because it is of the wrong type.</p>
<p>To solve this I just had to cast the result of the md5 function back to something that is compatible with the index type. In my case I used a fixed width character field which is represented in the database as bpchar (blank padded character). So after modifying the query to the following I was back on index usage.</p>
<pre class="prettyprint lang-sql">cloud=# explain analyze select * from pages where md5 = md5('abc')::bpchar;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Index Scan using i_pages_md5 on pages  (cost=0.00..8.50 rows=1 width=166) (actual time=0.141..0.141 rows=0 loops=1)
   Index Cond: (md5 = '900150983cd24fb0d6963f7d28e17f72'::bpchar)
 Total runtime: 0.199 ms
(3 rows)</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/09/searching-for-hash-strings-in-postgres/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>why sub-selects can be faster than inner joins</title>
		<link>http://blog.underdog-projects.net/2009/06/why-sub-selects-can-be-faster-than-inner-joins/</link>
		<comments>http://blog.underdog-projects.net/2009/06/why-sub-selects-can-be-faster-than-inner-joins/#comments</comments>
		<pubDate>Thu, 25 Jun 2009 12:28:26 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[join]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[postgres]]></category>
		<category><![CDATA[sub-select]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=355</guid>
		<description><![CDATA[So here is my situation. I have 2 tables with the following DDL. CREATE TABLE tags ( id bigint NOT NULL, "value" character varying(150), CONSTRAINT tags_pkey PRIMARY KEY (id), CONSTRAINT tags_value_key UNIQUE (value) ) &#160; CREATE TABLE sites_tags ( sites_id bigint NOT NULL, pages_id bigint NOT NULL, tags_id bigint NOT NULL, count integer, updated timestamp [...]]]></description>
			<content:encoded><![CDATA[<p>So here is my situation. I have 2 tables with the following DDL.</p>
<pre class="prettyprint lang-sql">
CREATE TABLE tags
(
  id bigint NOT NULL,
  "value" character varying(150),
  CONSTRAINT tags_pkey PRIMARY KEY (id),
  CONSTRAINT tags_value_key UNIQUE (value)
)</pre>
<p>&nbsp;</p>
<pre class="prettyprint lang-sql">
CREATE TABLE sites_tags
(
  sites_id bigint NOT NULL,
  pages_id bigint NOT NULL,
  tags_id bigint NOT NULL,
  count integer,
  updated timestamp without time zone,
  CONSTRAINT sites_tags_pkey PRIMARY KEY (sites_id, pages_id, tags_id)
)
</pre>
<p>As you can see, the tags table is a simple value-id-table. The second table represents a join table between pages and tags.</p>
<p>The goal of my Query should be to get the most used tag from the join table. Only the first x-Rows are of interest to me. To get there I used a simple limit command. So just for comparison here a simple query of the join table without the actual values.</p>
<pre class="prettyprint lang-sql">
select sum(st.count) as anzahl from sites_tags st group by st.tags_id order by anzahl desc limit 50;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=13185.22..13185.35 rows=50 width=12) (actual time=1974.893..1975.033 rows=50 loops=1)
   ->  Sort  (cost=13185.22..13192.27 rows=2819 width=12) (actual time=1974.888..1974.941 rows=50 loops=1)
         Sort Key: (sum(count))
         Sort Method:  top-N heapsort  Memory: 18kB
         ->  HashAggregate  (cost=13056.34..13091.58 rows=2819 width=12) (actual time=1766.681..1876.092 rows=66136 loops=1)
               ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.120..690.719 rows=570756 loops=1)
 Total runtime: 1975.669 ms
(7 rows)
</pre>
<p>This is just a statement to get you the picture of cost for a simple query (without fetching any actual values).</p>
<p>To make this query useful I needed to add the values. All the values will be joined through the tags table.</p>
<p>Here the first implementation I came up with.</p>
<pre class="prettyprint lang-sql">
select t.value,sum(st.count) as anzahl from sites_tags st inner join tags t on t.id=st.tags_id group by st.tags_id,t.value order by anzahl desc limit 50;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=123002.69..123002.82 rows=50 width=22) (actual time=12640.100..12640.239 rows=50 loops=1)
   ->  Sort  (cost=123002.69..124429.58 rows=570756 width=22) (actual time=12640.095..12640.153 rows=50 loops=1)
         Sort Key: (sum(st.count))
         Sort Method:  top-N heapsort  Memory: 20kB
         ->  GroupAggregate  (cost=91200.58..104042.59 rows=570756 width=22) (actual time=10165.002..12537.121 rows=66136 loops=1)
               ->  Sort  (cost=91200.58..92627.47 rows=570756 width=22) (actual time=10162.562..11564.604 rows=570756 loops=1)
                     Sort Key: st.tags_id, t.value
                     Sort Method:  external merge  Disk: 18808kB
                     ->  Hash Join  (cost=1877.06..24921.63 rows=570756 width=22) (actual time=259.674..3080.093 rows=570756 loops=1)
                           Hash Cond: (st.tags_id = t.id)
                           ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.070..781.449 rows=570756 loops=1)
                           ->  Hash  (cost=1050.36..1050.36 rows=66136 width=18) (actual time=259.518..259.518 rows=66136 loops=1)
                                 ->  Seq Scan on tags t  (cost=0.00..1050.36 rows=66136 width=18) (actual time=0.027..115.197 rows=66136 loops=1)
 Total runtime: 12647.403 ms
(14 rows)
</pre>
<p>As you can see, simply joining the table makes this query quite complex. The part which consumes most of the cost is the more complicated group by clause. Now the execution engine has to join these tables and then sort all values by id and string (mostly the value is the important part).</p>
<p>To avoid this there only could be one solution &#8211; remove the join. With removing the join there comes the question how to get the values from the second table. One way to do this would be to use the program (in my case a php web application) to query again for every line of the result set.<br />
Another way to approach this would be to do a sub-select in the select section. This way you don&#8217;t have the additional round trip of doing it in the application. Another advantage would be that the database would only do these sub-selects for the actually returning rows (with respect of the limit).</p>
<p>So here the query I came up with (with the query execution plan)</p>
<pre class="prettyprint lang-sql">
select (select value from tags t where t.id=st.tags_id),sum(st.count) as anzahl from sites_tags st group by st.tags_id order by anzahl desc limit 50;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=36511.94..36512.07 rows=50 width=12) (actual time=2682.650..2682.790 rows=50 loops=1)
   ->  Sort  (cost=36511.94..36518.99 rows=2819 width=12) (actual time=2682.645..2682.705 rows=50 loops=1)
         Sort Key: (sum(st.count))
         Sort Method:  top-N heapsort  Memory: 20kB
         ->  HashAggregate  (cost=13056.34..36418.30 rows=2819 width=12) (actual time=1752.934..2570.690 rows=66136 loops=1)
               ->  Seq Scan on sites_tags st  (cost=0.00..10202.56 rows=570756 width=12) (actual time=0.109..713.541 rows=570756 loops=1)
               SubPlan
                 ->  Index Scan using tags_pkey on tags t  (cost=0.00..8.27 rows=1 width=10) (actual time=0.006..0.007 rows=1 loops=66136)
                       Index Cond: (id = $0)
 Total runtime: 2683.478 ms
(10 rows)
</pre>
<p>As you can see i still costs a lot. It is still 3 times more expensive then doing it without the values. On the other hand the cost is only a fourth of the cost of the join. This is mostly owed to the limit clause. The join has no way of knowing that it would be enough to run the limit without the join and later join the values. So far I found no way to tell postgres to do this more efficient.<br />
So the simplest solution for that would be to do sub-queries. With that, the limit clause will be honored.</p>
<p>So as this example shows, it is always a good idea to try different approaches to one and the same query. Often you can see lots of differences in the execution plan which can have a major impact on performance.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/06/why-sub-selects-can-be-faster-than-inner-joins/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>get hostname from url as stored procedure in plpgsql</title>
		<link>http://blog.underdog-projects.net/2009/06/get-hostname-from-url-as-stored-procedure-in-plpgsql/</link>
		<comments>http://blog.underdog-projects.net/2009/06/get-hostname-from-url-as-stored-procedure-in-plpgsql/#comments</comments>
		<pubDate>Mon, 08 Jun 2009 22:38:35 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[plpgsql]]></category>
		<category><![CDATA[postgres]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=346</guid>
		<description><![CDATA[I just needed a simple stored procedure to extract the hostname from any given URL. So here is what I came up with. CREATE OR REPLACE FUNCTION getHostFromUrl(p_url character varying) RETURNS character varying AS $BODY$ declare begin return substring(p_url from 'http.?://(.*?)/(.*)'); end; $BODY$ LANGUAGE 'plpgsql' VOLATILE COST 100;]]></description>
			<content:encoded><![CDATA[<p>I just needed a simple stored procedure to extract the hostname from any given URL. So here is what I came up with.</p>
<pre class="prettyprint lang-sql">
CREATE OR REPLACE FUNCTION getHostFromUrl(p_url character varying)
  RETURNS character varying AS
$BODY$
declare
begin
  return substring(p_url from  'http.?://(.*?)/(.*)');
end;
$BODY$
  LANGUAGE 'plpgsql' VOLATILE
  COST 100;</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/06/get-hostname-from-url-as-stored-procedure-in-plpgsql/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>copy a table across databases via dblink</title>
		<link>http://blog.underdog-projects.net/2009/02/copy-a-table-across-databases-via-dblink/</link>
		<comments>http://blog.underdog-projects.net/2009/02/copy-a-table-across-databases-via-dblink/#comments</comments>
		<pubDate>Fri, 06 Feb 2009 13:32:40 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[dblink]]></category>
		<category><![CDATA[postgres]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=108</guid>
		<description><![CDATA[Recently I ran into the situation that I needed to copy a large subset of data from one database to another. Normally I would say, make a dump and then re-import the data into the new schema. But this solution has some serious drawbacks. First you have to copy the complete database. Second you have [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I ran into the situation that I needed to copy a large subset of data from one database to another. Normally I would say, make a dump and then re-import the data into the new schema. But this solution has some serious drawbacks. First you have to copy the complete database. Second you have to maintain the structure of the data. A third problem could be that you have to copy the complete dump to the target location (in case it is not the same machine and your database is a bit larger e.g. some gigabyte). Having these drawbacks in mind I started searching for an alternative solution for my problem.</p>
<p>Here some facts to render my situation more precisely.</p>
<ul>
<li>database containing multiple tables</li>
<li>only one has relevant data</li>
<li>only the subset of 1 month is needed</li>
</ul>
<p><strong>ddl for the original table:</strong></p>
<pre class="prettyprint  lang-sql">CREATE TABLE realtime
(
name varchar(10),
date timestamp,
bid numeric,
ask numeric
)</pre>
<p><strong>ddl for the target table:</strong></p>
<pre class="prettyprint  lang-sql">CREATE TABLE realtime
(
symbol varchar(10),
date timestamp,
price numeric,
"day" char(5),
max numeric,
avg numeric,
atr numeric
)</pre>
<p><strong>here the mapping:</strong><br />
realtime.name -&gt; realtime.symbol<br />
realtime.date -&gt; realtime.date<br />
(realtime.bid + realtime.ask) /2 -&gt; realtime.price<br />
-&gt; other columns filled by trigger</p>
<p>To get this task done I decided to use a dblink between those two database instances (<a href="http://www.postgresql.org/docs/current/static/contrib-dblink.html">how-to here</a>).</p>
<p>So here is the select I used to transfer the month January to the new db:</p>
<pre class="prettyprint  lang-sql">insert into realtime (symbol,date,price)
select * from dblink('dbname=stocks',
              'select name,date,(bid+ask)/2 as price
              from realtime
              where date &gt; to_date(''20081231'',''yyyyMMDD'') and date &lt; to_date(''20090201'',''yyyyMMDD'')')
         as t1 (name character varying,date timestamp,price numeric);</pre>
<p>As you can see this approach is pretty straight forward. You basically write an insert statement for the new table and use a dblink as source. In the dblink definition you can apply any given sql criteria.</p>
<p>One real drawback has this solution, because of the mode of operation of the dblink approach it is pretty slow. Here is what the postgres documentation has to say about this:</p>
<p>dblink fetches the entire remote query result before returning any of it to the local system. If the query is expected to return a large number of rows, it&#8217;s better to open it as a cursor with dblink_open and then fetch a manageable number of rows at a time.For me the performance was ok because I just copied several hundred megabytes.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2009/02/copy-a-table-across-databases-via-dblink/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Tibco EMS with database backend (postgresql)</title>
		<link>http://blog.underdog-projects.net/2008/11/tibco-ems-with-database-backend-postgresql/</link>
		<comments>http://blog.underdog-projects.net/2008/11/tibco-ems-with-database-backend-postgresql/#comments</comments>
		<pubDate>Sun, 16 Nov 2008 18:48:11 +0000</pubDate>
		<dc:creator>jens</dc:creator>
				<category><![CDATA[TIBCO]]></category>
		<category><![CDATA[EMS]]></category>
		<category><![CDATA[hibernate]]></category>
		<category><![CDATA[postgres]]></category>

		<guid isPermaLink="false">http://blog.underdog-projects.net/?p=27</guid>
		<description><![CDATA[I recently tried to build a JMS Server with database backend. The chosen product was the TIBCO EMS Server. The Server brings its own database support over hibernate. Unfortunately TIBCO supports only Oracle,Mysql and DB2 by default. Lucky me, I needed an installation for Postgres but this shouldn&#8217;t be a big deal, because hibernate supports [...]]]></description>
			<content:encoded><![CDATA[<p>I recently tried to build a JMS Server with database backend. The chosen product was the TIBCO EMS Server. The Server brings its own database support over hibernate.</p>
<p>Unfortunately TIBCO supports only Oracle,Mysql and DB2 by default. Lucky me, I needed an installation for Postgres but this shouldn&#8217;t be a big deal, because hibernate supports postgres as well. You just have to modify the hibernate config.</p>
<p>First you should install the EMS server and hibernate (version provided by Tibco). After that let&#8217;s start configuring.</p>
<p>To have everthing in the database you need 3 databases (3 separate EMS stores). For me I used the following 3 stores:</p>
<ol>
<li>emsmeta -&gt; for metadata content</li>
<li>emsnf -&gt; for nonfailsafe data</li>
<li>emsf -&gt; for failsafe data</li>
</ol>
<p>I used the same account for all 3 databases (just easier for testing purposes).</p>
<p>Setup the stores in the stores.conf in the EMS folder:</p>
<pre class="prettyprint">[$sys.meta]
type=dbstore
dbstore_driver_url=jdbc:postgresql://localhost/emsmeta
dbstore_driver_username=ems
dbstore_driver_password=ems

[$sys.nonfailsafe]
type=dbstore
dbstore_driver_url=jdbc:postgresql://localhost/emsnf
dbstore_driver_username=ems
dbstore_driver_password=ems

[$sys.failsafe]
type=dbstore
dbstore_driver_url=jdbc:postgresql://localhost/emsf
dbstore_driver_username=ems
dbstore_driver_password=ems</pre>
<p>After that you have to change you tibemsd.conf file.The following line is a sample what you should add. Additionally you should add tha path to your jdbc driver (in this sample the last parameter &#8211; change it to your config). You also have to add the path to your JVM (here it is the debian lenny default).</p>
<pre class="prettyprint">dbstore_classpath       = ../../../components/eclipse/plugins/com.tibco.tpcl.org.hibernate_3.2.5.001/hibernate3.jar:../../../components/eclipse/plugins/com.tibco.tpcl.org.com.mchange.c3p0_0.9.1.001/c3p0-0.9.1.jar:antlr-2.7.6.jar:asm-attrs.jar:asm.jar:cglib-2.1.3.jar:commons-collections-2.1.1.jar:commons-logging-1.0.4.jar:dom4j-1.6.1.jar:ehcache-1.2.3.jar:jta.jar:/usr/local/bin/oracledrivers/lib/ojdbc5.jar:/home/jens/tibco/tpcl/5.6/jdbc/postgresql-8.3-603.jdbc3.jar

## db section
#dbstore_driver_name     = oracle.jdbc.driver.OracleDriver
#dbstore_driver_dialect  = org.hibernate.dialect.Oracle10gDialect
dbstore_driver_name     = org.postgresql.Driver
dbstore_driver_dialect  = org.hibernate.dialect.PostgreSQLDialect
jre_library             = /usr/lib/jvm/java-6-sun/jre/lib/i386/server/libjvm.so</pre>
<p>As you can see changing databases is easy. You just have to change the driver_name and driver_dialect of hibernate.</p>
<p>After that you have the basic configuration ready. Next thing to do is to initialize the dabases. For that task Tibco provides a tool which generates the proper sql-statements.</p>
<p>You can run the following command in your shell and get the sql as output.</p>
<pre class="prettyprint lang-sh">java -jar tibemsd_util.jar -tibemsdconf tibemsd.conf -createall</pre>
<p>For some reason the create sql-statements had no &#8216;;&#8217; at the end of every command. So you can&#8217;t just pipe the output to psql. You have to do it old school.</p>
<p>After that you can start the EMS Server and it stores all data to the selected database.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.underdog-projects.net/2008/11/tibco-ems-with-database-backend-postgresql/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

