I was really pulling my hair out over some code that did programmatic transaction handling using Spring's PlatformTransactionManager on top of Hibernate. It is fairly complicated, with up to 4 different transactions running concurrently. At one point it needs to process 3 different result sets, like this:
TransactionStatus status1 = transactionManager.getTransaction(new DefaultTransactionDefinition(TransactionDefinition.PROPAGATION_REQUIRES_NEW));
PreparedStatement statement1 = sessionFactory.getCurrentSession().connection().prepareStatement(...);
TransactionStatus status2 = transactionManager.getTransaction(new DefaultTransactionDefinition(TransactionDefinition.PROPAGATION_REQUIRES_NEW));
PreparedStatement statement2 = sessionFactory.getCurrentSession().connection().prepareStatement(...);
TransactionStatus status3 = transactionManager.getTransaction(new DefaultTransactionDefinition(TransactionDefinition.PROPAGATION_REQUIRES_NEW));
PreparedStatement statement3 = sessionFactory.getCurrentSession().connection().prepareStatement(...);
... // set parameters on the statements
ResultSet rs1 = statement1.executeQuery();
ResultSet rs2 = statement1.executeQuery();
ResultSet rs3 = statement1.executeQuery();
... // process the result sets
... // close the result sets and statements
transactionManager.commit(status1);
transactionManager.commit(status2);
transactionManager.commit(status3);
This part of the code ran fine the first time, but the next time, it would throw an exception saying "Session is closed", when trying to get the first transaction. Do you see the problem? I didn't see it for way too long. I spent time removing sections of code, staring at logs, and pulling lots of hair out, before I finally noticed that in the spring logging, after finishing that section of code, it was falling back to trying to use the wrong session.
Finally it dawned on me that Spring must be using a stack to keep track of the current session. When a transaction is started with PROPAGATION_REQUIRES_NEW, the current session is pushed on the stack, and a new session is created and becomes the current session. When that transaction finishes, the current session is closed, and the previous session is popped off of the stack to resume as the now current session.
Looking at the documentation about transaction propagation, it does talk about "inner" and "outer" transactions, but I don't think it's quite explicit enough at explaining the nesting relationship. And my problem is that I wasn't really considering them to be nested, thinking of them more as simply independent transactions (in my defense, the docs do say that they are "completely independent" transactions). That is why it took me so long to realize my mistake.
My problem was that I was creating transaction 1, then transaction 2, then 3. But I was trying to commit transaction 1 first, then 2, then 3. Doing this messed up the stack, and in the end Spring was left with a session that had been closed as the current session. So the next attempt to use the session would cause an exception to be thrown telling me that the "Session is closed".
Rearranging the code to commit transaction 3 first, then 2, then 1, fixed the problem. I now have a better sense of how Spring works, working code, and less hair. I thought I'd write this up in case it helps someone else avoid this simple mistake.
nathan's noggin
mainly thoughts on: Java, MySQL, Spring, Anything else I'm working on
Wednesday, March 9, 2011
Wednesday, September 15, 2010
CruiseControl to Hudson
At work we recently switched from using CruiseControl to Hudson for continuous builds.
Originally I was just trying to upgrade CruiseControl so that we could add in a plugin to support Mercurial (we also recently switched from Subversion to Mercurial for version control). We were on a fairly old version of CruiseControl. After upgrading, we were having lots of headaches with the web interface freezing up, and I had to write a Mercurial label incrementer plugin, and I didn't really like the new web interface anyway. I finally got frustrated enough to try something new.
I threw Hudson on the machine, and had builds up and running so much more quickly than in CruiseControl that I was sold almost immediately. Some of the differences that I really liked were:
* No editing XML. All configuration in Hudson can be done through the web interface. It's also easy to create build configs for new clones/branches. When you create a new Job, you can create it based on an existing Job, and then just change a few names and paths.
* Plugins are listed and installed through the web interface. This was probably the single best part of Hudson compared to CruiseControl. With CruiseControl, I was left searching for plugins to see what existed, and going through the hassle of researching and installing plugins to try to get things working. Hudson lets you see all the available plugins in one place. That was really handy.
* Hudson automatically detects if test cases fail, and will mark the build as "unstable". It's also really good at showing you the test case failures within the Hudson web interface.
* More features. I could probably get all of these things in CruiseControl, but it's so hard to find plugins, that I didn't try very hard. In Hudson it was easy to get it to host/expose my javadocs. It's easy to link to the most recent artifacts with static urls. It's also possible to have "slave" build machines so that long builds don't hold everything up.
* The email notifications are better. I only get emailed the first time a build breaks, and not for every subsequent failure. It also emails me when the build is fixed.
Now I'm guessing that I could probably get CruiseControl to do most or maybe all of those things, but when Hudson makes it so easy, why try to figure CruiseControl out?
The only trick I had to work out was getting Hudson to show the build revision. We use build revisions in our bug tracking system, so that developers can let testers know what revision a bug fix was made in. The solution I found was to use the "Hudson Description Setter Plugin". I installed the plugin in Hudson, then modified my build script to output the Mercurial revision. I created an ant task that I run as a dependency in my build task:
<target name="revision" description="Stores the latest revision number in revision property">
<echo message="os: ${os.name}"/>
<condition property="unix">
<os family="unix"/>
</condition>
<echo message="is unix: ${unix}"/>
<if>
<equals arg1="${unix}" arg2="true"/>
<then>
<exec executable="sh" outputproperty="revision" errorproperty="revision-error" dir="${basedir}">
<arg value="-c"/>
<arg line="hg identify -n | tr -d '+'"/>
</exec>
</then>
<else>
<property name="revision" value="unknown"/>
</else>
</if>
<echo message="rev: ${revision}"/>
</target>
Then I activated the plugin in my build configuration in Hudson, under "Post-build Actions" (the checkbox named "Set build description"), and set the regular expression to "rev: (.*)" and the Description to "[version] \1". All subsequent builds will have a description with the Mercurial revision in it (e.g. "[version] 12345").
Originally I was just trying to upgrade CruiseControl so that we could add in a plugin to support Mercurial (we also recently switched from Subversion to Mercurial for version control). We were on a fairly old version of CruiseControl. After upgrading, we were having lots of headaches with the web interface freezing up, and I had to write a Mercurial label incrementer plugin, and I didn't really like the new web interface anyway. I finally got frustrated enough to try something new.
I threw Hudson on the machine, and had builds up and running so much more quickly than in CruiseControl that I was sold almost immediately. Some of the differences that I really liked were:
* No editing XML. All configuration in Hudson can be done through the web interface. It's also easy to create build configs for new clones/branches. When you create a new Job, you can create it based on an existing Job, and then just change a few names and paths.
* Plugins are listed and installed through the web interface. This was probably the single best part of Hudson compared to CruiseControl. With CruiseControl, I was left searching for plugins to see what existed, and going through the hassle of researching and installing plugins to try to get things working. Hudson lets you see all the available plugins in one place. That was really handy.
* Hudson automatically detects if test cases fail, and will mark the build as "unstable". It's also really good at showing you the test case failures within the Hudson web interface.
* More features. I could probably get all of these things in CruiseControl, but it's so hard to find plugins, that I didn't try very hard. In Hudson it was easy to get it to host/expose my javadocs. It's easy to link to the most recent artifacts with static urls. It's also possible to have "slave" build machines so that long builds don't hold everything up.
* The email notifications are better. I only get emailed the first time a build breaks, and not for every subsequent failure. It also emails me when the build is fixed.
Now I'm guessing that I could probably get CruiseControl to do most or maybe all of those things, but when Hudson makes it so easy, why try to figure CruiseControl out?
The only trick I had to work out was getting Hudson to show the build revision. We use build revisions in our bug tracking system, so that developers can let testers know what revision a bug fix was made in. The solution I found was to use the "Hudson Description Setter Plugin". I installed the plugin in Hudson, then modified my build script to output the Mercurial revision. I created an ant task that I run as a dependency in my build task:
<target name="revision" description="Stores the latest revision number in revision property">
<echo message="os: ${os.name}"/>
<condition property="unix">
<os family="unix"/>
</condition>
<echo message="is unix: ${unix}"/>
<if>
<equals arg1="${unix}" arg2="true"/>
<then>
<exec executable="sh" outputproperty="revision" errorproperty="revision-error" dir="${basedir}">
<arg value="-c"/>
<arg line="hg identify -n | tr -d '+'"/>
</exec>
</then>
<else>
<property name="revision" value="unknown"/>
</else>
</if>
<echo message="rev: ${revision}"/>
</target>
Then I activated the plugin in my build configuration in Hudson, under "Post-build Actions" (the checkbox named "Set build description"), and set the regular expression to "rev: (.*)" and the Description to "[version] \1". All subsequent builds will have a description with the Mercurial revision in it (e.g. "[version] 12345").
Tuesday, October 27, 2009
Storing a tree structure in a database
NOTE: I fixed a bug in the node update trigger on 16 August 2011.
I think that one of the more tough problems in database design is how to store tree data (arbitrary depth parent-child relationships, where a child has at most one parent).
The two most common approaches are the Adjacency List model, and the Nested Set model. Both are explained and compared here. This forum post also has some good links to information on the two models.
In my opinion the major advantage and disadvantage of each are:
Adjacency Lists:
Let's say we have a simple node table:
CREATE TABLE node (
node_id INTEGER PRIMARY KEY AUTO_INCREMENT,
parent_id INTEGER,
CONSTRAINT fk_node__parent FOREIGN KEY (parent_id) REFERENCES node (node_id) ON DELETE CASCADE
) ENGINE=InnoDB;
Each node has at most one parent. Root nodes have a parent of null. Now we create an ancestor list, which is our denormalization table:
CREATE TABLE node_ancestry_link (
node_id INTEGER UNSIGNED NOT NULL,
ancestor_id INTEGER UNSIGNED NOT NULL,
PRIMARY KEY(node_id, ancestor_id),
INDEX ix_node_anc__anc_node (ancestor_id, node_id),
CONSTRAINT fk_node_anc__node FOREIGN KEY (node_id) REFERENCES node(node_id) ON DELETE CASCADE,
CONSTRAINT fk_node_anc__anc FOREIGN KEY (ancestor_id) REFERENCES node(node_id) ON DELETE CASCADE
) ENGINE=InnoDB;
Each node has an entry in the ancestry table for each of its ancestors. This table will grow much more quickly than the node table, especially for deep trees. This denormalization allows us to write queries to get all ancestors of a node:
SELECT l.ancestor_id FROM node n
JOIN node_ancestry_link l on n.node_id = l.node_id
WHERE n.node_id = :nodeid;
and all descendants of a node:
SELECT l.ancestor_id FROM node n
JOIN node_ancestry_link l on n.node_id = l.ancestor_id
WHERE n.node_id = :nodeid;
Now, to help us keep the ancestry table in sync as changes are made in the node table, we define some triggers:
DELIMITER |
CREATE TRIGGER tr_node_ins AFTER INSERT ON node
FOR EACH ROW
BEGIN
INSERT INTO node_ancestry_link (node_id, ancestor_id) VALUES (NEW.node_id, NEW.node_id);
IF NEW.parent_id IS NOT NULL THEN
INSERT INTO node_ancestry_link (node_id, ancestor_id) SELECT NEW.node_id, l.ancestor_id FROM node_ancestry_link l WHERE l.node_id = NEW.parent_id;
END IF;
END
|
CREATE TRIGGER tr_node_upd AFTER UPDATE ON node
FOR EACH ROW
BEGIN
IF NEW.parent_id <> OLD.parent_id OR ((NEW.parent_id IS NULL) <> (OLD.parent_id IS NULL)) THEN
IF OLD.parent_id IS NOT NULL THEN
DELETE FROM links USING node_ancestry_link links
JOIN node_ancestry_link anclinks ON links.ancestor_id = anclinks.ancestor_id
JOIN node_ancestry_link deslinks ON links.node_id = deslinks.node_id
WHERE anclinks.node_id = OLD.parent_id
AND deslinks.ancestor_id = NEW.node_id;
END IF;
IF NEW.parent_id IS NOT NULL THEN
INSERT INTO node_ancestry_link (node_id, ancestor_id)
SELECT desnodes.node_id, ancnodes.ancestor_id
FROM node_ancestry_link ancnodes
CROSS JOIN node_ancestry_link desnodes
WHERE ancnodes.node_id = NEW.parent_id
AND desnodes.ancestor_id = NEW.node_id;
END IF;
END IF;
END
|
DELIMITER ;
With the triggers in place, we can edit the node hierarchy without having to worry about the ancestry table. We only need to worry about inserts and updates because of the CASCADE DELETEs on the ancestry table foreign keys.
If you need to know how many descendants a particular node has, you may want to track the descendant count in the node table, since the count queries will be expensive for large trees.
You could augment the triggers to update the descendant counts when nodes are inserted, updated, or deleted. This would require adding a delete trigger. The one gotcha with doing this in MySQL is that you can't depend on cascade deletes when you delete nodes. MySQL has a bug/feature that cascade deletes don't fire delete triggers. For keeping descendant counts up-to-date this isn't a problem, as long as you take it into account, and when a node is deleted, subtract its descendant count from its ancestors.
I think that one of the more tough problems in database design is how to store tree data (arbitrary depth parent-child relationships, where a child has at most one parent).
The two most common approaches are the Adjacency List model, and the Nested Set model. Both are explained and compared here. This forum post also has some good links to information on the two models.
In my opinion the major advantage and disadvantage of each are:
Adjacency Lists:
- Major advantage: Simple and easy to understand.
- Major disadvantage: It takes multiple queries to find all ancestors or all descendants of a node.
- Major advantage: A single query can find all ancestors or all descendants of a node.
- Major disadvantage: Modifying the tree structure affects half the nodes in the tree, on average.
Let's say we have a simple node table:
CREATE TABLE node (
node_id INTEGER PRIMARY KEY AUTO_INCREMENT,
parent_id INTEGER,
CONSTRAINT fk_node__parent FOREIGN KEY (parent_id) REFERENCES node (node_id) ON DELETE CASCADE
) ENGINE=InnoDB;
Each node has at most one parent. Root nodes have a parent of null. Now we create an ancestor list, which is our denormalization table:
CREATE TABLE node_ancestry_link (
node_id INTEGER UNSIGNED NOT NULL,
ancestor_id INTEGER UNSIGNED NOT NULL,
PRIMARY KEY(node_id, ancestor_id),
INDEX ix_node_anc__anc_node (ancestor_id, node_id),
CONSTRAINT fk_node_anc__node FOREIGN KEY (node_id) REFERENCES node(node_id) ON DELETE CASCADE,
CONSTRAINT fk_node_anc__anc FOREIGN KEY (ancestor_id) REFERENCES node(node_id) ON DELETE CASCADE
) ENGINE=InnoDB;
Each node has an entry in the ancestry table for each of its ancestors. This table will grow much more quickly than the node table, especially for deep trees. This denormalization allows us to write queries to get all ancestors of a node:
SELECT l.ancestor_id FROM node n
JOIN node_ancestry_link l on n.node_id = l.node_id
WHERE n.node_id = :nodeid;
and all descendants of a node:
SELECT l.ancestor_id FROM node n
JOIN node_ancestry_link l on n.node_id = l.ancestor_id
WHERE n.node_id = :nodeid;
Now, to help us keep the ancestry table in sync as changes are made in the node table, we define some triggers:
DELIMITER |
CREATE TRIGGER tr_node_ins AFTER INSERT ON node
FOR EACH ROW
BEGIN
INSERT INTO node_ancestry_link (node_id, ancestor_id) VALUES (NEW.node_id, NEW.node_id);
IF NEW.parent_id IS NOT NULL THEN
INSERT INTO node_ancestry_link (node_id, ancestor_id) SELECT NEW.node_id, l.ancestor_id FROM node_ancestry_link l WHERE l.node_id = NEW.parent_id;
END IF;
END
|
CREATE TRIGGER tr_node_upd AFTER UPDATE ON node
FOR EACH ROW
BEGIN
IF NEW.parent_id <> OLD.parent_id OR ((NEW.parent_id IS NULL) <> (OLD.parent_id IS NULL)) THEN
IF OLD.parent_id IS NOT NULL THEN
DELETE FROM links USING node_ancestry_link links
JOIN node_ancestry_link anclinks ON links.ancestor_id = anclinks.ancestor_id
JOIN node_ancestry_link deslinks ON links.node_id = deslinks.node_id
WHERE anclinks.node_id = OLD.parent_id
AND deslinks.ancestor_id = NEW.node_id;
END IF;
IF NEW.parent_id IS NOT NULL THEN
INSERT INTO node_ancestry_link (node_id, ancestor_id)
SELECT desnodes.node_id, ancnodes.ancestor_id
FROM node_ancestry_link ancnodes
CROSS JOIN node_ancestry_link desnodes
WHERE ancnodes.node_id = NEW.parent_id
AND desnodes.ancestor_id = NEW.node_id;
END IF;
END IF;
END
|
DELIMITER ;
With the triggers in place, we can edit the node hierarchy without having to worry about the ancestry table. We only need to worry about inserts and updates because of the CASCADE DELETEs on the ancestry table foreign keys.
If you need to know how many descendants a particular node has, you may want to track the descendant count in the node table, since the count queries will be expensive for large trees.
You could augment the triggers to update the descendant counts when nodes are inserted, updated, or deleted. This would require adding a delete trigger. The one gotcha with doing this in MySQL is that you can't depend on cascade deletes when you delete nodes. MySQL has a bug/feature that cascade deletes don't fire delete triggers. For keeping descendant counts up-to-date this isn't a problem, as long as you take it into account, and when a node is deleted, subtract its descendant count from its ancestors.
Tuesday, September 1, 2009
re-dispatching events in Flash
I recently ran into a head-scratcher, and couldn't find any help online, so I thought I'd post my solution.
I had a custom flex component, ZoomControl:
<mx:VBox>
<mx:Metadata>
[Event(name="zoomChanged", type="ZoomEvent")]
</mx:Metadata>
...
</mx:VBox>>
ZoomControl can throw a "zoomChanged" event, of a custom class ZoomEvent. I then created a Toolbar custom component that contained the ZoomControl:
<mx:VBox>
<mx:Metadata>
[Event(name="zoomChanged", type="ZoomEvent")]
</mx:Metadata>
...
<ZoomControl
zoomChanged="dispatchEvent(event)"
/>
</mx:VBox>>
When the ZoomControl dispatches a ZoomEvent, I want my Toolbar to re-dispatch the event. Seems simple, right? But when the Toolbar calls dispatchEvent, it throws an exception:
After some wasted time trying to make sure that it really was a ZoomEvent being passed to dispatchEvent, I finally took time to read the documentation on UIComponent.dispatchEvent:
My problem was that I needed to override the clone method in my ZoomEvent. The dispatchEvent method was calling the base Event.clone(), which was returning an Event, of course. Overriding the clone method to return a ZoomEvent solved the problem.
I had a custom flex component, ZoomControl:
<mx:VBox>
<mx:Metadata>
[Event(name="zoomChanged", type="ZoomEvent")]
</mx:Metadata>
...
</mx:VBox>>
ZoomControl can throw a "zoomChanged" event, of a custom class ZoomEvent. I then created a Toolbar custom component that contained the ZoomControl:
<mx:VBox>
<mx:Metadata>
[Event(name="zoomChanged", type="ZoomEvent")]
</mx:Metadata>
...
<ZoomControl
zoomChanged="dispatchEvent(event)"
/>
</mx:VBox>>
When the ZoomControl dispatches a ZoomEvent, I want my Toolbar to re-dispatch the event. Seems simple, right? But when the Toolbar calls dispatchEvent, it throws an exception:
TypeError: Error #1034: Type Coercion failed: cannot convert flash.events::Event@19ea2df1 to footnote.imageviewer.events.ZooomEvent.
at flash.events::EventDispatcher/dispatchEventFunction()
at flash.events::EventDispatcher/dispatchEvent()
at mx.core::UIComponent/dispatchEvent()[E:\dev\3.0.x\frameworks\projects\framework\src\mx\core\UIComponent.as:9051]
at Toolbar/__zoomControl_zoomChanged()...
After some wasted time trying to make sure that it really was a ZoomEvent being passed to dispatchEvent, I finally took time to read the documentation on UIComponent.dispatchEvent:
* If the event is being redispatched, a clone of the event is created automatically.
* After an event is dispatched, itstargetproperty cannot be changed,
* so you must create a new copy of the event for redispatching to work.
My problem was that I needed to override the clone method in my ZoomEvent. The dispatchEvent method was calling the base Event.clone(), which was returning an Event, of course. Overriding the clone method to return a ZoomEvent solved the problem.
Friday, June 12, 2009
SuperDuper "Smart Update" doesn't stay smart
I use SuperDuper for backups on my work machine (MacBook Pro), and I have it set up to backup daily to an external drive using "Smart Update", which is supposed to be fast and only copy things that have changed. I really have no idea how it works, nor do I care, as long as it's working.
The problem is that it has started to take longer and longer to run. It had reached the point where it was taking 2 hours or more to finish (my drive is 120G and I don't back all of it up). I found very little help by searching on Google, so I asked the Sysadmin, who also uses SuperDuper, if he had seen the same problem. He said "I don't know, I only do full backups once a week".
That made me think that maybe SuperDuper just doesn't handle backing up repeatedly using Smart Update only. So I revised my schedule to do a full backup once a week, and Smart Updates daily. Success! My daily backups are back down to 15 minutes or less! I thought I'd post this in case it helps someone else.
The problem is that it has started to take longer and longer to run. It had reached the point where it was taking 2 hours or more to finish (my drive is 120G and I don't back all of it up). I found very little help by searching on Google, so I asked the Sysadmin, who also uses SuperDuper, if he had seen the same problem. He said "I don't know, I only do full backups once a week".
That made me think that maybe SuperDuper just doesn't handle backing up repeatedly using Smart Update only. So I revised my schedule to do a full backup once a week, and Smart Updates daily. Success! My daily backups are back down to 15 minutes or less! I thought I'd post this in case it helps someone else.
Friday, April 24, 2009
SELECT DISTINCT with ORDER BY
I recently wrote a query in MySQL that didn't seem to be returning the right results, and at first I couldn't figure out why. Here is a toy example, where we are tracking pages and page views (one-to-many relationship):
create table page (
page_id integer unsigned primary key,
name varchar(32) not null,
created datetime not null
) engine=InnoDB;
create table page_view (
page_view_id integer unsigned primary key,
page_id integer unsigned not null,
created datetime not null,
foreign key (page_id) references page (page_id) on delete cascade
) engine=InnoDB;
What I want to get is the most recently viewed pages. Let's say I have the following data in my tables:
mysql> select * from page;
+---------+--------+---------------------+
| page_id | name | created |
+---------+--------+---------------------+
| 1 | page 1 | 2000-01-01 00:00:00 |
| 2 | page 2 | 2000-01-02 00:00:00 |
| 3 | page 3 | 2000-01-03 00:00:00 |
| 4 | page 4 | 2000-01-04 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)
mysql> select * from page_view;
+--------------+---------+---------------------+
| page_view_id | page_id | created |
+--------------+---------+---------------------+
| 1 | 3 | 2000-01-01 00:00:00 |
| 2 | 1 | 2000-01-02 00:00:00 |
| 3 | 1 | 2000-01-03 00:00:00 |
| 4 | 3 | 2000-01-04 00:00:00 |
| 5 | 2 | 2000-01-05 00:00:00 |
| 6 | 4 | 2000-01-06 00:00:00 |
| 7 | 2 | 2000-01-07 00:00:00 |
+--------------+---------+---------------------+
7 rows in set (0.00 sec)
What I want to get back is page 2 (most recently viewed), then page 4, then page 3, then page 1.
So I write my query:
mysql> select distinct p.page_id, p.name, p.created from page p join page_view pv on p.page_id = pv.page_id order by pv.created desc;
+---------+--------+---------------------+
| page_id | name | created |
+---------+--------+---------------------+
| 4 | page 4 | 2000-01-04 00:00:00 |
| 2 | page 2 | 2000-01-02 00:00:00 |
| 1 | page 1 | 2000-01-01 00:00:00 |
| 3 | page 3 | 2000-01-03 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)
That's not right at all! What's going on?
The problem is that I'm using distinct just on the page table, but ordering by the page_view table. Since there is a many-to-one, what is the database supposed to do when a page has multiple views? which view should it use for the order by?
What I wanted the query to do is first join, then order, then apply the distinct. That's not what MySQL does, though. It first joins, then applies the distinct, then orders the results (or something like that). You can think of it like MySQL going sequentially through the page_view table, finding rows with distinct page ids. So it would pick rows 1,2,5,6:
+--------------+---------+---------------------+
| page_view_id | page_id | created |
+--------------+---------+---------------------+
| 1 | 3 | 2000-01-01 00:00:00 |
| 2 | 1 | 2000-01-02 00:00:00 |
| 5 | 2 | 2000-01-05 00:00:00 |
| 6 | 4 | 2000-01-06 00:00:00 |
+--------------+---------+---------------------+
4 rows in set (0.00 sec)
You can see that if you order those by created, you get the page order that the (badly written) query returned (4,2,1,3).
We can force MySQL to do things in the order we want by changing the query to:
mysql> select distinct p.page_id, p.name, p.created from (select p.page_id, p.name, p.created from page p join page_view pv on p.page_id = pv.page_id order by pv.created desc) as p;
+---------+--------+---------------------+
| page_id | name | created |
+---------+--------+---------------------+
| 2 | page 2 | 2000-01-02 00:00:00 |
| 4 | page 4 | 2000-01-04 00:00:00 |
| 3 | page 3 | 2000-01-03 00:00:00 |
| 1 | page 1 | 2000-01-01 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)
But I think that's kind of a hack, and depends on MySQL doing the distinct in a certain order (I don't think order by in a subquery is standard sql, and shouldn't necessarily constraint the order of the entire query). So what's the "right" way to write this type of query?
Before I tackled that, I thought, "What would a strict database like PostgreSQL do with this type of query?" My hope was that it would throw it out altogether. And it does. Here's what I get:
postgres=# select distinct t1.id, t1.name, t1.created from table1 t1 join table2 t2 on t1.id = t2.table1_id order by t2.created desc;
ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
That's much better, and the error message is very helpful, and makes sense. So here's the query I came up with that will give the correct results, and is correct SQL, in MySQL...:
mysql> select p.page_id, p.name, p.created from page p join (select page_id, max(created) as created from page_view group by page_id) v on p.page_id = v.page_id order by v.created desc;
+---------+--------+---------------------+
| page_id | name | created |
+---------+--------+---------------------+
| 2 | page 2 | 2000-01-02 00:00:00 |
| 4 | page 4 | 2000-01-04 00:00:00 |
| 3 | page 3 | 2000-01-03 00:00:00 |
| 1 | page 1 | 2000-01-01 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)
...and in PostgreSQL:
postgres=# select p.page_id, p.name, p.created from page p join (select page_id, max(created) as created from page_view group by page_id) v on p.page_id = v.page_id order by v.created desc;
page_id | name | created
---------+--------+---------------------
2 | page 2 | 2000-01-02 00:00:00
4 | page 4 | 2000-01-04 00:00:00
3 | page 3 | 2000-01-03 00:00:00
1 | page 1 | 2000-01-01 00:00:00
(4 rows)
Is there a better performing query out there to do the same thing? I'd love to know, please leave a comment! :)
create table page (
page_id integer unsigned primary key,
name varchar(32) not null,
created datetime not null
) engine=InnoDB;
create table page_view (
page_view_id integer unsigned primary key,
page_id integer unsigned not null,
created datetime not null,
foreign key (page_id) references page (page_id) on delete cascade
) engine=InnoDB;
What I want to get is the most recently viewed pages. Let's say I have the following data in my tables:
mysql> select * from page;
+---------+--------+---------------------+
| page_id | name | created |
+---------+--------+---------------------+
| 1 | page 1 | 2000-01-01 00:00:00 |
| 2 | page 2 | 2000-01-02 00:00:00 |
| 3 | page 3 | 2000-01-03 00:00:00 |
| 4 | page 4 | 2000-01-04 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)
mysql> select * from page_view;
+--------------+---------+---------------------+
| page_view_id | page_id | created |
+--------------+---------+---------------------+
| 1 | 3 | 2000-01-01 00:00:00 |
| 2 | 1 | 2000-01-02 00:00:00 |
| 3 | 1 | 2000-01-03 00:00:00 |
| 4 | 3 | 2000-01-04 00:00:00 |
| 5 | 2 | 2000-01-05 00:00:00 |
| 6 | 4 | 2000-01-06 00:00:00 |
| 7 | 2 | 2000-01-07 00:00:00 |
+--------------+---------+---------------------+
7 rows in set (0.00 sec)
What I want to get back is page 2 (most recently viewed), then page 4, then page 3, then page 1.
So I write my query:
mysql> select distinct p.page_id, p.name, p.created from page p join page_view pv on p.page_id = pv.page_id order by pv.created desc;
+---------+--------+---------------------+
| page_id | name | created |
+---------+--------+---------------------+
| 4 | page 4 | 2000-01-04 00:00:00 |
| 2 | page 2 | 2000-01-02 00:00:00 |
| 1 | page 1 | 2000-01-01 00:00:00 |
| 3 | page 3 | 2000-01-03 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)
That's not right at all! What's going on?
The problem is that I'm using distinct just on the page table, but ordering by the page_view table. Since there is a many-to-one, what is the database supposed to do when a page has multiple views? which view should it use for the order by?
What I wanted the query to do is first join, then order, then apply the distinct. That's not what MySQL does, though. It first joins, then applies the distinct, then orders the results (or something like that). You can think of it like MySQL going sequentially through the page_view table, finding rows with distinct page ids. So it would pick rows 1,2,5,6:
+--------------+---------+---------------------+
| page_view_id | page_id | created |
+--------------+---------+---------------------+
| 1 | 3 | 2000-01-01 00:00:00 |
| 2 | 1 | 2000-01-02 00:00:00 |
| 5 | 2 | 2000-01-05 00:00:00 |
| 6 | 4 | 2000-01-06 00:00:00 |
+--------------+---------+---------------------+
4 rows in set (0.00 sec)
You can see that if you order those by created, you get the page order that the (badly written) query returned (4,2,1,3).
We can force MySQL to do things in the order we want by changing the query to:
mysql> select distinct p.page_id, p.name, p.created from (select p.page_id, p.name, p.created from page p join page_view pv on p.page_id = pv.page_id order by pv.created desc) as p;
+---------+--------+---------------------+
| page_id | name | created |
+---------+--------+---------------------+
| 2 | page 2 | 2000-01-02 00:00:00 |
| 4 | page 4 | 2000-01-04 00:00:00 |
| 3 | page 3 | 2000-01-03 00:00:00 |
| 1 | page 1 | 2000-01-01 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)
But I think that's kind of a hack, and depends on MySQL doing the distinct in a certain order (I don't think order by in a subquery is standard sql, and shouldn't necessarily constraint the order of the entire query). So what's the "right" way to write this type of query?
Before I tackled that, I thought, "What would a strict database like PostgreSQL do with this type of query?" My hope was that it would throw it out altogether. And it does. Here's what I get:
postgres=# select distinct t1.id, t1.name, t1.created from table1 t1 join table2 t2 on t1.id = t2.table1_id order by t2.created desc;
ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
That's much better, and the error message is very helpful, and makes sense. So here's the query I came up with that will give the correct results, and is correct SQL, in MySQL...:
mysql> select p.page_id, p.name, p.created from page p join (select page_id, max(created) as created from page_view group by page_id) v on p.page_id = v.page_id order by v.created desc;
+---------+--------+---------------------+
| page_id | name | created |
+---------+--------+---------------------+
| 2 | page 2 | 2000-01-02 00:00:00 |
| 4 | page 4 | 2000-01-04 00:00:00 |
| 3 | page 3 | 2000-01-03 00:00:00 |
| 1 | page 1 | 2000-01-01 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)
...and in PostgreSQL:
postgres=# select p.page_id, p.name, p.created from page p join (select page_id, max(created) as created from page_view group by page_id) v on p.page_id = v.page_id order by v.created desc;
page_id | name | created
---------+--------+---------------------
2 | page 2 | 2000-01-02 00:00:00
4 | page 4 | 2000-01-04 00:00:00
3 | page 3 | 2000-01-03 00:00:00
1 | page 1 | 2000-01-01 00:00:00
(4 rows)
Is there a better performing query out there to do the same thing? I'd love to know, please leave a comment! :)
Monday, April 6, 2009
safely editing MySQL triggers in a production database
MySQL does not provide an atomic CREATE OR REPLACE TRIGGER, or an ALTER TRIGGER statement that will safely modify a trigger on a database while it is in use. The only way to update a TRIGGER is with a DROP and then a CREATE.
Why is that a big deal? Say, for example, you are using triggers to keep row counts up-to-date in a summary table. You may miss some inserts while you are issuing the DROP and then the CREATE. To verify this, I used mysqlslap. Here is my schema script:
drop table if exists triggertest.record_count;
create table triggertest.record_count
(
id INTEGER UNSIGNED PRIMARY KEY AUTO_INCREMENT,
count_name VARCHAR(64) NOT NULL,
count_value INTEGER UNSIGNED NOT NULL DEFAULT 1,
UNIQUE (count_name)
) ENGINE=InnoDB;
drop table if exists triggertest.record_table;
create table triggertest.record_table
(
id INTEGER UNSIGNED PRIMARY KEY AUTO_INCREMENT,
some_value VARCHAR(64) NOT NULL
) ENGINE=InnoDB;
DROP PROCEDURE IF EXISTS triggertest.sp_increment_record_count;
DELIMITER |
CREATE PROCEDURE triggertest.sp_increment_record_count(IN countname VARCHAR(64))
BEGIN
INSERT INTO triggertest.record_count(count_name, count_value) VALUES (countname,1) ON DUPLICATE KEY UPDATE count_value = count_value + 1;
END
|
DELIMITER ;
DROP TRIGGER IF EXISTS triggertest.tr_record_table_ins;
CREATE TRIGGER triggertest.tr_record_table_ins AFTER INSERT ON triggertest.record_table
FOR EACH ROW CALL triggertest.sp_increment_record_count('record_table');
I used mysqlslap to run a lot of inserts against the record_table, and while that was running, I re-created the trigger by running this script a bunch of times:
DROP TRIGGER IF EXISTS triggertest.tr_record_table_ins;
CREATE TRIGGER triggertest.tr_record_table_ins AFTER INSERT ON triggertest.record_table
FOR EACH ROW CALL triggertest.sp_increment_record_count('record_table');
I then verified that the count_value in record_count was smaller than the number of records in record_table:
mysql> select * from record_count;
+----+--------------+-------------+
| id | count_name | count_value |
+----+--------------+-------------+
| 1 | record_table | 9944 |
+----+--------------+-------------+
1 row in set (0.00 sec)
mysql> select count(*) from record_table;
+----------+
| count(*) |
+----------+
| 10000 |
+----------+
1 row in set (0.00 sec)
mysql>
At first, I was not sure there would be a solution to this. I realize that you can lock tables, but my first guess was that since ddl (DROP, CREATE, etc.) statements cause an implicit commit, that my locks would be released.
Fortunately, as the MySQL documentation explains, if you use LOCK TABLES, implicit commits don't release your locks. From the docs:
So the safe way to recreate my trigger is like this:
set autocommit=0;
lock tables triggertest.record_table write;
DROP TRIGGER IF EXISTS triggertest.tr_record_table_ins;
CREATE TRIGGER triggertest.tr_record_table_ins AFTER INSERT ON triggertest.record_table FOR EACH ROW CALL triggertest.sp_increment_record_count('record_table');
unlock tables;
I had some trouble testing this with mysqlslap, even with only one thread running inserts, because of some locking issues (the inserts would error out with a 'Lock wait timeout'), but I did get a few tests to make it through, so I could verify that the record_count matched the number of rows in the record_table. So the worst-case seems to be that some of the inserts on the production database may hit a lock wait timeout, but no inserts will miss firing the triggers!
UPDATE: Soon after writing this post, I came across this: http://code.openark.org/blog/mysql/why-of-the-week, which may explain why I was having so many problems with deadlocks when I tried to run against a database that was in use. I'm not talking about deadlocks where MySQL detects it and rolls back a transaction. Things would just lock up. No deadlock detected, no lock wait timeout, just locked up, until I killed a query. So while the solution above should work in theory, beware of MySQL locking bugs...
Why is that a big deal? Say, for example, you are using triggers to keep row counts up-to-date in a summary table. You may miss some inserts while you are issuing the DROP and then the CREATE. To verify this, I used mysqlslap. Here is my schema script:
drop table if exists triggertest.record_count;
create table triggertest.record_count
(
id INTEGER UNSIGNED PRIMARY KEY AUTO_INCREMENT,
count_name VARCHAR(64) NOT NULL,
count_value INTEGER UNSIGNED NOT NULL DEFAULT 1,
UNIQUE (count_name)
) ENGINE=InnoDB;
drop table if exists triggertest.record_table;
create table triggertest.record_table
(
id INTEGER UNSIGNED PRIMARY KEY AUTO_INCREMENT,
some_value VARCHAR(64) NOT NULL
) ENGINE=InnoDB;
DROP PROCEDURE IF EXISTS triggertest.sp_increment_record_count;
DELIMITER |
CREATE PROCEDURE triggertest.sp_increment_record_count(IN countname VARCHAR(64))
BEGIN
INSERT INTO triggertest.record_count(count_name, count_value) VALUES (countname,1) ON DUPLICATE KEY UPDATE count_value = count_value + 1;
END
|
DELIMITER ;
DROP TRIGGER IF EXISTS triggertest.tr_record_table_ins;
CREATE TRIGGER triggertest.tr_record_table_ins AFTER INSERT ON triggertest.record_table
FOR EACH ROW CALL triggertest.sp_increment_record_count('record_table');
I used mysqlslap to run a lot of inserts against the record_table, and while that was running, I re-created the trigger by running this script a bunch of times:
DROP TRIGGER IF EXISTS triggertest.tr_record_table_ins;
CREATE TRIGGER triggertest.tr_record_table_ins AFTER INSERT ON triggertest.record_table
FOR EACH ROW CALL triggertest.sp_increment_record_count('record_table');
I then verified that the count_value in record_count was smaller than the number of records in record_table:
mysql> select * from record_count;
+----+--------------+-------------+
| id | count_name | count_value |
+----+--------------+-------------+
| 1 | record_table | 9944 |
+----+--------------+-------------+
1 row in set (0.00 sec)
mysql> select count(*) from record_table;
+----------+
| count(*) |
+----------+
| 10000 |
+----------+
1 row in set (0.00 sec)
mysql>
At first, I was not sure there would be a solution to this. I realize that you can lock tables, but my first guess was that since ddl (DROP, CREATE, etc.) statements cause an implicit commit, that my locks would be released.
Fortunately, as the MySQL documentation explains, if you use LOCK TABLES, implicit commits don't release your locks. From the docs:
...statements that implicitly cause transactions to be committed do not release existing locks.
So the safe way to recreate my trigger is like this:
set autocommit=0;
lock tables triggertest.record_table write;
DROP TRIGGER IF EXISTS triggertest.tr_record_table_ins;
CREATE TRIGGER triggertest.tr_record_table_ins AFTER INSERT ON triggertest.record_table FOR EACH ROW CALL triggertest.sp_increment_record_count('record_table');
unlock tables;
I had some trouble testing this with mysqlslap, even with only one thread running inserts, because of some locking issues (the inserts would error out with a 'Lock wait timeout'), but I did get a few tests to make it through, so I could verify that the record_count matched the number of rows in the record_table. So the worst-case seems to be that some of the inserts on the production database may hit a lock wait timeout, but no inserts will miss firing the triggers!
UPDATE: Soon after writing this post, I came across this: http://code.openark.org/blog/mysql/why-of-the-week, which may explain why I was having so many problems with deadlocks when I tried to run against a database that was in use. I'm not talking about deadlocks where MySQL detects it and rolls back a transaction. Things would just lock up. No deadlock detected, no lock wait timeout, just locked up, until I killed a query. So while the solution above should work in theory, beware of MySQL locking bugs...
Subscribe to:
Posts (Atom)