I was helping out http://www.pmi-sfbac.org whose web site was down.
The server was being unresponsive, and Larry Van Cantfort (the Director of Operations for PMI SFBAC) sent me this clue:
It appears that httpd has a lot of running processes and when I try to kill them I am unable to. THis is causing the server to sloooow way down. There are also error messages from mysqld with a corrupt table:

I couldn’t even get into the control panel to get things going, so the cleanest approach was to simply rebuild the server.
So I shut down the database and started to try and get it backed up so we’d at least save all the hard work of the volunteers.
So I tried to run the dump, with the following command that I got from the Parallels support site:

But that gave me the same error:

A quick Google search and I found the “fix” for this is to run a repair on the tables: http://www.daveperrett.com/articles/2009/06/18/mysql-table-is-marked-as-crashed/

That fixed the issue, and I was able to create the “dumpall.sql”.

Once I had ALL of the tables fixed, I simply rebuilt the server and migrated it into a new database as described in the Parallels KB article mentioned above (a lot more steps to make sure the backup was good, but once it was the newly imaged server was able to run without incident).

Enhanced by Zemanta
Homer Simpson
Homer Simpson (Photo credit: Wikipedia)

I just shot my blog in the foot, or more accurately, I didn’t follow IT 101 and back things up before making a change.

I had moved my site to be completely WordPress based a while ago, and as a result I had a bit of a convoluted setup on my server.

When I first set up my WordPress blog it was as a sub-domain of accuweaver.com, and was housed at http://wordpress.accuweaver.com/ (also aliased to http://blog.accuweaver.com/). The http://www.accuweaver.com/ site just static pages that hadn’t changed for years.

So when I finally got my blog set up to host the few static pages I had, I just changed the directory on my server to have a symbolic link to the directory where wordpress.accuweaver.com had it’s content:

  1. Removed the directory httpdocs from /var/www/vhosts/accuweaver.com
  2. Added a link in that folder to /var/www/vhosts/accuweaver.com/subdomains/wordpress/httpdocs.

This actually worked really well, since the content was only in one place, and all I had to do was change the host name in WordPress. Continue reading

I am a big fan of Test Driven Development (TDD) and tools like Hudson/Jenkins to automate the process of having a continuous integration build system are key.

On my current project we recently started moving things to Amazon EC2, and rather than put everything on one big server, I thought I’d follow the best practices in cloud computing and make a number of small special purpose servers to take care of the project’s needs.

We’ve had a Jenkins server running for a bit, so rather than reinventing the wheel, I figured I could copy my Jenkins configuration to a new server and get things up and running.

I fired up a new Tomcat server on Amazon Elastic Beanstalk, and loaded up the Jenkins WAR file, which quickly got me to a working Jenkins server. This project is written in PHP, so I had to install PHP after that, which meant logging in to the server and running through the whole PHP and PHPUnit setup.

Once that was done, I scp’d the Jenkins folders from the old server, edited the Tomcat startup files to include the environment variable to point Jenkins to the right place, changed a few permissions, and everything appeared to be working.

I could log in, I fired up the build, and it appeared to be running – very cool.

But that was when the flaw in the design of the PHP unit tests was exposed ….

I was watching the output of the phpunit tests, and noticed two things:

  1. The tests seemed to be taking a really long time
  2. Every test was failing

Watching the console, each time a test would fail, the little “E” would print, then a few seconds would go by and another “E” would appear. Finally after many minutes (because we have a LOT of classes to test) the error output appeared, and looked something like this for EVERY test:

And of course there were 5297 of these … I did some Google searches for the PHP_Invoker_TimeoutException which mostly pointed to issues with upgrade from one version of PHPUnit to another, but the versions on the old server and this one were the same.

So my next step was debugging an individual tests. Running the test from the command line gave me the same error, odd. But then I ran the test using php instead of the phpunit call, and found the problem: I was getting a timeout trying to open a database connection.

The issue as it turns out, is a design flaw in our code that hadn’t showed up before: all the classes invoke a database connection class that sets up the connection to the database as soon as they are loaded.

Since the Elastic Beanstalk server was in a different security group than was allowed to connect to the RDS database, it was unable to connect at all, and PHPUnit would simply timeout before the connection failed (by default phpunit sets 1 second as the acceptable for a test to run in order to catch endless loops).

Now in theory our tests shouldn’t be hitting the database (at least not for these unit tests since we don’t want them updating anything on the backend), so this problem turned out to be very fortuitous. Because the Jenkins server couldn’t reach the database, it exposed a flaw in our unit tests: we weren’t mocking all the things we needed to, so the tests were actually opening connections to the database.

With some refactoring of the test classes to mock the database access layer, the tests all succeeded. Next we’ll need to do the actual DBUnit tests for the database, and Selenium or HTTPUnit tests for all the front-end and AJAX stuff.